### 🗣️ [#201](https://github.com/ikawrakow/ik_llama.cpp/discussions/201) - What is the NUMA situation ?
| **Author** | `bhugueney` |
| :--- | :--- |
| **Created** | 2025-02-11 |
| **Updated** | 2025-05-21 |
---
#### Description
It seems to me that output generation being memory bandwidth bounded and LLM requiring a lot of RAM , a cheap way to try increase both RAM amount and bandwidth is to go for NUMA.
For instance, a dual Epyc server can have 16 or 24 memory channels each CPU can also have up to 4 NUMA domains for best theoretical performance (also, on Gen 2 Epyc at least, L3 cache is shared only amongst cores on the same CCX).
However, there are many pitfalls to efficient NUMA programming especially to minimize cross NUMA domain memory and PCIe access.
It is my understanding that llama.cpp is trying to avoid the most basic problems (e.g. allocation everything in 1 NUMA domain) but more work needs to be done.
[KTransformers](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md#some-explanations) just duplicates matrices on each NUMA domain !
[vLLM](https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html#other-considerations) can do tensor parallelism on NUMA : «In general each NUMA node is treated as one GPU card. »
Is ik_llama.cpp NUMA aware ? If not, are there plans to make it NUMA aware ?
Thx !
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-02-11** at **06:09:03**:
In `ik_llama.cpp`, being a fork of `llama.cpp`, the NUMA situation is the same as in `llama.cpp`.
Improving performance on NUMA systems is something I would be interested in looking into, but I don't have a dual socket system available (with enough memory bandwidth to make it interesting), and I'm just a lonely guy hacking here for fun without the resources to go and rent/buy such a system.
> 👤 **bhugueney** replied the **2025-02-11** at **10:56:00**:
> Thx !
> I sure hope my message didn't come of as complaining : I've very grateful for what you already did !
> If you are interested I will try to provide you full access to my dual Epyc server with 16 × 64 GB of DDR4 @3200.
>
> 👤 **ikawrakow** replied the **2025-02-11** at **14:47:10**:
> This would be of course great, but I'm hesitant to promise to tackle the NUMA issue right away.
>
> When you say "full access", you mean you are not going to be using the system while I'm using it? Which Epycs do you have?
>
> 👤 **bhugueney** replied the **2025-02-11** at **23:17:06**:
> I'm not expecting any promises, especially as I'm afraid llama.cpp cannot be patched to become NUMA efficient. My (very) limited understanding is that people ran llama.cpp CPU backend on NUMA and got bad performance because one thread was doing all the memory allocation (so in one NUMA domain) and they started trying to address that by patching the CPU backend. Unfortunately, such approach seems doomed to hit a wall as NUMA efficiency probably requires a different architecture more like a multi-GPU backend with tensor parallelism where each NUMA domain would be treated like a GPU wrt trying to minimize inter GPU communication and maximize parallelism. This is the vLLM approach for NUMA if I'm note mistaken.
>
> When I say "full access", I mean IPMI access while I'm not using it. But I have to figure things out first. Epycs would be 7R32 (same as AWS c5a instances).
>
> 👤 **saood06** replied the **2025-02-11** at **23:58:26**:
> So in regards to the current state of llama.cpp/ik_llama.cpp NUMA performance I don't think it's that bad. I've seen a few reports from a few users on more modern NUMA machines than mine report performance running multiple instances of llama.cpp on each NUMA domain isolated, vs running one larger instance on all NUMA domains and although there was gain to be had it wasn't that dramatic of a difference. My older NUMA machine also gets decent performance for it's bandwidth.
>
> I'm looking into expert parallelism for the Deepseek V3/R1 MoE model, which should benefit NUMA systems. The plan for that is port over the PR which allows you to specify what tensor is loaded onto what backend, change the tensor representation of this model to not consolidate the experts. At that point I'd test performance with that and each NUMA node on a separate RPC backend, since changing ik_llama.cpp to create a backend for each NUMA domain might require a lot more work, but I'd look into it once I get there.
---
👤 **saood06** replied the **2025-03-13** at **05:53:54**:
There is actually a good discussion on mainline: https://github.com/ggml-org/llama.cpp/discussions/12088
They did test ik_llama.cpp (but in only with a single NUMA Node on a single CPU at Q8_0) where it still outperformed mainline for CPU only.
Also you can look at zts9989's comment [here](https://github.com/ggml-org/llama.cpp/pull/11397#issuecomment-2716225570) where he talks about NUMA and what llama.cpp could improve on after he found that "approximately 50% of CPU usage is spent on thread synchronization" when running Deepseek R1 with multiple numa nodes.
> 👤 **ikawrakow** replied the **2025-03-13** at **07:27:34**:
> > They did test ik_llama.cpp (but in only with a single NUMA Node on a single CPU at Q8_0) where it still outperformed mainline for CPU only.
>
> Where can I find the test results?
>
> 👤 **saood06** replied the **2025-03-13** at **07:44:42**:
> In the linked post the second table under 6980P Benchmarks has it, but pasting it here for reference:
>
> Quantization | Tokens/Second | NUMA Configuration
> -- | -- | --
> Q8_0 | 6.6 | 1x NUMA Node on 1x CPU ik_llama
> Q8_0 | 6.2 | 1x NUMA Node on 1x CPU
>
> This is the only published result for ik_llama but they do state "Keep an eye on [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) fork which has interesting optimizations." so they may run more.
>
> 👤 **saood06** replied the **2025-03-13** at **08:45:24**:
> I forgot he had much more detailed results under Methodology and Notes, there is a section for ik_llama.cpp showing the command and bench numbers, interestingly ik_llama.cpp performance peaked at 128 threads for both PP and TG compared to peaking at 86 threads for TG and 128 threads for PP in mainline. He also shares PP numbers as well, where ik_llama again shows better performance than mainline. He does explicitly state TODO for testing ik_llama.cpp for 2x CPU Q8_0.
>
> Again pasting the segment of his post featuring ik_llama.cpp for reference:
>
>
numactl -N 0 -m 0 \
> ./build/bin/llama-bench \
> --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \
> --cache-type-k f16 \
> --cache-type-v f16 \
> --numa numactl \
> --threads 64,43,64,86,128,172
> Results
> > model | size | params | backend | threads | test | t/s > -- | -- | -- | -- | -- | -- | -- > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 64 | pp512 | 56.86 ± 7.21 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 64 | tg128 | 4.86 ± 0.01 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 43 | pp512 | 40.62 ± 0.02 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 43 | tg128 | 3.69 ± 0.00 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 64 | pp512 | 57.67 ± 4.62 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 64 | tg128 | 4.89 ± 0.00 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 86 | pp512 | 62.21 ± 13.63 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 86 | tg128 | 5.69 ± 0.00 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 128 | pp512 | 78.89 ± 21.46 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 128 | tg128 | 6.60 ± 0.00 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 172 | pp512 | 70.63 ± 0.58 > deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 172 | tg128 | 5.05 ± 0.00 --- 👤 **ikawrakow** replied the **2025-03-13** at **11:55:55**: