8.3 KiB
🐛 #390 - Fix build for Xeon Gold 6226R
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-05-07 |
| Updated | 2025-05-08 |
Description
I got access to a Xeon Gold 6226R system. The PR fixes the compilation errors due to this CPU supporting all AVX512 extensions necessary to define HAVE_FANCY_SIMD, but does not support SIMD popcnt.
After fixing the build, I did a quick test with Gemma3-27B-It. It is a dual-socket system, but even without numactl and without dropping caches I get quite respectable results:
./bin/llama-sweep-bench -m /LLM/google_gemma-3-27b-it-Q8_0.gguf -c 4096 -t 32 -fa -rtr
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 9.622 | 53.21 | 47.765 | 2.68 |
| 512 | 128 | 512 | 8.314 | 61.58 | 24.781 | 5.17 |
| 512 | 128 | 1024 | 8.389 | 61.03 | 25.398 | 5.04 |
| 512 | 128 | 1536 | 9.834 | 52.06 | 26.493 | 4.83 |
| 512 | 128 | 2048 | 9.072 | 56.44 | 27.823 | 4.60 |
| 512 | 128 | 2560 | 8.931 | 57.33 | 26.041 | 4.92 |
| 512 | 128 | 3072 | 9.195 | 55.68 | 25.953 | 4.93 |
| 512 | 128 | 3584 | 9.360 | 54.70 | 26.807 | 4.77 |
I guess, the lower performance for the first entry in the table is due to the system having not properly warmed up yet.
💬 Conversation
👤 Ph0rk0z commented the 2025-05-07 at 13:39:54:
I have generation prior to this chip. If you set bios to have 1 numa per CPU, the best results are from --numa distribute. Messing with numactl and interleave gives worse results across the board regardless of what the warning when you run says.
👤 ikawrakow commented the 2025-05-07 at 13:45:05:
I have generation prior to this chip. If you set bios to have 1 numa per CPU, the best results are from --numa distribute. Messing with numactl and interleave gives worse results across the board regardless of what the warning when you run says.
Thanks for the tip!
Unfortunately I don't have physical access to the box (it belongs to somebody else in a different country), and no sudo privileges (so I could drop caches, play with huge pages, install missing software, etc.).
👤 Ph0rk0z commented the 2025-05-07 at 14:43:33:
Run with numa distribute and see if your benchie goes up. I might buy 8260 es since they're cheap. Does the extra AVX512-VNNI really help much?
👤 ikawrakow commented the 2025-05-07 at 15:15:42:
Does the extra AVX512-VNNI really help much?
It does not for TG as we are memory bound, and most x86_64 CPUs will saturate memory bandwidth with fewer threads than available.
But it does make a difference for prompt processing. I get about the same PP sped on a 16-core Ryzen-7950X (Zen4 core with AVX512F and quite a few AVX512 extensions) as on a 32-core Ryzen-5975WX (Zen3 core, so vanilla AVX2). This despite the fact that the Zen4 core executes 512-bit instructions as two separate 256-bit instructions and the Zen3 is a "Pro" variant. Having 32 instead of 16 vector registers alone helps quite a bit. The _mm512_dpbusds_epi32 instruction that one gets with AVX512_VNNI is a huge help for quants that fuse the full int8 range (Q8_0, IQ4_XS/NL plus several of the IQK quants from this repository). AVX2 is a real pain for those (I sometimes like to think that the _mm256_maddubs_epi16 instruction that one has available for int8 dot products has been designed after a 7-day marathon brainstorming put in place with the purpose of designing the most unhelpful instruction possible).
👤 Ph0rk0z commented the 2025-05-07 at 15:55:30:
Thanks. I already have AVX-512 but I guess my prompt processing will see a slight boost and of course I can upgrade my memory. With 6 channel 2400mt/s I only get 180GB which is a 30% haircut per proc from theoretical.
👤 ikawrakow commented the 2025-05-07 at 16:12:52:
Thanks. I already have AVX-512 but
I haven't done a fine-gained implementation depending on AVX512 extensions available. The CPU must support AVX512_VNNI, AVX512VL, AVX512BW and AVX512DQ to enable the faster matrix multiplication implementation. As your CPU does not ave AVX512_VNNI, matrix multiplications will be done using the vanilla AVX2 implementation. You only benefit from AVX512 in the flash attention implementation (but the K*Q multiplication that is about half of the total FA computation cost is still using AVX2).
👤 gereoffy commented the 2025-05-07 at 16:41:26:
I have generation prior to this chip. If you set bios to have 1 numa per CPU, the best results are from --numa distribute. Messing with numactl and interleave gives worse results across the board regardless of what the warning when you run says.
Thanks for the tip!
Unfortunately I don't have physical access to the box (it belongs to somebody else in a different country), and no sudo privileges (so I could drop caches, play with huge pages, install missing software, etc.).
hi! that box is mine, i can give you DRAC access so it's almost the phisical access except that you cannot kick the box :) anyway thanks for fixing compile!
👤 gereoffy commented the 2025-05-07 at 17:16:49:
Oh, hi, nice to meet you virtually! And thanks for letting me use your box, it has been very helpful. Hope I didn't annoy you too much by running a lot of benchmarks. no problem at all! this is a test/dev system...
DRAC will give me access to the BIOS? yes. full console (remote monitor/keyboard/usb access in browser)... but i forgot that this box cannot boot from nvme ssd so it's a bit tricky to start it using sd-card (or virtual usb) and custom grub options :(
But I'm not sure I want to do with it as none of the nodes has enough RAM to fit the DeepSeek models, so I need to use both CPUs.
yep. and the network card and the nvme card are also wired to different cpus i think...
is it possible to run model somehow splitted, and running each part of the model on the cpu wired to the memory containing its weights data? like a cluster?
👤 Ph0rk0z commented the 2025-05-07 at 18:37:04:
Pass --numa distribute, it splits the memory between both CPU evenly. I think all numa stuff here and main is the same. You can also put it on one node only.. ie the one you launch from.
When I did tests I didn't have llama-sweep-bench so maybe worth trying again? I simply used both gemma/llama 3 70b and checked generation speed.
👤 Gaolingx commented the 2025-05-07 at 18:41:35:
thank you for fixing it. when I run llama-server with -fa and -rtr parameter, the speed is a little faster than only use -fa, the prefill and decode are increased, That is a good beginning!
-c 8192 -t 16 -fa:
INFO [ print_timings] prompt eval time = 6958.30 ms / 36 tokens ( 193.29 ms per token, 5.17 tokens per second) | tid="52596" timestamp=1746491529 id_slot=0 id_task=31856 t_prompt_processing=6958.3 n_prompt_tokens_processed=36 t_token=193.28611111111113 n_tokens_second=5.173677478694509
INFO [ print_timings] generation eval time = 617799.88 ms / 1700 runs ( 363.41 ms per token, 2.75 tokens per second) | tid="52596" timestamp=1746491529 id_slot=0 id_task=31856 t_token_generation=617799.884 n_decoded=1700 t_token=363.4116964705882 n_tokens_second=2.7517000958193774
-c 8192 -t 16 -fa -rtr:
INFO [ print_timings] prompt eval time = 11499.35 ms / 148 tokens ( 77.70 ms per token, 12.87 tokens per second) | tid="66164" timestamp=1746643229 id_slot=0 id_task=859 t_prompt_processing=11499.349 n_prompt_tokens_processed=148 t_token=77.69830405405405 n_tokens_second=12.8702937879353
INFO [ print_timings] generation eval time = 755894.69 ms / 2074 runs ( 364.46 ms per token, 2.74 tokens per second) | tid="66164" timestamp=1746643229 id_slot=0 id_task=859 t_token_generation=755894.69 n_decoded=2074 t_token=364.4622420443587 n_tokens_second=2.7437684474275117