ik_llama.cpp/github-data/discussions/613 - Pathological Quant_CUDA combinations -- How to know what works_.md at ca03d07bb6bcd8efa60b2b7ebab3d282a21decd3 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-30 11:21:56 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

3.8 KiB

Raw Blame History

🗣️ #613 - Pathological Quant/CUDA combinations -- How to know what works?

Author	`usrlocalben`
Created	2025-07-15
Updated	2025-07-15

Description

Some quants/tensors seem to be incompatible with CUDA. My current example is a Q6_K (unsloth) quant of Kimi K2. If I leave all routed exp on CPU, I can get e.g. TG=~9tps. There's some VRAM remaining (RTX 8000, Turing, 48GB) so I can put a few e.g. up_exps on GPU. When doing this TG drops to 1tps or worse.

I've seen this phenomena before, trying to offload routed experts with some other quant types (w DeepSeek R1/V3) My understanding (I think somewhere @ubergarm explained it) is that some quants are not supported on CUDA and therefore must be converted before use per token.

PP throughput (~80tps) is not noticeably affected, presumably because of batching. (b=ub=4096)

Good outcome, ~9tps TG

-mla 2 -fa -fmoe
-b 4096 -ub 4096
-ctk f16 -c 64000
--n-gpu-layers 99
-ot exps=CPU
-op 26,0,27,0,29,0
-m /path/to/Kimi-K2-Instruct-Q6_K-00001-of-00018.gguf

if I change to

-ot "blk\.(1|2|3|4|5|6)\.ffn_up.*=CUDA0"
-ot exps=CPU

TG drops to 1tps or worse.

Assuming the idea is correct, Q6_K is a pathological quant type (at least on Turing) -- how to know this? How can I know what my options are when building GGUFs that match my offload/cpu arrangement?

edit: I shouldn't say they are not supported, but they aren't integrated into a kernel for the required op.

🗣️ Discussion

👤 ikawrakow replied the 2025-07-15 at 17:52:16:

Q6_K has been around forever and hence is a well supported quant on all platforms. So, it is not that.

Instead, you absolutely do not want to split up ffn_up and ffn_gate when using -fmoe. Try

-ot "blk\.(1|2|3)\.ffn_up_exps=CUDA0,blk\.(1|2|3)\.ffn_gate_exps=CUDA0"

instead.

If you split ffn_up and ffn_gate and there is a fused ffn_up/ffn_gate op where ffn_up is on the GPU but ffn_gate is on the CPU, whatever the back-end decides to do (run the op on the GPU or the CPU), the tensors need to be copied from the GPU to the CPU or vice versa. This totally kills TG performance.

👤 usrlocalben replied the 2025-07-15 at 18:37:07:

Q6_K has been around forever and hence is a well supported quant on all platforms.

I did find it surprising. If instead it were IQxxx I proably might not have been inspired to ask/write.

Instead, you absolutely do not want to split up ffn_up and ffn_gate when using -fmoe. Try

Makes so much sense it should have been obvious 😥

Thanks

👤 usrlocalben replied the 2025-07-15 at 18:42:37:
Furthermore, now I see why I've observed that particular offload pattern mentioned in various places.

I'll have to revisit some of my previous quant layouts and invocations. I had mixed gate/up offloads arbitrarily to optimally fill VRAM and didn't realize I was creating pathological arrangements.

👤 ikawrakow replied the 2025-07-15 at 18:07:47:

One more thing: if you have enough VRAM to use batch and u-batch of 4096, you should try removing -op 26,0,27,0,29,0 to see how this affects your PP performance. Depending on GPU vs CPU speed, this may give you a non-negligible boost in PP performance for long prompts (longer than 4k tokens).

👤 usrlocalben replied the 2025-07-15 at 18:39:48:
In the same testing prior to posting I did a fresh a/b test w & w/o this and it still improve things, maybe 1.5x (I just tossed the measurements). I did notice the recent change to the heuristics wrt. offloading but enforcing the -op policy is still an improvement for my hw combo.

3.8 KiB Raw Blame History

🗣️ #613 - Pathological Quant/CUDA combinations -- How to know what works?

Description

🗣️ Discussion

3.8 KiB

Raw Blame History