ik_llama.cpp/350 - Maverick slow prompt with gpu.md at eaa2510a28b60d43c2210c69cefdf750d5cc119f - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

2.9 KiB

Raw Blame History

🗣️ #350 - Maverick slow prompt with gpu

Author	`justinjja`
Created	2025-04-27
Updated	2025-04-27

Description

Any idea what the deal is with prompt speeds on Maverick?

1 3090 and a 56 core ddr4 epyc - Q4.5 - ~3500 tokens: Prompt 6.24 T/s Generation 31.7 T/s

Same but with the GPU disabled: Prompt 95 T/s Generation 5.6 T/s

Is it possible to leave prompt processing on the CPU and still use the GPU for generation?

🗣️ Discussion

👤 saood06 replied the 2025-04-27 at 04:22:52:

Do you mind providing the exact commands used to get those numbers (and any details about the quant used)?

👤 ikawrakow replied the 2025-04-27 at 06:45:38:

Please tell us your command line parameters.

I cannot run Maverick, but here is how I run Scout on a 32-core Ryzen-5975WX with a 16 GB RTX-4080:

./bin/llama-sweep-bench -m $model -t 32 -ngl 100 -ot "blk\.[0-9]\.ffn_up=CUDA0,blk\.[0-9]\.ffn_gate=CUDA0,exps=CPU" -rtr -fa -fmoe -ctk q8_0 -ctv q8_0 -c 16384 -ub 2048

where $model roughly corresponds in size to Unsloth's UD-Q2_K-XL (~40 GiB). And here is what I get in terms of performance as measured by llama-sweep-bench

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	5.798	353.23	24.503	20.90
2048	512	2048	5.779	354.36	25.474	20.10
2048	512	4096	5.868	349.04	26.436	19.37
2048	512	6144	5.958	343.76	27.480	18.63
2048	512	8192	6.041	339.04	28.457	17.99
2048	512	10240	6.121	334.60	29.508	17.35
2048	512	12288	6.206	329.99	30.540	16.76
2048	512	14336	6.297	325.25	31.513	16.25

The above command puts all attention tensors, shared experts, and the first 10 layers of ffn_up_exps and ffn_down_exps tensors on the GPU, all remaining experts stay on the CPU. With 16k context, this requires about 14 GiB of VRAM. You can use something similar, adapting to the 24 GiB of VRAM you have, and the different size of the Maverick model.

👤 justinjja replied the 2025-04-27 at 16:51:53:

Nice, thank you! My command must have been bad.

Your command 5x'ed my prompt speed. And upgrading my pcie from Gen3x4 to Gen4x16 got me another 4x on top of that.

I'm running unsloths 4.5 Bit dynamic gguf.

On my original test I'm now able to get: 128 prompt and 34 gen

New command: CUDA_VISIBLE_DEVICES=0 ./llama-server -m mav.gguf -t 32 --n-gpu-layers 100 -ot "blk.[0-1].ffn_up=CUDA0,blk.[0-1].ffn_gate=CUDA0,exps=CPU" -fa -ctk q8_0 -ctv q8_0 -c 16384 -ub 2048 --host 0.0.0.0 --port 8000

2.9 KiB Raw Blame History

🗣️ #350 - Maverick slow prompt with gpu

Description

🗣️ Discussion

2.9 KiB

Raw Blame History