Files
ik_llama.cpp/github-data/discussions/350 - Maverick slow prompt with gpu.md
2025-07-23 13:31:53 +02:00

2.9 KiB

🗣️ #350 - Maverick slow prompt with gpu

Author justinjja
Created 2025-04-27
Updated 2025-04-27

Description

Any idea what the deal is with prompt speeds on Maverick?

1 3090 and a 56 core ddr4 epyc - Q4.5 - ~3500 tokens: Prompt 6.24 T/s Generation 31.7 T/s

Same but with the GPU disabled: Prompt 95 T/s Generation 5.6 T/s

Is it possible to leave prompt processing on the CPU and still use the GPU for generation?


🗣️ Discussion

👤 saood06 replied the 2025-04-27 at 04:22:52:

Do you mind providing the exact commands used to get those numbers (and any details about the quant used)?


👤 ikawrakow replied the 2025-04-27 at 06:45:38:

Please tell us your command line parameters.

I cannot run Maverick, but here is how I run Scout on a 32-core Ryzen-5975WX with a 16 GB RTX-4080:

./bin/llama-sweep-bench -m $model -t 32 -ngl 100 -ot "blk\.[0-9]\.ffn_up=CUDA0,blk\.[0-9]\.ffn_gate=CUDA0,exps=CPU" -rtr -fa -fmoe -ctk q8_0 -ctv q8_0 -c 16384 -ub 2048

where $model roughly corresponds in size to Unsloth's UD-Q2_K-XL (~40 GiB). And here is what I get in terms of performance as measured by llama-sweep-bench

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 5.798 353.23 24.503 20.90
2048 512 2048 5.779 354.36 25.474 20.10
2048 512 4096 5.868 349.04 26.436 19.37
2048 512 6144 5.958 343.76 27.480 18.63
2048 512 8192 6.041 339.04 28.457 17.99
2048 512 10240 6.121 334.60 29.508 17.35
2048 512 12288 6.206 329.99 30.540 16.76
2048 512 14336 6.297 325.25 31.513 16.25

The above command puts all attention tensors, shared experts, and the first 10 layers of ffn_up_exps and ffn_down_exps tensors on the GPU, all remaining experts stay on the CPU. With 16k context, this requires about 14 GiB of VRAM. You can use something similar, adapting to the 24 GiB of VRAM you have, and the different size of the Maverick model.


👤 justinjja replied the 2025-04-27 at 16:51:53:

Nice, thank you! My command must have been bad.

Your command 5x'ed my prompt speed. And upgrading my pcie from Gen3x4 to Gen4x16 got me another 4x on top of that.

I'm running unsloths 4.5 Bit dynamic gguf.

On my original test I'm now able to get: 128 prompt and 34 gen

New command: CUDA_VISIBLE_DEVICES=0 ./llama-server -m mav.gguf -t 32 --n-gpu-layers 100 -ot "blk.[0-1].ffn_up=CUDA0,blk.[0-1].ffn_gate=CUDA0,exps=CPU" -fa -ctk q8_0 -ctv q8_0 -c 16384 -ub 2048 --host 0.0.0.0 --port 8000