Files
ik_llama.cpp/github-data/pull_requests/589 - CUDA_ small PP performance improvement for MoE models.md
2025-07-23 13:31:53 +02:00

1.6 KiB

🔀 #589 - CUDA: small PP performance improvement for MoE models

Author ikawrakow
State Closed
Created 2025-07-06
Updated 2025-07-07

Description

This PR brings a small (2-3%) prompt processing performance improvement on CUDA for quantized MoE models (when -fmoe is used).

Instead of first copying activations to contiguous memory and the quantizing, quantization is done directly using the row mapping IDs, thus saving the associated kernel launch overhead.

Here is a performance comparison for Q4_0 quantized DeepSeek-Lite on RTX-4080 using -mla 3 -fa -fmoe -b 4096 -ub 4096

Main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 0.480 8532.52 5.640 181.55
4096 1024 4096 0.566 7240.62 5.904 173.43
4096 1024 8192 0.674 6073.99 6.143 166.68
4096 1024 12288 0.789 5189.61 6.421 159.47

PR

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 0.469 8738.41 5.638 181.61
4096 1024 4096 0.554 7388.85 5.909 173.29
4096 1024 8192 0.670 6117.30 6.148 166.57
4096 1024 12288 0.779 5256.86 6.435 159.14