mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-25 17:09:22 +00:00
1.5 KiB
1.5 KiB
🔀 #404 - TG improvements for MoE models
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-05-10 |
| Updated | 2025-05-10 |
Description
This PR does 3 things:
- Removes an unnecessary device to host copy of selected experts IDs on CUDA. This results in a few percent improvement of CUDA TG speed for MoE models
- Fixes bugs related to Smart Experts Reduction (SER, see #239). The issue was that the
GGML_OP_GET_ROWSop implementation did not consider disabled experts for float tensors. As a result, when combining the results of the experts garbage weights were used for the disabled experts, which could lead to NaNs. - Further improves CUDA TG performance with SER enabled. Here the
ggml_cuda_op_mul_mat_vec_q_idfunction did not consider that an expert may be disabled, and needlessly calculated the matrix-vector multiplication for disabled experts.
Prompt processing is not eaffected by these changes.
Here is a graph obtained with sweep-bench showing TG performance as a function of the number of tokens in the KV cache N_KV. The model is DeepSeek-Lite quantized to Q4_0. The GPU is RTX-4080. Black symbols are without using SER, red symbols are with -ser 4,1. The command line is
./bin/llama-sweep-bench -m $model -t 1 -ngl 100 -fmoe -mla 3 -fa -b 4096 -ub 4096 [-ser 4,1]