Files
ik_llama.cpp/github-data/pull_requests/404 - TG improvements for MoE models.md
2025-07-23 13:31:53 +02:00

1.5 KiB

🔀 #404 - TG improvements for MoE models

Author ikawrakow
State Closed
Created 2025-05-10
Updated 2025-05-10

Description

This PR does 3 things:

  • Removes an unnecessary device to host copy of selected experts IDs on CUDA. This results in a few percent improvement of CUDA TG speed for MoE models
  • Fixes bugs related to Smart Experts Reduction (SER, see #239). The issue was that the GGML_OP_GET_ROWS op implementation did not consider disabled experts for float tensors. As a result, when combining the results of the experts garbage weights were used for the disabled experts, which could lead to NaNs.
  • Further improves CUDA TG performance with SER enabled. Here the ggml_cuda_op_mul_mat_vec_q_id function did not consider that an expert may be disabled, and needlessly calculated the matrix-vector multiplication for disabled experts.

Prompt processing is not eaffected by these changes.

Here is a graph obtained with sweep-bench showing TG performance as a function of the number of tokens in the KV cache N_KV. The model is DeepSeek-Lite quantized to Q4_0. The GPU is RTX-4080. Black symbols are without using SER, red symbols are with -ser 4,1. The command line is

./bin/llama-sweep-bench -m $model -t 1 -ngl 100 -fmoe -mla 3 -fa -b 4096 -ub 4096 [-ser 4,1]

z8