ik_llama.cpp/github-data/pull_requests/404 - TG improvements for MoE models.md at ik/refactor_llama.cpp - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-25 17:09:22 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.5 KiB

Raw Permalink Blame History

🔀 #404 - TG improvements for MoE models

Author	`ikawrakow`
State	❌ Closed
Created	2025-05-10
Updated	2025-05-10

Description

This PR does 3 things:

Removes an unnecessary device to host copy of selected experts IDs on CUDA. This results in a few percent improvement of CUDA TG speed for MoE models
Fixes bugs related to Smart Experts Reduction (SER, see #239). The issue was that the GGML_OP_GET_ROWS op implementation did not consider disabled experts for float tensors. As a result, when combining the results of the experts garbage weights were used for the disabled experts, which could lead to NaNs.
Further improves CUDA TG performance with SER enabled. Here the ggml_cuda_op_mul_mat_vec_q_id function did not consider that an expert may be disabled, and needlessly calculated the matrix-vector multiplication for disabled experts.

Prompt processing is not eaffected by these changes.

Here is a graph obtained with sweep-bench showing TG performance as a function of the number of tokens in the KV cache N_KV. The model is DeepSeek-Lite quantized to Q4_0. The GPU is RTX-4080. Black symbols are without using SER, red symbols are with -ser 4,1. The command line is

./bin/llama-sweep-bench -m $model -t 1 -ngl 100 -fmoe -mla 3 -fa -b 4096 -ub 4096 [-ser 4,1]

1.5 KiB Raw Permalink Blame History

🔀 #404 - TG improvements for MoE models

Description

1.5 KiB

Raw Permalink Blame History