Faster MoE token generation on CUDA (#248)

* This gives us ~20% TG speedup for DeepSeek on CUDA

* Slightly better

* Also do it for plain (not fused) mul_mat_id

* Guard against numerical precision issues for MLA on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-03-10 16:16:51 +02:00
committed by GitHub
parent 46b526c2c4
commit fcd1e124e0
6 changed files with 488 additions and 209 deletions

View File

@@ -13734,6 +13734,9 @@ struct llm_build_context {
}
ggml_tensor * kq = ggml_mul_mat(ctx0, kv_cache, q);
if (kv_cache->ne[1] < 256) {
ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
}
cb(kq, "kq", il);
if (!pp_opt) {