mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-31 11:39:52 +00:00
* This gives us ~20% TG speedup for DeepSeek on CUDA * Slightly better * Also do it for plain (not fused) mul_mat_id * Guard against numerical precision issues for MLA on CUDA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>