CUDA: fuse ffn_up*unary_op(ffn_gate) for MMVQ (V2) (#864)

* Args for MMVQ functions * WIP * Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias) We see nearly 2% TG speedup for Ling-mini-2.0 and about 1% for DeepSeek-Lite. * Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias) * Fusing also for iqk/trellis/repacked quants * Fusing mmvq also in non-MoE up+gate * Fuse mul_mat_id and add_id into a single kernel for mmvq * Also iqk quants * Split mmvq.cu and iqk_mmvq.cu into separate template instances * Put iqk mmvq implementations into template instances * Somehow I forgot to change the ggml_type in the legacy template calls * Add disagnostics * Disable assert * Fix TG fused up*nary(gate) when down cannot be fused The wrong memory buffer got used in that case --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-05-01 03:41:53 +00:00 · 2025-10-26 17:08:50 +02:00
parent 41d6c42b96
commit e34399c116
55 changed files with 2611 additions and 2334 deletions
--- a/ggml/src/CMakeLists.txt
+++ b/ggml/src/CMakeLists.txt
@@ -357,6 +357,8 @@ if (GGML_CUDA)
        list(APPEND GGML_SOURCES_CUDA ${SRCS})
        file(GLOB   SRCS "ggml-cuda/template-instances/mmq*.cu")
        list(APPEND GGML_SOURCES_CUDA ${SRCS})
+        file(GLOB   SRCS "ggml-cuda/template-instances/mmvq-instance*.cu")
+        list(APPEND GGML_SOURCES_CUDA ${SRCS})

        if (GGML_CUDA_FA_ALL_QUANTS)
            file(GLOB   SRCS "ggml-cuda/template-instances/fattn-vec*.cu")