ik_llama.cpp/ggml/src/ggml-cuda/template-instances/mmvq-instance-q5_0.cu at f76e98536f9377b29e2058cf83e0aa0450b631c8 - ik_llama.cpp - Public git mirror

ikawrakow/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-01 03:41:53 +00:00

Files

Kawrakow f76e98536f CUDA: fuse ffn_up*unary_op(ffn_gate) for MMVQ (V2) (#864 )

* Args for MMVQ functions

* WIP

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias)

We see nearly 2% TG speedup for Ling-mini-2.0 and
about 1% for DeepSeek-Lite.

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias)

* Fusing also for iqk/trellis/repacked quants

* Fusing mmvq also in non-MoE up+gate

* Fuse mul_mat_id and add_id into a single kernel for mmvq

* Also iqk quants

* Split mmvq.cu and iqk_mmvq.cu into separate template instances

* Put iqk mmvq implementations into template instances

* Somehow I forgot to change the ggml_type in the legacy template calls

* Add disagnostics

* Disable assert

* Fix TG fused up*nary(gate) when down cannot be fused

The wrong memory buffer got used in that case

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-10-26 17:08:50 +02:00

7 lines

170 B

Plaintext

Raw Blame History

 #include "../mmvq-templates.cuh"
 void mul_mat_vec_q5_0_q8_1_cuda(const mmvq_args & args, cudaStream_t stream) {
     mul_mat_vec_q_cuda<GGML_TYPE_Q5_0>(args, stream);
 }