Files
ik_llama.cpp/ggml/src/ggml-cuda/template-instances/mmvq-instance-q5_0.cu
Kawrakow f76e98536f CUDA: fuse ffn_up*unary_op(ffn_gate) for MMVQ (V2) (#864)
* Args for MMVQ functions

* WIP

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias)

We see nearly 2% TG speedup for Ling-mini-2.0 and
about 1% for DeepSeek-Lite.

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias)

* Fusing also for iqk/trellis/repacked quants

* Fusing mmvq also in non-MoE up+gate

* Fuse mul_mat_id and add_id into a single kernel for mmvq

* Also iqk quants

* Split mmvq.cu and iqk_mmvq.cu into separate template instances

* Put iqk mmvq implementations into template instances

* Somehow I forgot to change the ggml_type in the legacy template calls

* Add disagnostics

* Disable assert

* Fix TG fused up*nary(gate) when down cannot be fused

The wrong memory buffer got used in that case

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-26 17:08:50 +02:00

7 lines
170 B
Plaintext

#include "../mmvq-templates.cuh"
void mul_mat_vec_q5_0_q8_1_cuda(const mmvq_args & args, cudaStream_t stream) {
mul_mat_vec_q_cuda<GGML_TYPE_Q5_0>(args, stream);
}