mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-05-01 03:41:53 +00:00
* Args for MMVQ functions * WIP * Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias) We see nearly 2% TG speedup for Ling-mini-2.0 and about 1% for DeepSeek-Lite. * Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias) * Fusing also for iqk/trellis/repacked quants * Fusing mmvq also in non-MoE up+gate * Fuse mul_mat_id and add_id into a single kernel for mmvq * Also iqk quants * Split mmvq.cu and iqk_mmvq.cu into separate template instances * Put iqk mmvq implementations into template instances * Somehow I forgot to change the ggml_type in the legacy template calls * Add disagnostics * Disable assert * Fix TG fused up*nary(gate) when down cannot be fused The wrong memory buffer got used in that case --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
7 lines
170 B
Plaintext
7 lines
170 B
Plaintext
#include "../mmvq-templates.cuh"
|
|
|
|
void mul_mat_vec_q5_0_q8_1_cuda(const mmvq_args & args, cudaStream_t stream) {
|
|
mul_mat_vec_q_cuda<GGML_TYPE_Q5_0>(args, stream);
|
|
}
|
|
|