mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-08 23:40:10 +00:00
* Args for MMVQ functions * WIP * Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias) We see nearly 2% TG speedup for Ling-mini-2.0 and about 1% for DeepSeek-Lite. * Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias) * Fusing also for iqk/trellis/repacked quants * Fusing mmvq also in non-MoE up+gate * Fuse mul_mat_id and add_id into a single kernel for mmvq * Also iqk quants * Split mmvq.cu and iqk_mmvq.cu into separate template instances * Put iqk mmvq implementations into template instances * Somehow I forgot to change the ggml_type in the legacy template calls * Add disagnostics * Disable assert * Fix TG fused up*nary(gate) when down cannot be fused The wrong memory buffer got used in that case --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>