ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-08 23:40:10 +00:00

Files

Kawrakow f76e98536f CUDA: fuse ffn_up*unary_op(ffn_gate) for MMVQ (V2) (#864 )

* Args for MMVQ functions

* WIP

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias)

We see nearly 2% TG speedup for Ling-mini-2.0 and
about 1% for DeepSeek-Lite.

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias)

* Fusing also for iqk/trellis/repacked quants

* Fusing mmvq also in non-MoE up+gate

* Fuse mul_mat_id and add_id into a single kernel for mmvq

* Also iqk quants

* Split mmvq.cu and iqk_mmvq.cu into separate template instances

* Put iqk mmvq implementations into template instances

* Somehow I forgot to change the ggml_type in the legacy template calls

* Add disagnostics

* Disable assert

* Fix TG fused up*nary(gate) when down cannot be fused

The wrong memory buffer got used in that case

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-10-26 17:08:50 +02:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Fused mul + multi_add op (#858 )

2025-10-24 07:40:35 +03:00

src

CUDA: fuse ffn_up*unary_op(ffn_gate) for MMVQ (V2) (#864 )

2025-10-26 17:08:50 +02:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Set default value of GGML_SCHED_MAX_COPIES to 1 (#751 )

2025-09-02 07:04:39 +02:00