ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-22 06:04:24 +00:00

Files

Kawrakow ed2599d8a3 Faster MLA on CUDA (#234 )

* Slight MLA TG performance improvement on CUDA

The low MLA performance on CUDA is dues to
the wk_b * q_nope operation.

It turns into n_head matrix multiplications with
n_head separate quantization and GEMV steps.
The associated overhead is just too much for TG
where each GEMV is very fast (512 x 128 = 131 KFLOP
for DeepSeek-Lite, 4X that for DeepSeekV3/R1).
The way it was done there was also a copy of each q_nope
row before quantization, which I have now eliminated.
This results in a ~2.5% speedup.
What needs to happen instead is to launch a single
computation that quantizes all heads, and then have
a kernel that does the GEMV for all heads instead of
n_head sequential GEMVs.

* Slightly better

* CUDA: Quantize non-contiguous tensors

* Much better MLA

It is a total hack, but it works.

* Cleanup

Remove duplicated gemv's.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-02-27 08:42:18 +02:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Fused MoE ffn_up and ffn_gate (#229 )

2025-02-23 14:31:11 +02:00

src

Faster MLA on CUDA (#234 )

2025-02-27 08:42:18 +02:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

FA: Add option to build all FA kernels (#197 )

2025-02-09 18:59:33 +02:00