ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-05 19:40:19 +00:00

Files

Kawrakow 30536ee369 FlashMLA-3 for DeepSeek models on CUDA (#386 )

* CUDA WIP: support for FlashMLA-3

* Much better

The issue was that I did not change the number of warps
used for 3D matrix multiplications (wk_b * kv_cache, MoE),
so we ended up using 4 warps for TG. By going to 1 warp
in these cases, we get a significant boost in TG performance
(tested with DeepSeek-Lite)

* Sadly, the previous commit was wrong

* Finalizing

* Also add these

* Minor

* Minor tweak

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-05-07 17:38:22 +03:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Add copyright notices (#317 )

2025-04-07 10:43:26 +02:00

src

FlashMLA-3 for DeepSeek models on CUDA (#386 )

2025-05-07 17:38:22 +03:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Compile time option to use bf16 for qunts without MMQ kernels (#261 )

2025-03-18 07:37:10 +01:00