ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-02 18:10:02 +00:00

Files

Kawrakow fc6a65dda4 MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252 )

* FlashMLA(CUDA): WIP to allow q8_0 quantized cache

* WIP

* FlashMLA(CUDA) - allow q8_0 for KV cache

This works, and PP is not bad, but TG is still quite a bit slower.

* FlashMLA(CUDA) - allow q8_0 for KV cache

This is better. ~9% slower than f16 cache for short contexts,
nearly on par at 16k tokens.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-03-12 07:21:46 +02:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

SER - Smart Expert Reduction (#239 )

2025-03-02 13:47:38 +02:00

src

MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252 )

2025-03-12 07:21:46 +02:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

FA: Add option to build all FA kernels (#197 )

2025-02-09 18:59:33 +02:00