ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-28 00:54:09 +00:00

Files

Kawrakow 305fabfc3b FlashMLA-2 (CPU): faster and smaller compute buffer size (#253 )

* FlashMLA-2: eliminate intermediate f32 tensors

This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.

* FlashMLA-2: enable fast path only on the CPU for now

I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.

* FlashMLA-2: slightly smaller computer buffer size

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-03-13 12:07:43 +02:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

SER - Smart Expert Reduction (#239 )

2025-03-02 13:47:38 +02:00

src

FlashMLA-2 (CPU): faster and smaller compute buffer size (#253 )

2025-03-13 12:07:43 +02:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

FA: Add option to build all FA kernels (#197 )

2025-02-09 18:59:33 +02:00