ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-26 01:19:20 +00:00

Files

Iwan Kawrakow 1216a43719 WIP KQ binary mask: CUDA

Relatively painless to implement for soft_max and soft_cap_max.
We gain 11.5% for LLaMA-8B and ~14% for Gemma-2-2b at 32k tokens.
The KQ mask is prepared on the CPU and copied to the GPU, so
my guess is that most of it comes from the 32X reduction in the
amount of data being copied to the GPU.

TODO: flash attention

2024-08-28 10:03:10 +03:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Faster Gemma2 (#27 )

2024-08-27 17:40:59 +03:00

src

WIP KQ binary mask: CUDA

2024-08-28 10:03:10 +03:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00