Files
ik_llama.cpp/ggml
Iwan Kawrakow 1216a43719 WIP KQ binary mask: CUDA
Relatively painless to implement for soft_max and soft_cap_max.
We gain 11.5% for LLaMA-8B and ~14% for Gemma-2-2b at 32k tokens.
The KQ mask is prepared on the CPU and copied to the GPU, so
my guess is that most of it comes from the 32X reduction in the
amount of data being copied to the GPU.

TODO: flash attention
2024-08-28 10:03:10 +03:00
..
2024-07-27 07:55:01 +02:00
2024-08-27 17:40:59 +03:00
2024-08-28 10:03:10 +03:00
2024-07-27 07:55:01 +02:00