mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-26 01:19:20 +00:00
Relatively painless to implement for soft_max and soft_cap_max. We gain 11.5% for LLaMA-8B and ~14% for Gemma-2-2b at 32k tokens. The KQ mask is prepared on the CPU and copied to the GPU, so my guess is that most of it comes from the 32X reduction in the amount of data being copied to the GPU. TODO: flash attention