mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-26 09:29:27 +00:00
Here we get a small speedup: Gemma-2-2b and 32k context is ~4% faster on Zen4. But on Zen4 we can use _mm512_mask_mul_ps(-inifnity, mask, s_after, tanh(x*s_before)) to scale and apply mask in a single op that has the same latency and throughput as _mm512_mul_ps. Combined with reducing memory loads for the mask represented as fp32 (or fp16), this gives us some performance improvement for very large masks (contexts). It will be much more tricky on the other platforms that do not have masked instructions.