Files
ik_llama.cpp/ggml
Iwan Kawrakow 511c459232 WIP: play with KQ mask - make it binary
Here we get a small speedup: Gemma-2-2b and 32k context
is ~4% faster on Zen4. But on Zen4 we can use
  _mm512_mask_mul_ps(-inifnity, mask, s_after, tanh(x*s_before))
to scale and apply mask in a single op that has the same
latency and throughput as _mm512_mul_ps. Combined with reducing
memory loads for the mask represented as fp32 (or fp16), this
gives us some performance improvement for very large masks (contexts).

It will be much more tricky on the other platforms that do not
have masked instructions.
2024-08-28 09:08:49 +03:00
..
2024-07-27 07:55:01 +02:00
2024-08-27 17:40:59 +03:00
2024-07-27 07:55:01 +02:00