ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-26 09:29:27 +00:00

Files

Iwan Kawrakow 511c459232 WIP: play with KQ mask - make it binary

Here we get a small speedup: Gemma-2-2b and 32k context
is ~4% faster on Zen4. But on Zen4 we can use
  _mm512_mask_mul_ps(-inifnity, mask, s_after, tanh(x*s_before))
to scale and apply mask in a single op that has the same
latency and throughput as _mm512_mul_ps. Combined with reducing
memory loads for the mask represented as fp32 (or fp16), this
gives us some performance improvement for very large masks (contexts).

It will be much more tricky on the other platforms that do not
have masked instructions.

2024-08-28 09:08:49 +03:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Faster Gemma2 (#27 )

2024-08-27 17:40:59 +03:00

src

WIP: play with KQ mask - make it binary

2024-08-28 09:08:49 +03:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00