Files
ik_llama.cpp/ggml
Iwan Kawrakow 4ff2c6d188 NEON Flash Attention: quantized K*Q for q8_0
This makes quite a bit of difference:
For Gemma2-2b PP-8192 is 228 t/s with quantized K*Q vs
178 t/s when converting things to fp16 and using fp16
matrix multiplication.
We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the
performance of PP-512. In contrast, llama.cpp with Q8_0
cache is 38% of PP-512.
2024-09-12 10:38:44 +02:00
..
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00