ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-25 15:44:10 +00:00

Files

Iwan Kawrakow 4ff2c6d188 NEON Flash Attention: quantized K*Q for q8_0

This makes quite a bit of difference:
For Gemma2-2b PP-8192 is 228 t/s with quantized K*Q vs
178 t/s when converting things to fp16 and using fp16
matrix multiplication.
We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the
performance of PP-512. In contrast, llama.cpp with Q8_0
cache is 38% of PP-512.

2024-09-12 10:38:44 +02:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44 )

2024-09-09 14:56:34 +03:00

src

NEON Flash Attention: quantized K*Q for q8_0

2024-09-12 10:38:44 +02:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00