Files
ik_llama.cpp/ggml/include
Iwan Kawrakow 48974c7acd iq4_k: Rearrange blocks for faster matrix multiplications
On Zen4 we get PP-512(LLaMA-3.1-8B) = 216.7 t/s.
In comparison, the original bit arrangement gave 180 t/s.
The trick is to have quants
  0...3,  64...67,  128...131, 192...195 in block 0,
  4...7,  68...71,  131...135, 196...199 in block 2, etc.
With that, we can simply sum the integer dot products
and multiply with the block scales, whithout needing
to shuffle scales/dot products and such.

iq4_k is now the fastest quantization type on Zen4, so
time to see how this will work on the other platforms.
2024-11-04 10:22:59 +02:00
..
2024-07-27 07:55:01 +02:00
2024-10-25 13:08:43 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-08-12 15:14:32 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00