ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-26 08:04:09 +00:00

Files

Iwan Kawrakow 48974c7acd iq4_k: Rearrange blocks for faster matrix multiplications

On Zen4 we get PP-512(LLaMA-3.1-8B) = 216.7 t/s.
In comparison, the original bit arrangement gave 180 t/s.
The trick is to have quants
  0...3,  64...67,  128...131, 192...195 in block 0,
  4...7,  68...71,  131...135, 196...199 in block 2, etc.
With that, we can simply sum the integer dot products
and multiply with the block scales, whithout needing
to shuffle scales/dot products and such.

iq4_k is now the fastest quantization type on Zen4, so
time to see how this will work on the other platforms.

2024-11-04 10:22:59 +02:00

ggml-alloc.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

ggml-backend.h

Bitnet changes (#106 )

2024-10-25 13:08:43 +02:00

ggml-blas.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

ggml-cann.h

Merge mainline llama.cpp (#3 )