Files
ik_llama.cpp/ggml
Iwan Kawrakow f3a823ce72 iq6_k: Zen4 iqk_mul_mat
We need to do 4 shuffles to get the non-uniform values, so this
makes it slower than other iqX_k quants.

And then I realized that I was using the standard Zen4 template for
all iqX_k quants. The standard template converts the 32-bit integers
obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples
with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and
iq6_k. I guess, I did not notice with iq4_k and iq5_k because the
PPL difference to CUDA was relatively small, and I attributed it to
Q8_K not being accurate enough for the activations. But for iq6_k
the PPL difference was much too big to be attributable to Q8_K
inaccuracies, so that's when I realized that I cannot be packing
the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k
quants.

For now I fixed it for iq6_k, but the outcome is that it is
significantly slower than Q6_K: I get PP-512 = 125 t/s for
LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better
approach.
2024-08-08 11:17:42 +03:00
..
2024-07-27 07:55:01 +02:00
2024-08-07 15:24:16 +03:00
2024-08-08 11:17:42 +03:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00