ik_llama.cpp/github-data/pull_requests/427-Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K.md at ab7d193fe0f2ca1f99713e858f976eefb28224a5

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-01 03:41:53 +00:00

Files

Thomas ab7d193fe0 Add GitHub data (#637 )

2025-07-22 18:18:40 +02:00

1.7 KiB

Raw Blame History

🐛 #427 - Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K

Author	`ikawrakow`
State	❌ Closed
Created	2025-05-16
Updated	2025-05-16

Description

I have made the exact same mistake a number of times.

On AVX2 the instruction to perform dot products of int8_t vectors (as needed in quantized matrix multiplications) is _mm256_maddubs_epi8(x, y), where x must be unsigned and y signed, and the result is a SIMD vector of signed int16_t values z_i = x_{2i} y_{2i} + x_{2i+1} y_{2i+1}. The quant values x and quantized activations y are signed, so one way to deal with the the strangeness of this instruction is to add a suitable constant value c to x so that it becomes unsigned, use _mm256_maddubs_epi8(c+x, y) to accumulate the dot product, and at the end subtract c \cdot b, where b = \sum y_i has been pre-computed when quantizing the activations y. The issue arises when the x values span the full int8_t range as it is the case with the non-linear quants IQ4_NL, IQ4_XS, IQ4_K, IQ4_KS, IQ5_K, IQ5_KS, IQ6_K. In that case c = 128, the c+x values span the full uint8_t range, and hence it is possible to overflow the signed int16_t range.

I had though that I had fixed this mistake, but while working on the IQ5_KS type added in PR #422 I noticed that the issue still exists IQ4_K, IQ4_KS, IQ5_K, IQ6_K and was only fixed for the corresponding repacked variants.

The PR corrects the problem. There will be a slight (a few percent) PP performance degradation on AVX2 for these quantization types.

1.7 KiB Raw Blame History

🐛 #427 - Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K

Description

1.7 KiB

Raw Blame History