Files
ik_llama.cpp/github-data/pull_requests/83 - New SOTA quantization_ 4.25 bpw IQ4_KS.md
2025-07-23 13:31:53 +02:00

1.8 KiB

🔀 #83 - New SOTA quantization: 4.25 bpw IQ4_KS

Author ikawrakow
State Closed
Created 2024-10-09
Updated 2024-10-09

Description

It is similar to IQ4_K with the following difference

  • Blocks of 32 instead of blocks of 16
  • Row-wise float scale instead of per block instead of per super-block ggml_half
  • 7-bit block scales instead of 6-bit - needed to ensure enough precision when using per row float scale

It ends up being 4.25 bpw, so the same as IQ4_XS. Why add it then? Because it has a lower quantization error than IQ4_XS. For some models the difference is quite significant. The following table gives some examples. Quantization error Qerr is defined as PPL(Q)/PPL(f16)-1

Model Qerr(IQ4_XS) Qerr(IQ4_KS)
LLaMA-3.1-8B 2.82% 2.68%
LLaMA-3.1-8B-Instruct 2.54% 1.85%
LLaMA-3.2-3B-Instruct 2.45% 2.13%
Qwen-2.5-7B-Instruct 2.31% 1.62%
Qwen-2.5-32B-Instruct 2.17% 1.82%
Nemo-Instruct-2407 1.592% 1.579%
Gemma-2-9B 1.33% 0.92%
Gemma-2-27B-Instruct 1.23% 0.72%

Performance is similar to IQ4_XS or even slightly better, except for TG on the M2-Max GPU, where it is ~2% slower (Apple Silicon does not like non-sequential memory access, but having the row scale stored at the beginning of the row causes an additional memory jump in the dot product kernel).

The PR also adds a new quantization mix - IQ3_KL (L for "large"). It fills the gap between IQ4_K and IQ4_K (and now IQ4_KS). The following graph illustrates where this new mix sits for LLaMA-3.1-8B-Instruct.

il31_8B