ik_llama.cpp/github-data/pull_requests/83 - New SOTA quantization_ 4.25 bpw IQ4_KS.md at 7fbe8d3ac269bff67ef7e5bd7a4e593e4a709e8a - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-30 11:21:56 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.8 KiB

Raw Blame History

🔀 #83 - New SOTA quantization: 4.25 bpw IQ4_KS

Author	`ikawrakow`
State	❌ Closed
Created	2024-10-09
Updated	2024-10-09

Description

It is similar to IQ4_K with the following difference

Blocks of 32 instead of blocks of 16
Row-wise float scale instead of per block instead of per super-block ggml_half
7-bit block scales instead of 6-bit - needed to ensure enough precision when using per row float scale

It ends up being 4.25 bpw, so the same as IQ4_XS. Why add it then? Because it has a lower quantization error than IQ4_XS. For some models the difference is quite significant. The following table gives some examples. Quantization error Qerr is defined as PPL(Q)/PPL(f16)-1

Model	Qerr(IQ4_XS)	Qerr(IQ4_KS)
LLaMA-3.1-8B	2.82%	2.68%
LLaMA-3.1-8B-Instruct	2.54%	1.85%
LLaMA-3.2-3B-Instruct	2.45%	2.13%
Qwen-2.5-7B-Instruct	2.31%	1.62%
Qwen-2.5-32B-Instruct	2.17%	1.82%
Nemo-Instruct-2407	1.592%	1.579%
Gemma-2-9B	1.33%	0.92%
Gemma-2-27B-Instruct	1.23%	0.72%

Performance is similar to IQ4_XS or even slightly better, except for TG on the M2-Max GPU, where it is ~2% slower (Apple Silicon does not like non-sequential memory access, but having the row scale stored at the beginning of the row causes an additional memory jump in the dot product kernel).

The PR also adds a new quantization mix - IQ3_KL (L for "large"). It fills the gap between IQ4_K and IQ4_K (and now IQ4_KS). The following graph illustrates where this new mix sits for LLaMA-3.1-8B-Instruct.

1.8 KiB Raw Blame History

🔀 #83 - New SOTA quantization: 4.25 bpw IQ4_KS

Description

1.8 KiB

Raw Blame History