ik_llama.cpp/85 - IQ2_KS_ 2.1875 bpw non-linear quantization.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

2.7 KiB

Raw Permalink Blame History

🔀 #85 - IQ2_KS: 2.1875 bpw non-linear quantization

Author	`ikawrakow`
State	❌ Closed
Created	2024-10-13
Updated	2024-10-13

Description

It ends up being somewhere in the middle between IQ2_XXS and IQ2_XS in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct

What is the point, then? Two points:

Another proof that one can extend quantization to very low bpw without using a codebook. My previous attempts to do that have not been successful, so I'm quite pleased with this outcome
Much better CPU performance compared to IQ2_XXS or IQ2_XS (or any of the i-quants that uses a codebook), see tables.

M2-Max CPU

model	size	params	backend	threads	test	t/s
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	ARM_NEON	8	pp512	46.86 ± 0.05
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	ARM_NEON	8	pp512	72.27 ± 0.19
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	ARM_NEON	8	tg128	18.83 ± 0.06
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	ARM_NEON	8	tg128	34.50 ± 0.30

Ryzen-7950X CPU

model	size	params	backend	threads	test	t/s
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	Zen4	16	pp512	128.88 ± 0.21
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	Zen4	16	pp512	187.56 ± 1.01
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	Zen4	4	tg128	11.91 ± 0.01
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	Zen4	4	tg128	21.05 ± 0.01
llama 8B IQ2_XS - 2.3125 bpw	2.42 GiB	8.03 B	Zen4	8	tg128	20.55 ± 0.01
llama 8B IQ2_KS - 2.1875 bpw	2.30 GiB	8.03 B	Zen4	8	tg128	23.61 ± 0.20

The only caveat: quantization is really slow: It takes 270 seconds on a Ryzen-7950X to quantize LLaMA-3.1-8B.

2.7 KiB Raw Permalink Blame History

🔀 #85 - IQ2_KS: 2.1875 bpw non-linear quantization

Description

2.7 KiB

Raw Permalink Blame History