Files
ik_llama.cpp/github-data/pull_requests/85 - IQ2_KS_ 2.1875 bpw non-linear quantization.md
2025-07-23 13:31:53 +02:00

2.7 KiB

🔀 #85 - IQ2_KS: 2.1875 bpw non-linear quantization

Author ikawrakow
State Closed
Created 2024-10-13
Updated 2024-10-13

Description

It ends up being somewhere in the middle between IQ2_XXS and IQ2_XS in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct il31a

What is the point, then? Two points:

  • Another proof that one can extend quantization to very low bpw without using a codebook. My previous attempts to do that have not been successful, so I'm quite pleased with this outcome
  • Much better CPU performance compared to IQ2_XXS or IQ2_XS (or any of the i-quants that uses a codebook), see tables.

M2-Max CPU

model size params backend threads test t/s
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B ARM_NEON 8 pp512 46.86 ± 0.05
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B ARM_NEON 8 pp512 72.27 ± 0.19
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B ARM_NEON 8 tg128 18.83 ± 0.06
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B ARM_NEON 8 tg128 34.50 ± 0.30

Ryzen-7950X CPU

model size params backend threads test t/s
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B Zen4 16 pp512 128.88 ± 0.21
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B Zen4 16 pp512 187.56 ± 1.01
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B Zen4 4 tg128 11.91 ± 0.01
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B Zen4 4 tg128 21.05 ± 0.01
llama 8B IQ2_XS - 2.3125 bpw 2.42 GiB 8.03 B Zen4 8 tg128 20.55 ± 0.01
llama 8B IQ2_KS - 2.1875 bpw 2.30 GiB 8.03 B Zen4 8 tg128 23.61 ± 0.20

The only caveat: quantization is really slow: It takes 270 seconds on a Ryzen-7950X to quantize LLaMA-3.1-8B.