mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 09:09:50 +00:00
2.7 KiB
2.7 KiB
🔀 #85 - IQ2_KS: 2.1875 bpw non-linear quantization
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-10-13 |
| Updated | 2024-10-13 |
Description
It ends up being somewhere in the middle between IQ2_XXS and IQ2_XS in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct
What is the point, then? Two points:
- Another proof that one can extend quantization to very low bpw without using a codebook. My previous attempts to do that have not been successful, so I'm quite pleased with this outcome
- Much better CPU performance compared to
IQ2_XXSorIQ2_XS(or any of the i-quants that uses a codebook), see tables.
M2-Max CPU
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | ARM_NEON | 8 | pp512 | 46.86 ± 0.05 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | ARM_NEON | 8 | pp512 | 72.27 ± 0.19 |
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | ARM_NEON | 8 | tg128 | 18.83 ± 0.06 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | ARM_NEON | 8 | tg128 | 34.50 ± 0.30 |
Ryzen-7950X CPU
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 16 | pp512 | 128.88 ± 0.21 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 16 | pp512 | 187.56 ± 1.01 |
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 4 | tg128 | 11.91 ± 0.01 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 4 | tg128 | 21.05 ± 0.01 |
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 8 | tg128 | 20.55 ± 0.01 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 8 | tg128 | 23.61 ± 0.20 |
The only caveat: quantization is really slow: It takes 270 seconds on a Ryzen-7950X to quantize LLaMA-3.1-8B.