Files
ik_llama.cpp/github-data/pull_requests/85 - IQ2_KS_ 2.1875 bpw non-linear quantization.md
2025-07-23 13:31:53 +02:00

40 lines
2.7 KiB
Markdown

### 🔀 [#85](https://github.com/ikawrakow/ik_llama.cpp/pull/85) - IQ2_KS: 2.1875 bpw non-linear quantization
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-10-13 |
| **Updated** | 2024-10-13 |
---
#### Description
It ends up being somewhere in the middle between `IQ2_XXS` and `IQ2_XS` in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct
![il31a](https://github.com/user-attachments/assets/6656173b-075e-4e50-a849-86a326561e10)
What is the point, then? Two points:
* Another proof that one can extend quantization to very low bpw **without using a codebook**. My previous attempts to do that have not been successful, so I'm quite pleased with this outcome
* Much better CPU performance compared to `IQ2_XXS` or `IQ2_XS` (or any of the i-quants that uses a codebook), see tables.
**M2-Max CPU**
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | ARM_NEON | 8 | pp512 | 46.86 ± 0.05 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | ARM_NEON | 8 | pp512 | 72.27 ± 0.19 |
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | ARM_NEON | 8 | tg128 | 18.83 ± 0.06 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | ARM_NEON | 8 | tg128 | 34.50 ± 0.30 |
**Ryzen-7950X CPU**
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 16 | pp512 | 128.88 ± 0.21 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 16 | pp512 | 187.56 ± 1.01 |
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 4 | tg128 | 11.91 ± 0.01 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 4 | tg128 | 21.05 ± 0.01 |
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 8 | tg128 | 20.55 ± 0.01 |
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 8 | tg128 | 23.61 ± 0.20 |
The only caveat: quantization is really slow: It takes 270 seconds on a Ryzen-7950X to quantize LLaMA-3.1-8B.