ik_llama.cpp/github-data/pull_requests/85 - IQ2_KS_ 2.1875 bpw non-linear quantization.md

### 🔀 [#85](https://github.com/ikawrakow/ik_llama.cpp/pull/85) - IQ2_KS: 2.1875 bpw non-linear quantization

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-10-13 |
| **Updated** | 2024-10-13 |

---

#### Description

It ends up being somewhere in the middle between `IQ2_XXS` and `IQ2_XS` in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct
![il31a](https://github.com/user-attachments/assets/6656173b-075e-4e50-a849-86a326561e10)

What is the point, then? Two points:
* Another proof that one can extend quantization to very low bpw **without using a codebook**. My previous attempts to do that have not been successful, so I'm quite pleased with this outcome
* Much better CPU performance compared to `IQ2_XXS` or `IQ2_XS` (or any of the i-quants that uses a codebook), see tables.

**M2-Max CPU**

| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 8B IQ2_XS - 2.3125 bpw   |   2.42 GiB |     8.03 B | ARM_NEON   |   8 |         pp512 |     46.86 ± 0.05 |
| llama 8B IQ2_KS - 2.1875 bpw   |   2.30 GiB |     8.03 B | ARM_NEON   |   8 |         pp512 |     72.27 ± 0.19 |
| llama 8B IQ2_XS - 2.3125 bpw   |   2.42 GiB |     8.03 B | ARM_NEON   |   8 |         tg128 |     18.83 ± 0.06 |
| llama 8B IQ2_KS - 2.1875 bpw   |   2.30 GiB |     8.03 B | ARM_NEON   |   8 |         tg128 |     34.50 ± 0.30 |

**Ryzen-7950X CPU**

| model                          |       size |     params | backend    | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 8B IQ2_XS - 2.3125 bpw   |   2.42 GiB |     8.03 B | Zen4       |      16 |         pp512 |    128.88 ± 0.21 |
| llama 8B IQ2_KS - 2.1875 bpw   |   2.30 GiB |     8.03 B | Zen4       |      16 |         pp512 |    187.56 ± 1.01 |
| llama 8B IQ2_XS - 2.3125 bpw   |   2.42 GiB |     8.03 B | Zen4       |       4 |         tg128 |     11.91 ± 0.01 |
| llama 8B IQ2_KS - 2.1875 bpw   |   2.30 GiB |     8.03 B | Zen4       |       4 |         tg128 |     21.05 ± 0.01 |
| llama 8B IQ2_XS - 2.3125 bpw   |   2.42 GiB |     8.03 B | Zen4       |       8 |         tg128 |     20.55 ± 0.01 |
| llama 8B IQ2_KS - 2.1875 bpw   |   2.30 GiB |     8.03 B | Zen4       |       8 |         tg128 |     23.61 ± 0.20 |

The only caveat: quantization is really slow: It takes 270 seconds on a Ryzen-7950X to quantize LLaMA-3.1-8B.