### 🔀 [#85](https://github.com/ikawrakow/ik_llama.cpp/pull/85) - IQ2_KS: 2.1875 bpw non-linear quantization | **Author** | `ikawrakow` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2024-10-13 | | **Updated** | 2024-10-13 | --- #### Description It ends up being somewhere in the middle between `IQ2_XXS` and `IQ2_XS` in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct ![il31a](https://github.com/user-attachments/assets/6656173b-075e-4e50-a849-86a326561e10) What is the point, then? Two points: * Another proof that one can extend quantization to very low bpw **without using a codebook**. My previous attempts to do that have not been successful, so I'm quite pleased with this outcome * Much better CPU performance compared to `IQ2_XXS` or `IQ2_XS` (or any of the i-quants that uses a codebook), see tables. **M2-Max CPU** | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: | | llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | ARM_NEON | 8 | pp512 | 46.86 ± 0.05 | | llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | ARM_NEON | 8 | pp512 | 72.27 ± 0.19 | | llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | ARM_NEON | 8 | tg128 | 18.83 ± 0.06 | | llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | ARM_NEON | 8 | tg128 | 34.50 ± 0.30 | **Ryzen-7950X CPU** | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: | | llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 16 | pp512 | 128.88 ± 0.21 | | llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 16 | pp512 | 187.56 ± 1.01 | | llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 4 | tg128 | 11.91 ± 0.01 | | llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 4 | tg128 | 21.05 ± 0.01 | | llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 8 | tg128 | 20.55 ± 0.01 | | llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 8 | tg128 | 23.61 ± 0.20 | The only caveat: quantization is really slow: It takes 270 seconds on a Ryzen-7950X to quantize LLaMA-3.1-8B.