mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 17:20:01 +00:00
40 lines
2.7 KiB
Markdown
40 lines
2.7 KiB
Markdown
### 🔀 [#85](https://github.com/ikawrakow/ik_llama.cpp/pull/85) - IQ2_KS: 2.1875 bpw non-linear quantization
|
|
|
|
| **Author** | `ikawrakow` |
|
|
| :--- | :--- |
|
|
| **State** | ❌ **Closed** |
|
|
| **Created** | 2024-10-13 |
|
|
| **Updated** | 2024-10-13 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
It ends up being somewhere in the middle between `IQ2_XXS` and `IQ2_XS` in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct
|
|

|
|
|
|
What is the point, then? Two points:
|
|
* Another proof that one can extend quantization to very low bpw **without using a codebook**. My previous attempts to do that have not been successful, so I'm quite pleased with this outcome
|
|
* Much better CPU performance compared to `IQ2_XXS` or `IQ2_XS` (or any of the i-quants that uses a codebook), see tables.
|
|
|
|
**M2-Max CPU**
|
|
|
|
| model | size | params | backend | threads | test | t/s |
|
|
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
|
|
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | ARM_NEON | 8 | pp512 | 46.86 ± 0.05 |
|
|
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | ARM_NEON | 8 | pp512 | 72.27 ± 0.19 |
|
|
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | ARM_NEON | 8 | tg128 | 18.83 ± 0.06 |
|
|
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | ARM_NEON | 8 | tg128 | 34.50 ± 0.30 |
|
|
|
|
**Ryzen-7950X CPU**
|
|
|
|
| model | size | params | backend | threads | test | t/s |
|
|
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
|
|
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 16 | pp512 | 128.88 ± 0.21 |
|
|
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 16 | pp512 | 187.56 ± 1.01 |
|
|
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 4 | tg128 | 11.91 ± 0.01 |
|
|
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 4 | tg128 | 21.05 ± 0.01 |
|
|
| llama 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8.03 B | Zen4 | 8 | tg128 | 20.55 ± 0.01 |
|
|
| llama 8B IQ2_KS - 2.1875 bpw | 2.30 GiB | 8.03 B | Zen4 | 8 | tg128 | 23.61 ± 0.20 |
|
|
|
|
The only caveat: quantization is really slow: It takes 270 seconds on a Ryzen-7950X to quantize LLaMA-3.1-8B. |