mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-05-03 12:51:53 +00:00
35 lines
1.8 KiB
Markdown
35 lines
1.8 KiB
Markdown
### 🔀 [#83](https://github.com/ikawrakow/ik_llama.cpp/pull/83) - New SOTA quantization: 4.25 bpw IQ4_KS
|
|
|
|
| **Author** | `ikawrakow` |
|
|
| :--- | :--- |
|
|
| **State** | ❌ **Closed** |
|
|
| **Created** | 2024-10-09 |
|
|
| **Updated** | 2024-10-09 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
It is similar to `IQ4_K` with the following difference
|
|
* Blocks of 32 instead of blocks of 16
|
|
* Row-wise `float` scale instead of per block instead of per super-block `ggml_half`
|
|
* 7-bit block scales instead of 6-bit - needed to ensure enough precision when using per row float scale
|
|
|
|
It ends up being 4.25 bpw, so the same as `IQ4_XS`. Why add it then? Because it has a lower quantization error than `IQ4_XS`. For some models the difference is quite significant. The following table gives some examples. Quantization error `Qerr` is defined as `PPL(Q)/PPL(f16)-1`
|
|
|
|
| Model | Qerr(IQ4_XS) | Qerr(IQ4_KS) |
|
|
| :------- | ---: | ---: |
|
|
| LLaMA-3.1-8B | 2.82% | 2.68% |
|
|
| LLaMA-3.1-8B-Instruct | 2.54% | 1.85% |
|
|
| LLaMA-3.2-3B-Instruct | 2.45% | 2.13% |
|
|
| Qwen-2.5-7B-Instruct | 2.31% | 1.62% |
|
|
| Qwen-2.5-32B-Instruct | 2.17% | 1.82% |
|
|
| Nemo-Instruct-2407 | 1.592% | 1.579% |
|
|
| Gemma-2-9B | 1.33% | 0.92% |
|
|
| Gemma-2-27B-Instruct | 1.23% | 0.72% |
|
|
|
|
Performance is similar to `IQ4_XS` or even slightly better, except for TG on the M2-Max GPU, where it is ~2% slower (Apple Silicon does not like non-sequential memory access, but having the row scale stored at the beginning of the row causes an additional memory jump in the dot product kernel).
|
|
|
|
The PR also adds a new quantization mix - `IQ3_KL` (`L` for "large"). It fills the gap between `IQ4_K` and `IQ4_K` (and now `IQ4_KS`). The following graph illustrates where this new mix sits for LLaMA-3.1-8B-Instruct.
|
|
|
|
 |