ik_llama.cpp/github-data/pull_requests/566 - Adding IQ3_KS quants.md

### 🔀 [#566](https://github.com/ikawrakow/ik_llama.cpp/pull/566) - Adding IQ3_KS quants

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-07-01 |
| **Updated** | 2025-07-02 |

---

#### Description

This PR adds `IQ3_KS` - 3.1875 bpw quants with a block size of 32. This makes the `IQX_KS` quant series complete

| type | bpw |
| ---: | ---: |
| IQ2_KS | 2.1875 |
| IQ3_KS | 3.1875 |
| IQ4_KS | 4.25 |
| IQ5_KS | 5.25 |

CUDA and CPU performance are very good, Metal is not so great.

Here a few sweep-benches for LlaMA-3.1-8B-Instruct

### RTX-4080

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |    512 |    0.065 |  7932.94 |    0.887 |   144.38 |
|   512 |    128 |   1024 |    0.066 |  7725.27 |    0.893 |   143.35 |
|   512 |    128 |   1536 |    0.068 |  7551.51 |    0.908 |   141.02 |
|   512 |    128 |   2048 |    0.069 |  7404.30 |    0.924 |   138.59 |
|   512 |    128 |   2560 |    0.072 |  7098.39 |    0.939 |   136.30 |
|   512 |    128 |   3072 |    0.074 |  6873.96 |    0.955 |   134.08 |
|   512 |    128 |   3584 |    0.074 |  6890.43 |    0.969 |   132.07 |
|   512 |    128 |   4096 |    0.077 |  6620.20 |    0.987 |   129.64 |
|   512 |    128 |   4608 |    0.079 |  6445.44 |    1.000 |   128.00 |
|   512 |    128 |   5120 |    0.081 |  6350.94 |    1.026 |   124.82 |
|   512 |    128 |   5632 |    0.083 |  6175.82 |    1.033 |   123.97 |
|   512 |    128 |   6144 |    0.084 |  6071.67 |    1.043 |   122.77 |
|   512 |    128 |   6656 |    0.086 |  5944.16 |    1.057 |   121.15 |
|   512 |    128 |   7168 |    0.088 |  5810.65 |    1.071 |   119.46 |
|   512 |    128 |   7680 |    0.090 |  5693.89 |    1.087 |   117.77 |

### Ryzen-7950X (Zen4)

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    1.423 |   359.79 |    7.616 |    16.81 |
|   512 |    128 |    512 |    1.479 |   346.15 |    7.800 |    16.41 |
|   512 |    128 |   1024 |    1.537 |   333.06 |    7.979 |    16.04 |
|   512 |    128 |   1536 |    1.603 |   319.47 |    7.939 |    16.12 |
|   512 |    128 |   2048 |    1.661 |   308.29 |    7.984 |    16.03 |
|   512 |    128 |   2560 |    1.722 |   297.39 |    8.071 |    15.86 |
|   512 |    128 |   3072 |    1.778 |   287.90 |    8.154 |    15.70 |
|   512 |    128 |   3584 |    1.841 |   278.04 |    8.241 |    15.53 |

### Ryzen-5975WX

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    1.697 |   301.64 |    6.933 |    18.46 |
|   512 |    128 |    512 |    1.760 |   290.91 |    7.062 |    18.13 |
|   512 |    128 |   1024 |    1.834 |   279.19 |    7.217 |    17.74 |
|   512 |    128 |   1536 |    1.910 |   268.03 |    7.414 |    17.26 |
|   512 |    128 |   2048 |    1.985 |   257.88 |    7.555 |    16.94 |
|   512 |    128 |   2560 |    2.062 |   248.26 |    7.666 |    16.70 |
|   512 |    128 |   3072 |    2.140 |   239.29 |    7.810 |    16.39 |
|   512 |    128 |   3584 |    2.217 |   230.98 |    7.987 |    16.03 |

### M2-Max CPU

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.119 |   164.13 |    5.410 |    23.66 |
|   512 |    128 |    512 |    3.322 |   154.14 |    5.487 |    23.33 |
|   512 |    128 |   1024 |    3.614 |   141.66 |    5.658 |    22.62 |
|   512 |    128 |   1536 |    3.872 |   132.23 |    5.735 |    22.32 |
|   512 |    128 |   2048 |    4.089 |   125.21 |    5.911 |    21.65 |

### M2-Max 30-core GPU

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    1.088 |   470.79 |    3.255 |    39.33 |
|   512 |    128 |    512 |    1.106 |   462.77 |    3.411 |    37.53 |
|   512 |    128 |   1024 |    1.126 |   454.85 |    3.579 |    35.77 |
|   512 |    128 |   1536 |    1.153 |   444.08 |    3.762 |    34.03 |
|   512 |    128 |   2048 |    1.178 |   434.48 |    3.965 |    32.28 |
|   512 |    128 |   2560 |    1.207 |   424.23 |    4.118 |    31.08 |
|   512 |    128 |   3072 |    1.235 |   414.51 |    4.290 |    29.84 |
|   512 |    128 |   3584 |    1.265 |   404.69 |    4.461 |    28.69 |

---

#### 💬 Conversation

👤 **ikawrakow** commented the **2025-07-02** at **07:27:42**:<br>

Let's merge this so people don't get crashes when trying to run `IQ3_KS` models with the main branch.

---

👤 **Nexesenex** commented the **2025-07-02** at **15:01:59**:<br>

Thanks for the explanation, I understand that the alternatives you have atm are quite unpractical.

In any case, thank you for the IQ3_KS (and the Cuda MMQ Kernels you kindly provided for most quants), it completes the KS quants lot, which is more practical to quantize than the already very demanding indeed Trellis lot. I'm very happy with all of this, compared to what mainline limits itself to atm.