### 🔀 [#516](https://github.com/ikawrakow/ik_llama.cpp/pull/516) - Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2)

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-06-11 |
| **Updated** | 2025-06-11 |

---

#### Description

This PR is a follow up of #515, and applies the same technique to `IQ3_XXS`. We see nearly 3X increase in prompt processing speed compared to `IQ3_XXS`, and over 2X compared to `IQ3_XXS_R4`.

Sweep-bench for pure `IQ3_XXS` quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:

### IQ3_XXS, main branch

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|  
|   512 |    128 |      0 |    5.023 |   101.94 |    7.365 |    17.38 |   
|   512 |    128 |    512 |    5.281 |    96.96 |    8.088 |    15.83 |   
|   512 |    128 |   1024 |    5.170 |    99.03 |    7.977 |    16.05 |   
|   512 |    128 |   1536 |    5.324 |    96.16 |    7.942 |    16.12 |   
|   512 |    128 |   2048 |    5.389 |    95.02 |    8.043 |    15.91 |

### IQ3_XXS_R4, main branch

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.836 |   133.47 |    7.675 |    16.68 |
|   512 |    128 |    512 |    3.687 |   138.87 |    8.279 |    15.46 |
|   512 |    128 |   1024 |    3.805 |   134.57 |    8.245 |    15.53 |
|   512 |    128 |   1536 |    3.906 |   131.08 |    8.252 |    15.51 |
|   512 |    128 |   2048 |    4.076 |   125.61 |    8.545 |    14.98 |

### IQ3_XXS, PR

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    1.730 |   296.01 |    7.641 |    16.75 |
|   512 |    128 |    512 |    1.807 |   283.30 |    8.333 |    15.36 |
|   512 |    128 |   1024 |    1.896 |   269.98 |    8.070 |    15.86 |
|   512 |    128 |   1536 |    1.978 |   258.78 |    8.481 |    15.09 |
|   512 |    128 |   2048 |    2.062 |   248.32 |    8.514 |    15.03 |