ik_llama.cpp/github-data/pull_requests/139 - Faster R4 quants on Zen4.md

### 🔀 [#139](https://github.com/ikawrakow/ik_llama.cpp/pull/139) - Faster R4 quants on Zen4

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-12-13 |
| **Updated** | 2024-12-13 |

---

#### Description

Use integer accumulators for dot products within superblocks. I did not use this originally because according to [this Intel reference](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=6440,3715,4851,465,488,6424,488,4200,6554,83,4843,5760,5740,6548,6548,852,3669,6205,6205,3669,3675,5750,6375,6437,3869,2675,2675,3850,3869,2946,2946,308,1741,6044,6073,6585,7030,4851,4874,6196,6068,1741,4760,6077,4236,3667,4236,488,4044,3669,5741,6009,3869,691,5303,3843,3667,4843,110,5743,4772,1741,4046,4044,6077,4860,4860,3715,1866,1866,1866,4044,1863,1866,1866,3707,3715,5114,3667,3667,3667,5831,5738,3669,92,2692,4110,4203,4239,3869,94,853,856,1598,4953,6068,5997,4851,5997,4953,4931,6571,420,5068,488,488,4998,5010,3847,3842,4897,114,6007,4863,4761,6005,6008,3910,882,3921,6008,5002,6007,6598,1159,1159,144,828,486,823,299,337,823,4838,4239,2692,1607,6077,6006,4860,828,486,5704,6007,6007,6009,882,2692,2705,473,6007,3866,6007,4239,114,84,344,6006,5002,3869,5824,4690,143,4874,5234,5251,823,5234,2103,2662,2936,3670,2124,1664,5234,2632,5256,5234,5234,1622,461,1583,2252,4772,823,674,344,5234,2629,4175,5506,5512,5500,6189,6424,2692,2705,2671,5997,4986,679,2943,4960,4990,6068,6059,3667,6068,1750,1753,6189,2962,6053,4949,7003,7021,2930,3667,6077,782,6604,5086,6000,6047,6000,5997,6006,6000,6009,6000,6411,770,2938,4236,2965,6053,1753,1866,463,6050,2932,5798,6050,2932,6050,2930,5997,5053,4953,5994,6000,5056,2962,5056,6053,613,6000,6000,5056,2962,4642,4772,6601,1619,4772,6053,5041,4772&text=_mm256_mullo_epi32) the `_mm256_mullo_epi32()` instruction has an extremely high latency. But given that on `ARM_NEON` the use of integer dot product accumulation resulted in significant performance boost (see #135), I decided to still try. Outcome: it is faster, despite the high latency of the integer multiplication.

Here PP-512 and TG-128 measurements for LLaMA-3.1-8B on Zen4 (Ryzen-7950X CPU):

| Quant | Threads | Task | t/s (main) | t/s (PR) | Speedup |
| ---: | ---: | ---: | ---: | ---: | ---: |
| Q2_K_R4 | 16 | pp512 | 256.19 ± 0.26 | 272.69 ± 0.13 | 1.064 |
|                  |  1   | tg128 | 9.08 ± 0.12 | 9.95 ± 0.0 | 1.096 |
|                  |  2  | tg128 | 16.40 ± 0.00 | 17.44 ± 0.01 | 1.063 |
|                  |  4  | tg128 | 20.72 ± 0.12 | 20.97 ± 0.08 | 1.012 |
| Q3_K_R4 | 16 | pp512 | 236.77 ± 0.35 | 255.84 ± 0.20 | 1.081 |
|                  |  1  | tg128 | 6.78 ± 0.00 | 7.16 ± 0.07 | 1.056 |
|                  |  2  | tg128 | 12.46 ± 0.00  | 13.00 ± 0.01 | 1.043 |
|                  |  4  | tg128 | 17.02 ± 0.09 | 17.20 ± 0.24  | 1.012 |
| Q4_K_R4 | 16 | pp512 | 262.40 ± 0.28 | 268.09 ± 0.12 | 1.022 |
| IQ4_XS_R4 | 16 | pp512 | 256.80 ± 0.35 | 271.95 ± 0.39 | 1.059 |
| Q5_K_R4 | 16 | pp512 | 248.30 ± 0.29 | 256.68 ± 0.31 | 1.034 |
| Q6_K_R4 | 16 | pp512 | 243.25 ± 0.31 | 261.33 ± 0.38 | 1.074 |
|                  |  1  | tg128 | 7.94 ± 0.00 | 8.34 ± 0.00 | 1.050 |
|                  |  2  | tg128 | 10.38 ± 0.00 | 10.38 ± 0.00 | 1.000 |

For `Q4_K_R4, Q5_K_R4` and `IQ4_XS_R4` matrix-vector multiplications are done with a different implementation where this change is not applicable, so no TG results for those.