### 🔀 [#37](https://github.com/ikawrakow/ik_llama.cpp/pull/37) - Performance improvements for legacy quants on ARM_NEON | **Author** | `ikawrakow` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2024-09-03 | | **Updated** | 2024-09-04 | --- #### Description If we process 2 rows in the left matrix at a time we get in the range of 20% performance boost for PP-512 (except for `Q8_0`, where performance was already higher than the other quants). The table summarizes the results or LLaMA-3.1-8B on an M2-Max CPU. As I like keeping track of how we perform relative to mainline `llama.cpp`, the table includes results for the current `llama.cpp` build (`69a480a (3660)`). tinyBLAS is enabled in `llama.cpp`, so the 33% (`Q4_0`) or 16.6% (`Q8_0`) improvement is compared to tinyBLAS, which does not provide implementation for `Q4_1`, `Q5_0` and `Q5_1` (and correspondingly the performance gap is much larger). | Quants | t/s (llama.cpp) | t/s (main) | t/s (PR) | Speedup vs main | Speedup vs llama.cpp | | ------- | -------------------: | ---------------: | ---------------: | ----------------: | --------------------: | | Q4_0 | 65.45 ± 0.01 | 72.88 ± 0.61 | 87.22 ± 0.85 | 1.197 | 1.333 | | Q4_1 | 35.18 ± 0.51 | 59.95 ± 1.26 | 73.87 ± 0.47 | 1.232 | 2.100 | | Q5_0 | 26.69 ± 0.35 | 62.63 ± 1.47 | 74.32 ± 0.13 | 1.187 | 2.785 | | Q5_1 | 23.33 ± 0.06 | 52.83 ± 1.32 | 60.79 ± 0.19 | 1.151 | 2.606 | | Q8_0 | 75.44 ± 1.84 | 85.08 ± 1.74 | 88.01 ± 0.11 | 1.034 | 1.166 |