mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 09:09:50 +00:00
1.7 KiB
1.7 KiB
🔀 #37 - Performance improvements for legacy quants on ARM_NEON
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-09-03 |
| Updated | 2024-09-04 |
Description
If we process 2 rows in the left matrix at a time we get in the range of 20% performance boost for PP-512 (except for Q8_0, where performance was already higher than the other quants). The table summarizes the results or LLaMA-3.1-8B on an M2-Max CPU. As I like keeping track of how we perform relative to mainline llama.cpp, the table includes results for the current llama.cpp build (69a480a (3660)). tinyBLAS is enabled in llama.cpp, so the 33% (Q4_0) or 16.6% (Q8_0) improvement is compared to tinyBLAS, which does not provide implementation for Q4_1, Q5_0 and Q5_1 (and correspondingly the performance gap is much larger).
| Quants | t/s (llama.cpp) | t/s (main) | t/s (PR) | Speedup vs main | Speedup vs llama.cpp |
|---|---|---|---|---|---|
| Q4_0 | 65.45 ± 0.01 | 72.88 ± 0.61 | 87.22 ± 0.85 | 1.197 | 1.333 |
| Q4_1 | 35.18 ± 0.51 | 59.95 ± 1.26 | 73.87 ± 0.47 | 1.232 | 2.100 |
| Q5_0 | 26.69 ± 0.35 | 62.63 ± 1.47 | 74.32 ± 0.13 | 1.187 | 2.785 |
| Q5_1 | 23.33 ± 0.06 | 52.83 ± 1.32 | 60.79 ± 0.19 | 1.151 | 2.606 |
| Q8_0 | 75.44 ± 1.84 | 85.08 ± 1.74 | 88.01 ± 0.11 | 1.034 | 1.166 |