mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-09 05:20:01 +00:00
2.1 KiB
2.1 KiB
🔀 #179 - Minor performance improvements
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-01-27 |
| Updated | 2025-01-27 |
Description
This PR does two things
- It changes
Q4_0_R4to 8 interleaved rows - It adds the ability to apply platform specific transformations of the tensor data while repacking
Examples for the usage of 2.:
- On
ARM_NEONit is useful to apply aXORoperation with a mask0x88toQ4_0quants. In this way one does not need to subtract8during run time. This tweak improvesQ4_0PP performance by nearly 5% on my M2-Max CPU. This is absolutely not useful onAVX2/Zen4, so this becomes a platform specific transformation when run-time-repacking on anARM_NEONCPU. - On
Zen4one can add128to the signedQ8quants to make them unsigned (so they can be used directly in_mmXXX_dpbusd_epi32(). This improvesQ8_0andQ8_K_R8performance by about 3%. The transformation is not useful onARM_NEON(one needs signedint8_t's) or vanillaAVX2(the_mm256_maddubs_epi16dot product may overflow), so it only gets applied when repacking onZen4.
The table shows some comparisons for PP-512 LlaMA-3.1-8B for the affected quantization types using Flash Attention and Q8_0 KV-cache.
| model | backend | test | t/s (main) | t/s (PR) | Speedup |
|---|---|---|---|---|---|
| llama 8B Q4_0 | NEON | pp512 | 130.92 ± 0.10 | 137.39 ± 0.32 | 1.049 |
| llama 8B Q8_K_R8 | Zen4 | pp512 | 380.75 ± 1.52 | 390.40 ± 0.88 | 1.025 |
| llama 8B Q8_0 | Zen4 | pp512 | 295.62 ± 0.80 | 307.80 ± 0.34 | 1.041 |
| llama 8B Q4_0 | Zen4 | pp512 | 281.38 ± 0.73 | 294.43 ± 0.68 | 1.046 |
| llama 8B Q4_0 | AVX2 | pp512 | 302.61 ± 0.29 | 316.23 ± 0.31 | 1.045 |
I really wanted to hit 400 t/s for Q8_K_R8, but it will be on another day.