Files
ik_llama.cpp/github-data/pull_requests/179-Minor performance improvements.md
2025-07-22 18:18:40 +02:00

2.1 KiB

🔀 #179 - Minor performance improvements

Author ikawrakow
State Closed
Created 2025-01-27
Updated 2025-01-27

Description

This PR does two things

  1. It changes Q4_0_R4 to 8 interleaved rows
  2. It adds the ability to apply platform specific transformations of the tensor data while repacking

Examples for the usage of 2.:

  • On ARM_NEON it is useful to apply a XOR operation with a mask 0x88 to Q4_0 quants. In this way one does not need to subtract 8 during run time. This tweak improves Q4_0 PP performance by nearly 5% on my M2-Max CPU. This is absolutely not useful on AVX2/Zen4, so this becomes a platform specific transformation when run-time-repacking on an ARM_NEON CPU.
  • On Zen4 one can add 128 to the signed Q8 quants to make them unsigned (so they can be used directly in _mmXXX_dpbusd_epi32(). This improves Q8_0 and Q8_K_R8 performance by about 3%. The transformation is not useful on ARM_NEON (one needs signed int8_t's) or vanilla AVX2 (the _mm256_maddubs_epi16 dot product may overflow), so it only gets applied when repacking on Zen4.

The table shows some comparisons for PP-512 LlaMA-3.1-8B for the affected quantization types using Flash Attention and Q8_0 KV-cache.

model backend test t/s (main) t/s (PR) Speedup
llama 8B Q4_0 NEON pp512 130.92 ± 0.10 137.39 ± 0.32 1.049
llama 8B Q8_K_R8 Zen4 pp512 380.75 ± 1.52 390.40 ± 0.88 1.025
llama 8B Q8_0 Zen4 pp512 295.62 ± 0.80 307.80 ± 0.34 1.041
llama 8B Q4_0 Zen4 pp512 281.38 ± 0.73 294.43 ± 0.68 1.046
llama 8B Q4_0 AVX2 pp512 302.61 ± 0.29 316.23 ± 0.31 1.045

I really wanted to hit 400 t/s for Q8_K_R8, but it will be on another day.