Files
ik_llama.cpp/github-data/pull_requests/207 - Faster CPU TG for GQA models.md
2025-07-23 13:31:53 +02:00

3.1 KiB

🔀 #207 - Faster CPU TG for GQA models

Author ikawrakow
State Closed
Created 2025-02-15
Updated 2025-02-15

Description

This PR

  • Absorbs the iqk matrix multiplication logic in ggml into a new iqk function iqk_mul_mat_4d. The change to ggml to incorporate the iqk-added functionality is now much less intusive
  • Adds to iqk_mul_mat_4d special handling of the TG case with GQA. In this case the K and V tensors have a shape N x M x Lkv (N is the head size, Lkv is the number of KV heads), and they multiply a tensor (Q or K*Q) with shape N x 1 x L (L is the number of heads, L > Lkv). If we rearrange Q as N x L/Lkv x Lkv, we now have GEMM instead of GEMV, and this is significantly faster.

This better approach only gives noticeable TG speedup for long context (large KV cache), as without that the fraction of time spent on the K*Q and V*softmax(K*Q) is small. So, here is a table comparing TG performance on main and with this PR for LLaMA-3.1-8B for different prompt lengths. Model is quantized with IQ4_XS and is running on a Ryzen-7950X (Zen4) or M2-Max CPU

model backend threads test t/s (main) t/s (PR) Speedup
llama 8B IQ4_XS Zen4 8 tg64@pp128 13.85 ± 0.01 13.88 ± 0.00 1.002
llama 8B IQ4_XS Zen4 8 tg64@pp256 13.72 ± 0.01 13.80 ± 0.00 1.006
llama 8B IQ4_XS Zen4 8 tg64@pp512 13.48 ± 0.02 13.63 ± 0.02 1.011
llama 8B IQ4_XS Zen4 8 tg64@pp1024 13.05 ± 0.02 13.33 ± 0.00 1.021
llama 8B IQ4_XS Zen4 8 tg64@pp2048 12.21 ± 0.01 12.77 ± 0.00 1.046
llama 8B IQ4_XS Zen4 8 tg64@pp4096 10.72 ± 0.00 11.82 ± 0.00 1.103
llama 8B IQ4_XS Zen4 8 tg64@pp8192 8.60 ± 0.00 10.26 ± 0.01 1.193
llama 8B IQ4_XS M2-Max 8 tg64@pp128 26.82 ± 0.07 28.01 ± 0.06 1.044
llama 8B IQ4_XS M2-Max 8 tg64@pp256 26.49 ± 0.04 27.90 ± 0.01 1.053
llama 8B IQ4_XS M2-Max 8 tg64@pp512 25.94 ± 0.00 27.47 ± 0.00 1.059
llama 8B IQ4_XS M2-Max 8 tg64@pp1024 24.80 ± 0.00 26.28 ± 0.40 1.060
llama 8B IQ4_XS M2-Max 8 tg64@pp2048 22.66 ± 0.01 25.17 ± 0.00 1.111
llama 8B IQ4_XS M2-Max 8 tg64@pp4096 18.99 ± 0.01 23.12 ± 0.02 1.217
llama 8B IQ4_XS M2-Max 8 tg64@pp8192 14.07 ± 0.00 19.66 ± 0.02 1.397

On the M2-Max, which has a higher memory bandwidth (so better TG performance) but lower computing power than the Ryzen-7950X, the speedup is significantly higher.