mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 09:09:50 +00:00
3.1 KiB
3.1 KiB
🔀 #207 - Faster CPU TG for GQA models
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-02-15 |
| Updated | 2025-02-15 |
Description
This PR
- Absorbs the
iqkmatrix multiplication logic inggmlinto a newiqkfunctioniqk_mul_mat_4d. The change toggmlto incorporate theiqk-added functionality is now much less intusive - Adds to
iqk_mul_mat_4dspecial handling of the TG case with GQA. In this case theKandVtensors have a shapeN x M x Lkv(Nis the head size,Lkvis the number of KV heads), and they multiply a tensor (QorK*Q) with shapeN x 1 x L(Lis the number of heads,L > Lkv). If we rearrangeQasN x L/Lkv x Lkv, we now have GEMM instead of GEMV, and this is significantly faster.
This better approach only gives noticeable TG speedup for long context (large KV cache), as without that the fraction of time spent on the K*Q and V*softmax(K*Q) is small. So, here is a table comparing TG performance on main and with this PR for LLaMA-3.1-8B for different prompt lengths. Model is quantized with IQ4_XS and is running on a Ryzen-7950X (Zen4) or M2-Max CPU
| model | backend | threads | test | t/s (main) | t/s (PR) | Speedup |
|---|---|---|---|---|---|---|
| llama 8B IQ4_XS | Zen4 | 8 | tg64@pp128 | 13.85 ± 0.01 | 13.88 ± 0.00 | 1.002 |
| llama 8B IQ4_XS | Zen4 | 8 | tg64@pp256 | 13.72 ± 0.01 | 13.80 ± 0.00 | 1.006 |
| llama 8B IQ4_XS | Zen4 | 8 | tg64@pp512 | 13.48 ± 0.02 | 13.63 ± 0.02 | 1.011 |
| llama 8B IQ4_XS | Zen4 | 8 | tg64@pp1024 | 13.05 ± 0.02 | 13.33 ± 0.00 | 1.021 |
| llama 8B IQ4_XS | Zen4 | 8 | tg64@pp2048 | 12.21 ± 0.01 | 12.77 ± 0.00 | 1.046 |
| llama 8B IQ4_XS | Zen4 | 8 | tg64@pp4096 | 10.72 ± 0.00 | 11.82 ± 0.00 | 1.103 |
| llama 8B IQ4_XS | Zen4 | 8 | tg64@pp8192 | 8.60 ± 0.00 | 10.26 ± 0.01 | 1.193 |
| llama 8B IQ4_XS | M2-Max | 8 | tg64@pp128 | 26.82 ± 0.07 | 28.01 ± 0.06 | 1.044 |
| llama 8B IQ4_XS | M2-Max | 8 | tg64@pp256 | 26.49 ± 0.04 | 27.90 ± 0.01 | 1.053 |
| llama 8B IQ4_XS | M2-Max | 8 | tg64@pp512 | 25.94 ± 0.00 | 27.47 ± 0.00 | 1.059 |
| llama 8B IQ4_XS | M2-Max | 8 | tg64@pp1024 | 24.80 ± 0.00 | 26.28 ± 0.40 | 1.060 |
| llama 8B IQ4_XS | M2-Max | 8 | tg64@pp2048 | 22.66 ± 0.01 | 25.17 ± 0.00 | 1.111 |
| llama 8B IQ4_XS | M2-Max | 8 | tg64@pp4096 | 18.99 ± 0.01 | 23.12 ± 0.02 | 1.217 |
| llama 8B IQ4_XS | M2-Max | 8 | tg64@pp8192 | 14.07 ± 0.00 | 19.66 ± 0.02 | 1.397 |
On the M2-Max, which has a higher memory bandwidth (so better TG performance) but lower computing power than the Ryzen-7950X, the speedup is significantly higher.