Files
ik_llama.cpp/github-data/pull_requests/351 - CPU FA improvements.md
2025-07-23 13:31:53 +02:00

21 KiB

🔀 #351 - CPU FA improvements

Author ikawrakow
State Closed
Created 2025-04-28
Updated 2025-04-29

Description

This PR further improves CPU FA performance for GQA models. It does not affect FlashMLA (relevant for DeepSeek models), but the same strategy could be applied also there. I have left this for a future PR.

Here some performance data and graphs for LLaMA-3.1-8B and Gemma3-12B. In all cases Q8_0 quantized KV cache is used. The model weights are quantized with Q4_0, selected specifically because of having best performance in mainline llama.cpp due to the extraordinary amount of attention this quantization type receives.

Gemma3-12B, Ryzen-7950X CPU

g3_tg_7950

g3_pp_7950

Gemma3-12B, Ryzen-7950X CPU, mainline llama.cpp
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 4.855 105.46 15.816 8.09
512 128 512 5.743 89.15 16.529 7.74
512 128 1024 6.337 80.80 17.091 7.49
512 128 1536 6.516 78.58 17.199 7.44
512 128 2048 6.688 76.56 17.309 7.39
512 128 2560 6.882 74.40 17.416 7.35
512 128 3072 7.075 72.36 17.526 7.30
512 128 3584 7.291 70.22 17.638 7.26
512 128 4096 7.493 68.33 17.746 7.21
512 128 4608 7.751 66.05 17.769 7.20
512 128 5120 8.153 62.80 17.957 7.13
512 128 5632 8.658 59.13 18.072 7.08
512 128 6144 9.215 55.56 18.165 7.05
512 128 6656 9.792 52.29 18.264 7.01
512 128 7168 10.360 49.42 18.378 6.97
512 128 7680 10.964 46.70 18.484 6.92
512 128 8192 11.576 44.23 18.599 6.88
512 128 8704 12.193 41.99 18.687 6.85
512 128 9216 12.805 39.98 18.817 6.80
512 128 9728 13.402 38.20 18.923 6.76
512 128 10240 13.914 36.80 19.047 6.72
512 128 10752 14.442 35.45 19.226 6.66
512 128 11264 14.966 34.21 19.333 6.62
512 128 11776 15.517 33.00 19.372 6.61
512 128 12288 16.000 32.00 19.480 6.57
512 128 12800 16.504 31.02 19.593 6.53
512 128 13312 16.998 30.12 19.706 6.50
512 128 13824 17.607 29.08 19.810 6.46
512 128 14336 18.041 28.38 19.976 6.41
512 128 14848 18.543 27.61 20.092 6.37
512 128 15360 19.050 26.88 20.216 6.33
512 128 15872 19.514 26.24 20.393 6.28
Gemma3-12B, Ryzen-7950X, ik_llama.cpp main branch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 2.913 175.75 15.638 8.18
512 128 512 2.998 170.78 15.889 8.06
512 128 1024 3.094 165.46 16.178 7.91
512 128 1536 3.180 160.99 16.474 7.77
512 128 2048 3.269 156.61 16.668 7.68
512 128 2560 3.360 152.39 16.895 7.58
512 128 3072 3.447 148.55 17.145 7.47
512 128 3584 3.539 144.66 17.415 7.35
512 128 4096 3.627 141.16 17.672 7.24
512 128 4608 3.715 137.82 17.924 7.14
512 128 5120 3.805 134.58 18.184 7.04
512 128 5632 3.892 131.56 18.448 6.94
512 128 6144 3.985 128.47 18.702 6.84
512 128 6656 4.081 125.45 18.951 6.75
512 128 7168 4.180 122.50 19.199 6.67
512 128 7680 4.289 119.38 19.444 6.58
512 128 8192 4.376 117.00 19.689 6.50
512 128 8704 4.481 114.27 19.927 6.42
512 128 9216 4.570 112.04 20.185 6.34
512 128 9728 4.684 109.31 20.427 6.27
512 128 10240 4.766 107.42 20.689 6.19
512 128 10752 4.870 105.13 20.921 6.12
512 128 11264 4.983 102.75 21.177 6.04
512 128 11776 5.076 100.87 21.430 5.97
512 128 12288 5.213 98.21 21.661 5.91
512 128 12800 5.324 96.16 21.924 5.84
512 128 13312 5.356 95.59 22.439 5.70
512 128 13824 5.468 93.63 22.689 5.64
512 128 14336 5.558 92.11 22.964 5.57
512 128 14848 5.684 90.07 23.209 5.52
512 128 15360 5.829 87.84 23.803 5.38
512 128 15872 5.971 85.75 24.068 5.32
Gemma3-12B, Ryzen-7950X, PR
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 2.871 178.35 15.620 8.19
512 128 512 2.952 173.46 15.752 8.13
512 128 1024 3.033 168.84 15.861 8.07
512 128 1536 3.112 164.53 15.995 8.00
512 128 2048 3.187 160.65 16.099 7.95
512 128 2560 3.265 156.82 16.227 7.89
512 128 3072 3.339 153.32 16.339 7.83
512 128 3584 3.419 149.75 16.463 7.77
512 128 4096 3.490 146.68 16.577 7.72
512 128 4608 3.566 143.60 16.701 7.66
512 128 5120 3.643 140.56 16.814 7.61
512 128 5632 3.721 137.61 16.940 7.56
512 128 6144 3.802 134.66 17.057 7.50
512 128 6656 3.884 131.84 17.165 7.46
512 128 7168 3.966 129.10 17.282 7.41
512 128 7680 4.051 126.38 17.402 7.36
512 128 8192 4.127 124.05 17.521 7.31
512 128 8704 4.208 121.68 17.631 7.26
512 128 9216 4.288 119.39 17.751 7.21
512 128 9728 4.366 117.28 17.861 7.17
512 128 10240 4.447 115.13 17.986 7.12
512 128 10752 4.526 113.13 18.099 7.07
512 128 11264 4.609 111.08 18.209 7.03
512 128 11776 4.698 108.99 18.330 6.98
512 128 12288 4.765 107.44 18.448 6.94
512 128 12800 4.843 105.71 18.559 6.90
512 128 13312 4.923 104.00 18.686 6.85
512 128 13824 4.999 102.42 18.797 6.81
512 128 14336 5.081 100.76 18.915 6.77
512 128 14848 5.160 99.23 19.029 6.73
512 128 15360 5.234 97.81 19.144 6.69
512 128 15872 5.320 96.24 19.265 6.64

LLaMA-3.1-8B, Ryzen-7950X CPU

l3_tg_7950

l3_pp_7950

LLaMA-3.1-8B, Ryzen-7950X CPU, mainline llama.cpp
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.142 162.97 9.757 13.12
512 128 512 3.843 133.21 10.188 12.56
512 128 1024 4.755 107.68 10.650 12.02
512 128 1536 5.603 91.37 11.111 11.52
512 128 2048 6.516 78.58 11.663 10.98
512 128 2560 7.336 69.79 11.965 10.70
512 128 3072 8.223 62.27 12.806 10.00
512 128 3584 8.933 57.32 13.365 9.58
512 128 4096 9.856 51.95 13.786 9.28
512 128 4608 10.706 47.82 14.193 9.02
512 128 5120 11.364 45.05 14.343 8.92
512 128 5632 12.454 41.11 14.798 8.65
512 128 6144 13.314 38.46 15.306 8.36
512 128 6656 14.295 35.82 16.040 7.98
512 128 7168 15.305 33.45 16.261 7.87
512 128 7680 16.176 31.65 16.296 7.85
512 128 8192 17.431 29.37 16.787 7.62
512 128 8704 18.729 27.34 17.301 7.40
512 128 9216 19.666 26.03 18.312 6.99
512 128 9728 20.288 25.24 18.825 6.80
512 128 10240 21.463 23.86 19.068 6.71
512 128 10752 23.474 21.81 19.701 6.50
512 128 11264 25.045 20.44 21.869 5.85
512 128 11776 27.214 18.81 21.128 6.06
512 128 12288 29.659 17.26 21.934 5.84
512 128 12800 32.139 15.93 22.233 5.76
512 128 13312 34.763 14.73 23.041 5.56
512 128 13824 34.760 14.73 24.010 5.33
512 128 14336 37.343 13.71 24.287 5.27
512 128 14848 42.109 12.16 25.254 5.07
512 128 15360 44.581 11.48 26.290 4.87
512 128 15872 45.159 11.34 25.655 4.99
LLaMA-3.1-8B, Ryzen-7950X CPU, ik_llama.cpp, main branch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 1.812 282.53 9.859 12.98
512 128 512 1.856 275.84 9.971 12.84
512 128 1024 1.911 267.87 10.082 12.70
512 128 1536 1.976 259.05 10.207 12.54
512 128 2048 2.025 252.81 10.323 12.40
512 128 2560 2.078 246.34 10.442 12.26
512 128 3072 2.137 239.57 10.559 12.12
512 128 3584 2.210 231.72 10.674 11.99
512 128 4096 2.248 227.76 10.791 11.86
512 128 4608 2.299 222.75 10.909 11.73
512 128 5120 2.357 217.24 11.024 11.61
512 128 5632 2.408 212.60 11.140 11.49
512 128 6144 2.467 207.51 11.255 11.37
512 128 6656 2.519 203.22 11.369 11.26
512 128 7168 2.578 198.63 11.488 11.14
512 128 7680 2.628 194.79 11.607 11.03
512 128 8192 2.688 190.46 11.720 10.92
512 128 8704 2.742 186.70 11.842 10.81
512 128 9216 2.796 183.10 11.965 10.70
512 128 9728 2.848 179.75 12.078 10.60
512 128 10240 2.910 175.97 12.194 10.50
512 128 10752 2.964 172.76 12.319 10.39
512 128 11264 3.021 169.48 12.440 10.29
512 128 11776 3.077 166.40 12.547 10.20
512 128 12288 3.136 163.27 12.670 10.10
512 128 12800 3.193 160.33 12.799 10.00
512 128 13312 3.252 157.42 12.913 9.91
512 128 13824 3.309 154.71 13.018 9.83
512 128 14336 3.372 151.85 13.152 9.73
512 128 14848 3.429 149.30 13.270 9.65
512 128 15360 3.491 146.65 13.370 9.57
512 128 15872 3.554 144.08 13.496 9.48
LLaMA-3.1-8B, Ryzen-7950X CPU, ik_llama.cpp, PR
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 1.848 277.10 9.838 13.01
512 128 512 1.834 279.16 9.893 12.94
512 128 1024 1.891 270.70 9.971 12.84
512 128 1536 1.951 262.37 10.033 12.76
512 128 2048 2.003 255.62 10.082 12.70
512 128 2560 2.057 248.90 10.147 12.61
512 128 3072 2.111 242.51 10.200 12.55
512 128 3584 2.169 236.00 10.258 12.48
512 128 4096 2.217 230.97 10.314 12.41
512 128 4608 2.268 225.72 10.368 12.35
512 128 5120 2.322 220.51 10.423 12.28
512 128 5632 2.372 215.83 10.479 12.22
512 128 6144 2.430 210.68 10.538 12.15
512 128 6656 2.477 206.73 10.575 12.10
512 128 7168 2.530 202.39 10.626 12.05
512 128 7680 2.580 198.42 10.685 11.98
512 128 8192 2.637 194.15 10.738 11.92
512 128 8704 2.682 190.88 10.791 11.86
512 128 9216 2.740 186.87 10.847 11.80
512 128 9728 2.785 183.83 10.903 11.74
512 128 10240 2.849 179.69 10.959 11.68
512 128 10752 2.892 177.03 11.015 11.62
512 128 11264 2.949 173.60 11.068 11.56
512 128 11776 2.995 170.93 11.122 11.51
512 128 12288 3.058 167.45 11.179 11.45
512 128 12800 3.102 165.06 11.233 11.39
512 128 13312 3.164 161.82 11.285 11.34
512 128 13824 3.210 159.52 11.339 11.29
512 128 14336 3.271 156.54 11.394 11.23
512 128 14848 3.319 154.26 11.447 11.18
512 128 15360 3.380 151.49 11.504 11.13
512 128 15872 3.428 149.34 11.560 11.07

LLaMA-3.1-8B, M2-Max CPU

l3_tg_m2

l3_pp_m2

LLaMA-3.1-8B, M2-Max CPU, mainline llama.cpp
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 4.775 107.22 4.909 26.08
512 128 512 6.157 83.15 5.462 23.43
512 128 1024 8.047 63.63 5.981 21.40
512 128 1536 9.752 52.50 6.553 19.53
512 128 2048 11.760 43.54 7.078 18.08
512 128 2560 13.010 39.36 7.527 17.01
512 128 3072 13.878 36.89 8.051 15.90
512 128 3584 15.967 32.07 8.611 14.87
512 128 4096 17.357 29.50 9.099 14.07
512 128 4608 17.953 28.52 9.664 13.25
512 128 5120 20.917 24.48 10.123 12.64
512 128 5632 21.812 23.47 10.720 11.94
512 128 6144 24.313 21.06 11.310 11.32
512 128 6656 26.592 19.25 12.010 10.66
512 128 7168 28.705 17.84 12.549 10.20
512 128 7680 29.934 17.10 13.435 9.53
LLaMA-3.1-8B, M2-Max CPU, ik_llama.cpp, main branch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 4.026 127.16 4.793 26.70
512 128 512 4.150 123.36 4.949 25.87
512 128 1024 4.322 118.45 5.292 24.19
512 128 1536 4.524 113.18 5.263 24.32
512 128 2048 4.740 108.01 5.415 23.64
512 128 2560 4.966 103.11 5.558 23.03
512 128 3072 5.154 99.34 5.708 22.42
512 128 3584 5.330 96.06 5.930 21.59
512 128 4096 5.471 93.59 6.072 21.08
512 128 4608 5.636 90.85 6.161 20.78
512 128 5120 5.755 88.96 6.449 19.85
512 128 5632 5.919 86.50 6.473 19.78
512 128 6144 6.142 83.36 6.672 19.19
512 128 6656 6.242 82.03 6.838 18.72
512 128 7168 6.287 81.44 6.923 18.49
512 128 7680 6.406 79.93 7.077 18.09
LLaMA-3.1-8B, M2-Max CPU, ik_llama.cpp, PR
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 4.035 126.88 4.842 26.73
512 128 512 4.139 123.70 4.868 26.29
512 128 1024 4.250 120.46 4.955 25.83
512 128 1536 4.408 116.16 5.055 25.32
512 128 2048 4.605 111.19 5.181 24.70
512 128 2560 4.790 106.90 5.250 24.38
512 128 3072 5.022 101.96 5.362 23.87
512 128 3584 5.198 98.50 5.379 23.80
512 128 4096 5.395 94.90 5.460 23.44
512 128 4608 5.546 92.31 5.543 23.09
512 128 5120 5.671 90.28 5.717 22.39
512 128 5632 5.793 88.39 5.718 22.39
512 128 6144 5.967 85.80 5.820 21.99
512 128 6656 6.051 84.61 5.901 21.69
512 128 7168 6.147 83.29 5.972 21.43
512 128 7680 6.228 82.21 6.081 21.05