Files
ik_llama.cpp/github-data/pull_requests/516-Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2).md
2025-07-22 18:18:40 +02:00

2.1 KiB

🔀 #516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2)

Author ikawrakow
State Closed
Created 2025-06-11
Updated 2025-06-11

Description

This PR is a follow up of #515, and applies the same technique to IQ3_XXS. We see nearly 3X increase in prompt processing speed compared to IQ3_XXS, and over 2X compared to IQ3_XXS_R4.

Sweep-bench for pure IQ3_XXS quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:

IQ3_XXS, main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 5.023 101.94 7.365 17.38
512 128 512 5.281 96.96 8.088 15.83
512 128 1024 5.170 99.03 7.977 16.05
512 128 1536 5.324 96.16 7.942 16.12
512 128 2048 5.389 95.02 8.043 15.91

IQ3_XXS_R4, main branch

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.836 133.47 7.675 16.68
512 128 512 3.687 138.87 8.279 15.46
512 128 1024 3.805 134.57 8.245 15.53
512 128 1536 3.906 131.08 8.252 15.51
512 128 2048 4.076 125.61 8.545 14.98

IQ3_XXS, PR

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 1.730 296.01 7.641 16.75
512 128 512 1.807 283.30 8.333 15.36
512 128 1024 1.896 269.98 8.070 15.86
512 128 1536 1.978 258.78 8.481 15.09
512 128 2048 2.062 248.32 8.514 15.03