ik_llama.cpp/github-data/pull_requests/516-Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2).md at 94aa54df763aecde4cc5f528a26f72e9c5f2d4ff

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-02 12:21:42 +00:00

Files

Thomas 94aa54df76 Add GitHub data (#637 )

2025-07-22 18:18:40 +02:00

2.1 KiB

Raw Blame History

🔀 #516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2)

Author	`ikawrakow`
State	❌ Closed
Created	2025-06-11
Updated	2025-06-11

Description

This PR is a follow up of #515, and applies the same technique to IQ3_XXS. We see nearly 3X increase in prompt processing speed compared to IQ3_XXS, and over 2X compared to IQ3_XXS_R4.

Sweep-bench for pure IQ3_XXS quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:

IQ3_XXS, main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	5.023	101.94	7.365	17.38
512	128	512	5.281	96.96	8.088	15.83
512	128	1024	5.170	99.03	7.977	16.05
512	128	1536	5.324	96.16	7.942	16.12
512	128	2048	5.389	95.02	8.043	15.91

IQ3_XXS_R4, main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	3.836	133.47	7.675	16.68
512	128	512	3.687	138.87	8.279	15.46
512	128	1024	3.805	134.57	8.245	15.53
512	128	1536	3.906	131.08	8.252	15.51
512	128	2048	4.076	125.61	8.545	14.98

IQ3_XXS, PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.730	296.01	7.641	16.75
512	128	512	1.807	283.30	8.333	15.36
512	128	1024	1.896	269.98	8.070	15.86
512	128	1536	1.978	258.78	8.481	15.09
512	128	2048	2.062	248.32	8.514	15.03

2.1 KiB Raw Blame History