ik_llama.cpp/github-data/pull_requests/453 - Faster IQ3_KT and IQ4_KT.md at 993cb00a347fc77632b73126f614092d659727de - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-25 17:09:22 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.8 KiB

Raw Blame History

🔀 #453 - Faster IQ3_KT and IQ4_KT

Author	`ikawrakow`
State	❌ Closed
Created	2025-05-24
Updated	2025-05-24

Description

The PR improves AVX2 performance for the trellis quants IQ3_KT and IQ4_KT recently added in PR #441. The results below are for LLaMA-3.1-8B on a Ryzen-5975WX CPU.

IQ3_KT

N_KV	S_PP t/s (main)	S_PP t/s (PR)	PP speedup	S_TG t/s (main)	S_TG t/s (PR)	TG speedup
0	61.98	71.59	1.155	11.17	13.30	1.191
512	61.27	70.79	1.155	11.10	13.19	1.188
1024	60.48	69.93	1.156	11.04	13.10	1.187
1536	59.94	69.15	1.154	10.95	12.96	1.184
2048	59.48	68.55	1.152	10.87	12.85	1.182

IQ4_KT

N_KV	S_PP t/s (main)	S_PP t/s (PR)	PP speedup	S_TG t/s (main)	S_TG t/s (PR)	TG speedup
0	44.32	64.91	1.465	9.36	11.69	1.249
512	43.90	64.12	1.461	9.26	11.56	1.248
1024	43.60	63.39	1.454	9.19	11.47	1.248
1536	43.32	62.86	1.451	9.12	11.37	1.247
2048	43.07	62.37	1.448	9.06	11.28	1.245

CPU performance is still much lower than other quantization types. But memory bandwidth is far from saturated, so PP and TG will be better on a faster CPU with more cores.

1.8 KiB Raw Blame History

🔀 #453 - Faster IQ3_KT and IQ4_KT

Description

IQ3_KT

IQ4_KT

1.8 KiB

Raw Blame History