ik_llama.cpp/566 - Adding IQ3_KS quants.md at 993cb00a347fc77632b73126f614092d659727de - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-22 06:04:24 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

5.2 KiB

Raw Blame History

🔀 #566 - Adding IQ3_KS quants

Author	`ikawrakow`
State	❌ Closed
Created	2025-07-01
Updated	2025-07-02

Description

This PR adds IQ3_KS - 3.1875 bpw quants with a block size of 32. This makes the IQX_KS quant series complete

type	bpw
IQ2_KS	2.1875
IQ3_KS	3.1875
IQ4_KS	4.25
IQ5_KS	5.25

CUDA and CPU performance are very good, Metal is not so great.

Here a few sweep-benches for LlaMA-3.1-8B-Instruct

RTX-4080

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	512	0.065	7932.94	0.887	144.38
512	128	1024	0.066	7725.27	0.893	143.35
512	128	1536	0.068	7551.51	0.908	141.02
512	128	2048	0.069	7404.30	0.924	138.59
512	128	2560	0.072	7098.39	0.939	136.30
512	128	3072	0.074	6873.96	0.955	134.08
512	128	3584	0.074	6890.43	0.969	132.07
512	128	4096	0.077	6620.20	0.987	129.64
512	128	4608	0.079	6445.44	1.000	128.00
512	128	5120	0.081	6350.94	1.026	124.82
512	128	5632	0.083	6175.82	1.033	123.97
512	128	6144	0.084	6071.67	1.043	122.77
512	128	6656	0.086	5944.16	1.057	121.15
512	128	7168	0.088	5810.65	1.071	119.46
512	128	7680	0.090	5693.89	1.087	117.77

Ryzen-7950X (Zen4)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.423	359.79	7.616	16.81
512	128	512	1.479	346.15	7.800	16.41
512	128	1024	1.537	333.06	7.979	16.04
512	128	1536	1.603	319.47	7.939	16.12
512	128	2048	1.661	308.29	7.984	16.03
512	128	2560	1.722	297.39	8.071	15.86
512	128	3072	1.778	287.90	8.154	15.70
512	128	3584	1.841	278.04	8.241	15.53

Ryzen-5975WX

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.697	301.64	6.933	18.46
512	128	512	1.760	290.91	7.062	18.13
512	128	1024	1.834	279.19	7.217	17.74
512	128	1536	1.910	268.03	7.414	17.26
512	128	2048	1.985	257.88	7.555	16.94
512	128	2560	2.062	248.26	7.666	16.70
512	128	3072	2.140	239.29	7.810	16.39
512	128	3584	2.217	230.98	7.987	16.03

M2-Max CPU

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	3.119	164.13	5.410	23.66
512	128	512	3.322	154.14	5.487	23.33
512	128	1024	3.614	141.66	5.658	22.62
512	128	1536	3.872	132.23	5.735	22.32
512	128	2048	4.089	125.21	5.911	21.65

M2-Max 30-core GPU

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.088	470.79	3.255	39.33
512	128	512	1.106	462.77	3.411	37.53
512	128	1024	1.126	454.85	3.579	35.77
512	128	1536	1.153	444.08	3.762	34.03
512	128	2048	1.178	434.48	3.965	32.28
512	128	2560	1.207	424.23	4.118	31.08
512	128	3072	1.235	414.51	4.290	29.84
512	128	3584	1.265	404.69	4.461	28.69

💬 Conversation

👤 ikawrakow commented the 2025-07-02 at 07:27:42:

Let's merge this so people don't get crashes when trying to run IQ3_KS models with the main branch.

👤 Nexesenex commented the 2025-07-02 at 15:01:59:

Thanks for the explanation, I understand that the alternatives you have atm are quite unpractical.

In any case, thank you for the IQ3_KS (and the Cuda MMQ Kernels you kindly provided for most quants), it completes the KS quants lot, which is more practical to quantize than the already very demanding indeed Trellis lot. I'm very happy with all of this, compared to what mainline limits itself to atm.

5.2 KiB Raw Blame History