ik_llama.cpp/53 - Quantization mixes tweaks.md at 30381fc1fc6a302f9de0487b1e719f4efcc06a00 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.6 KiB

Raw Blame History

🔀 #53 - Quantization mixes tweaks

Author	`ikawrakow`
State	❌ Closed
Created	2024-09-14
Updated	2024-09-14

Description

This PR changes quantization type selection for some quantization types. This leads to a lower PPL and a smaller quantized model size for Gemma-2 models.

The following table shows a comparison between the main branch and this PR for Gemma2-9b in terms of bits-per-weight (bpw) and quantization error (QError) defined as PPL(Q)/PPL(fp16)-1

Type	bpw (main)	QError (main)	bpw (PR)	QError (PR)
IQ1_M	2.20	78.04%	2.15	67.55%
IQ2_XXS	2.44	41.79%	2.37	38.64%
IQ2_XS	2.65	29.58%	2.58	26.72%
IQ2_S	2.77	22.12%	2.68	21.82%
IQ2_M	2.97	15.22%	2.87	15.12%
IQ3_XXS	3.28	8.46%	3.19	8.07%
IQ3_S	3.75	4.79%	3.68	3.97%
IQ4_XS	4.48	1.56%	4.42	1.33%

Basically, because Gemma models use the same tensor for token embeddings and output, so it needs to be quantized with more bits, but the tensor is very large because of the large vocabulary, quantized models end up with significantly more bpw for the entire model compared to the bpw of the main quantization type. The idea here is to wuantize output.weight with one of the new quantization types (IQ4_K for 2- and low-3-bit quantization, IQ5_K for the others), and use a higher bpw for the attn_v tensor (IQ3_K, IQ4_K, or IQ5_K`, depending on quantization type).

1.6 KiB Raw Blame History

🔀 #53 - Quantization mixes tweaks

Description

1.6 KiB

Raw Blame History