ik_llama.cpp/302 - Quantization improvements _2_.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

4.0 KiB

Raw Permalink Blame History

🔀 #302 - Quantization improvements (2)

Author	`ikawrakow`
State	❌ Closed
Created	2025-03-31
Updated	2025-04-02

Description

This PR is a follow up of #295. It applies the same approach to type-1 quants (Q2_K, Q4_K, Q5_K, Q4_1, Q5_1) and to IQ3_K. Quantization speed for IQ3_K is improved by a significant margin (up to 40%). Quantization speed for type-1 quants is also slightly improved ($\le 15$%). The changes do not result in PPL improvement for all tested models, but do improve PPL for the models that are more difficult to quantize (e.g., the LLaMA-3 series of models), and avoid a near catastrophic failure of IQ3_K on DeepSeek-Lite.

The following table shows PPL comparisons between the main branch and this PR for LLaMA-v1-7B¹(L1-7B in the table), LLaMA-v2-7B¹ (L2-7B), Mistral-7B¹ (M-7B), LLaMA-3.1-8B-Instruct (L3-8B), and DeepSeek-V2-Lite (DSL). Context is always 512 tokens. Also given are the quantization times (Q-time for short in the table) in seconds on a Ryzen-7950X CPU. Tested is "pure" quantization (i.e., using the --pure option of llama-quantize) with token embeddings and output tensor set to Q8_0. The quantization command line is

./bin/llama-quantize --imatrix $imatrix --token-embedding-type q8_0 --output-tensor-type q8_0 --pure $model $output $quant

Model	Quantization	PPL (main)	PPL (this PR)	Q-time (main)	Q-time (this PR)
L1-7B	Q4_1	5.9773	5.9760	N/A²	N/A²
L2-7B	Q4_1	5.8676	5.8691	33.6	29.9
M-7B	Q4_1	5.7452	5.7471	36.7	32.3
L3-8B	Q4_1	7.5309	7.5277	38.1	34.0
DSL	Q4_1	6.8639	6.8584	84.1	75.3
L1-7B	Q5_1	5.9183	5.9182	N/A²	N/A²
L2-7B	Q5_1	5.8164	5.8175	35.6	30.8
M-7B	Q5_1	5.7067	5.7074	37.6	33.6
L3-8B	Q5_1	7.3749	7.3759	38.7	34.7
DSL	Q5_1	6.7881	6.7875	86.4	76.5
L1-7B	Q2_K	7.3154	7.2989	N/A^2,3	N/A²
L2-7B	Q2_K	7.3044	7.2558	36.4	32.2
M-7B	Q2_K	6.9507	6.9273	38.4	35.0
L3-8B	Q2_K	11.546	11.458	40.1	36.5
DSL	Q2_K	8.3822	8.3346	89.6	83.4
L1-7B	Q4_K	5.9801	5.9779	N/A²	N/A²
L2-7B	Q4_K	5.8675	5.8673	34.1	30.7
M-7B	Q4_K	5.7449	5.7406	37.0	32.8
L3-8B	Q4_K	7.5192	7.5157	38.2	34.5
DSL	Q4_K	6.8607	6.8570	75.7	68.5
L1-7B	Q5_K	5.9314	5.9299	N/A²	N/A²
L2-7B	Q5_K	5.8144	5.8196	35.6	31.2
M-7B	Q5_K	5.7030	5.7064	37.3	34.1
L3-8B	Q5_K	7.3941	7.3812	38.9	34.6
DSL	Q5_K	6.7929	6.7903	76.5	69.5
L1-7B	IQ3_K	6.1393	6.1377	N/A²	N/A²
L2-7B	IQ3_K	6.0251	6.0227	44.7	36.9
M-7B	IQ3_K	5.8835	5.8855	54.6	39.5
L3-8B	IQ3_K	7.9148	7.9189	56.3	41.4
DSL	IQ3_K	7.3143	7.0409	116.4	92.5

¹ Why use such ancient models? The LLaMA-v1 models were the basis for k-quants development. I-quants were developed using LLaMA-v1, LLaMA-v2 and Mistral-7B. In my experience, if a quantization technique does well on all 3 of these, it is (almost) guaranteed to do well on any other model out there.

² I have this model on an old HDD. In this case quantization time is dominated by the time needed to read the data from the HDD. I could have copied the model to the SSD drive, but I think the timing for the other models gives enough indication of the relative performance.

💬 Conversation

👤 saood06 commented the 2025-04-02 at 10:55:25:

and avoid a near catastrophic failure of IQ3_K on DeepSeek-Lite.

Interestingly IQ3_K before this PR was actually worse than Q3_K before #295 for DSL.

4.0 KiB Raw Permalink Blame History

🔀 #302 - Quantization improvements (2)

Description

💬 Conversation

4.0 KiB

Raw Permalink Blame History