ik_llama.cpp/63 - LLaMA-3.2 quantization evaluation.md at eaa2510a28b60d43c2210c69cefdf750d5cc119f - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

13 KiB

Raw Blame History

🗣️ #63 - LLaMA-3.2 quantization evaluation

Author	`ikawrakow`
Created	2024-09-26
Updated	2024-09-26

Description

LLaMA-3.2 is out. llama.cpp does not yet support the vision models, so this post focuses on the 1B ad 3B text models that could be very handy for local usage on low-end devices. The models are small enough even with full precision (bf16) but I think it is still interesting to look at quantization as token generation is significantly faster with quantized models.

To reproduce the results reported here

Clone my validation dataset repository

git clone git@hf.co:datasets/ikawrakow/validation-datasets-for-llama.cpp
cd validation-datasets-for-llama.cpp
gunzip wiki.test.raw.gz
gunzip wiki.train.raw.gz

Get one or more LLaMA-3.2 models. E.g.

git clone git@hf.co:meta-llama/Llama-3.2-3B

Convert to GGUF. E.g.

python3 convert_hf_to_gguf.py --outtype bf16 Llama-3.2-3B/

Create imatrix data. E.g.

./bin/llama-imatrix -m Llama-3.2-3B/Llama-3.2-3B-BF16.gguf -f validation-datasets-for-llama.cpp/wiki.train.raw --chunks 1000 -o l32_imatrix_c512.out

Quantize. E.g.

./bin/llama-quantize --imatrix l32_imatrix_c512.out Llama-3.2-3B/Llama-3.2-3B-BF16.gguf iq4k.gguf iq4_k

Compute perplexity

./bin/llama-perplexity -m iq4k.gguf -f validation-datasets-for-llama.cpp/wiki.test.raw -t 1 -ngl 100

Compute HellaSwag

./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/hellaswag-validation.bin --multiple-choice -t 1 -ngl 100 -c 2048

Compute MMLU

./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/mmlu-test.bin --multiple-choice -t 1 -ngl 100 -c 2048

Perplexity

Perplexity (PPL in what follows) is not the best measure to compare different models, but it is extremely useful when comparing a quantized version of a model to the same full precision model. In the graphs below I use the quantization error defined as

quantization error = PPL(Q)/PPL(bf16) - 1

where PPL(Q) is the perplexity of quantization Q and PPL(bf16) is the perplexity of the full model (the 3.2 models are released as bf16, so I use bf16 throughout as bf16 support has been added here in PR #39, #40, #41, #56).

The following graph shows quantization error of LLaMA-3.2-3B as a function of bits-per-weight (bpw) for (almost) all quantization types supported here. Note that this is the effective bpw that includes the token_embedding.weight tensor, which is quantized with more bits (typically Q6_K), and this has a significant impact on the overall bpw balance as this tensor represents a significant fraction of the overall model size. The y-axis is logarithmic, so differences can be quite large even if data points look relatively close. The cyan circles are for the new quants IQ2_K, IQ3_K, IQ4_K, IQ5_K and IQ6_K that are not available in mainline llama.cpp. The black symbols are for i-quants, the red for k-quants, and the blue symbols are legacy quants (Q4_0, Q4_1, Q5_0, Q5_1`).

The next graph shows results for LLaMA-3.2-3B-Instruct. The results are qualitatively very similar to the base model, with the quantization error being slightly lower compared to the base model.

My conclusion from these two graphs are

Going below 3 bpw with these models is not useful - the quantization error becomes too large. This is similar to the 3.1 LlaMA models
The new iqk-quants IQ4_K and IQ5_K are significantly better than k- or legacy quants in this bpw range
Legacy quants are mostly useless as it is so often the case

The next graph is for the base LLaMA-3.2-1B model

Here the quantization error is significantly larger, going below 2% only for 5+ bpw. At about 4.95 bpw IQ4_K has a quantization error of 3%, Q4_K_S is at 4.3%, and Q4_0 at 12.5% (!), nearly the same as IQ3_K at 3.68 bpw.

HellaSwag

The HellaSwag 0-shot score of 74.34 for the 3B base model is surprisingly high for a model of this size. But here we are more interested in looking at the impact of quantization, so I'll focus on that. The following graph shows

HellaSwag(bf16) - HellaSwag(Q)

for LLaMA-3.2-3B.

As one could have expected from the perplexity results, sub-3-bpw quantization destroys the model utility. Hence, it is more useful to focus on the 3+ bpw range, which is the purpose of the next graph

We see that IQ4_K, IQ5_K, IQ6_K and Q6_K are basically indistinguishable from the bf16 model for the HellaSwag metrics. But at less than 2 points below bf16, even IQ3_K and IQ3_S could be useful if HellaSwag is representative for the kind of tasks one intends to tackle.

MMLU

Here I show only results for the 3+ bpw range for LLaMA-3.2-3B in the following graph

All quantizations above IQ3_K (3.6 bpw) are (nearly) indistinguishable from the full bf16 model according to this metrics.

🗣️ Discussion

👤 ikawrakow replied the 2024-09-26 at 16:11:00:

Here some performance numbers for the 1B model on a Ryzen-7950X CPU

model	size	backend	threads	test	t/s
llama 1B BF16	2.79 GiB	CPU	16	pp512	1217.13 ± 18.31
llama 1B BF16	2.79 GiB	CPU	1	tg128	15.31 ± 0.19
llama 1B BF16	2.79 GiB	CPU	2	tg128	22.97 ± 0.04
llama 1B BF16	2.79 GiB	CPU	4	tg128	23.86 ± 0.08
llama 1B BF16	2.79 GiB	CPU	8	tg128	23.45 ± 0.32
llama 1B Q8_0	1.48 GiB	CPU	16	pp512	1109.36 ± 24.77
llama 1B Q8_0	1.48 GiB	CPU	1	tg128	38.57 ± 0.24
llama 1B Q8_0	1.48 GiB	CPU	2	tg128	46.86 ± 0.04
llama 1B Q8_0	1.48 GiB	CPU	4	tg128	46.42 ± 0.11
llama 1B Q8_0	1.48 GiB	CPU	8	tg128	44.41 ± 0.07
llama 1B IQ4_K	935.24 MiB	CPU	16	pp512	1211.41 ± 12.99
llama 1B IQ4_K	935.24 MiB	CPU	1	tg128	30.81 ± 0.04
llama 1B IQ4_K	935.24 MiB	CPU	2	tg128	57.37 ± 0.17
llama 1B IQ4_K	935.24 MiB	CPU	4	tg128	76.93 ± 0.14
llama 1B IQ4_K	935.24 MiB	CPU	8	tg128	74.61 ± 0.09
llama 1B IQ5_K	1.02 GiB	CPU	16	pp512	982.76 ± 16.70
llama 1B IQ5_K	1.02 GiB	CPU	1	tg128	24.76 ± 0.04
llama 1B IQ5_K	1.02 GiB	CPU	2	tg128	46.39 ± 0.06
llama 1B IQ5_K	1.02 GiB	CPU	4	tg128	66.47 ± 0.23
llama 1B IQ5_K	1.02 GiB	CPU	8	tg128	64.73 ± 0.10
llama 1B Q5_K_S	1.03 GiB	CPU	16	pp512	1257.38 ± 13.08
llama 1B Q5_K_S	1.03 GiB	CPU	1	tg128	31.56 ± 0.55
llama 1B Q5_K_S	1.03 GiB	CPU	2	tg128	55.68 ± 0.28
llama 1B Q5_K_S	1.03 GiB	CPU	4	tg128	66.34 ± 0.27
llama 1B Q5_K_S	1.03 GiB	CPU	8	tg128	65.35 ± 0.23
llama 1B Q6_K	1.15 GiB	CPU	16	pp512	1271.25 ± 12.18
llama 1B Q6_K	1.15 GiB	CPU	1	tg128	31.43 ± 0.21
llama 1B Q6_K	1.15 GiB	CPU	2	tg128	51.40 ± 0.22
llama 1B Q6_K	1.15 GiB	CPU	4	tg128	58.25 ± 0.13
llama 1B Q6_K	1.15 GiB	CPU	8	tg128	57.64 ± 0.02

👤 ikawrakow replied the 2024-09-26 at 16:18:44:

Here some performance numbers for the 3B model on a Ryzen-7950X CPU

model	size	backend	threads	test	t/s
llama 3B BF16	6.72 GiB	CPU	16	pp512	482.81 ± 16.34
llama 3B BF16	6.72 GiB	CPU	1	tg128	5.53 ± 0.05
llama 3B BF16	6.72 GiB	CPU	2	tg128	8.65 ± 0.01
llama 3B BF16	6.72 GiB	CPU	4	tg128	9.35 ± 0.02
llama 3B BF16	6.72 GiB	CPU	8	tg128	9.14 ± 0.05
llama 3B Q8_0	3.57 GiB	CPU	16	pp512	383.82 ± 1.85
llama 3B Q8_0	3.57 GiB	CPU	1	tg128	14.93 ± 0.30
llama 3B Q8_0	3.57 GiB	CPU	2	tg128	18.66 ± 0.04
llama 3B Q8_0	3.57 GiB	CPU	4	tg128	18.03 ± 0.13
llama 3B Q8_0	3.57 GiB	CPU	8	tg128	17.20 ± 0.03
llama 3B IQ3_K	1.55 GiB	CPU	16	pp512	409.30 ± 3.79
llama 3B IQ3_K	1.55 GiB	CPU	1	tg128	11.58 ± 0.01
llama 3B IQ3_K	1.55 GiB	CPU	2	tg128	22.28 ± 0.02
llama 3B IQ3_K	1.55 GiB	CPU	4	tg128	39.25 ± 0.18
llama 3B IQ3_K	1.55 GiB	CPU	8	tg128	37.45 ± 0.08
llama 3B IQ4_K	2.09 GiB	CPU	16	pp512	418.06 ± 2.13
llama 3B IQ4_K	2.09 GiB	CPU	1	tg128	12.23 ± 0.04
llama 3B IQ4_K	2.09 GiB	CPU	2	tg128	23.16 ± 0.07
llama 3B IQ4_K	2.09 GiB	CPU	4	tg128	30.55 ± 0.02
llama 3B IQ4_K	2.09 GiB	CPU	8	tg128	29.41 ± 0.16
llama 3B Q4_K_S	2.09 GiB	CPU	16	pp512	445.79 ± 15.41
llama 3B Q4_K_S	2.09 GiB	CPU	1	tg128	13.85 ± 0.03
llama 3B Q4_K_S	2.09 GiB	CPU	2	tg128	22.74 ± 0.09
llama 3B Q4_K_S	2.09 GiB	CPU	4	tg128	30.74 ± 0.09
llama 3B Q4_K_S	2.09 GiB	CPU	8	tg128	29.77 ± 0.02
llama 3B IQ5_K	2.41 GiB	CPU	16	pp512	338.86 ± 7.69
llama 3B IQ5_K	2.41 GiB	CPU	1	tg128	9.70 ± 0.12
llama 3B IQ5_K	2.41 GiB	CPU	2	tg128	18.31 ± 0.02
llama 3B IQ5_K	2.41 GiB	CPU	4	tg128	26.21 ± 0.03
llama 3B IQ5_K	2.41 GiB	CPU	8	tg128	25.18 ± 0.10
llama 3B Q5_K_S	2.41 GiB	CPU	16	pp512	432.96 ± 2.83
llama 3B Q5_K_S	2.41 GiB	CPU	1	tg128	12.89 ± 0.15
llama 3B Q5_K_S	2.41 GiB	CPU	2	tg128	22.54 ± 0.09
llama 3B Q5_K_S	2.41 GiB	CPU	4	tg128	26.37 ± 0.07
llama 3B Q5_K_S	2.41 GiB	CPU	8	tg128	25.55 ± 0.02
llama 3B Q6_K	2.76 GiB	CPU	16	pp512	439.73 ± 5.86
llama 3B Q6_K	2.76 GiB	CPU	1	tg128	12.90 ± 0.19
llama 3B Q6_K	2.76 GiB	CPU	2	tg128	21.05 ± 0.01
llama 3B Q6_K	2.76 GiB	CPU	4	tg128	22.97 ± 0.01
llama 3B Q6_K	2.76 GiB	CPU	8	tg128	22.20 ± 0.01

13 KiB Raw Blame History