13 KiB
🗣️ #63 - LLaMA-3.2 quantization evaluation
| Author | ikawrakow |
|---|---|
| Created | 2024-09-26 |
| Updated | 2024-09-26 |
Description
LLaMA-3.2 is out. llama.cpp does not yet support the vision models, so this post focuses on the 1B ad 3B text models that could be very handy for local usage on low-end devices. The models are small enough even with full precision (bf16) but I think it is still interesting to look at quantization as token generation is significantly faster with quantized models.
To reproduce the results reported here
- Clone my validation dataset repository
git clone git@hf.co:datasets/ikawrakow/validation-datasets-for-llama.cpp
cd validation-datasets-for-llama.cpp
gunzip wiki.test.raw.gz
gunzip wiki.train.raw.gz
- Get one or more LLaMA-3.2 models. E.g.
git clone git@hf.co:meta-llama/Llama-3.2-3B
- Convert to GGUF. E.g.
python3 convert_hf_to_gguf.py --outtype bf16 Llama-3.2-3B/
- Create imatrix data. E.g.
./bin/llama-imatrix -m Llama-3.2-3B/Llama-3.2-3B-BF16.gguf -f validation-datasets-for-llama.cpp/wiki.train.raw --chunks 1000 -o l32_imatrix_c512.out
- Quantize. E.g.
./bin/llama-quantize --imatrix l32_imatrix_c512.out Llama-3.2-3B/Llama-3.2-3B-BF16.gguf iq4k.gguf iq4_k
- Compute perplexity
./bin/llama-perplexity -m iq4k.gguf -f validation-datasets-for-llama.cpp/wiki.test.raw -t 1 -ngl 100
- Compute HellaSwag
./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/hellaswag-validation.bin --multiple-choice -t 1 -ngl 100 -c 2048
- Compute MMLU
./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/mmlu-test.bin --multiple-choice -t 1 -ngl 100 -c 2048
Perplexity
Perplexity (PPL in what follows) is not the best measure to compare different models, but it is extremely useful when comparing a quantized version of a model to the same full precision model. In the graphs below I use the quantization error defined as
quantization error = PPL(Q)/PPL(bf16) - 1
where PPL(Q) is the perplexity of quantization Q and PPL(bf16) is the perplexity of the full model (the 3.2 models are released as bf16, so I use bf16 throughout as bf16 support has been added here in PR #39, #40, #41, #56).
The following graph shows quantization error of LLaMA-3.2-3B as a function of bits-per-weight (bpw) for (almost) all quantization types supported here. Note that this is the effective bpw that includes the token_embedding.weight tensor, which is quantized with more bits (typically Q6_K), and this has a significant impact on the overall bpw balance as this tensor represents a significant fraction of the overall model size. The y-axis is logarithmic, so differences can be quite large even if data points look relatively close. The cyan circles are for the new quants IQ2_K, IQ3_K, IQ4_K, IQ5_K and IQ6_K that are not available in mainline llama.cpp. The black symbols are for i-quants, the red for k-quants, and the blue symbols are legacy quants (Q4_0, Q4_1, Q5_0, Q5_1`).
The next graph shows results for LLaMA-3.2-3B-Instruct. The results are qualitatively very similar to the base model, with the quantization error being slightly lower compared to the base model.
My conclusion from these two graphs are
- Going below 3 bpw with these models is not useful - the quantization error becomes too large. This is similar to the 3.1 LlaMA models
- The new iqk-quants
IQ4_KandIQ5_Kare significantly better than k- or legacy quants in this bpw range - Legacy quants are mostly useless as it is so often the case
The next graph is for the base LLaMA-3.2-1B model
Here the quantization error is significantly larger, going below 2% only for 5+ bpw. At about 4.95 bpw IQ4_K has a quantization error of 3%, Q4_K_S is at 4.3%, and Q4_0 at 12.5% (!), nearly the same as IQ3_K at 3.68 bpw.
HellaSwag
The HellaSwag 0-shot score of 74.34 for the 3B base model is surprisingly high for a model of this size. But here we are more interested in looking at the impact of quantization, so I'll focus on that. The following graph shows
HellaSwag(bf16) - HellaSwag(Q)
for LLaMA-3.2-3B.
As one could have expected from the perplexity results, sub-3-bpw quantization destroys the model utility. Hence, it is more useful to focus on the 3+ bpw range, which is the purpose of the next graph
We see that IQ4_K, IQ5_K, IQ6_K and Q6_K are basically indistinguishable from the bf16 model for the HellaSwag metrics. But at less than 2 points below bf16, even IQ3_K and IQ3_S could be useful if HellaSwag is representative for the kind of tasks one intends to tackle.
MMLU
Here I show only results for the 3+ bpw range for LLaMA-3.2-3B in the following graph
All quantizations above IQ3_K (3.6 bpw) are (nearly) indistinguishable from the full bf16 model according to this metrics.
🗣️ Discussion
👤 ikawrakow replied the 2024-09-26 at 16:11:00:
Here some performance numbers for the 1B model on a Ryzen-7950X CPU
| model | size | backend | threads | test | t/s |
|---|---|---|---|---|---|
| llama 1B BF16 | 2.79 GiB | CPU | 16 | pp512 | 1217.13 ± 18.31 |
| llama 1B BF16 | 2.79 GiB | CPU | 1 | tg128 | 15.31 ± 0.19 |
| llama 1B BF16 | 2.79 GiB | CPU | 2 | tg128 | 22.97 ± 0.04 |
| llama 1B BF16 | 2.79 GiB | CPU | 4 | tg128 | 23.86 ± 0.08 |
| llama 1B BF16 | 2.79 GiB | CPU | 8 | tg128 | 23.45 ± 0.32 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 16 | pp512 | 1109.36 ± 24.77 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 1 | tg128 | 38.57 ± 0.24 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 2 | tg128 | 46.86 ± 0.04 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 4 | tg128 | 46.42 ± 0.11 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 8 | tg128 | 44.41 ± 0.07 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 16 | pp512 | 1211.41 ± 12.99 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 1 | tg128 | 30.81 ± 0.04 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 2 | tg128 | 57.37 ± 0.17 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 4 | tg128 | 76.93 ± 0.14 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 8 | tg128 | 74.61 ± 0.09 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 16 | pp512 | 982.76 ± 16.70 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 1 | tg128 | 24.76 ± 0.04 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 2 | tg128 | 46.39 ± 0.06 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 4 | tg128 | 66.47 ± 0.23 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 8 | tg128 | 64.73 ± 0.10 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 16 | pp512 | 1257.38 ± 13.08 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 1 | tg128 | 31.56 ± 0.55 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 2 | tg128 | 55.68 ± 0.28 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 4 | tg128 | 66.34 ± 0.27 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 8 | tg128 | 65.35 ± 0.23 |
| llama 1B Q6_K | 1.15 GiB | CPU | 16 | pp512 | 1271.25 ± 12.18 |
| llama 1B Q6_K | 1.15 GiB | CPU | 1 | tg128 | 31.43 ± 0.21 |
| llama 1B Q6_K | 1.15 GiB | CPU | 2 | tg128 | 51.40 ± 0.22 |
| llama 1B Q6_K | 1.15 GiB | CPU | 4 | tg128 | 58.25 ± 0.13 |
| llama 1B Q6_K | 1.15 GiB | CPU | 8 | tg128 | 57.64 ± 0.02 |
👤 ikawrakow replied the 2024-09-26 at 16:18:44:
Here some performance numbers for the 3B model on a Ryzen-7950X CPU
| model | size | backend | threads | test | t/s |
|---|---|---|---|---|---|
| llama 3B BF16 | 6.72 GiB | CPU | 16 | pp512 | 482.81 ± 16.34 |
| llama 3B BF16 | 6.72 GiB | CPU | 1 | tg128 | 5.53 ± 0.05 |
| llama 3B BF16 | 6.72 GiB | CPU | 2 | tg128 | 8.65 ± 0.01 |
| llama 3B BF16 | 6.72 GiB | CPU | 4 | tg128 | 9.35 ± 0.02 |
| llama 3B BF16 | 6.72 GiB | CPU | 8 | tg128 | 9.14 ± 0.05 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 16 | pp512 | 383.82 ± 1.85 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 1 | tg128 | 14.93 ± 0.30 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 2 | tg128 | 18.66 ± 0.04 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 4 | tg128 | 18.03 ± 0.13 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 8 | tg128 | 17.20 ± 0.03 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 16 | pp512 | 409.30 ± 3.79 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 1 | tg128 | 11.58 ± 0.01 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 2 | tg128 | 22.28 ± 0.02 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 4 | tg128 | 39.25 ± 0.18 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 8 | tg128 | 37.45 ± 0.08 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 16 | pp512 | 418.06 ± 2.13 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 1 | tg128 | 12.23 ± 0.04 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 2 | tg128 | 23.16 ± 0.07 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 4 | tg128 | 30.55 ± 0.02 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 8 | tg128 | 29.41 ± 0.16 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 16 | pp512 | 445.79 ± 15.41 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 1 | tg128 | 13.85 ± 0.03 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 2 | tg128 | 22.74 ± 0.09 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 4 | tg128 | 30.74 ± 0.09 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 8 | tg128 | 29.77 ± 0.02 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 16 | pp512 | 338.86 ± 7.69 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 1 | tg128 | 9.70 ± 0.12 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 2 | tg128 | 18.31 ± 0.02 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 4 | tg128 | 26.21 ± 0.03 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 8 | tg128 | 25.18 ± 0.10 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 16 | pp512 | 432.96 ± 2.83 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 1 | tg128 | 12.89 ± 0.15 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 2 | tg128 | 22.54 ± 0.09 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 4 | tg128 | 26.37 ± 0.07 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 8 | tg128 | 25.55 ± 0.02 |
| llama 3B Q6_K | 2.76 GiB | CPU | 16 | pp512 | 439.73 ± 5.86 |
| llama 3B Q6_K | 2.76 GiB | CPU | 1 | tg128 | 12.90 ± 0.19 |
| llama 3B Q6_K | 2.76 GiB | CPU | 2 | tg128 | 21.05 ± 0.01 |
| llama 3B Q6_K | 2.76 GiB | CPU | 4 | tg128 | 22.97 ± 0.01 |
| llama 3B Q6_K | 2.76 GiB | CPU | 8 | tg128 | 22.20 ± 0.01 |