### 🗣️ [#63](https://github.com/ikawrakow/ik_llama.cpp/discussions/63) - LLaMA-3.2 quantization evaluation
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2024-09-26 |
| **Updated** | 2024-09-26 |
---
#### Description
LLaMA-3.2 is out. `llama.cpp` does not yet support the vision models, so this post focuses on the 1B ad 3B text models that could be very handy for local usage on low-end devices. The models are small enough even with full precision (`bf16`) but I think it is still interesting to look at quantization as token generation is significantly faster with quantized models.
To reproduce the results reported here
1. Clone my validation dataset repository
```
git clone git@hf.co:datasets/ikawrakow/validation-datasets-for-llama.cpp
cd validation-datasets-for-llama.cpp
gunzip wiki.test.raw.gz
gunzip wiki.train.raw.gz
```
2. Get one or more LLaMA-3.2 models. E.g.
```
git clone git@hf.co:meta-llama/Llama-3.2-3B
```
3. Convert to GGUF. E.g.
```
python3 convert_hf_to_gguf.py --outtype bf16 Llama-3.2-3B/
```
4. Create imatrix data. E.g.
```
./bin/llama-imatrix -m Llama-3.2-3B/Llama-3.2-3B-BF16.gguf -f validation-datasets-for-llama.cpp/wiki.train.raw --chunks 1000 -o l32_imatrix_c512.out
```
5. Quantize. E.g.
```
./bin/llama-quantize --imatrix l32_imatrix_c512.out Llama-3.2-3B/Llama-3.2-3B-BF16.gguf iq4k.gguf iq4_k
```
6. Compute perplexity
```
./bin/llama-perplexity -m iq4k.gguf -f validation-datasets-for-llama.cpp/wiki.test.raw -t 1 -ngl 100
```
7. Compute HellaSwag
```
./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/hellaswag-validation.bin --multiple-choice -t 1 -ngl 100 -c 2048
```
8. Compute MMLU
```
./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/mmlu-test.bin --multiple-choice -t 1 -ngl 100 -c 2048
```
### Perplexity
Perplexity (`PPL` in what follows) is not the best measure to compare *different* models, but it is extremely useful when comparing a quantized version of a model to the *same* full precision model. In the graphs below I use the quantization error defined as
```
quantization error = PPL(Q)/PPL(bf16) - 1
```
where `PPL(Q)` is the perplexity of quantization `Q` and `PPL(bf16)` is the perplexity of the full model (the 3.2 models are released as `bf16`, so I use `bf16` throughout as `bf16` support has been added here in PR #39, #40, #41, #56).
The following graph shows quantization error of LLaMA-3.2-3B as a function of bits-per-weight (bpw) for (almost) all quantization types supported here. Note that this is the effective bpw that includes the `token_embedding.weight` tensor, which is quantized with more bits (typically `Q6_K`), and this has a significant impact on the overall bpw balance as this tensor represents a significant fraction of the overall model size. The y-axis is logarithmic, so differences can be quite large even if data points look relatively close. The cyan circles are for the new quants `IQ2_K, IQ3_K, IQ4_K, IQ5_K` and `IQ6_K` that are not available in mainline `llama.cpp`. The black symbols are for i-quants, the red for k-quants, and the blue symbols are legacy quants (`Q4_0, Q4_1, Q5_0`, Q5_1`).

The next graph shows results for LLaMA-3.2-3B-Instruct. The results are qualitatively very similar to the base model, with the quantization error being slightly lower compared to the base model.

My conclusion from these two graphs are
1. Going below 3 bpw with these models is not useful - the quantization error becomes too large. This is similar to the 3.1 LlaMA models
2. The new iqk-quants `IQ4_K` and `IQ5_K` are significantly better than k- or legacy quants in this bpw range
3. Legacy quants are mostly useless as it is so often the case
The next graph is for the base LLaMA-3.2-1B model

Here the quantization error is significantly larger, going below 2% only for 5+ bpw. At about 4.95 bpw `IQ4_K` has a quantization error of 3%, `Q4_K_S` is at 4.3%, and `Q4_0` at 12.5% (!), nearly the same as `IQ3_K` at 3.68 bpw.
### HellaSwag
The HellaSwag 0-shot score of 74.34 for the 3B base model is surprisingly high for a model of this size. But here we are more interested in looking at the impact of quantization, so I'll focus on that. The following graph shows
```
HellaSwag(bf16) - HellaSwag(Q)
```
for LLaMA-3.2-3B.

As one could have expected from the perplexity results, sub-3-bpw quantization destroys the model utility. Hence, it is more useful to focus on the 3+ bpw range, which is the purpose of the next graph

We see that `IQ4_K, IQ5_K, IQ6_K` and `Q6_K` are basically indistinguishable from the `bf16` model for the HellaSwag metrics. But at less than 2 points below `bf16`, even `IQ3_K` and `IQ3_S` could be useful if HellaSwag is representative for the kind of tasks one intends to tackle.
### MMLU
Here I show only results for the 3+ bpw range for LLaMA-3.2-3B in the following graph

All quantizations above `IQ3_K` (3.6 bpw) are (nearly) indistinguishable from the full `bf16` model according to this metrics.
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2024-09-26** at **16:11:00**:
Here some performance numbers for the 1B model on a Ryzen-7950X CPU
| model | size | backend | threads | test | t/s |
| --------------- | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 1B BF16 | 2.79 GiB | CPU | 16 | pp512 | 1217.13 ± 18.31 |
| llama 1B BF16 | 2.79 GiB | CPU | 1 | tg128 | 15.31 ± 0.19 |
| llama 1B BF16 | 2.79 GiB | CPU | 2 | tg128 | 22.97 ± 0.04 |
| llama 1B BF16 | 2.79 GiB | CPU | 4 | tg128 | 23.86 ± 0.08 |
| llama 1B BF16 | 2.79 GiB | CPU | 8 | tg128 | 23.45 ± 0.32 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 16 | pp512 | 1109.36 ± 24.77 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 1 | tg128 | 38.57 ± 0.24 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 2 | tg128 | 46.86 ± 0.04 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 4 | tg128 | 46.42 ± 0.11 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 8 | tg128 | 44.41 ± 0.07 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 16 | pp512 | 1211.41 ± 12.99 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 1 | tg128 | 30.81 ± 0.04 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 2 | tg128 | 57.37 ± 0.17 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 4 | tg128 | 76.93 ± 0.14 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 8 | tg128 | 74.61 ± 0.09 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 16 | pp512 | 982.76 ± 16.70 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 1 | tg128 | 24.76 ± 0.04 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 2 | tg128 | 46.39 ± 0.06 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 4 | tg128 | 66.47 ± 0.23 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 8 | tg128 | 64.73 ± 0.10 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 16 | pp512 | 1257.38 ± 13.08 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 1 | tg128 | 31.56 ± 0.55 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 2 | tg128 | 55.68 ± 0.28 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 4 | tg128 | 66.34 ± 0.27 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 8 | tg128 | 65.35 ± 0.23 |
| llama 1B Q6_K | 1.15 GiB | CPU | 16 | pp512 | 1271.25 ± 12.18 |
| llama 1B Q6_K | 1.15 GiB | CPU | 1 | tg128 | 31.43 ± 0.21 |
| llama 1B Q6_K | 1.15 GiB | CPU | 2 | tg128 | 51.40 ± 0.22 |
| llama 1B Q6_K | 1.15 GiB | CPU | 4 | tg128 | 58.25 ± 0.13 |
| llama 1B Q6_K | 1.15 GiB | CPU | 8 | tg128 | 57.64 ± 0.02 |
---
👤 **ikawrakow** replied the **2024-09-26** at **16:18:44**:
Here some performance numbers for the 3B model on a Ryzen-7950X CPU
| model | size | backend | threads | test | t/s |
| --------------- | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 3B BF16 | 6.72 GiB | CPU | 16 | pp512 | 482.81 ± 16.34 |
| llama 3B BF16 | 6.72 GiB | CPU | 1 | tg128 | 5.53 ± 0.05 |
| llama 3B BF16 | 6.72 GiB | CPU | 2 | tg128 | 8.65 ± 0.01 |
| llama 3B BF16 | 6.72 GiB | CPU | 4 | tg128 | 9.35 ± 0.02 |
| llama 3B BF16 | 6.72 GiB | CPU | 8 | tg128 | 9.14 ± 0.05 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 16 | pp512 | 383.82 ± 1.85 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 1 | tg128 | 14.93 ± 0.30 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 2 | tg128 | 18.66 ± 0.04 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 4 | tg128 | 18.03 ± 0.13 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 8 | tg128 | 17.20 ± 0.03 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 16 | pp512 | 409.30 ± 3.79 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 1 | tg128 | 11.58 ± 0.01 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 2 | tg128 | 22.28 ± 0.02 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 4 | tg128 | 39.25 ± 0.18 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 8 | tg128 | 37.45 ± 0.08 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 16 | pp512 | 418.06 ± 2.13 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 1 | tg128 | 12.23 ± 0.04 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 2 | tg128 | 23.16 ± 0.07 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 4 | tg128 | 30.55 ± 0.02 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 8 | tg128 | 29.41 ± 0.16 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 16 | pp512 | 445.79 ± 15.41 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 1 | tg128 | 13.85 ± 0.03 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 2 | tg128 | 22.74 ± 0.09 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 4 | tg128 | 30.74 ± 0.09 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 8 | tg128 | 29.77 ± 0.02 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 16 | pp512 | 338.86 ± 7.69 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 1 | tg128 | 9.70 ± 0.12 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 2 | tg128 | 18.31 ± 0.02 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 4 | tg128 | 26.21 ± 0.03 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 8 | tg128 | 25.18 ± 0.10 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 16 | pp512 | 432.96 ± 2.83 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 1 | tg128 | 12.89 ± 0.15 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 2 | tg128 | 22.54 ± 0.09 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 4 | tg128 | 26.37 ± 0.07 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 8 | tg128 | 25.55 ± 0.02 |
| llama 3B Q6_K | 2.76 GiB | CPU | 16 | pp512 | 439.73 ± 5.86 |
| llama 3B Q6_K | 2.76 GiB | CPU | 1 | tg128 | 12.90 ± 0.19 |
| llama 3B Q6_K | 2.76 GiB | CPU | 2 | tg128 | 21.05 ± 0.01 |
| llama 3B Q6_K | 2.76 GiB | CPU | 4 | tg128 | 22.97 ± 0.01 |
| llama 3B Q6_K | 2.76 GiB | CPU | 8 | tg128 | 22.20 ± 0.01 |