mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-21 05:34:08 +00:00
195 lines
13 KiB
Markdown
195 lines
13 KiB
Markdown
### 🗣️ [#63](https://github.com/ikawrakow/ik_llama.cpp/discussions/63) - LLaMA-3.2 quantization evaluation
|
|
|
|
| **Author** | `ikawrakow` |
|
|
| :--- | :--- |
|
|
| **Created** | 2024-09-26 |
|
|
| **Updated** | 2024-09-26 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
LLaMA-3.2 is out. `llama.cpp` does not yet support the vision models, so this post focuses on the 1B ad 3B text models that could be very handy for local usage on low-end devices. The models are small enough even with full precision (`bf16`) but I think it is still interesting to look at quantization as token generation is significantly faster with quantized models.
|
|
|
|
To reproduce the results reported here
|
|
1. Clone my validation dataset repository
|
|
```
|
|
git clone git@hf.co:datasets/ikawrakow/validation-datasets-for-llama.cpp
|
|
cd validation-datasets-for-llama.cpp
|
|
gunzip wiki.test.raw.gz
|
|
gunzip wiki.train.raw.gz
|
|
```
|
|
|
|
2. Get one or more LLaMA-3.2 models. E.g.
|
|
```
|
|
git clone git@hf.co:meta-llama/Llama-3.2-3B
|
|
```
|
|
|
|
3. Convert to GGUF. E.g.
|
|
```
|
|
python3 convert_hf_to_gguf.py --outtype bf16 Llama-3.2-3B/
|
|
```
|
|
|
|
4. Create imatrix data. E.g.
|
|
```
|
|
./bin/llama-imatrix -m Llama-3.2-3B/Llama-3.2-3B-BF16.gguf -f validation-datasets-for-llama.cpp/wiki.train.raw --chunks 1000 -o l32_imatrix_c512.out
|
|
```
|
|
|
|
5. Quantize. E.g.
|
|
```
|
|
./bin/llama-quantize --imatrix l32_imatrix_c512.out Llama-3.2-3B/Llama-3.2-3B-BF16.gguf iq4k.gguf iq4_k
|
|
```
|
|
6. Compute perplexity
|
|
```
|
|
./bin/llama-perplexity -m iq4k.gguf -f validation-datasets-for-llama.cpp/wiki.test.raw -t 1 -ngl 100
|
|
```
|
|
|
|
7. Compute HellaSwag
|
|
```
|
|
./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/hellaswag-validation.bin --multiple-choice -t 1 -ngl 100 -c 2048
|
|
```
|
|
|
|
8. Compute MMLU
|
|
```
|
|
./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/mmlu-test.bin --multiple-choice -t 1 -ngl 100 -c 2048
|
|
```
|
|
|
|
### Perplexity
|
|
|
|
Perplexity (`PPL` in what follows) is not the best measure to compare *different* models, but it is extremely useful when comparing a quantized version of a model to the *same* full precision model. In the graphs below I use the quantization error defined as
|
|
```
|
|
quantization error = PPL(Q)/PPL(bf16) - 1
|
|
```
|
|
where `PPL(Q)` is the perplexity of quantization `Q` and `PPL(bf16)` is the perplexity of the full model (the 3.2 models are released as `bf16`, so I use `bf16` throughout as `bf16` support has been added here in PR #39, #40, #41, #56).
|
|
|
|
The following graph shows quantization error of LLaMA-3.2-3B as a function of bits-per-weight (bpw) for (almost) all quantization types supported here. Note that this is the effective bpw that includes the `token_embedding.weight` tensor, which is quantized with more bits (typically `Q6_K`), and this has a significant impact on the overall bpw balance as this tensor represents a significant fraction of the overall model size. The y-axis is logarithmic, so differences can be quite large even if data points look relatively close. The cyan circles are for the new quants `IQ2_K, IQ3_K, IQ4_K, IQ5_K` and `IQ6_K` that are not available in mainline `llama.cpp`. The black symbols are for i-quants, the red for k-quants, and the blue symbols are legacy quants (`Q4_0, Q4_1, Q5_0`, Q5_1`).
|
|
|
|

|
|
|
|
The next graph shows results for LLaMA-3.2-3B-Instruct. The results are qualitatively very similar to the base model, with the quantization error being slightly lower compared to the base model.
|
|

|
|
|
|
My conclusion from these two graphs are
|
|
1. Going below 3 bpw with these models is not useful - the quantization error becomes too large. This is similar to the 3.1 LlaMA models
|
|
2. The new iqk-quants `IQ4_K` and `IQ5_K` are significantly better than k- or legacy quants in this bpw range
|
|
3. Legacy quants are mostly useless as it is so often the case
|
|
|
|
The next graph is for the base LLaMA-3.2-1B model
|
|
|
|

|
|
|
|
Here the quantization error is significantly larger, going below 2% only for 5+ bpw. At about 4.95 bpw `IQ4_K` has a quantization error of 3%, `Q4_K_S` is at 4.3%, and `Q4_0` at 12.5% (!), nearly the same as `IQ3_K` at 3.68 bpw.
|
|
|
|
### HellaSwag
|
|
|
|
The HellaSwag 0-shot score of 74.34 for the 3B base model is surprisingly high for a model of this size. But here we are more interested in looking at the impact of quantization, so I'll focus on that. The following graph shows
|
|
```
|
|
HellaSwag(bf16) - HellaSwag(Q)
|
|
```
|
|
for LLaMA-3.2-3B.
|
|
|
|

|
|
|
|
As one could have expected from the perplexity results, sub-3-bpw quantization destroys the model utility. Hence, it is more useful to focus on the 3+ bpw range, which is the purpose of the next graph
|
|
|
|

|
|
|
|
We see that `IQ4_K, IQ5_K, IQ6_K` and `Q6_K` are basically indistinguishable from the `bf16` model for the HellaSwag metrics. But at less than 2 points below `bf16`, even `IQ3_K` and `IQ3_S` could be useful if HellaSwag is representative for the kind of tasks one intends to tackle.
|
|
|
|
### MMLU
|
|
|
|
Here I show only results for the 3+ bpw range for LLaMA-3.2-3B in the following graph
|
|
|
|

|
|
|
|
All quantizations above `IQ3_K` (3.6 bpw) are (nearly) indistinguishable from the full `bf16` model according to this metrics.
|
|
|
|
---
|
|
|
|
#### 🗣️ Discussion
|
|
|
|
👤 **ikawrakow** replied the **2024-09-26** at **16:11:00**:<br>
|
|
|
|
Here some performance numbers for the 1B model on a Ryzen-7950X CPU
|
|
|
|
| model | size | backend | threads | test | t/s |
|
|
| --------------- | ---------: | ---------- | ------: | ------------: | ---------------: |
|
|
| llama 1B BF16 | 2.79 GiB | CPU | 16 | pp512 | 1217.13 ± 18.31 |
|
|
| llama 1B BF16 | 2.79 GiB | CPU | 1 | tg128 | 15.31 ± 0.19 |
|
|
| llama 1B BF16 | 2.79 GiB | CPU | 2 | tg128 | 22.97 ± 0.04 |
|
|
| llama 1B BF16 | 2.79 GiB | CPU | 4 | tg128 | 23.86 ± 0.08 |
|
|
| llama 1B BF16 | 2.79 GiB | CPU | 8 | tg128 | 23.45 ± 0.32 |
|
|
| llama 1B Q8_0 | 1.48 GiB | CPU | 16 | pp512 | 1109.36 ± 24.77 |
|
|
| llama 1B Q8_0 | 1.48 GiB | CPU | 1 | tg128 | 38.57 ± 0.24 |
|
|
| llama 1B Q8_0 | 1.48 GiB | CPU | 2 | tg128 | 46.86 ± 0.04 |
|
|
| llama 1B Q8_0 | 1.48 GiB | CPU | 4 | tg128 | 46.42 ± 0.11 |
|
|
| llama 1B Q8_0 | 1.48 GiB | CPU | 8 | tg128 | 44.41 ± 0.07 |
|
|
| llama 1B IQ4_K | 935.24 MiB | CPU | 16 | pp512 | 1211.41 ± 12.99 |
|
|
| llama 1B IQ4_K | 935.24 MiB | CPU | 1 | tg128 | 30.81 ± 0.04 |
|
|
| llama 1B IQ4_K | 935.24 MiB | CPU | 2 | tg128 | 57.37 ± 0.17 |
|
|
| llama 1B IQ4_K | 935.24 MiB | CPU | 4 | tg128 | 76.93 ± 0.14 |
|
|
| llama 1B IQ4_K | 935.24 MiB | CPU | 8 | tg128 | 74.61 ± 0.09 |
|
|
| llama 1B IQ5_K | 1.02 GiB | CPU | 16 | pp512 | 982.76 ± 16.70 |
|
|
| llama 1B IQ5_K | 1.02 GiB | CPU | 1 | tg128 | 24.76 ± 0.04 |
|
|
| llama 1B IQ5_K | 1.02 GiB | CPU | 2 | tg128 | 46.39 ± 0.06 |
|
|
| llama 1B IQ5_K | 1.02 GiB | CPU | 4 | tg128 | 66.47 ± 0.23 |
|
|
| llama 1B IQ5_K | 1.02 GiB | CPU | 8 | tg128 | 64.73 ± 0.10 |
|
|
| llama 1B Q5_K_S | 1.03 GiB | CPU | 16 | pp512 | 1257.38 ± 13.08 |
|
|
| llama 1B Q5_K_S | 1.03 GiB | CPU | 1 | tg128 | 31.56 ± 0.55 |
|
|
| llama 1B Q5_K_S | 1.03 GiB | CPU | 2 | tg128 | 55.68 ± 0.28 |
|
|
| llama 1B Q5_K_S | 1.03 GiB | CPU | 4 | tg128 | 66.34 ± 0.27 |
|
|
| llama 1B Q5_K_S | 1.03 GiB | CPU | 8 | tg128 | 65.35 ± 0.23 |
|
|
| llama 1B Q6_K | 1.15 GiB | CPU | 16 | pp512 | 1271.25 ± 12.18 |
|
|
| llama 1B Q6_K | 1.15 GiB | CPU | 1 | tg128 | 31.43 ± 0.21 |
|
|
| llama 1B Q6_K | 1.15 GiB | CPU | 2 | tg128 | 51.40 ± 0.22 |
|
|
| llama 1B Q6_K | 1.15 GiB | CPU | 4 | tg128 | 58.25 ± 0.13 |
|
|
| llama 1B Q6_K | 1.15 GiB | CPU | 8 | tg128 | 57.64 ± 0.02 |
|
|
|
|
---
|
|
|
|
👤 **ikawrakow** replied the **2024-09-26** at **16:18:44**:<br>
|
|
|
|
Here some performance numbers for the 3B model on a Ryzen-7950X CPU
|
|
|
|
| model | size | backend | threads | test | t/s |
|
|
| --------------- | ---------: | ---------- | ------: | ------------: | ---------------: |
|
|
| llama 3B BF16 | 6.72 GiB | CPU | 16 | pp512 | 482.81 ± 16.34 |
|
|
| llama 3B BF16 | 6.72 GiB | CPU | 1 | tg128 | 5.53 ± 0.05 |
|
|
| llama 3B BF16 | 6.72 GiB | CPU | 2 | tg128 | 8.65 ± 0.01 |
|
|
| llama 3B BF16 | 6.72 GiB | CPU | 4 | tg128 | 9.35 ± 0.02 |
|
|
| llama 3B BF16 | 6.72 GiB | CPU | 8 | tg128 | 9.14 ± 0.05 |
|
|
| llama 3B Q8_0 | 3.57 GiB | CPU | 16 | pp512 | 383.82 ± 1.85 |
|
|
| llama 3B Q8_0 | 3.57 GiB | CPU | 1 | tg128 | 14.93 ± 0.30 |
|
|
| llama 3B Q8_0 | 3.57 GiB | CPU | 2 | tg128 | 18.66 ± 0.04 |
|
|
| llama 3B Q8_0 | 3.57 GiB | CPU | 4 | tg128 | 18.03 ± 0.13 |
|
|
| llama 3B Q8_0 | 3.57 GiB | CPU | 8 | tg128 | 17.20 ± 0.03 |
|
|
| llama 3B IQ3_K | 1.55 GiB | CPU | 16 | pp512 | 409.30 ± 3.79 |
|
|
| llama 3B IQ3_K | 1.55 GiB | CPU | 1 | tg128 | 11.58 ± 0.01 |
|
|
| llama 3B IQ3_K | 1.55 GiB | CPU | 2 | tg128 | 22.28 ± 0.02 |
|
|
| llama 3B IQ3_K | 1.55 GiB | CPU | 4 | tg128 | 39.25 ± 0.18 |
|
|
| llama 3B IQ3_K | 1.55 GiB | CPU | 8 | tg128 | 37.45 ± 0.08 |
|
|
| llama 3B IQ4_K | 2.09 GiB | CPU | 16 | pp512 | 418.06 ± 2.13 |
|
|
| llama 3B IQ4_K | 2.09 GiB | CPU | 1 | tg128 | 12.23 ± 0.04 |
|
|
| llama 3B IQ4_K | 2.09 GiB | CPU | 2 | tg128 | 23.16 ± 0.07 |
|
|
| llama 3B IQ4_K | 2.09 GiB | CPU | 4 | tg128 | 30.55 ± 0.02 |
|
|
| llama 3B IQ4_K | 2.09 GiB | CPU | 8 | tg128 | 29.41 ± 0.16 |
|
|
| llama 3B Q4_K_S | 2.09 GiB | CPU | 16 | pp512 | 445.79 ± 15.41 |
|
|
| llama 3B Q4_K_S | 2.09 GiB | CPU | 1 | tg128 | 13.85 ± 0.03 |
|
|
| llama 3B Q4_K_S | 2.09 GiB | CPU | 2 | tg128 | 22.74 ± 0.09 |
|
|
| llama 3B Q4_K_S | 2.09 GiB | CPU | 4 | tg128 | 30.74 ± 0.09 |
|
|
| llama 3B Q4_K_S | 2.09 GiB | CPU | 8 | tg128 | 29.77 ± 0.02 |
|
|
| llama 3B IQ5_K | 2.41 GiB | CPU | 16 | pp512 | 338.86 ± 7.69 |
|
|
| llama 3B IQ5_K | 2.41 GiB | CPU | 1 | tg128 | 9.70 ± 0.12 |
|
|
| llama 3B IQ5_K | 2.41 GiB | CPU | 2 | tg128 | 18.31 ± 0.02 |
|
|
| llama 3B IQ5_K | 2.41 GiB | CPU | 4 | tg128 | 26.21 ± 0.03 |
|
|
| llama 3B IQ5_K | 2.41 GiB | CPU | 8 | tg128 | 25.18 ± 0.10 |
|
|
| llama 3B Q5_K_S | 2.41 GiB | CPU | 16 | pp512 | 432.96 ± 2.83 |
|
|
| llama 3B Q5_K_S | 2.41 GiB | CPU | 1 | tg128 | 12.89 ± 0.15 |
|
|
| llama 3B Q5_K_S | 2.41 GiB | CPU | 2 | tg128 | 22.54 ± 0.09 |
|
|
| llama 3B Q5_K_S | 2.41 GiB | CPU | 4 | tg128 | 26.37 ± 0.07 |
|
|
| llama 3B Q5_K_S | 2.41 GiB | CPU | 8 | tg128 | 25.55 ± 0.02 |
|
|
| llama 3B Q6_K | 2.76 GiB | CPU | 16 | pp512 | 439.73 ± 5.86 |
|
|
| llama 3B Q6_K | 2.76 GiB | CPU | 1 | tg128 | 12.90 ± 0.19 |
|
|
| llama 3B Q6_K | 2.76 GiB | CPU | 2 | tg128 | 21.05 ± 0.01 |
|
|
| llama 3B Q6_K | 2.76 GiB | CPU | 4 | tg128 | 22.97 ± 0.01 |
|
|
| llama 3B Q6_K | 2.76 GiB | CPU | 8 | tg128 | 22.20 ± 0.01 | |