mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 17:20:01 +00:00
* Adding iq2_bn_r4 This Zen4-only implementation achieves PP-512 = 826 t/s (!!!) for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn. * Make sure rows per thread are a multiple of the number of interleaved rows With this I can run iq2_bn_r4 with 32 threads and this increases PP-512 to 872 t/s. * iq2_bn_r4: 1st shot at NEON PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s for Bitnet-1.58b-3B). TG-128 is ~5% slower. * iq2_bn_r4: NEON PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn for 1 thread, but saturates to about the same 93 t/s at 8 threads. * iq2_bn_r4: Experimenting on NEON The matrix x vvector multiplication is erratic. iq2_bn_r4 is faster at 1, 2, and 4 threads, but saturates to a lower t/s at 8 threads compared to iq2_bn. iq2_bn actually manages 99 t/s at 8 threads and not 93 as I wrore in the last commit. iq2_bn_r4 performance has huge fluctuations at 4 and 8 threads. * Some cleanup * iq2_bn_r4: AVX2 As expected, PP is slightly slower as we just don;t have enough vector registers (690 vs 710 t/s). TG is slightly faster (18.2 vs 16.7 t/s at 1 thread). * iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn. * iq2_bn_r4: simdify q8_K16 quantization (AVX2) PP-512 becomes 834 t/s and TG-128 now saturates to the same performance as iq2_bn for 4 threads. * iq2_bn_r4: simdify q8_K16 quantization (NEON) PP-512 is now 304.7 t/s, and TG-128 @ 8 threads very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s) * iq2_bn_r4: fix AVX2 after breaking it two commits ago * iq2_bn_r4: better AVX2 As we don't have enough vector registers on AVX2, it is better to do two passes per row needing only half of the accumulator registers that way. With this, we now beat iq2_bn PP also on AVX2 by a small margin. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
quantize
You can also use the GGUF-my-repo space on Hugging Face to build your own quants without any setup.
Note: It is synced from llama.cpp main every 6 hours.
Example usage:
# obtain the official LLaMA model weights and place them in ./models
ls ./models
llama-2-7b tokenizer_checklist.chk tokenizer.model
# [Optional] for models using BPE tokenizers
ls ./models
<folder containing weights and tokenizer json> vocab.json
# [Optional] for PyTorch .bin models like Mistral-7B
ls ./models
<folder containing weights and tokenizer json>
# install Python dependencies
python3 -m pip install -r requirements.txt
# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py models/mymodel/
# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
# update the gguf filetype to current version if older version is now unsupported
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
Run the quantized model:
# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128
When running the larger models, make sure you have enough disk space to store all the intermediate files.
Memory/Disk Requirements
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
| Model | Original size | Quantized size (Q4_0) |
|---|---|---|
| 7B | 13 GB | 3.9 GB |
| 13B | 24 GB | 7.8 GB |
| 30B | 60 GB | 19.5 GB |
| 65B | 120 GB | 38.5 GB |
Quantization
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
(outdated)
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
|---|---|---|---|---|---|---|---|
| 7B | perplexity | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
| 7B | file size | 13.0G | 3.5G | 3.9G | 4.3G | 4.7G | 6.7G |
| 7B | ms/tok @ 4th | 127 | 55 | 54 | 76 | 83 | 72 |
| 7B | ms/tok @ 8th | 122 | 43 | 45 | 52 | 56 | 67 |
| 7B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
| 13B | perplexity | 5.2543 | 5.3860 | 5.3608 | 5.2856 | 5.2706 | 5.2548 |
| 13B | file size | 25.0G | 6.8G | 7.6G | 8.3G | 9.1G | 13G |
| 13B | ms/tok @ 4th | - | 103 | 105 | 148 | 160 | 131 |
| 13B | ms/tok @ 8th | - | 73 | 82 | 98 | 105 | 128 |
| 13B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 |
- k-quants
- recent k-quants improvements and new i-quants
- #2707
- #2807
- #4773 - 2-bit i-quants (inference)
- #4856 - 2-bit i-quants (inference)
- #4861 - importance matrix
- #4872 - MoE models
- #4897 - 2-bit quantization
- #4930 - imatrix for all k-quants
- #4951 - imatrix on the GPU
- #4969 - imatrix for legacy quants
- #4996 - k-qunats tuning
- #5060 - Q3_K_XS
- #5196 - 3-bit i-quants
- quantization tuning, another one, and another one
Llama 2 7B
| Quantization | Bits per Weight (BPW) |
|---|---|
| Q2_K | 3.35 |
| Q3_K_S | 3.50 |
| Q3_K_M | 3.91 |
| Q3_K_L | 4.27 |
| Q4_K_S | 4.58 |
| Q4_K_M | 4.84 |
| Q5_K_S | 5.52 |
| Q5_K_M | 5.68 |
| Q6_K | 6.56 |
Llama 2 13B
| Quantization | Bits per Weight (BPW) |
|---|---|
| Q2_K | 3.34 |
| Q3_K_S | 3.48 |
| Q3_K_M | 3.89 |
| Q3_K_L | 4.26 |
| Q4_K_S | 4.56 |
| Q4_K_M | 4.83 |
| Q5_K_S | 5.51 |
| Q5_K_M | 5.67 |
| Q6_K | 6.56 |
Llama 2 70B
| Quantization | Bits per Weight (BPW) |
|---|---|
| Q2_K | 3.40 |
| Q3_K_S | 3.47 |
| Q3_K_M | 3.85 |
| Q3_K_L | 4.19 |
| Q4_K_S | 4.53 |
| Q4_K_M | 4.80 |
| Q5_K_S | 5.50 |
| Q5_K_M | 5.65 |
| Q6_K | 6.56 |