Files
ik_llama.cpp/examples/quantize
Kawrakow 1140b4568d Q8_KV: 8-bit quantization type targeting the KV cache (#208)
* Adding q8_KV - Basics + AVX2 gemm/gemv

* q8_KV: Better AVX2 gemm

* q8_KV: Better Zen4 gemm

We get 225.7 t/s for L3-8B. In comparison q8_0 without
run-tinme-repacking is at 169 t/s.

* q8_KV: AVX2 gemm/gemv

We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr.

* q8_KV: be able to use it for K cache

This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
  removed most of it when implementing the quants with per row scale,
  bit it was stull lurking in ggml_copy. Not sure if these were the last
  remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
  the K-cache as a 2D tensor so we can have per row meta data as needed
  by q8_KV.

Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.

* q8_KV: be able to use it for K cache in FA

* q8_KV: repack it for K*Q in FA

* q8_KV: slightly faster gemv on Zen4

* q8_KV: slightly faster gemv on Zen4

* q8_KV: ARM_NEON

We get PP-512 = 167 t/s for L3-8B without interleaving!
We do the interleaving on the fly, so I wonder if this
could be done for other quants as well.

* q8_KV: use it in FA on NEON

* q8_KV_r8 - repacked q8_KV

On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s)
This makes no sense whatsoever as the q8_KV_r8 GEMM is
basically the q8_k_r8 GEMM with the unnecessary block stuff
removed (so, one would think that it would be faster).

* q8_KV_r8: don't use nrc_y = 16 on Zen4

This is faster - 350 t/s. Why?
Much better than the 290 t/s we had before, but still slower
than the 370 t/s for q8_k_r8.

* q8_KV: nrc_y = 16 also doesn't pay off in FA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-19 11:47:07 +02:00
..
2024-07-27 07:55:01 +02:00

quantize

You can also use the GGUF-my-repo space on Hugging Face to build your own quants without any setup.

Note: It is synced from llama.cpp main every 6 hours.

Example usage:

# obtain the official LLaMA model weights and place them in ./models
ls ./models
llama-2-7b tokenizer_checklist.chk tokenizer.model
# [Optional] for models using BPE tokenizers
ls ./models
<folder containing weights and tokenizer json> vocab.json
# [Optional] for PyTorch .bin models like Mistral-7B
ls ./models
<folder containing weights and tokenizer json>

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

# update the gguf filetype to current version if older version is now unsupported
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY

Run the quantized model:

# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128

When running the larger models, make sure you have enough disk space to store all the intermediate files.

Memory/Disk Requirements

As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.

Model Original size Quantized size (Q4_0)
7B 13 GB 3.9 GB
13B 24 GB 7.8 GB
30B 60 GB 19.5 GB
65B 120 GB 38.5 GB

Quantization

Several quantization methods are supported. They differ in the resulting model disk size and inference speed.

(outdated)

Model Measure F16 Q4_0 Q4_1 Q5_0 Q5_1 Q8_0
7B perplexity 5.9066 6.1565 6.0912 5.9862 5.9481 5.9070
7B file size 13.0G 3.5G 3.9G 4.3G 4.7G 6.7G
7B ms/tok @ 4th 127 55 54 76 83 72
7B ms/tok @ 8th 122 43 45 52 56 67
7B bits/weight 16.0 4.5 5.0 5.5 6.0 8.5
13B perplexity 5.2543 5.3860 5.3608 5.2856 5.2706 5.2548
13B file size 25.0G 6.8G 7.6G 8.3G 9.1G 13G
13B ms/tok @ 4th - 103 105 148 160 131
13B ms/tok @ 8th - 73 82 98 105 128
13B bits/weight 16.0 4.5 5.0 5.5 6.0 8.5

Llama 2 7B

Quantization Bits per Weight (BPW)
Q2_K 3.35
Q3_K_S 3.50
Q3_K_M 3.91
Q3_K_L 4.27
Q4_K_S 4.58
Q4_K_M 4.84
Q5_K_S 5.52
Q5_K_M 5.68
Q6_K 6.56

Llama 2 13B

Quantization Bits per Weight (BPW)
Q2_K 3.34
Q3_K_S 3.48
Q3_K_M 3.89
Q3_K_L 4.26
Q4_K_S 4.56
Q4_K_M 4.83
Q5_K_S 5.51
Q5_K_M 5.67
Q6_K 6.56

Llama 2 70B

Quantization Bits per Weight (BPW)
Q2_K 3.40
Q3_K_S 3.47
Q3_K_M 3.85
Q3_K_L 4.19
Q4_K_S 4.53
Q4_K_M 4.80
Q5_K_S 5.50
Q5_K_M 5.65
Q6_K 6.56