mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 08:30:19 +00:00

Files

Kawrakow 1140b4568d Q8_KV: 8-bit quantization type targeting the KV cache (#208 )

* Adding q8_KV - Basics + AVX2 gemm/gemv

* q8_KV: Better AVX2 gemm

* q8_KV: Better Zen4 gemm

We get 225.7 t/s for L3-8B. In comparison q8_0 without
run-tinme-repacking is at 169 t/s.

* q8_KV: AVX2 gemm/gemv

We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr.

* q8_KV: be able to use it for K cache

This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
  removed most of it when implementing the quants with per row scale,
  bit it was stull lurking in ggml_copy. Not sure if these were the last
  remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
  the K-cache as a 2D tensor so we can have per row meta data as needed
  by q8_KV.

Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.

* q8_KV: be able to use it for K cache in FA

* q8_KV: repack it for K*Q in FA

* q8_KV: slightly faster gemv on Zen4

* q8_KV: slightly faster gemv on Zen4

* q8_KV: ARM_NEON

We get PP-512 = 167 t/s for L3-8B without interleaving!
We do the interleaving on the fly, so I wonder if this
could be done for other quants as well.

* q8_KV: use it in FA on NEON

* q8_KV_r8 - repacked q8_KV

On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s)
This makes no sense whatsoever as the q8_KV_r8 GEMM is
basically the q8_k_r8 GEMM with the unnecessary block stuff
removed (so, one would think that it would be faster).

* q8_KV_r8: don't use nrc_y = 16 on Zen4

This is faster - 350 t/s. Why?
Much better than the 290 t/s we had before, but still slower
than the 370 t/s for q8_k_r8.

* q8_KV: nrc_y = 16 also doesn't pay off in FA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-02-19 11:47:07 +02:00

CMakeLists.txt

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

quantize.cpp

Q8_KV: 8-bit quantization type targeting the KV cache (#208 )

2025-02-19 11:47:07 +02:00

README.md

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

tests.sh

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

README.md

quantize

You can also use the GGUF-my-repo space on Hugging Face to build your own quants without any setup.

Note: It is synced from llama.cpp main every 6 hours.

Example usage:

# obtain the official LLaMA model weights and place them in ./models
ls ./models
llama-2-7b tokenizer_checklist.chk tokenizer.model
# [Optional] for models using BPE tokenizers
ls ./models
<folder containing weights and tokenizer json> vocab.json
# [Optional] for PyTorch .bin models like Mistral-7B
ls ./models
<folder containing weights and tokenizer json>

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml FP16 format
python3 convert_hf_to_gguf.py models/mymodel/

# quantize the model to 4-bits (using Q4_K_M method)
./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

# update the gguf filetype to current version if older version is now unsupported
./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY

Run the quantized model:

# start inference on a gguf model
./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128

When running the larger models, make sure you have enough disk space to store all the intermediate files.

Memory/Disk Requirements

As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.

Model	Original size	Quantized size (Q4_0)
7B	13 GB	3.9 GB
13B	24 GB	7.8 GB
30B	60 GB	19.5 GB
65B	120 GB	38.5 GB

Quantization

Several quantization methods are supported. They differ in the resulting model disk size and inference speed.

(outdated)

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9066	6.1565	6.0912	5.9862	5.9481	5.9070
7B	file size	13.0G	3.5G	3.9G	4.3G	4.7G	6.7G
7B	ms/tok @ 4th	127	55	54	76	83	72
7B	ms/tok @ 8th	122	43	45	52	56	67
7B	bits/weight	16.0	4.5	5.0	5.5	6.0	8.5
13B	perplexity	5.2543	5.3860	5.3608	5.2856	5.2706	5.2548
13B	file size	25.0G	6.8G	7.6G	8.3G	9.1G	13G
13B	ms/tok @ 4th	-	103	105	148	160	131
13B	ms/tok @ 8th	-	73	82	98	105	128
13B	bits/weight	16.0	4.5	5.0	5.5	6.0	8.5

k-quants
recent k-quants improvements and new i-quants

Llama 2 7B

Quantization	Bits per Weight (BPW)
Q2_K	3.35
Q3_K_S	3.50
Q3_K_M	3.91
Q3_K_L	4.27
Q4_K_S	4.58
Q4_K_M	4.84
Q5_K_S	5.52
Q5_K_M	5.68
Q6_K	6.56

Llama 2 13B

Quantization	Bits per Weight (BPW)
Q2_K	3.34
Q3_K_S	3.48
Q3_K_M	3.89
Q3_K_L	4.26
Q4_K_S	4.56
Q4_K_M	4.83
Q5_K_S	5.51
Q5_K_M	5.67
Q6_K	6.56

Llama 2 70B

Quantization	Bits per Weight (BPW)
Q2_K	3.40
Q3_K_S	3.47
Q3_K_M	3.85
Q3_K_L	4.19
Q4_K_S	4.53
Q4_K_M	4.80
Q5_K_S	5.50
Q5_K_M	5.65
Q6_K	6.56