mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-07 06:50:09 +00:00
351 lines
14 KiB
Markdown
351 lines
14 KiB
Markdown
### 🗣️ [#434](https://github.com/ikawrakow/ik_llama.cpp/discussions/434) - Quant Cookers Basic Guide
|
|
|
|
| **Author** | `ubergarm` |
|
|
| :--- | :--- |
|
|
| **Created** | 2025-05-18 |
|
|
| **Updated** | 2025-05-21 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
Quant Cooking Basic Guide
|
|
===
|
|
Example workflow for cooking custom quants with ik_llama.cpp that I used to generate [ubergarm/Qwen3-14B-GGUF](https://huggingface.co/ubergarm/Qwen3-14B-GGUF).
|
|
|
|
## Goal
|
|
The goal is to provide a specific example of methodology that can be adapted for future LLMs and quant types in general.
|
|
|
|
In this guide we will download and quantize the dense model [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) on a gaming rig with a single 3090TI FE 24GB VRAM GPU.
|
|
|
|
We will use the latest [ik_llama.cpp quants](https://github.com/ikawrakow/ik_llama.cpp/pull/422) to target running this 14B model in GGUF format fully offloaded on <=16GB VRAM systems with 32k context.
|
|
|
|
This guide does *not* get into more complex things like MLA methodology e.g. converting fp8 to bf16 on older GPU hardware.
|
|
|
|
## Dependencies
|
|
This is all run on a Linux rig, but feel free to use WSL for a similar experience if you're limited to a windows based OS.
|
|
|
|
Install any build essentials, git, etc. We will use `uv` for python virtual environment management to keep everything clean.
|
|
|
|
```bash
|
|
# Setup folder to do your work and hold the models etc
|
|
mkdir /mnt/llms
|
|
cd /mnt/llms
|
|
|
|
# Install uv and python packages
|
|
# https://docs.astral.sh/uv/getting-started/installation/
|
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
uv venv ./venv --python 3.12 --python-preference=only-managed
|
|
source ./venv/bin/activate
|
|
uv pip install huggingface_hub[hf-xet]
|
|
|
|
# Start downloading the bf16 safetensors from huggingface
|
|
mkdir -p Qwen/Qwen3-14B
|
|
cd Qwen/Qwen3-14B
|
|
huggingface-cli download --local-dir ./ Qwen/Qwen3-14B
|
|
|
|
# Make a target directory to hold your finished quants for uploading to huggingface
|
|
mkdir -p ubergarm/Qwen3-14B-GGUF # use your name obviously
|
|
|
|
# Install mainline or evshiron llama.cpp forks just for the python scripts.
|
|
cd /mnt/llms
|
|
git clone git@github.com:ggml-org/llama.cpp.git
|
|
cd llama.cpp
|
|
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
|
|
cmake --build build --config Release -j $(nproc)
|
|
|
|
# Install and build ik_llama.cpp for the heavy lifting and SOTA GGUF quants.
|
|
cd /mnt/llms
|
|
git clone git@github.com:ikawrakow/ik_llama.cpp.git
|
|
cd ik_llama.cpp
|
|
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
|
|
cmake --build build --config Release -j $(nproc)
|
|
|
|
# Download your imatrix corpus and wiki.test.raw test corpus.
|
|
wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt
|
|
|
|
wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
|
|
gunzip wiki.test.raw.gz
|
|
|
|
# Okay, now your folders should look something like this, and you are ready to begin cooking!
|
|
cd /mnt/llms
|
|
tree
|
|
|
|
.
|
|
├── venv
|
|
├── ik_llama.cpp
|
|
├── llama.cpp
|
|
├── Qwen
|
|
│ └── Qwen3-14B
|
|
└── ubergarm
|
|
└── Qwen3-14B-GGUF
|
|
```
|
|
|
|
## Convert bf16 safetensors to bf16 gguf
|
|
I generally use mainline llama.cpp or evshiron's fork for doing conversion with python script.
|
|
```bash
|
|
# This took less than 12GiB RAM and about 30 seconds
|
|
cd /mnt/llms
|
|
uv pip install -r llama.cpp/requirements/requirements-convert_hf_to_gguf.txt --prerelease=allow --index-strategy unsafe-best-match
|
|
|
|
python \
|
|
llama.cpp/convert_hf_to_gguf.py \
|
|
--outtype bf16 \
|
|
--split-max-size 50G \
|
|
--outfile ./ubergarm/Qwen3-14B-GGUF/ \
|
|
./Qwen/Qwen3-14B/
|
|
|
|
du -hc ./ubergarm/Qwen3-14B-GGUF/*.gguf
|
|
28G ./ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf
|
|
```
|
|
|
|
## Generate imatrix
|
|
Notes:
|
|
|
|
1. This took just over 5 minutes on my high end gaming rig.
|
|
2. If you can't run the bf16 you could make a q8_0 without imatrix and then use that as "baseline" instead
|
|
3. I could offload 32 layers naievly with `-ngl 32` but do whatever you need to run inferencing e.g. `-ngl 99 -ot ...` etc.
|
|
4. I don't bother with fancy calibration corpus nor extra context length as it isn't clearly proven to always improve results afaict.
|
|
5. Assuming you're offloading some to CPU, adjust threads as needed or set to exactly 1 if you are fully offloading to VRAM.
|
|
|
|
```bash
|
|
cd ik_llama.cpp
|
|
./build/bin/llama-imatrix \
|
|
--verbosity 1 \
|
|
-m /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
|
|
-f calibration_data_v5_rc.txt \
|
|
-o ./Qwen3-14B-BF16-imatrix.dat \
|
|
-ngl 32 \
|
|
--layer-similarity \
|
|
--ctx-size 512 \
|
|
--threads 16
|
|
|
|
mv ./Qwen3-14B-BF16-imatrix.dat ../ubergarm/Qwen3-14B-GGUF/
|
|
```
|
|
|
|
## Create Quant Recipe
|
|
I personally like to make a bash script for each quant recipe. You can explore different mixes using layer-similarity or [other imatrix statistics tools](https://github.com/ggml-org/llama.cpp/pull/12718). Keep log files around with `./blah 2>&1 | tee -a logs/version-blah.log`.
|
|
|
|
I often like to off with a pure q8_0 for benchmarking and then tweak as desired for target VRAM breakpoints.
|
|
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
|
|
# token_embd.weight, torch.bfloat16 --> BF16, shape = {5120, 151936}
|
|
#
|
|
# blk.28.ffn_down.weight, torch.bfloat16 --> BF16, shape = {17408, 5120}
|
|
# blk.28.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {5120, 17408}
|
|
# blk.28.ffn_up.weight, torch.bfloat16 --> BF16, shape = {5120, 17408}
|
|
#
|
|
# blk.28.attn_output.weight, torch.bfloat16 --> BF16, shape = {5120, 5120}
|
|
# blk.28.attn_q.weight, torch.bfloat16 --> BF16, shape = {5120, 5120}
|
|
# blk.28.attn_k.weight, torch.bfloat16 --> BF16, shape = {5120, 1024}
|
|
# blk.28.attn_v.weight, torch.bfloat16 --> BF16, shape = {5120, 1024}
|
|
#
|
|
# blk.28.attn_norm.weight, torch.bfloat16 --> F32, shape = {5120}
|
|
# blk.28.ffn_norm.weight, torch.bfloat16 --> F32, shape = {5120}
|
|
# blk.28.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
|
|
# blk.28.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
|
|
#
|
|
# output_norm.weight, torch.bfloat16 --> F32, shape = {5120}
|
|
# output.weight, torch.bfloat16 --> BF16, shape = {5120, 151936}
|
|
|
|
custom="
|
|
# Attention
|
|
blk\.[0-9]\.attn_.*\.weight=iq5_ks
|
|
blk\.[1-3][0-9]\.attn_.*\.weight=iq5_ks
|
|
|
|
# FFN
|
|
blk\.[0-9]\.ffn_down\.weight=iq5_ks
|
|
blk\.[1-3][0-9]\.ffn_down\.weight=iq5_ks
|
|
|
|
blk\.[0-9]\.ffn_(gate|up)\.weight=iq4_ks
|
|
blk\.[1-3][0-9]\.ffn_(gate|up)\.weight=iq4_ks
|
|
|
|
# Token embedding/output
|
|
token_embd\.weight=iq6_k
|
|
output\.weight=iq6_k
|
|
"
|
|
|
|
custom=$(
|
|
echo "$custom" | grep -v '^#' | \
|
|
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
|
|
)
|
|
|
|
./build/bin/llama-quantize \
|
|
--imatrix /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-imatrix.dat \
|
|
--custom-q "$custom" \
|
|
/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
|
|
/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf \
|
|
IQ4_KS \
|
|
16
|
|
```
|
|
|
|
## Perplexity
|
|
Run some benchmarks to compare your various quant recipes.
|
|
|
|
```bash
|
|
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q8_0.gguf
|
|
|
|
./build/bin/llama-perplexity \
|
|
-m "$model" \
|
|
--ctx-size 512 \
|
|
--ubatch-size 512 \
|
|
-f wiki.test.raw \
|
|
-fa \
|
|
-ngl 99 \
|
|
--seed 1337 \
|
|
--threads 1
|
|
```
|
|
|
|
* BF16
|
|
- `Final estimate: PPL = 9.0128 +/- 0.07114`
|
|
* Q8_0
|
|
- `Final estimate: PPL = 9.0281 +/- 0.07136`
|
|
* [ubergarm/IQ4_KS](https://huggingface.co/ubergarm/Qwen3-14B-GGUF#qwen3-14b-iq4_ks)
|
|
- `Final estimate: PPL = 9.0505 +/- 0.07133`
|
|
* [unsloth/UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-14B-GGUF?show_file_info=Qwen3-14B-UD-Q4_K_XL.gguf)
|
|
- `Final estimate: PPL = 9.1034 +/- 0.07189`
|
|
* [bartowski/Q4_K_L](https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF?show_file_info=Qwen_Qwen3-14B-Q4_K_L.gguf)
|
|
- `Final estimate: PPL = 9.1395 +/- 0.07236`
|
|
|
|
## KL-Divergence
|
|
You can run KLD if you want to measure how much smaller quants diverge from the unquantized model's outputs.
|
|
|
|
I have a custom ~1.6MiB `ubergarm-kld-test-corpus.txt` made from whisper-large-v3 transcriptions in plain text format from some recent episodes of [Buddha at the Gas Pump BATGAP YT Channel](https://www.youtube.com/c/batgap/videos).
|
|
|
|
#### Pass 1 Generate KLD Baseline File
|
|
The output kld base file can be quite large, this case it is ~55GiB. If
|
|
you can't run BF16, you could use Q8_0 as your baseline if necessary.
|
|
|
|
```bash
|
|
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf
|
|
CUDA_VISIBLE_DEVICES="0" \
|
|
./build/bin/llama-perplexity \
|
|
-m "$model" \
|
|
--kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
|
|
-f ubergarm-kld-test-corpus.txt \
|
|
-fa \
|
|
-ngl 32 \
|
|
--seed 1337 \
|
|
--threads 16
|
|
```
|
|
|
|
#### Pass 2 Measure KLD
|
|
This uses the above kld base file as input baseline.
|
|
```bash
|
|
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
|
|
CUDA_VISIBLE_DEVICES="0" \
|
|
./build/bin/llama-perplexity \
|
|
-m "$model" \
|
|
--kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
|
|
--kl-divergence \
|
|
-f ubergarm-kld-test-corpus.txt \
|
|
-fa \
|
|
-ngl 99 \
|
|
--seed 1337 \
|
|
--threads 1
|
|
```
|
|
|
|
This will report Perplexity on this corpus as well as various other statistics.
|
|
|
|
* BF16
|
|
- `Final estimate: PPL = 14.8587 +/- 0.09987`
|
|
* Q8_0
|
|
- `Mean PPL(Q) : 14.846724 ± 0.099745`
|
|
- `Median KLD: 0.000834`
|
|
- `99.0% KLD: 0.004789`
|
|
- `RMS Δp: 0.920 ± 0.006 %`
|
|
- `99.0% Δp: 2.761%`
|
|
* [ubergarm/IQ4_KS](https://huggingface.co/ubergarm/Qwen3-14B-GGUF#qwen3-14b-iq4_ks)
|
|
- `Mean PPL(Q) : 14.881428 ± 0.099779`
|
|
- `Median KLD: 0.004756`
|
|
- `99.0% KLD: 0.041509`
|
|
- `RMS Δp: 2.267 ± 0.013 %`
|
|
- `99.0% Δp: 6.493%`
|
|
* [unsloth/UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-14B-GGUF?show_file_info=Qwen3-14B-UD-Q4_K_XL.gguf)
|
|
- `Mean PPL(Q) : 14.934694 ± 0.100320`
|
|
- `Median KLD: 0.006275`
|
|
- `99.0% KLD: 0.060005`
|
|
- `RMS Δp: 2.545 ± 0.015 %`
|
|
- `99.0% Δp: 7.203%`
|
|
* [bartowski/Q4_K_L](https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF?show_file_info=Qwen_Qwen3-14B-Q4_K_L.gguf)
|
|
- `Mean PPL(Q) : 14.922353 ± 0.100054`
|
|
- `Median KLD: 0.006195`
|
|
- `99.0% KLD: 0.063428`
|
|
- `RMS Δp: 2.581 ± 0.015 %`
|
|
- `99.0% Δp: 7.155%`
|
|
|
|
## Speed Benchmarks
|
|
Run some `llama-sweep-bench` to see how fast your quants are over various context lengths.
|
|
|
|
```bash
|
|
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
|
|
./build/bin/llama-sweep-bench \
|
|
--model "$model" \
|
|
-fa \
|
|
-c 32768 \
|
|
-ngl 99 \
|
|
--warmup-batch \
|
|
--threads 1
|
|
```
|
|

|
|
|
|
## Vibe Check
|
|
Always remember to actually *run* your model to confirm it is working properly and generating valid responses.
|
|
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
|
|
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
|
|
|
|
./build/bin/llama-server \
|
|
--model "$model" \
|
|
--alias ubergarm/Qwen3-14B-IQ4_KS \
|
|
-fa \
|
|
-ctk f16 -ctv f16 \
|
|
-c 32768 \
|
|
-ngl 99 \
|
|
--threads 1 \
|
|
--host 127.0.0.1 \
|
|
--port 8080
|
|
```
|
|
|
|
## References
|
|
* [ik_llama.cpp old getting started guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
|
|
* [gist with some benchmarking gist methodology](https://gist.github.com/ubergarm/0f9663fd56fc181a00ec9f634635eb38#methodology)
|
|
* [ubergarm/Qwen3-14B-GGUF](https://huggingface.co/ubergarm/Qwen3-14B-GGUF)
|
|
|
|
---
|
|
|
|
#### 🗣️ Discussion
|
|
|
|
👤 **VinnyG9** replied the **2025-05-19** at **14:48:32**:<br>
|
|
|
|
thanks for this, can you point me where can i read a description of:
|
|
-DGGML_RPC=OFF
|
|
--seed 1337
|
|
|
|
> 👤 **ubergarm** replied the **2025-05-19** at **15:07:31**:<br>
|
|
> > -DGGML_RPC=OFF
|
|
> > --seed 1337
|
|
>
|
|
> The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run [a client and server(s) distributed across multiple machines or processes](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.
|
|
>
|
|
> > --seed 1337
|
|
>
|
|
> I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. [1337](https://www.urbandictionary.com/define.php?term=1337) is leet speek for [leet](https://www.urbandictionary.com/define.php?term=leet).
|
|
>
|
|
> 👤 **VinnyG9** replied the **2025-05-21** at **03:42:57**:<br>
|
|
> > > -DGGML_RPC=OFF
|
|
> > > --seed 1337
|
|
> >
|
|
> > The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run [a client and server(s) distributed across multiple machines or processes](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.
|
|
> >
|
|
> > > --seed 1337
|
|
> >
|
|
> > I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. [1337](https://www.urbandictionary.com/define.php?term=1337) is leet speek for [leet](https://www.urbandictionary.com/define.php?term=leet).
|
|
>
|
|
> you nerds speak like i know what you're talking about xD
|
|
> what is it "seeding"?
|
|
> i thought it was a reference to the universe's "fine-structure constant" |