ik_llama.cpp/github-data/discussions/434 - Quant Cookers Basic Guide.md

### 🗣️ [#434](https://github.com/ikawrakow/ik_llama.cpp/discussions/434) - Quant Cookers Basic Guide

| **Author** | `ubergarm` |
| :--- | :--- |
| **Created** | 2025-05-18 |
| **Updated** | 2025-05-21 |

---

#### Description

Quant Cooking Basic Guide
===
Example workflow for cooking custom quants with ik_llama.cpp that I used to generate [ubergarm/Qwen3-14B-GGUF](https://huggingface.co/ubergarm/Qwen3-14B-GGUF).

## Goal
The goal is to provide a specific example of methodology that can be adapted for future LLMs and quant types in general.

In this guide we will download and quantize the dense model [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) on a gaming rig with a single 3090TI FE 24GB VRAM GPU.

We will use the latest [ik_llama.cpp quants](https://github.com/ikawrakow/ik_llama.cpp/pull/422) to target running this 14B model in GGUF format fully offloaded on <=16GB VRAM systems with 32k context.

This guide does *not* get into more complex things like MLA methodology e.g. converting fp8 to bf16 on older GPU hardware.

## Dependencies
This is all run on a Linux rig, but feel free to use WSL for a similar experience if you're limited to a windows based OS.

Install any build essentials, git, etc. We will use `uv` for python virtual environment management to keep everything clean.

```bash
# Setup folder to do your work and hold the models etc
mkdir /mnt/llms
cd /mnt/llms

# Install uv and python packages
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv ./venv --python 3.12 --python-preference=only-managed
source ./venv/bin/activate
uv pip install huggingface_hub[hf-xet]

# Start downloading the bf16 safetensors from huggingface
mkdir -p Qwen/Qwen3-14B
cd Qwen/Qwen3-14B
huggingface-cli download --local-dir ./ Qwen/Qwen3-14B

# Make a target directory to hold your finished quants for uploading to huggingface
mkdir -p ubergarm/Qwen3-14B-GGUF # use your name obviously

# Install mainline or evshiron llama.cpp forks just for the python scripts.
cd /mnt/llms
git clone git@github.com:ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

# Install and build ik_llama.cpp for the heavy lifting and SOTA GGUF quants.
cd /mnt/llms
git clone git@github.com:ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

# Download your imatrix corpus and wiki.test.raw test corpus.
wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt

wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
gunzip wiki.test.raw.gz

# Okay, now your folders should look something like this, and you are ready to begin cooking!
cd /mnt/llms
tree

.
├── venv
├── ik_llama.cpp
├── llama.cpp
├── Qwen
│  └── Qwen3-14B
└── ubergarm
   └── Qwen3-14B-GGUF
```

## Convert bf16 safetensors to bf16 gguf
I generally use mainline llama.cpp or evshiron's fork for doing conversion with python script.
```bash
# This took less than 12GiB RAM and about 30 seconds
cd /mnt/llms
uv pip install -r llama.cpp/requirements/requirements-convert_hf_to_gguf.txt --prerelease=allow --index-strategy unsafe-best-match

python \
    llama.cpp/convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile ./ubergarm/Qwen3-14B-GGUF/ \
    ./Qwen/Qwen3-14B/

du -hc ./ubergarm/Qwen3-14B-GGUF/*.gguf
28G ./ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf
```

## Generate imatrix
Notes:

1. This took just over 5 minutes on my high end gaming rig.
2. If you can't run the bf16 you could make a q8_0 without imatrix and then use that as "baseline" instead
3. I could offload 32 layers naievly with `-ngl 32` but do whatever you need to run inferencing e.g. `-ngl 99 -ot ...` etc.
4. I don't bother with fancy calibration corpus nor extra context length as it isn't clearly proven to always improve results afaict.
5. Assuming you're offloading some to CPU, adjust threads as needed or set to exactly 1 if you are fully offloading to VRAM.

```bash
cd ik_llama.cpp
./build/bin/llama-imatrix \
    --verbosity 1 \
    -m /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
    -f calibration_data_v5_rc.txt \
    -o ./Qwen3-14B-BF16-imatrix.dat \
    -ngl 32 \
    --layer-similarity \
    --ctx-size 512 \
    --threads 16

mv ./Qwen3-14B-BF16-imatrix.dat ../ubergarm/Qwen3-14B-GGUF/
```

## Create Quant Recipe
I personally like to make a bash script for each quant recipe. You can explore different mixes using layer-similarity or [other imatrix statistics tools](https://github.com/ggml-org/llama.cpp/pull/12718). Keep log files around with `./blah 2>&1 | tee -a logs/version-blah.log`.

I often like to off with a pure q8_0 for benchmarking and then tweak as desired for target VRAM breakpoints.

```bash
#!/usr/bin/env bash

# token_embd.weight,         torch.bfloat16 --> BF16, shape = {5120, 151936}
#
# blk.28.ffn_down.weight,    torch.bfloat16 --> BF16, shape = {17408, 5120}
# blk.28.ffn_gate.weight,    torch.bfloat16 --> BF16, shape = {5120, 17408}
# blk.28.ffn_up.weight,      torch.bfloat16 --> BF16, shape = {5120, 17408}
#
# blk.28.attn_output.weight, torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_q.weight,      torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_k.weight,      torch.bfloat16 --> BF16, shape = {5120, 1024}
# blk.28.attn_v.weight,      torch.bfloat16 --> BF16, shape = {5120, 1024}
#
# blk.28.attn_norm.weight,   torch.bfloat16 --> F32, shape = {5120}
# blk.28.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {5120}
# blk.28.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
# blk.28.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
#
# output_norm.weight,        torch.bfloat16 --> F32, shape = {5120}
# output.weight,             torch.bfloat16 --> BF16, shape = {5120, 151936}

custom="
# Attention
blk\.[0-9]\.attn_.*\.weight=iq5_ks
blk\.[1-3][0-9]\.attn_.*\.weight=iq5_ks

# FFN
blk\.[0-9]\.ffn_down\.weight=iq5_ks
blk\.[1-3][0-9]\.ffn_down\.weight=iq5_ks

blk\.[0-9]\.ffn_(gate|up)\.weight=iq4_ks
blk\.[1-3][0-9]\.ffn_(gate|up)\.weight=iq4_ks

# Token embedding/output
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-imatrix.dat \
    --custom-q "$custom" \
    /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
    /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf \
    IQ4_KS \
    16
```

## Perplexity
Run some benchmarks to compare your various quant recipes.

```bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q8_0.gguf

./build/bin/llama-perplexity \
    -m "$model" \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1
```

* BF16
    - `Final estimate: PPL = 9.0128 +/- 0.07114`
* Q8_0
    - `Final estimate: PPL = 9.0281 +/- 0.07136`
* [ubergarm/IQ4_KS](https://huggingface.co/ubergarm/Qwen3-14B-GGUF#qwen3-14b-iq4_ks)
    - `Final estimate: PPL = 9.0505 +/- 0.07133`
* [unsloth/UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-14B-GGUF?show_file_info=Qwen3-14B-UD-Q4_K_XL.gguf)
    - `Final estimate: PPL = 9.1034 +/- 0.07189`
* [bartowski/Q4_K_L](https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF?show_file_info=Qwen_Qwen3-14B-Q4_K_L.gguf)
    - `Final estimate: PPL = 9.1395 +/- 0.07236`

## KL-Divergence
You can run KLD if you want to measure how much smaller quants diverge from the unquantized model's outputs.

I have a custom ~1.6MiB `ubergarm-kld-test-corpus.txt` made from whisper-large-v3 transcriptions in plain text format from some recent episodes of [Buddha at the Gas Pump BATGAP YT Channel](https://www.youtube.com/c/batgap/videos).

#### Pass 1 Generate KLD Baseline File
The output kld base file can be quite large, this case it is ~55GiB. If
you can't run BF16, you could use Q8_0 as your baseline if necessary.

```bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 32 \
    --seed 1337 \
    --threads 16
```

#### Pass 2 Measure KLD
This uses the above kld base file as input baseline.
```bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
    --kl-divergence \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1
```

This will report Perplexity on this corpus as well as various other statistics.

* BF16
    - `Final estimate: PPL = 14.8587 +/- 0.09987`
* Q8_0
    - `Mean PPL(Q)        :  14.846724 ± 0.099745`
    - `Median KLD: 0.000834`
    - `99.0% KLD: 0.004789`
    - `RMS Δp: 0.920 ± 0.006 %`
    - `99.0% Δp: 2.761%`
* [ubergarm/IQ4_KS](https://huggingface.co/ubergarm/Qwen3-14B-GGUF#qwen3-14b-iq4_ks)
    - `Mean PPL(Q)        :  14.881428 ± 0.099779`
    - `Median KLD: 0.004756`
    - `99.0% KLD: 0.041509`
    - `RMS Δp: 2.267 ± 0.013 %`
    - `99.0% Δp: 6.493%`
* [unsloth/UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-14B-GGUF?show_file_info=Qwen3-14B-UD-Q4_K_XL.gguf)
    - `Mean PPL(Q)        :  14.934694 ± 0.100320`
    - `Median KLD: 0.006275`
    - `99.0% KLD: 0.060005`
    - `RMS Δp: 2.545 ± 0.015 %`
    - `99.0% Δp: 7.203%`
* [bartowski/Q4_K_L](https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF?show_file_info=Qwen_Qwen3-14B-Q4_K_L.gguf)
    - `Mean PPL(Q)        :  14.922353 ± 0.100054`
    - `Median KLD: 0.006195`
    - `99.0% KLD: 0.063428`
    - `RMS Δp: 2.581 ± 0.015 %`
    - `99.0% Δp:  7.155%`

## Speed Benchmarks
Run some `llama-sweep-bench` to see how fast your quants are over various context lengths.

```bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fa \
  -c 32768 \
  -ngl 99 \
  --warmup-batch \
  --threads 1
```
![sweep-bench-qwen3-14b-gguf-more-q4](https://github.com/user-attachments/assets/2ba1f817-c1b9-4648-9cab-5b759f56e4a2)

## Vibe Check
Always remember to actually *run* your model to confirm it is working properly and generating valid responses.

```bash
#!/usr/bin/env bash

model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf

./build/bin/llama-server \
    --model "$model" \
    --alias ubergarm/Qwen3-14B-IQ4_KS \
    -fa \
    -ctk f16 -ctv f16 \
    -c 32768 \
    -ngl 99 \
    --threads 1 \
    --host 127.0.0.1 \
    --port 8080
```

## References
* [ik_llama.cpp old getting started guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
* [gist with some benchmarking gist methodology](https://gist.github.com/ubergarm/0f9663fd56fc181a00ec9f634635eb38#methodology)
* [ubergarm/Qwen3-14B-GGUF](https://huggingface.co/ubergarm/Qwen3-14B-GGUF)

---

#### 🗣️ Discussion

👤 **VinnyG9** replied the **2025-05-19** at **14:48:32**:<br>

thanks for this, can you point me where can i read a description of:
-DGGML_RPC=OFF
--seed 1337

> 👤 **ubergarm** replied the **2025-05-19** at **15:07:31**:<br>
> > -DGGML_RPC=OFF
> > --seed 1337
>
> The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run [a client and server(s) distributed across multiple machines or processes](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.
>
> > --seed 1337
>
> I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. [1337](https://www.urbandictionary.com/define.php?term=1337) is leet speek for [leet](https://www.urbandictionary.com/define.php?term=leet).
>
> 👤 **VinnyG9** replied the **2025-05-21** at **03:42:57**:<br>
> > > -DGGML_RPC=OFF
> > > --seed 1337
> >
> > The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run [a client and server(s) distributed across multiple machines or processes](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.
> >
> > > --seed 1337
> >
> > I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. [1337](https://www.urbandictionary.com/define.php?term=1337) is leet speek for [leet](https://www.urbandictionary.com/define.php?term=leet).
>
> you nerds speak like i know what you're talking about xD
> what is it "seeding"?
> i thought it was a reference to the universe's "fine-structure constant"