14 KiB
🗣️ #434 - Quant Cookers Basic Guide
| Author | ubergarm |
|---|---|
| Created | 2025-05-18 |
| Updated | 2025-05-21 |
Description
Quant Cooking Basic Guide
Example workflow for cooking custom quants with ik_llama.cpp that I used to generate ubergarm/Qwen3-14B-GGUF.
Goal
The goal is to provide a specific example of methodology that can be adapted for future LLMs and quant types in general.
In this guide we will download and quantize the dense model Qwen/Qwen3-14B on a gaming rig with a single 3090TI FE 24GB VRAM GPU.
We will use the latest ik_llama.cpp quants to target running this 14B model in GGUF format fully offloaded on <=16GB VRAM systems with 32k context.
This guide does not get into more complex things like MLA methodology e.g. converting fp8 to bf16 on older GPU hardware.
Dependencies
This is all run on a Linux rig, but feel free to use WSL for a similar experience if you're limited to a windows based OS.
Install any build essentials, git, etc. We will use uv for python virtual environment management to keep everything clean.
# Setup folder to do your work and hold the models etc
mkdir /mnt/llms
cd /mnt/llms
# Install uv and python packages
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv ./venv --python 3.12 --python-preference=only-managed
source ./venv/bin/activate
uv pip install huggingface_hub[hf-xet]
# Start downloading the bf16 safetensors from huggingface
mkdir -p Qwen/Qwen3-14B
cd Qwen/Qwen3-14B
huggingface-cli download --local-dir ./ Qwen/Qwen3-14B
# Make a target directory to hold your finished quants for uploading to huggingface
mkdir -p ubergarm/Qwen3-14B-GGUF # use your name obviously
# Install mainline or evshiron llama.cpp forks just for the python scripts.
cd /mnt/llms
git clone git@github.com:ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
# Install and build ik_llama.cpp for the heavy lifting and SOTA GGUF quants.
cd /mnt/llms
git clone git@github.com:ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
# Download your imatrix corpus and wiki.test.raw test corpus.
wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt
wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
gunzip wiki.test.raw.gz
# Okay, now your folders should look something like this, and you are ready to begin cooking!
cd /mnt/llms
tree
.
├── venv
├── ik_llama.cpp
├── llama.cpp
├── Qwen
│ └── Qwen3-14B
└── ubergarm
└── Qwen3-14B-GGUF
Convert bf16 safetensors to bf16 gguf
I generally use mainline llama.cpp or evshiron's fork for doing conversion with python script.
# This took less than 12GiB RAM and about 30 seconds
cd /mnt/llms
uv pip install -r llama.cpp/requirements/requirements-convert_hf_to_gguf.txt --prerelease=allow --index-strategy unsafe-best-match
python \
llama.cpp/convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile ./ubergarm/Qwen3-14B-GGUF/ \
./Qwen/Qwen3-14B/
du -hc ./ubergarm/Qwen3-14B-GGUF/*.gguf
28G ./ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf
Generate imatrix
Notes:
- This took just over 5 minutes on my high end gaming rig.
- If you can't run the bf16 you could make a q8_0 without imatrix and then use that as "baseline" instead
- I could offload 32 layers naievly with
-ngl 32but do whatever you need to run inferencing e.g.-ngl 99 -ot ...etc. - I don't bother with fancy calibration corpus nor extra context length as it isn't clearly proven to always improve results afaict.
- Assuming you're offloading some to CPU, adjust threads as needed or set to exactly 1 if you are fully offloading to VRAM.
cd ik_llama.cpp
./build/bin/llama-imatrix \
--verbosity 1 \
-m /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
-f calibration_data_v5_rc.txt \
-o ./Qwen3-14B-BF16-imatrix.dat \
-ngl 32 \
--layer-similarity \
--ctx-size 512 \
--threads 16
mv ./Qwen3-14B-BF16-imatrix.dat ../ubergarm/Qwen3-14B-GGUF/
Create Quant Recipe
I personally like to make a bash script for each quant recipe. You can explore different mixes using layer-similarity or other imatrix statistics tools. Keep log files around with ./blah 2>&1 | tee -a logs/version-blah.log.
I often like to off with a pure q8_0 for benchmarking and then tweak as desired for target VRAM breakpoints.
#!/usr/bin/env bash
# token_embd.weight, torch.bfloat16 --> BF16, shape = {5120, 151936}
#
# blk.28.ffn_down.weight, torch.bfloat16 --> BF16, shape = {17408, 5120}
# blk.28.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {5120, 17408}
# blk.28.ffn_up.weight, torch.bfloat16 --> BF16, shape = {5120, 17408}
#
# blk.28.attn_output.weight, torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_q.weight, torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_k.weight, torch.bfloat16 --> BF16, shape = {5120, 1024}
# blk.28.attn_v.weight, torch.bfloat16 --> BF16, shape = {5120, 1024}
#
# blk.28.attn_norm.weight, torch.bfloat16 --> F32, shape = {5120}
# blk.28.ffn_norm.weight, torch.bfloat16 --> F32, shape = {5120}
# blk.28.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
# blk.28.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
#
# output_norm.weight, torch.bfloat16 --> F32, shape = {5120}
# output.weight, torch.bfloat16 --> BF16, shape = {5120, 151936}
custom="
# Attention
blk\.[0-9]\.attn_.*\.weight=iq5_ks
blk\.[1-3][0-9]\.attn_.*\.weight=iq5_ks
# FFN
blk\.[0-9]\.ffn_down\.weight=iq5_ks
blk\.[1-3][0-9]\.ffn_down\.weight=iq5_ks
blk\.[0-9]\.ffn_(gate|up)\.weight=iq4_ks
blk\.[1-3][0-9]\.ffn_(gate|up)\.weight=iq4_ks
# Token embedding/output
token_embd\.weight=iq6_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--imatrix /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-imatrix.dat \
--custom-q "$custom" \
/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf \
IQ4_KS \
16
Perplexity
Run some benchmarks to compare your various quant recipes.
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q8_0.gguf
./build/bin/llama-perplexity \
-m "$model" \
--ctx-size 512 \
--ubatch-size 512 \
-f wiki.test.raw \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
- BF16
Final estimate: PPL = 9.0128 +/- 0.07114
- Q8_0
Final estimate: PPL = 9.0281 +/- 0.07136
- ubergarm/IQ4_KS
Final estimate: PPL = 9.0505 +/- 0.07133
- unsloth/UD-Q4_K_XL
Final estimate: PPL = 9.1034 +/- 0.07189
- bartowski/Q4_K_L
Final estimate: PPL = 9.1395 +/- 0.07236
KL-Divergence
You can run KLD if you want to measure how much smaller quants diverge from the unquantized model's outputs.
I have a custom ~1.6MiB ubergarm-kld-test-corpus.txt made from whisper-large-v3 transcriptions in plain text format from some recent episodes of Buddha at the Gas Pump BATGAP YT Channel.
Pass 1 Generate KLD Baseline File
The output kld base file can be quite large, this case it is ~55GiB. If you can't run BF16, you could use Q8_0 as your baseline if necessary.
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
-m "$model" \
--kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
-f ubergarm-kld-test-corpus.txt \
-fa \
-ngl 32 \
--seed 1337 \
--threads 16
Pass 2 Measure KLD
This uses the above kld base file as input baseline.
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
-m "$model" \
--kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
--kl-divergence \
-f ubergarm-kld-test-corpus.txt \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
This will report Perplexity on this corpus as well as various other statistics.
- BF16
Final estimate: PPL = 14.8587 +/- 0.09987
- Q8_0
Mean PPL(Q) : 14.846724 ± 0.099745Median KLD: 0.00083499.0% KLD: 0.004789RMS Δp: 0.920 ± 0.006 %99.0% Δp: 2.761%
- ubergarm/IQ4_KS
Mean PPL(Q) : 14.881428 ± 0.099779Median KLD: 0.00475699.0% KLD: 0.041509RMS Δp: 2.267 ± 0.013 %99.0% Δp: 6.493%
- unsloth/UD-Q4_K_XL
Mean PPL(Q) : 14.934694 ± 0.100320Median KLD: 0.00627599.0% KLD: 0.060005RMS Δp: 2.545 ± 0.015 %99.0% Δp: 7.203%
- bartowski/Q4_K_L
Mean PPL(Q) : 14.922353 ± 0.100054Median KLD: 0.00619599.0% KLD: 0.063428RMS Δp: 2.581 ± 0.015 %99.0% Δp: 7.155%
Speed Benchmarks
Run some llama-sweep-bench to see how fast your quants are over various context lengths.
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-c 32768 \
-ngl 99 \
--warmup-batch \
--threads 1
Vibe Check
Always remember to actually run your model to confirm it is working properly and generating valid responses.
#!/usr/bin/env bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
./build/bin/llama-server \
--model "$model" \
--alias ubergarm/Qwen3-14B-IQ4_KS \
-fa \
-ctk f16 -ctv f16 \
-c 32768 \
-ngl 99 \
--threads 1 \
--host 127.0.0.1 \
--port 8080
References
- ik_llama.cpp old getting started guide
- gist with some benchmarking gist methodology
- ubergarm/Qwen3-14B-GGUF
🗣️ Discussion
👤 VinnyG9 replied the 2025-05-19 at 14:48:32:
thanks for this, can you point me where can i read a description of: -DGGML_RPC=OFF --seed 1337
👤 ubergarm replied the 2025-05-19 at 15:07:31:
-DGGML_RPC=OFF --seed 1337
The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run a client and server(s) distributed across multiple machines or processes for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.
--seed 1337
I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. 1337 is leet speek for leet.
👤 VinnyG9 replied the 2025-05-21 at 03:42:57:
-DGGML_RPC=OFF --seed 1337
The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run a client and server(s) distributed across multiple machines or processes for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.
--seed 1337
I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. 1337 is leet speek for leet.
you nerds speak like i know what you're talking about xD what is it "seeding"? i thought it was a reference to the universe's "fine-structure constant"