mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

14 KiB

Raw Permalink Blame History

🗣️ #434 - Quant Cookers Basic Guide

Author	`ubergarm`
Created	2025-05-18
Updated	2025-05-21

Description

Quant Cooking Basic Guide

Example workflow for cooking custom quants with ik_llama.cpp that I used to generate ubergarm/Qwen3-14B-GGUF.

Goal

The goal is to provide a specific example of methodology that can be adapted for future LLMs and quant types in general.

In this guide we will download and quantize the dense model Qwen/Qwen3-14B on a gaming rig with a single 3090TI FE 24GB VRAM GPU.

We will use the latest ik_llama.cpp quants to target running this 14B model in GGUF format fully offloaded on <=16GB VRAM systems with 32k context.

This guide does not get into more complex things like MLA methodology e.g. converting fp8 to bf16 on older GPU hardware.

Dependencies

This is all run on a Linux rig, but feel free to use WSL for a similar experience if you're limited to a windows based OS.

Install any build essentials, git, etc. We will use uv for python virtual environment management to keep everything clean.

# Setup folder to do your work and hold the models etc
mkdir /mnt/llms
cd /mnt/llms

# Install uv and python packages
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv ./venv --python 3.12 --python-preference=only-managed
source ./venv/bin/activate
uv pip install huggingface_hub[hf-xet]

# Start downloading the bf16 safetensors from huggingface
mkdir -p Qwen/Qwen3-14B
cd Qwen/Qwen3-14B
huggingface-cli download --local-dir ./ Qwen/Qwen3-14B

# Make a target directory to hold your finished quants for uploading to huggingface
mkdir -p ubergarm/Qwen3-14B-GGUF # use your name obviously

# Install mainline or evshiron llama.cpp forks just for the python scripts.
cd /mnt/llms
git clone git@github.com:ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

# Install and build ik_llama.cpp for the heavy lifting and SOTA GGUF quants.
cd /mnt/llms
git clone git@github.com:ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)

# Download your imatrix corpus and wiki.test.raw test corpus.
wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt

wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
gunzip wiki.test.raw.gz

# Okay, now your folders should look something like this, and you are ready to begin cooking!
cd /mnt/llms
tree

.
├── venv
├── ik_llama.cpp
├── llama.cpp
├── Qwen
│  └── Qwen3-14B
└── ubergarm
   └── Qwen3-14B-GGUF

Convert bf16 safetensors to bf16 gguf

I generally use mainline llama.cpp or evshiron's fork for doing conversion with python script.

# This took less than 12GiB RAM and about 30 seconds
cd /mnt/llms
uv pip install -r llama.cpp/requirements/requirements-convert_hf_to_gguf.txt --prerelease=allow --index-strategy unsafe-best-match

python \
    llama.cpp/convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile ./ubergarm/Qwen3-14B-GGUF/ \
    ./Qwen/Qwen3-14B/

du -hc ./ubergarm/Qwen3-14B-GGUF/*.gguf
28G ./ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf

Generate imatrix

Notes:

This took just over 5 minutes on my high end gaming rig.
If you can't run the bf16 you could make a q8_0 without imatrix and then use that as "baseline" instead
I could offload 32 layers naievly with -ngl 32 but do whatever you need to run inferencing e.g. -ngl 99 -ot ... etc.
I don't bother with fancy calibration corpus nor extra context length as it isn't clearly proven to always improve results afaict.
Assuming you're offloading some to CPU, adjust threads as needed or set to exactly 1 if you are fully offloading to VRAM.

cd ik_llama.cpp
./build/bin/llama-imatrix \
    --verbosity 1 \
    -m /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
    -f calibration_data_v5_rc.txt \
    -o ./Qwen3-14B-BF16-imatrix.dat \
    -ngl 32 \
    --layer-similarity \
    --ctx-size 512 \
    --threads 16

mv ./Qwen3-14B-BF16-imatrix.dat ../ubergarm/Qwen3-14B-GGUF/

Create Quant Recipe

I personally like to make a bash script for each quant recipe. You can explore different mixes using layer-similarity or other imatrix statistics tools. Keep log files around with ./blah 2>&1 | tee -a logs/version-blah.log.

I often like to off with a pure q8_0 for benchmarking and then tweak as desired for target VRAM breakpoints.

#!/usr/bin/env bash

# token_embd.weight,         torch.bfloat16 --> BF16, shape = {5120, 151936}
#
# blk.28.ffn_down.weight,    torch.bfloat16 --> BF16, shape = {17408, 5120}
# blk.28.ffn_gate.weight,    torch.bfloat16 --> BF16, shape = {5120, 17408}
# blk.28.ffn_up.weight,      torch.bfloat16 --> BF16, shape = {5120, 17408}
#
# blk.28.attn_output.weight, torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_q.weight,      torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_k.weight,      torch.bfloat16 --> BF16, shape = {5120, 1024}
# blk.28.attn_v.weight,      torch.bfloat16 --> BF16, shape = {5120, 1024}
#
# blk.28.attn_norm.weight,   torch.bfloat16 --> F32, shape = {5120}
# blk.28.ffn_norm.weight,    torch.bfloat16 --> F32, shape = {5120}
# blk.28.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
# blk.28.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
#
# output_norm.weight,        torch.bfloat16 --> F32, shape = {5120}
# output.weight,             torch.bfloat16 --> BF16, shape = {5120, 151936}

custom="
# Attention
blk\.[0-9]\.attn_.*\.weight=iq5_ks
blk\.[1-3][0-9]\.attn_.*\.weight=iq5_ks

# FFN
blk\.[0-9]\.ffn_down\.weight=iq5_ks
blk\.[1-3][0-9]\.ffn_down\.weight=iq5_ks

blk\.[0-9]\.ffn_(gate|up)\.weight=iq4_ks
blk\.[1-3][0-9]\.ffn_(gate|up)\.weight=iq4_ks

# Token embedding/output
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-imatrix.dat \
    --custom-q "$custom" \
    /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
    /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf \
    IQ4_KS \
    16

Perplexity

Run some benchmarks to compare your various quant recipes.

model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q8_0.gguf

./build/bin/llama-perplexity \
    -m "$model" \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

BF16
- Final estimate: PPL = 9.0128 +/- 0.07114
Q8_0
- Final estimate: PPL = 9.0281 +/- 0.07136
ubergarm/IQ4_KS
- Final estimate: PPL = 9.0505 +/- 0.07133
unsloth/UD-Q4_K_XL
- Final estimate: PPL = 9.1034 +/- 0.07189
bartowski/Q4_K_L
- Final estimate: PPL = 9.1395 +/- 0.07236

KL-Divergence

You can run KLD if you want to measure how much smaller quants diverge from the unquantized model's outputs.

I have a custom ~1.6MiB ubergarm-kld-test-corpus.txt made from whisper-large-v3 transcriptions in plain text format from some recent episodes of Buddha at the Gas Pump BATGAP YT Channel.

Pass 1 Generate KLD Baseline File

The output kld base file can be quite large, this case it is ~55GiB. If you can't run BF16, you could use Q8_0 as your baseline if necessary.

model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 32 \
    --seed 1337 \
    --threads 16

Pass 2 Measure KLD

This uses the above kld base file as input baseline.

model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
    -m "$model" \
    --kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
    --kl-divergence \
    -f ubergarm-kld-test-corpus.txt \
    -fa \
    -ngl 99 \
    --seed 1337 \
    --threads 1

This will report Perplexity on this corpus as well as various other statistics.

BF16
- Final estimate: PPL = 14.8587 +/- 0.09987
Q8_0
- Mean PPL(Q) : 14.846724 ± 0.099745
- Median KLD: 0.000834
- 99.0% KLD: 0.004789
- RMS Δp: 0.920 ± 0.006 %
- 99.0% Δp: 2.761%
ubergarm/IQ4_KS
- Mean PPL(Q) : 14.881428 ± 0.099779
- Median KLD: 0.004756
- 99.0% KLD: 0.041509
- RMS Δp: 2.267 ± 0.013 %
- 99.0% Δp: 6.493%
unsloth/UD-Q4_K_XL
- Mean PPL(Q) : 14.934694 ± 0.100320
- Median KLD: 0.006275
- 99.0% KLD: 0.060005
- RMS Δp: 2.545 ± 0.015 %
- 99.0% Δp: 7.203%
bartowski/Q4_K_L
- Mean PPL(Q) : 14.922353 ± 0.100054
- Median KLD: 0.006195
- 99.0% KLD: 0.063428
- RMS Δp: 2.581 ± 0.015 %
- 99.0% Δp: 7.155%

Speed Benchmarks

Run some llama-sweep-bench to see how fast your quants are over various context lengths.

model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fa \
  -c 32768 \
  -ngl 99 \
  --warmup-batch \
  --threads 1

Vibe Check

Always remember to actually run your model to confirm it is working properly and generating valid responses.

#!/usr/bin/env bash

model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf

./build/bin/llama-server \
    --model "$model" \
    --alias ubergarm/Qwen3-14B-IQ4_KS \
    -fa \
    -ctk f16 -ctv f16 \
    -c 32768 \
    -ngl 99 \
    --threads 1 \
    --host 127.0.0.1 \
    --port 8080

References

🗣️ Discussion

👤 VinnyG9 replied the 2025-05-19 at 14:48:32:

thanks for this, can you point me where can i read a description of: -DGGML_RPC=OFF --seed 1337

👤 ubergarm replied the 2025-05-19 at 15:07:31:

-DGGML_RPC=OFF --seed 1337

The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run a client and server(s) distributed across multiple machines or processes for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.

--seed 1337

I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. 1337 is leet speek for leet.

👤 VinnyG9 replied the 2025-05-21 at 03:42:57:

-DGGML_RPC=OFF --seed 1337

The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run a client and server(s) distributed across multiple machines or processes for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.

--seed 1337

I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. 1337 is leet speek for leet.

you nerds speak like i know what you're talking about xD what is it "seeding"? i thought it was a reference to the universe's "fine-structure constant"

14 KiB Raw Permalink Blame History