mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-12 23:10:01 +00:00

Go to file

Andrew Chan 25d34e3d2f Trellis quants with CPU inference (#441 )

* WIP

* WIP

* WIP

* Testing Trellis quantization

Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.

* Testing Trellis quantization: 4-bit quantized block scales

rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.

* Testing Trellis quantization: playing with scales and generators

* iq2_kt: quantize / dequantize

I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).

* iq2_kt: CUDA dequantize

so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.

* WIP

* WIP

* WIP - try larger blocks

With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.

* iq2_kt - this is better

Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.

* iq2_kt - even better

Re-quantize after determining block scales
(at the epxense of much longer quantization time).

* iq2_kt: CUDA dot product

Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.

* iq2_kt: very slightly faster CUDA dot product

* iq2_kt: f16 CUDA dot product

We arrive at 112 t/s.

* iq2_kt: faster f16 CUDA dot product

We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.

* iq2_kt: faster f16 CUDA dot product

We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.

* Minor

* Adding iq3_kt

3.125 bpw. So far does not look good on the PPL vs bpw plot.

* Forgotten change

* WIP

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.

* iq3_kt WIP: speed up quantization

Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.

* iq3_kt speed up quantization

Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!

* iq3_kt: CUDA dot product

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* Adding iq4_kt - not competitive at this point

* WIP

* WIP

* iq4_kt: CUDA dot product

* iq4_kt: minor tweaks

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.

* iq3_kt: small improvements and faster quantization

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.

* iq3_kt: small progress

* WIP

* iq4_kt: go to 4.0 bpw

15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.

* iq4_kt: very slightly better

at the expense of much longer quantization time.

* iq4_kt: failed attemt to adjust CUDA dot product

It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.

* DRY

* DRY

* iq4_kt: CUDA dot product works

* DRY

* Report actual bpw

* Minor tweaks

* Checkpoint

Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).

I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.

* WIP for IQ2_KT

* WIP - working basic iq2_kt

* still super slow (0.17t/s eval)

* flatten 3inst iters + avx2 (0.3t/s eval)

* iq3_kt (0.3t/s eval) and renames

* wip buggy iq4_KT

* fix (0.22t/s eval)

* naming and remove unused fn

* cleanup

* more cleanup

* delete unused and noncompiling mmvq functions

* Some performance tweaks

* Slighty faster iq2_kt

* port Trellis struct to iq3_kt, iq4_kt

* oops untracked files

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-05-23 09:17:52 +03:00

.devops

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

.github

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

common

Add batch warmup to sweep-bench (#375 )

2025-05-12 07:50:26 +03:00

docs

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

examples

Trellis quants with CPU inference (#441 )

2025-05-23 09:17:52 +03:00

ggml

Trellis quants with CPU inference (#441 )

2025-05-23 09:17:52 +03:00

gguf-py

Fix missing rope_freqs with convert_hf_to_gguf (#402 )

2025-05-09 09:17:41 -05:00

grammars

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Trellis quants with CPU inference (#441 )

2025-05-23 09:17:52 +03:00

media

README: add graphic for matrix multiplication (#6881 )

2024-04-24 21:29:13 +02:00

models

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pocs

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

prompts

llama : add Qwen support (#4281 )

2023-12-01 20:16:31 +02:00

requirements

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

scripts

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

spm-headers

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

src

Trellis quants with CPU inference (#441 )

2025-05-23 09:17:52 +03:00

tests

Faster Gemma2 (#27 )

2024-08-27 17:40:59 +03:00

.dockerignore

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

.ecrc

Nomic Vulkan backend (#4456 )

2024-01-29 15:50:50 -05:00

.editorconfig

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

.flake8

py : logging and flake8 suppression refactoring (#7081 )

2024-05-05 08:07:48 +03:00

.gitignore

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

.gitmodules

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

.pre-commit-config.yaml

convert.py : add python logging instead of print() (#6511 )

2024-05-03 22:36:41 +03:00

AUTHORS

Update AUTHORS

2025-04-29 07:22:06 +02:00

CMakeLists.txt

fix some MSVC build problem. (#392 )

2025-05-07 17:04:39 +03:00

CMakePresets.json

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CONTRIBUTING.md

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

convert_hf_to_gguf_update.py

Deepseek V3 support added (#176 )

2025-01-23 18:24:10 +02:00

convert_hf_to_gguf.py

Fix missing rope_freqs with convert_hf_to_gguf (#402 )

2025-05-09 09:17:41 -05:00

convert_llama_ggml_to_gguf.py

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

convert_lora_to_gguf.py

Fix missing rope_freqs with convert_hf_to_gguf (#402 )

2025-05-09 09:17:41 -05:00

flake.lock

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

flake.nix

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

LICENSE

Use links for ggml/llama.cpp authors (#318 )

2025-04-07 17:25:06 +02:00

Makefile

Enable q6_0 for flash attention (#101 )

2024-10-22 11:34:49 +02:00

mypy.ini

convert : partially revert PR #4818 (#5041 )

2024-01-20 18:14:18 -05:00

Package.swift

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

poetry.lock

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pyproject.toml

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pyrightconfig.json

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

README.md

Update README.md

2025-05-12 15:48:37 +03:00

requirements.txt

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

README.md

ik_llama.cpp: llama.cpp fork with better CPU performance

TL;DR

This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.

Latest News

May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See PR 405
May 12 2025: Compatibility issues with mainline llama.cpp GGUFs for DeepSeek models with MLA enabled were resolved in PR 394. The lower prompt processing performance resulting from using llama.cpp-style MLA GGUFs was recovered in PR 409.
May 11 2025: 🚀 Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. See PR 408
May 9 2025: Support for LlaMA-3-Nemotron models added, see PR 377
May 7 2025: 🚀 Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. See PR 386 for details. Caveat: Ampere or newer Nvidia GPU required
May 4 2025: 🚀 Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see PR #370
April 29 2025: Qwen3 support added, see PR 355
April 26 2025: GLM-4 support added, see PR 344
April 26 2025: Command-A support added, see PR 341
April 22 2025: Support for the latest Microsoft Bitnet model added, see PR 337
April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see PR 336
April 17 2025: 🚀 Better CPU Flash Attention token generation performance, see PR 332
April 13 2025: IQ1_M quantization improvements, see PR 327
April 10 2025: LLaMA-4 support added, see PR 321. In the PR there are also some custom quantization recipes for L4-Scout provided.
April 7 2025: IQ2_XS quantization improvements, see PR 312
April 3 2025: 🚀 Much faster MoE implementation on Metal, see PR 307
April 1 2025: Quantization improvements for Q2_K, Q4_K, Q5_K, Q4_1, Q5_1, see PR 302
March 28 2025: Quantization imrovements for Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL, see PR 295
March 25 2025: 🚀 Better MoE performance on CUDA
March 23 2025: 🚀 Better batched processing speed for DeepSeek models
March 22 2025: Gemma3 support added
March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models
March 18 2025: Reduce compute buffer size
March 17 2025: 🚀 FlashMLA-2 performance improvements
March 12 2025: Allow Q8_0 KV cache with FlashMLA-2 on CUDA
March 10 2025: 🚀 Better TG performance for MoE models on CUDA
March 9 2025: 🚀 FlashMLA on CUDA
March 8 2025: 🚀 Faster FlashMLA CPU implementation
March 7 2025: Custom quantization mixes using regular expressions
March 5 2025: 🚀 FlashMLA on CUDA
March 3 2025: 🚀 Introducing FlashMLA - MLA with Flash Attention
March 1 2025: Smart Expert Reduction for faster DeepSeek inference
Feb 27 2025: MLA without transposed cache
Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)
Feb 23 2025: 🚀 Fused FFN ops for faster MoE inference
Feb 23 2025: sweep-bench - better performance benchmarking
Feb 20 2025: 🚀 Fast GEMM/GEMV for IQ1_S
Feb 19 2025: Q8_KV - new type for 8-bit KV-cache quantization
Feb 13 2025: Allow Q8_0 quantized cache with MLA
Feb 11 2025: 🚀 Flash Attention support for DeepSeek models
Feb 9 2025: 🚀 MLA for DeepSeek models
Jan 23 2025: DeepSeek-V3 support added

Resources

There is no single point of reference describing all new ik_llama.cpp features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition

The Wiki page has performance comparisons to mainline llama.cpp
This guide is a good place to start if you came here because of DeepSeek models
This discussion is about running DeepSeek-V3/R1 on a 16 x 3090 setup
This discussion describes the new quantization types available in ik_llama.cpp

Contributing

Contributions in form of pull requests, issue submissions (bug reports, feature requests), or general discussions, are welcome.

License

MIT

Languages

C++ 57.5%

C 15.4%

Cuda 13.7%

Python 5.2%

Metal 2.8%

Other 5.4%