mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 17:20:01 +00:00
25d34e3d2f8cc494067958c909f71453a833cd4e
* WIP * WIP * WIP * Testing Trellis quantization Using 12 bits per 8 weights I get a better rmse than iq2_xxs. I still need to see how quantizing the group-of-8 scales will affect accuracy. By AVX2 SIMDifying the search for the best code, LLaMA-3.1-8B gets quantized in 130 seconds on the Ryzen-7950X CPU - sluggish but still acceptable. * Testing Trellis quantization: 4-bit quantized block scales rmse increases by just 3%, so this is beating iq2_xss in terms of rmse at the same 2.0625 bpw. * Testing Trellis quantization: playing with scales and generators * iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B). * iq2_kt: CUDA dequantize so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs. * WIP * WIP * WIP - try larger blocks With blocks of 32 and 16 bits per groups of 8 the brute force seach becomes prohibitive in terms of CPU time (30+ minutes for 8B LLaMA after SIMDifying with AVX2). The trick is to group the points in clusters, find the nearest cluster, and only search within the cluster. * iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower. * iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time). * iq2_kt: CUDA dot product Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU. * iq2_kt: very slightly faster CUDA dot product * iq2_kt: f16 CUDA dot product We arrive at 112 t/s. * iq2_kt: faster f16 CUDA dot product We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance. * iq2_kt: faster f16 CUDA dot product We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16. * Minor * Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot. * Forgotten change * WIP * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants. * WIP * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892 * iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v. * iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX. * iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds! * iq3_kt: CUDA dot product * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * Adding iq4_kt - not competitive at this point * WIP * WIP * iq4_kt: CUDA dot product * iq4_kt: minor tweaks * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920 * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297 PPL(LLaMA-2-7B, 4096) = 6.3913 Ah, quantization is faster too. About 20% faster. * iq3_kt: small improvements and faster quantization * iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627 PPL(LLaMA-2-7B, 4096) = 6.3825 Quantization is faster too: ~200 seconds for LLaMA-3.1-8B on Ryzen-5975WX. * iq3_kt: small progress * WIP * iq4_kt: go to 4.0 bpw 15 bits per group of 4, plus 8 bit scales ifor blocks of 32. This gives a slightly better PPL than iq4_kss. * iq4_kt: very slightly better at the expense of much longer quantization time. * iq4_kt: failed attemt to adjust CUDA dot product It was working for 4.125 bpw. But after changing to 4.0 bpw there is something wrong and I don't see the bug. * DRY * DRY * iq4_kt: CUDA dot product works * DRY * Report actual bpw * Minor tweaks * Checkpoint Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude plus 1 bpw for the sign. It goves a visible improvement in the PPL vs bpw plot, but that comes at the expense of much longer quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX). I also notices that the 3INST generator is not actually generating a Gaussian distribution. But going to a better generator means readjusting all the hyper-parameters, so leaving it for later. * WIP for IQ2_KT * WIP - working basic iq2_kt * still super slow (0.17t/s eval) * flatten 3inst iters + avx2 (0.3t/s eval) * iq3_kt (0.3t/s eval) and renames * wip buggy iq4_KT * fix (0.22t/s eval) * naming and remove unused fn * cleanup * more cleanup * delete unused and noncompiling mmvq functions * Some performance tweaks * Slighty faster iq2_kt * port Trellis struct to iq3_kt, iq4_kt * oops untracked files --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
ik_llama.cpp: llama.cpp fork with better CPU performance
TL;DR
This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.
Latest News
- May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See PR 405
- May 12 2025: Compatibility issues with mainline
llama.cppGGUFs for DeepSeek models with MLA enabled were resolved in PR 394. The lower prompt processing performance resulting from usingllama.cpp-style MLA GGUFs was recovered in PR 409. - May 11 2025: 🚀 Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. See PR 408
- May 9 2025: Support for LlaMA-3-Nemotron models added, see PR 377
- May 7 2025: 🚀 Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. See PR 386 for details. Caveat: Ampere or newer Nvidia GPU required
- May 4 2025: 🚀 Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see PR #370
- April 29 2025: Qwen3 support added, see PR 355
- April 26 2025: GLM-4 support added, see PR 344
- April 26 2025: Command-A support added, see PR 341
- April 22 2025: Support for the latest Microsoft Bitnet model added, see PR 337
- April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see PR 336
- April 17 2025: 🚀 Better CPU Flash Attention token generation performance, see PR 332
- April 13 2025:
IQ1_Mquantization improvements, see PR 327 - April 10 2025: LLaMA-4 support added, see PR 321. In the PR there are also some custom quantization recipes for L4-Scout provided.
- April 7 2025:
IQ2_XSquantization improvements, see PR 312 - April 3 2025: 🚀 Much faster MoE implementation on Metal, see PR 307
- April 1 2025: Quantization improvements for
Q2_K, Q4_K, Q5_K, Q4_1, Q5_1, see PR 302 - March 28 2025: Quantization imrovements for
Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL, see PR 295 - March 25 2025: 🚀 Better MoE performance on CUDA
- March 23 2025: 🚀 Better batched processing speed for DeepSeek models
- March 22 2025: Gemma3 support added
- March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models
- March 18 2025: Reduce compute buffer size
- March 17 2025: 🚀 FlashMLA-2 performance improvements
- March 12 2025: Allow
Q8_0KV cache with FlashMLA-2 on CUDA - March 10 2025: 🚀 Better TG performance for MoE models on CUDA
- March 9 2025: 🚀 FlashMLA on CUDA
- March 8 2025: 🚀 Faster FlashMLA CPU implementation
- March 7 2025: Custom quantization mixes using regular expressions
- March 5 2025: 🚀 FlashMLA on CUDA
- March 3 2025: 🚀 Introducing FlashMLA - MLA with Flash Attention
- March 1 2025: Smart Expert Reduction for faster DeepSeek inference
- Feb 27 2025: MLA without transposed cache
- Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)
- Feb 23 2025: 🚀 Fused FFN ops for faster MoE inference
- Feb 23 2025:
sweep-bench- better performance benchmarking - Feb 20 2025: 🚀 Fast GEMM/GEMV for
IQ1_S - Feb 19 2025:
Q8_KV- new type for 8-bit KV-cache quantization - Feb 13 2025: Allow
Q8_0quantized cache with MLA - Feb 11 2025: 🚀 Flash Attention support for DeepSeek models
- Feb 9 2025: 🚀 MLA for DeepSeek models
- Jan 23 2025: DeepSeek-V3 support added
Resources
There is no single point of reference describing all new ik_llama.cpp features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition
- The Wiki page has performance comparisons to mainline
llama.cpp - This guide is a good place to start if you came here because of DeepSeek models
- This discussion is about running DeepSeek-V3/R1 on a 16 x 3090 setup
- This discussion describes the new quantization types available in
ik_llama.cpp
Contributing
Contributions in form of pull requests, issue submissions (bug reports, feature requests), or general discussions, are welcome.
License
MIT
Languages
C++
55.4%
C
16.4%
Cuda
14%
Python
5.5%
Metal
3%
Other
5.6%