* iq3_ks: basics * iq3_ks: CUDA dequantize * iq3_ks: CUDA mmvq * iq3_ks: mmq * iq3_ks: faster mmq * iq3_ks: Zen4 * iq3_ks: AVX2 convert to q8_k_r8 This gives usPP-512 = 360 t/s. * iq3_ks: AVX2 GEMM/GEMV * iq3_ks: NEON GEMM/GEMV * iq3_ks: NEON convert to q8_k_r8 This gives us PP-512 = 164 t/s. * iq3_ks: Metal dequantize * iq3_ks: Metal gemv - pathetic performance --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
ik_llama.cpp: llama.cpp fork with better CPU performance
TL;DR
This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.
Latest News
Model Support
LlaMA-3-Nemotron PR 377, Qwen3 PR 355, GLM-4 PR 344, Command-A PR 341, bitnet-b1.58-2B-4T PR 337, LLaMA-4 PR 321, Gemma3 PR 276, DeepSeek-V3 PR 176
Quantization
Quantization additions
Trellis quants (IQ2_KT, IQ3_KT, IQ4_KT)
Information and the original CUDA implementation in PR 113. Additional implementations: Metal PR 475, Neon PR 471, CPU PR 441
IQK quants
Information can be found in Discussion 8.
Initial implementations (Zen4, AVX2, NEON): IQ5_KS_R4 PR 426, IQ5_KS PR 422, IQ4_KS_R4 PR 150, IQ5_K_R4 PR 149, IQ2_K_R4 PR 146, IQ3_K_R4 PR 145, IQ4_K_R4 PR 138, IQ4_KSS PR 89, IQ2_KS PR 85, IQ4_KS PR 83, IQ6_K PR 14, IQ2_K, IQ3_K and IQ5_K PR 7, IQ4_K PR 6
Cuda implementations: IQ4_KS_R4 and IQ5_KS_R4 PR 493, IQ1_S_R4 PR 492, IQ1_M_R4 PR 494. IQ4_KS_R4 and IQ5_KS_R4 PR 462, IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4 PR 461, IQ4_K, IQ5_K, IQ6_K PR 417, IQ2_KS, IQ2_K, IQ3_K PR 418
Quantization improvements
IQ1_M PR 327, IQ2_XS PR 312, Q2_K, Q4_K, Q5_K, Q4_1, Q5_1 PR 302, Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL PR 295
Quantization performance improvements
- Faster CPU prompt processing for Trellis quants and MoE models. PR 488
- Trellis quants: faster CPU prompt processing PR 482.
- Minor (~2%)
iq2_ksTG performance improvement on CUDA PR 468 - Faster
IQ3_KTandIQ4_KTPR 453 - Zen4: Faster PP for
IQ2_KS, IQ4_KS, IQ5_KSPR 428 - Fast GEMM/GEMV for
IQ1_SPR 212
Features
- Legacy quants conversion schemes in
convert_hf_to_gguf.pyPR 449,Q6_0in PR 483 - June 8 2025: Webui updated (legacy still available when
--path ./examples/server/public_legacyis passed) PR 481 - June 8 2025: RPC improvements PR 480
- June 7 2025: Add an endpoint that lists all the saved prompt caches to server PR 502
- June 6 2025: Make prompt cache saving and restoring MLA aware PR 497
- June 3 2025: Added samplers, XTC PR 486, top-n σ PR 489.
- May 22 2025: Refactor
iqk_mul_mat.cppwhich speeds up compilation time significantly. PR 435 - May 17 2025: Option to enable or disable the CPU FA kernels PR 429.
- May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See PR 405
- May 12 2025: Compatibility issues with mainline
llama.cppGGUFs for DeepSeek models with MLA enabled were resolved in PR 394. The lower prompt processing performance resulting from usingllama.cpp-style MLA GGUFs was recovered in PR 409. - April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see PR 336
- March 1 2025: Smart Expert Reduction for faster DeepSeek inference PR 239
- Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU) PR 232
- Feb 23 2025:
sweep-bench- better performance benchmarking PR 225 - Feb 19 2025:
Q8_KV- new type for 8-bit KV-cache quantization PR 208 - March 7 2025: Custom quantization mixes using regular expressions PR 244
Performance improvements
- May 13 2025: Better CPU FA performance for DeepSeek-Lite. PR 410
- May 11 2025: Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. PR 408
- May 4 2025: Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks. PR 370
- April 17 2025: Better CPU Flash Attention token generation performance. PR 332
- April 3 2025: Much faster MoE implementation on Metal. PR 307
- March 25 2025: Better MoE performance on CUDA PR 283
- March 23 2025: Better batched processing speed for DeepSeek models PR 282
- March 18 2025: Reduce compute buffer size PR 237
- March 10 2025: Better TG performance for MoE models on CUDA PR 248
- Feb 23 2025: Fused FFN ops for faster MoE inference PR 229
Flash-MLA
- May 7 2025: 🚀 FlashMLA-3 for DeepSeek models on CUDA. PR 386. Caveat: Ampere or newer Nvidia GPU required
- March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models PR 273
- March 17 2025: 🚀 FlashMLA-2 performance improvements PR 253
- March 12 2025: Allow
Q8_0KV cache with FlashMLA-2 on CUDA PR 265 - March 9 2025: 🚀 FlashMLA on CUDA PR 247
- March 8 2025: 🚀 Faster FlashMLA CPU implementation PR 243
- March 3 2025: 🚀 Introducing FlashMLA - MLA with Flash Attention PR 240
- Feb 27 2025: MLA without transposed cache PR 235
- Feb 13 2025: Allow
Q8_0quantized cache with MLA PR 206 - Feb 11 2025: 🚀 Flash Attention support for DeepSeek models PR 200
- Feb 9 2025: 🚀 MLA for DeepSeek models PR 188
Fixes
- Fix bug in MMVQ kernel PR 446
- Fix AVX2 implementation of
IQ4_K, IQ4_KS, IQ5_K, IQ6_KPR 427 - Fix standard attention on the CPU PR 421
- Fix imatrix calculation for MLA models PR 411
- Fix new CUDA FA on Touring PR 413
- Fix SER. CPU: PR 415 CUDA: PR 416
Resources
There is no single point of reference describing all new ik_llama.cpp features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition
- The Wiki page has performance comparisons to mainline
llama.cpp - This guide is a good place to start if you came here because of DeepSeek models
- This discussion is about running DeepSeek-V3/R1 on a 16 x 3090 setup
- This discussion describes the new quantization types available in
ik_llama.cpp
Contributing
Contributions in form of pull requests, issue submissions (bug reports, feature requests), or general discussions, are welcome.
License
MIT