Files
ik_llama.cpp/README.md
2025-06-09 07:26:47 -05:00

5.4 KiB

ik_llama.cpp: llama.cpp fork with better CPU performance

License: MIT

TL;DR

This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.

Latest News

Model Support

LlaMA-3-Nemotron PR 377, Qwen3 PR 355, GLM-4 PR 344, Command-A PR 341, bitnet-b1.58-2B-4T PR 337, LLaMA-4 PR 321], Gemma3 PR 276, DeepSeek-V3 PR 176

Quantization improvements

IQ1_M PR 327, IQ2_XS PR 312, Q2_K, Q4_K, Q5_K, Q4_1, Q5_1 PR 302, Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL PR 295,

Features

  • May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See PR 405
  • May 12 2025: Compatibility issues with mainline llama.cpp GGUFs for DeepSeek models with MLA enabled were resolved in PR 394. The lower prompt processing performance resulting from using llama.cpp-style MLA GGUFs was recovered in PR 409.
  • April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see PR 336
  • March 1 2025: Smart Expert Reduction for faster DeepSeek inference, see
  • Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)
  • Feb 23 2025: sweep-bench - better performance benchmarking
  • Feb 19 2025: Q8_KV - new type for 8-bit KV-cache quantization
  • March 7 2025: Custom quantization mixes using regular expressions
  • Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)

Performance improvements

  • May 11 2025: 🚀 Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. See PR 408
  • May 7 2025: 🚀 Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. See PR 386 for details. Caveat: Ampere or newer Nvidia GPU required
  • May 4 2025: 🚀 Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see PR #370
  • April 17 2025: 🚀 Better CPU Flash Attention token generation performance, see PR 332
  • April 3 2025: 🚀 Much faster MoE implementation on Metal, see PR 307
  • March 25 2025: 🚀 Better MoE performance on CUDA
  • March 23 2025: 🚀 Better batched processing speed for DeepSeek models
  • March 18 2025: Reduce compute buffer size
  • March 10 2025: 🚀 Better TG performance for MoE models on CUDA
  • Feb 23 2025: 🚀 Fused FFN ops for faster MoE inference
  • Feb 20 2025: 🚀 Fast GEMM/GEMV for IQ1_S

Flash-MLA

  • March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models see
  • March 17 2025: 🚀 FlashMLA-2 performance improvements
  • March 12 2025: Allow Q8_0 KV cache with FlashMLA-2 on CUDA
  • March 9 2025: 🚀 FlashMLA on CUDA
  • March 8 2025: 🚀 Faster FlashMLA CPU implementation
  • March 3 2025: 🚀 Introducing FlashMLA - MLA with Flash Attention
  • Feb 27 2025: MLA without transposed cache
  • Feb 13 2025: Allow Q8_0 quantized cache with MLA
  • Feb 11 2025: 🚀 Flash Attention support for DeepSeek models
  • Feb 9 2025: 🚀 MLA for DeepSeek models

Resources

There is no single point of reference describing all new ik_llama.cpp features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition

  • The Wiki page has performance comparisons to mainline llama.cpp
  • This guide is a good place to start if you came here because of DeepSeek models
  • This discussion is about running DeepSeek-V3/R1 on a 16 x 3090 setup
  • This discussion describes the new quantization types available in ik_llama.cpp

Contributing

Contributions in form of pull requests, issue submissions (bug reports, feature requests), or general discussions, are welcome.

License

MIT