2025-06-09 07:26:47 -05:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2025-06-08 17:27:00 +03:00
2024-08-12 15:14:32 +02:00
2025-06-09 05:32:03 -05:00
2025-06-08 17:27:00 +03:00
2024-07-27 07:55:01 +02:00
2025-06-03 17:35:09 +03:00
2024-07-27 07:55:01 +02:00
2023-12-01 20:16:31 +02:00
2024-07-27 07:55:01 +02:00
2025-06-08 17:27:00 +03:00
2024-08-27 17:40:59 +03:00
2024-01-29 15:50:50 -05:00
2024-07-27 07:55:01 +02:00
2024-08-12 15:14:32 +02:00
2024-07-27 07:55:01 +02:00
2025-06-08 14:41:17 +03:00
2025-06-08 17:27:00 +03:00
2024-08-12 15:14:32 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2025-06-09 07:26:47 -05:00
2024-07-27 07:55:01 +02:00

ik_llama.cpp: llama.cpp fork with better CPU performance

License: MIT

TL;DR

This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.

Latest News

Model Support

LlaMA-3-Nemotron PR 377, Qwen3 PR 355, GLM-4 PR 344, Command-A PR 341, bitnet-b1.58-2B-4T PR 337, LLaMA-4 PR 321], Gemma3 PR 276, DeepSeek-V3 PR 176

Quantization improvements

IQ1_M PR 327, IQ2_XS PR 312, Q2_K, Q4_K, Q5_K, Q4_1, Q5_1 PR 302, Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL PR 295,

Features

  • May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See PR 405
  • May 12 2025: Compatibility issues with mainline llama.cpp GGUFs for DeepSeek models with MLA enabled were resolved in PR 394. The lower prompt processing performance resulting from using llama.cpp-style MLA GGUFs was recovered in PR 409.
  • April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see PR 336
  • March 1 2025: Smart Expert Reduction for faster DeepSeek inference, see
  • Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)
  • Feb 23 2025: sweep-bench - better performance benchmarking
  • Feb 19 2025: Q8_KV - new type for 8-bit KV-cache quantization
  • March 7 2025: Custom quantization mixes using regular expressions
  • Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)

Performance improvements

  • May 11 2025: 🚀 Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. See PR 408
  • May 7 2025: 🚀 Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. See PR 386 for details. Caveat: Ampere or newer Nvidia GPU required
  • May 4 2025: 🚀 Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see PR #370
  • April 17 2025: 🚀 Better CPU Flash Attention token generation performance, see PR 332
  • April 3 2025: 🚀 Much faster MoE implementation on Metal, see PR 307
  • March 25 2025: 🚀 Better MoE performance on CUDA
  • March 23 2025: 🚀 Better batched processing speed for DeepSeek models
  • March 18 2025: Reduce compute buffer size
  • March 10 2025: 🚀 Better TG performance for MoE models on CUDA
  • Feb 23 2025: 🚀 Fused FFN ops for faster MoE inference
  • Feb 20 2025: 🚀 Fast GEMM/GEMV for IQ1_S

Flash-MLA

  • March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models see
  • March 17 2025: 🚀 FlashMLA-2 performance improvements
  • March 12 2025: Allow Q8_0 KV cache with FlashMLA-2 on CUDA
  • March 9 2025: 🚀 FlashMLA on CUDA
  • March 8 2025: 🚀 Faster FlashMLA CPU implementation
  • March 3 2025: 🚀 Introducing FlashMLA - MLA with Flash Attention
  • Feb 27 2025: MLA without transposed cache
  • Feb 13 2025: Allow Q8_0 quantized cache with MLA
  • Feb 11 2025: 🚀 Flash Attention support for DeepSeek models
  • Feb 9 2025: 🚀 MLA for DeepSeek models

Resources

There is no single point of reference describing all new ik_llama.cpp features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition

  • The Wiki page has performance comparisons to mainline llama.cpp
  • This guide is a good place to start if you came here because of DeepSeek models
  • This discussion is about running DeepSeek-V3/R1 on a 16 x 3090 setup
  • This discussion describes the new quantization types available in ik_llama.cpp

Contributing

Contributions in form of pull requests, issue submissions (bug reports, feature requests), or general discussions, are welcome.

License

MIT

Description
llama.cpp fork with additional SOTA quants and improved performance
Readme MIT 191 MiB
Languages
C++ 55.5%
C 16.2%
Cuda 14.4%
Python 5.4%
Metal 3%
Other 5.5%