mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 16:40:16 +00:00

Go to file

Kawrakow 98d1626469 Update README.md (#352 )

* Update README.md

* Edits

* Updates

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-04-30 15:11:29 +02:00

.devops

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

.github

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

common

Add copyright notices (#317 )

2025-04-07 10:43:26 +02:00

docs

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

examples

imatrix: collect layer influence statistics (#328 )

2025-04-14 19:43:19 +02:00

ggml

Fix IQK_FA_ALL_QUANTS on AVX2 (#360 )

2025-04-30 10:45:43 +02:00

gguf-py

Add missing enum values for qwen3 and qwen3moe (#356 )

2025-04-29 10:05:38 +02:00

grammars

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

LlaMA-4 support (text only) (#321 )

2025-04-10 09:05:21 +02:00

media

README: add graphic for matrix multiplication (#6881 )

2024-04-24 21:29:13 +02:00

models

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pocs

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

prompts

llama : add Qwen support (#4281 )

2023-12-01 20:16:31 +02:00

requirements

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

scripts

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

spm-headers

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

src

Apply Qwen3 PR from llama.cpp (#355 )

2025-04-29 10:02:08 +02:00

tests

Faster Gemma2 (#27 )

2024-08-27 17:40:59 +03:00

.dockerignore

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

.ecrc

Nomic Vulkan backend (#4456 )

2024-01-29 15:50:50 -05:00

.editorconfig

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

.flake8

py : logging and flake8 suppression refactoring (#7081 )

2024-05-05 08:07:48 +03:00

.gitignore

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

.gitmodules

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

.pre-commit-config.yaml

convert.py : add python logging instead of print() (#6511 )

2024-05-03 22:36:41 +03:00

AUTHORS

Update AUTHORS

2025-04-29 07:22:06 +02:00

CMakeLists.txt

Move to c++17 projectwide (#80 )

2024-10-04 14:43:26 +03:00

CMakePresets.json

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CONTRIBUTING.md

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

convert_hf_to_gguf_update.py

Deepseek V3 support added (#176 )

2025-01-23 18:24:10 +02:00

convert_hf_to_gguf.py

Apply Qwen3 PR from llama.cpp (#355 )

2025-04-29 10:02:08 +02:00

convert_llama_ggml_to_gguf.py

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

convert_lora_to_gguf.py

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

flake.lock

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

flake.nix

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

LICENSE

Use links for ggml/llama.cpp authors (#318 )

2025-04-07 17:25:06 +02:00

Makefile

Enable q6_0 for flash attention (#101 )

2024-10-22 11:34:49 +02:00

mypy.ini

convert : partially revert PR #4818 (#5041 )

2024-01-20 18:14:18 -05:00

Package.swift

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

poetry.lock

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pyproject.toml

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pyrightconfig.json

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

README.md

Update README.md (#352 )

2025-04-30 15:11:29 +02:00

requirements.txt

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

README.md

ik_llama.cpp: llama.cpp fork with better CPU performance

TL;DR

This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.

Latest News

April 29 2025: Qwen3 support added
April 26 2025: GLM-4 support added
April 26 2025: Command-A support added
April 22 2025: Support for the latest Microsoft Bitnet model added
April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux)
April 17 2025: Better CPU Flash Attention token generation performance
April 13 2025: IQ1_M quantization improvements
April 10 2025: LLaMA-4 support added
April 7 2025: IQ2_XS quantization improvements
April 3 2025: Much faster MoE implementation on Metal
April 1 2025: Quantization improvements for Q2_K, Q4_K, Q5_K, Q4_1, Q5_1
March 28 2025: Quantization imrovements for Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL
March 25 2025: Better MoE performance on CUDA
March 23 2025: Better batched processing speed for DeepSeek models
March 22 2025: Gemma3 support added
March 21 2025: FlashMLA-3: fastest CPU-only inference for DeepSeek models
March 18 2025: reduce compute buffer size
March 17 2025: FlashMLA-2 performance improvements
March 12 2025: Allow Q8_0 KV cache with FlashMLA-2 on CUDA
March 10 2025: Better TG performance for MoE models on CUDA
March 9 2025: FlashMLA on CUDA
March 8 2025: Faster FlashMLA CPU implementation
March 7 2025: Custom quantization mixes using regular expressions
March 5 2025: FlashMLA on CUDA
March 3 2025: Introducing FlashMLA - MLA with Flash Attention
March 1 2025: Smart Expert Reduction for faster DeepSeek inference
Feb 27 2025: MLA without transposed cache
Feb 25 2025: tensor overrides for better control where model weights are stored (GPU or CPU)
Feb 23 2025: fused FFN ops for faster MoE inference
Feb 23 2025: sweep-bench - better performance benchmarking
Feb 20 2025: fast GEMM/GEMV for IQ1_S
Feb 19 2025: Q8_KV - new type for 8-bit KV-cache quantization
Feb 13 2025: allow Q8_0 quantized cache with MLA
Feb 11 2025: Flash Attention support for DeepSeek models
Feb 9 2025: MLA for DeepSeek models
Jan 23 2025: DeepSeek-V3 support added

Resources

There is no single point of reference describing all new ik_llama.cpp features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition

The Wiki page has performance comparisons to mainline llama.cpp
This guide is a good place to start if you came here because of DeepSeek models
This discussion is about running DeepSeek-V3/R1 on a 16 x 3090 setup
This discussion describes the new quantization types available in ik_llama.cpp

Contributing

Contributions in form of pull requests, issue submissions (bug reports, feature requests), or general discussions, are welcome.

License

MIT

Languages

C++ 58.3%

C 14.8%

Cuda 13.1%

Python 5.1%

Metal 2.7%

Other 6%