ikawrakow/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-24 15:14:10 +00:00

Go to file

Kawrakow 1f77976476 Update README.md

2025-04-28 16:25:48 +02:00

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Add copyright notices (#317 )

2025-04-07 10:43:26 +02:00

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

imatrix: collect layer influence statistics (#328 )

2025-04-14 19:43:19 +02:00

Fix division by zero bug (#349 )

2025-04-26 09:19:43 +02:00

Update gguf-py constants (#298 )

2025-04-24 00:34:10 -05:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

LlaMA-4 support (text only) (#321 )

2025-04-10 09:05:21 +02:00

README: add graphic for matrix multiplication (#6881 )

2024-04-24 21:29:13 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

llama : add Qwen support (#4281 )

2023-12-01 20:16:31 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Add GLM-4-0414 Model Support (#344 )

2025-04-26 17:34:04 +02:00

Faster Gemma2 (#27 )

2024-08-27 17:40:59 +03:00

.dockerignore

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

.ecrc

Nomic Vulkan backend (#4456 )

2024-01-29 15:50:50 -05:00

.editorconfig

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

.flake8

py : logging and flake8 suppression refactoring (#7081 )

2024-05-05 08:07:48 +03:00

.gitignore

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

.gitmodules

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

.pre-commit-config.yaml

convert.py : add python logging instead of print() (#6511 )

2024-05-03 22:36:41 +03:00

AUTHORS

Update AUTHORS

2025-04-07 17:31:35 +02:00

CMakeLists.txt

Move to c++17 projectwide (#80 )

2024-10-04 14:43:26 +03:00

CMakePresets.json

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CONTRIBUTING.md

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

convert_hf_to_gguf_update.py

Deepseek V3 support added (#176 )

2025-01-23 18:24:10 +02:00

convert_hf_to_gguf.py

Add support for bitnet2b_2501 model (#337 )

2025-04-22 08:34:13 +02:00

convert_llama_ggml_to_gguf.py

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

convert_lora_to_gguf.py

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

flake.lock

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

flake.nix

build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809 )

2024-06-13 00:41:52 +01:00

LICENSE

Use links for ggml/llama.cpp authors (#318 )

2025-04-07 17:25:06 +02:00

Makefile

Enable q6_0 for flash attention (#101 )

2024-10-22 11:34:49 +02:00

mypy.ini

convert : partially revert PR #4818 (#5041 )

2024-01-20 18:14:18 -05:00

Package.swift

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

poetry.lock

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pyproject.toml

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

pyrightconfig.json

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

README.md

Update README.md

2025-04-28 16:25:48 +02:00

requirements.txt

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

README.md

ik_llama.cpp: llama.cpp fork with better CPU performance

TL;DR

This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance.

Latest News

April 26 2025: GLM-4 support added
April 26 2025: Command-A support added
April 22 2025: Support for the latest Microsoft Bitnet model added
April 17 2025: Better CPU Flash Attention token generation performance
April 13 2025: IQ1_M quantization improvements
April 10 2025: LLaMA-4 support added
April 7 2025: IQ2_XS quantization improvements
April 3 2025: Much faster MoE implementation on Metal
April 1 2025: Quantization improvements for Q2_K, Q4_K, Q5_K, Q4_1, Q5_1
March 28 2025: Quantization imrovements for Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL
March 25 2025: Better MoE performance on CUDA
March 23 2025: Better batched processing speed for DeepSeek models
March 22 2025: Gemma3 support added
March 21 2025: FlashMLA-3: fastest CPU-only inference for DeepSeek models
March 18 2025: reduce compute buffer size
March 17 2025: FlashMLA-2 performance improvements
March 12 2025: Allow Q8_0 KV cache with FlashMLA-2 on CUDA
March 10 2025: Better TG performance for MoE models on CUDA
March 9 2025: FlashMLA on CUDA
March 8 2025: Faster FlashMLA CPU implementation
March 7 2025: Custom quantization mixes using regular expressions
March 5 2025: FlashMLA on CUDA
March 3 2025: Introducing FlashMLA - MLA with Flash Attention
March 1 2025: Smart Expert Reduction for faster DeepSeek inference
Feb 27 2025: MLA without transposed cache
Feb 25 2025: tensor overrides for better control where model weights are stored (GPU or CPU)
Feb 23 2025: fused FFN ops for faster MoE inference
Feb 23 2025: sweep-bench - better performance benchmarking
Feb 20 2025: fast GEMM/GEMV for IQ1_S
Feb 19 2025: Q8_KV - new type for 8-bit KV-cache quantization
Feb 13 2025: allow Q8_0 quantized cache with MLA
Feb 11 2025: Flash Attention support for DeepSeek models
Feb 9 2025: MLA for DeepSeek models
Jan 23 2025: DeepSeek-V3 support added

Contributing

Contributions in form of pull requests or issue submissions (bug reports, feature requests) are welcome.

Licens

MIT

Languages

C++ 55.5%

C 16.2%

Cuda 14.3%

Python 5.4%

Metal 3%

Other 5.6%