mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-01 01:24:08 +00:00
Update with new ones
This commit is contained in:
68
README.md
68
README.md
@@ -12,12 +12,41 @@ This repository is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
||||
|
||||
LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Qwen3 [PR 355](https://github.com/ikawrakow/ik_llama.cpp/pull/355), GLM-4 [PR 344](https://github.com/ikawrakow/ik_llama.cpp/pull/344), Command-A [PR 341](https://github.com/ikawrakow/ik_llama.cpp/pull/341), bitnet-b1.58-2B-4T [PR 337](https://github.com/ikawrakow/ik_llama.cpp/pull/337), LLaMA-4 [PR 321](https://github.com/ikawrakow/ik_llama.cpp/pull/321)], Gemma3 [PR 276](https://github.com/ikawrakow/ik_llama.cpp/pull/276), DeepSeek-V3 [PR 176](https://github.com/ikawrakow/ik_llama.cpp/pull/176)
|
||||
|
||||
### Quantization improvements
|
||||
### Quantization
|
||||
|
||||
`IQ1_M` [PR 327](https://github.com/ikawrakow/ik_llama.cpp/pull/327), `IQ2_XS` [PR 312](https://github.com/ikawrakow/ik_llama.cpp/pull/312), `Q2_K, Q4_K, Q5_K, Q4_1, Q5_1` [PR 302](https://github.com/ikawrakow/ik_llama.cpp/pull/302), `Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL` [PR 295](https://github.com/ikawrakow/ik_llama.cpp/pull/295),
|
||||
#### Quantization additions
|
||||
|
||||
`IQ5_KS_R4` [PR 426](https://github.com/ikawrakow/ik_llama.cpp/pull/426), `IQ5_KS` [PR 422](https://github.com/ikawrakow/ik_llama.cpp/pull/422)
|
||||
|
||||
#### Quantization improvements
|
||||
|
||||
`IQ1_M` [PR 327](https://github.com/ikawrakow/ik_llama.cpp/pull/327), `IQ2_XS` [PR 312](https://github.com/ikawrakow/ik_llama.cpp/pull/312), `Q2_K, Q4_K, Q5_K, Q4_1, Q5_1` [PR 302](https://github.com/ikawrakow/ik_llama.cpp/pull/302), `Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL` [PR 295](https://github.com/ikawrakow/ik_llama.cpp/pull/295)
|
||||
|
||||
### Quantization performance and support improvements and fixes
|
||||
|
||||
|
||||
* MMQ implementation for `IQ4_KS_R4` and `IQ5_KS_R4` #493
|
||||
* CUDA implementation for `IQ1_S_R4` #492, `IQ1_M_R4` #494
|
||||
* Faster CPU prompt processing for Trellis quants and MoE models. #488
|
||||
* Trellis quants: faster CPU prompt processing #482.
|
||||
* Additional implementations for the trellis quants. Metal #475, Neon #471, CPU #441
|
||||
* Minor (~2%) `iq2_ks` TG performance improvement on CUDA #468
|
||||
* CUDA GEMM and GEMV for `IQ4_KS_R4` and `IQ5_KS_R4` #462
|
||||
* CUDA implementation for `IQ2_K_R4`, `IQ3_K_R4`, `IQ4_K_R4`, `IQ5_K_R4` #461
|
||||
* Faster `IQ3_KT` and `IQ4_KT` #453
|
||||
* Legacy quants conversion schemes in `convert_hf_to_gguf.py` #449, `Q6_0` in #483
|
||||
* Zen4: Faster PP for `IQ2_KS, IQ4_KS, IQ5_KS` #428
|
||||
* CUDA: quantized GEMMs `IQ4_K, IQ5_K, IQ6_K` [PR 417](https://github.com/ikawrakow/ik_llama.cpp/pull/417), `IQ2_KS, IQ2_K, IQ3_K` [PR 418](https://github.com/ikawrakow/ik_llama.cpp/pull/417)
|
||||
|
||||
### Features
|
||||
|
||||
* June 8 2025: Webui updated (legacy still available when `--path ./examples/server/public_legacy` is passed) #481
|
||||
* June 8 2025: RPC improvements #480
|
||||
* June 7 2025: Add an endpoint that lists all the saved prompt caches to server #502
|
||||
* June 6 2025: Make prompt cache saving and restoring MLA aware #497
|
||||
* June 3 2025: Added samplers, XTC #486, top-n σ #489.
|
||||
* May 22 2025: Refactor `iqk_mul_mat.cpp` which speeds up compilation time significantly. #435
|
||||
* May 17 2025: Option to enable or disable the CPU FA kernels #429.
|
||||
* May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See [PR 405](https://github.com/ikawrakow/ik_llama.cpp/pull/405)
|
||||
* May 12 2025: Compatibility issues with mainline `llama.cpp` GGUFs for DeepSeek models with MLA enabled were resolved in [PR 394](https://github.com/ikawrakow/ik_llama.cpp/pull/394). The lower prompt processing performance resulting from using `llama.cpp`-style MLA GGUFs was recovered in [PR 409](https://github.com/ikawrakow/ik_llama.cpp/pull/409).
|
||||
* April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see [PR 336](https://github.com/ikawrakow/ik_llama.cpp/pull/336)
|
||||
@@ -29,20 +58,23 @@ LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Q
|
||||
* Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)
|
||||
|
||||
### Performance improvements
|
||||
* May 11 2025: 🚀 Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. See [PR 408](https://github.com/ikawrakow/ik_llama.cpp/pull/408)
|
||||
* May 7 2025: 🚀 Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. See [PR 386](https://github.com/ikawrakow/ik_llama.cpp/pull/386) for details. Caveat: Ampere or newer Nvidia GPU required
|
||||
* May 4 2025: 🚀 Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see [PR #370](https://github.com/ikawrakow/ik_llama.cpp/pull/370)
|
||||
* April 17 2025: 🚀 Better CPU Flash Attention token generation performance, see [PR 332](https://github.com/ikawrakow/ik_llama.cpp/pull/332)
|
||||
* April 3 2025: 🚀 Much faster MoE implementation on Metal, see [PR 307](https://github.com/ikawrakow/ik_llama.cpp/pull/307)
|
||||
* March 25 2025: 🚀 Better MoE performance on CUDA
|
||||
* March 23 2025: 🚀 Better batched processing speed for DeepSeek models
|
||||
|
||||
* May 13 2025: Better CPU FA performance for DeepSeek-Lite. [PR 410](https://github.com/ikawrakow/ik_llama.cpp/pull/410)
|
||||
* May 11 2025: Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. [PR 408](https://github.com/ikawrakow/ik_llama.cpp/pull/408)
|
||||
* May 7 2025: Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. [PR 386](https://github.com/ikawrakow/ik_llama.cpp/pull/386). Caveat: Ampere or newer Nvidia GPU required
|
||||
* May 4 2025: Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks. [PR #370](https://github.com/ikawrakow/ik_llama.cpp/pull/370)
|
||||
* April 17 2025: Better CPU Flash Attention token generation performance. [PR 332](https://github.com/ikawrakow/ik_llama.cpp/pull/332)
|
||||
* April 3 2025: Much faster MoE implementation on Metal. [PR 307](https://github.com/ikawrakow/ik_llama.cpp/pull/307)
|
||||
* March 25 2025: Better MoE performance on CUDA
|
||||
* March 23 2025: Better batched processing speed for DeepSeek models
|
||||
* March 18 2025: Reduce compute buffer size
|
||||
* March 10 2025: 🚀 Better TG performance for MoE models on CUDA
|
||||
* Feb 23 2025: 🚀 Fused FFN ops for faster MoE inference
|
||||
* Feb 20 2025: 🚀 Fast GEMM/GEMV for `IQ1_S`
|
||||
* March 10 2025: Better TG performance for MoE models on CUDA
|
||||
* Feb 23 2025: Fused FFN ops for faster MoE inference
|
||||
* Feb 20 2025: Fast GEMM/GEMV for `IQ1_S`
|
||||
|
||||
|
||||
### Flash-MLA
|
||||
|
||||
* March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models see
|
||||
* March 17 2025: 🚀 FlashMLA-2 performance improvements
|
||||
* March 12 2025: Allow `Q8_0` KV cache with FlashMLA-2 on CUDA
|
||||
@@ -54,6 +86,18 @@ LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Q
|
||||
* Feb 11 2025: 🚀 Flash Attention support for DeepSeek models
|
||||
* Feb 9 2025: 🚀 MLA for DeepSeek models
|
||||
|
||||
### Fixes
|
||||
|
||||
* Fix bug in MMVQ kernel #446
|
||||
* Fix AVX2 implementation of `IQ4_K, IQ4_KS, IQ5_K, IQ6_K` #427
|
||||
* Fix standard attention on the CPU #421
|
||||
* Fix imatrix calculation for MLA models #411
|
||||
* Fix new CUDA FA on Touring #413
|
||||
* Fix SER. CPU: [PR 415](https://github.com/ikawrakow/ik_llama.cpp/pull/415) CUDA: [PR 416](https://github.com/ikawrakow/ik_llama.cpp/pull/416)
|
||||
|
||||
|
||||
|
||||
|
||||
## Resources
|
||||
|
||||
There is no single point of reference describing all new `ik_llama.cpp` features. Pull requests often contain detailed information, so browsing the PRs is often the best way to learn about new features and how to use them. In addition
|
||||
|
||||
Reference in New Issue
Block a user