From 7cb6a76cd0ae54909cdbffa95f163c077827dfc5 Mon Sep 17 00:00:00 2001 From: Kawrakow Date: Sun, 4 May 2025 11:49:29 +0300 Subject: [PATCH] Update README.md --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index a04c1130..17d19645 100644 --- a/README.md +++ b/README.md @@ -6,8 +6,13 @@ This repository is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp) with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc. +>[!IMPORTANT] +>The new GGUFs for DeepSeek-V3/R1/Lite do not work in this repository. This is due to the backwards incompatibe change in mainline `llama.cpp` that [added MLA support](https://github.com/ggml-org/llama.cpp/pull/12801) +>2.5 months after MLA was available here, and worked with the original DeepSeek GGUFs. Please use the original GGUF or, if you don't have one, convert the HF safetnosrs using the Python conversion scrip in this repository. + ## Latest News +* May 4 2025: Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see PR #370 * April 29 2025: Qwen3 support added * April 26 2025: GLM-4 support added * April 26 2025: Command-A support added