diff --git a/doc/en/Kimi-K2.5.md b/doc/en/Kimi-K2.5.md index e75c22d..179092e 100644 --- a/doc/en/Kimi-K2.5.md +++ b/doc/en/Kimi-K2.5.md @@ -4,31 +4,36 @@ This tutorial demonstrates how to run Kimi-K2.5 model inference using SGLang int ## Table of Contents -- [Hardware Requirements](#hardware-requirements) -- [Prerequisites](#prerequisites) -- [Step 1: Download Model Weights](#step-1-download-model-weights) -- [Step 2: Launch SGLang Server](#step-2-launch-sglang-server) -- [Step 3: Send Inference Requests](#step-3-send-inference-requests) +- [Running Kimi-K2.5 with SGLang and KT-Kernel](#running-kimi-k25-with-sglang-and-kt-kernel) + - [Table of Contents](#table-of-contents) + - [Hardware Requirements](#hardware-requirements) + - [Prerequisites](#prerequisites) + - [Step 1: Download Model Weights](#step-1-download-model-weights) + - [Step 2: Launch SGLang Server](#step-2-launch-sglang-server) + - [Launch Command (4x RTX 4090 Example)](#launch-command-4x-rtx-4090-example) + - [Step 3: Send Inference Requests](#step-3-send-inference-requests) + - [Basic Chat Completion Request](#basic-chat-completion-request) + - [Example Response](#example-response) ## Hardware Requirements **Minimum Configuration:** -- **GPU**: NVIDIA RTX 2x4090 48GB (or equivalent with at least total 48GB VRAM available) +- **GPU**: NVIDIA RTX 2x4090(24GB) (or equivalent with at least **total 48GB** VRAM available) - **CPU**: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids) - **RAM**: At least 600GB system memory - **Storage**: ~600GB for model weights (native INT4 weight, same weight folder for CPU and GPU) ## Prerequisites +**Update (2026-01-30): Both kimi_k2.5 branches have now been merged into main, so there’s no need to check out those branches anymore. The EPBL feature is also supported.** Before starting, ensure you have: 1. **KT-Kernel installed**: - Note: Latest KTransformers' EPLB feature for Kimi-K2.5 will be supported soon. + ~~Note: Latest KTransformers' EPLB feature for Kimi-K2.5 will be supported soon.~~ ``` git clone https://github.com/kvcache-ai/ktransformers.git -git checkout kimi_k2.5 git submodule update --init --recursive cd kt-kernel && ./install.sh ``` @@ -39,7 +44,6 @@ Note: Currently, please clone our custom SGLang repository: ``` git clone https://github.com/kvcache-ai/sglang.git -git checkout kimi_k2.5 cd sglang && pip install -e "python[all]" // maybe need to reinstall cudnn according to the issue when launching SGLang pip install nvidia-cudnn-cu12==9.16.0.29 @@ -93,7 +97,8 @@ python -m sglang.launch_server \ --disable-shared-experts-fusion \ --chunked-prefill-size 32658 \ --max-total-tokens 50000 \ - --attention-backend flashinfer + --attention-backend flashinfer \ + --kt-enable-dynamic-expert-update ``` It takes about 2~3 minutes to start the server.