Kimi k2.5 doc (#1812)

* [doc]: add Kimi-K2.5 deploy&sft guide

* [doc]: add Kimi-K2.5 deploy&sft guide
This commit is contained in:
Jiaqi Liao
2026-01-27 13:33:25 +08:00
committed by GitHub
parent 1da075a3fa
commit 2f6f7f1921
3 changed files with 377 additions and 0 deletions

154
doc/en/Kimi-K2.5.md Normal file
View File

@@ -0,0 +1,154 @@
# Running Kimi-K2.5 with SGLang and KT-Kernel
This tutorial demonstrates how to run Kimi-K2.5 model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. This setup enables efficient deployment of large MoE models by offloading experts to CPU.
## Table of Contents
- [Hardware Requirements](#hardware-requirements)
- [Prerequisites](#prerequisites)
- [Step 1: Download Model Weights](#step-1-download-model-weights)
- [Step 2: Launch SGLang Server](#step-2-launch-sglang-server)
- [Step 3: Send Inference Requests](#step-3-send-inference-requests)
## Hardware Requirements
**Minimum Configuration:**
- **GPU**: NVIDIA RTX 2x4090 48GB (or equivalent with at least total 48GB VRAM available)
- **CPU**: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids)
- **RAM**: At least 600GB system memory
- **Storage**: ~600GB for model weights (native INT4 weight, same weight folder for CPU and GPU)
## Prerequisites
Before starting, ensure you have:
1. **KT-Kernel installed**:
Note: Latest KTransformers' EPLB feature for Kimi-K2.5 will be supported soon.
```
git clone https://github.com/kvcache-ai/ktransformers.git
git checkout kimi_k2.5
git submodule update --init --recursive
cd kt-kernel && ./install.sh
```
2. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang)
Note: Currently, please clone our custom SGLang repository:
```
git clone https://github.com/kvcache-ai/sglang.git
git checkout kimi_k2.5
cd sglang && pip install -e "python[all]"
// maybe need to reinstall cudnn according to the issue when launching SGLang
pip install nvidia-cudnn-cu12==9.16.0.29
```
3. **CUDA toolkit** - Compatible with your GPU (CUDA 12.8+ recommended)
4. **Hugging Face CLI** - For downloading models:
```bash
pip install huggingface-hub
```
## Step 1: Download Model Weights
```bash
# Create a directory for models
mkdir -p /path/to/models
cd /path/to/models
# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU)
huggingface-cli download moonshotai/Kimi-K2.5 \
--local-dir /path/to/kimi-k2.5
```
**Note:** Replace `/path/to/models` with your actual storage path throughout this tutorial.
## Step 2: Launch SGLang Server
Start the SGLang server with KT-Kernel integration for CPU-GPU heterogeneous inference.
### Launch Command (4x RTX 4090 Example)
```bash
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 31245 \
--model /path/to/kimi-k2.5 \
--kt-weight-path /path/to/kimi-k2.5 \
--kt-cpuinfer 96 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 30 \
--kt-method RAWINT4 \
--kt-gpu-prefill-token-threshold 400 \
--trust-remote-code \
--mem-fraction-static 0.94 \
--served-model-name Kimi-K2.5 \
--enable-mixed-chunk \
--tensor-parallel-size 4 \
--enable-p2p-check \
--disable-shared-experts-fusion \
--chunked-prefill-size 32658 \
--max-total-tokens 50000 \
--attention-backend flashinfer
```
It takes about 2~3 minutes to start the server.
See [KT-Kernel Parameters](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel#kt-kernel-parameters) for detailed parameter tuning guidelines.
## Step 3: Send Inference Requests
Once the server is running, you can send inference requests using the OpenAI-compatible API.
### Basic Chat Completion Request
```bash
curl -s http://localhost:31245/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Kimi-K2.5",
"stream": false,
"messages": [
{"role": "user", "content": "hi, who are you?"}
]
}'
```
### Example Response
```json
{
"id": "2a4e83f8a79b4b57b103b0f298fbaa7d",
"object": "chat.completion",
"created": 1769333912,
"model": "Kimi-K2.5",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " The user is asking \"hi, who are you?\" which is a simple greeting and identity question. I need to respond appropriately by introducing myself clearly and concisely.\n\nI am Kimi, a large language model trained by Moonshot AI. I should state my name, my nature (AI assistant), and my developer (Moonshot AI). I should keep it friendly and helpful.\n\nKey points to include:\n- Greet them back (\"hi\" or \"hello\")\n- State my name: Kimi\n- State what I am: an AI assistant/language model\n- Mention my developer: Moonshot AI\n- Briefly describe my purpose: to help answer questions, provide information, and assist with various tasks\n- Keep it concise but informative\n- Use a friendly, professional tone\n\nI should avoid overly technical jargon while being accurate. The response should be welcoming and set the stage for further interaction.\n\nPossible response:\n\"Hi! I'm Kimi, an AI assistant created by Moonshot AI. I'm designed to help answer questions, provide information, and assist with a wide range of tasks. How can I help you today?\"\n\nThis covers all the necessary points and invites the user to continue the conversation. </think> Hi! I'm Kimi, an AI assistant created by Moonshot AI. I'm designed to help answer questions, provide information, and assist with a wide range of tasks. How can I help you today?",
"reasoning_content": null,
"tool_calls": null
},
"logprobs": null,
"finish_reason": "stop",
"matched_stop": 163586
}
],
"usage": {
"prompt_tokens": 32,
"total_tokens": 317,
"completion_tokens": 285,
"prompt_tokens_details": null,
"reasoning_tokens": 0
},
"metadata": {
"weight_version": "default"
}
}
```

View File

@@ -0,0 +1,222 @@
# Kimi-K2.5 LoRA SFT Tutorial
This tutorial demonstrates how to perform **LoRA Supervised Fine-Tuning (SFT)** on **Kimi-K2.5** using **LlamaFactory** with **KTransformers** as the backend, and then serve the fine-tuned model using **SGLang**.
The workflow is:
```txt
KTransformers + LlamaFactory LoRA SFT → (Optional) LlamaFactory Verification → SGLang Serving
```
## Table of Contents
- [Hardware Requirements](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#hardware-requirements)
- [Prerequisites](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#prerequisites)
- [Step 0: Environment Setup (Method 1: Source Install)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-0-environment-setup-method-1-source-install)
- [Step 1: Prepare Model Weights (BF16 for SFT)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-1-prepare-model-weights-bf16-for-sft)
- [Step 2: Prepare YAML for LoRA SFT (KTransformers Backend)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-2-prepare-yaml-for-lora-sft-ktransformers-backend)
- [Step 3: Run LoRA SFT](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-3-run-lora-sft)
- [Step 4: Post-SFT Quick Verification with LlamaFactory (Optional)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-4-post-sft-quick-verification-with-LlamaFactory-optional)
- [Step 5: SGLang Serving with LoRA (Recommended Delivery Path)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-5-sglang-serving-with-lora-recommended-delivery-path)
## Hardware Requirements
### Training (LoRA SFT)
- **LlamaFactory + KTransformers**
- **GPU**: 4 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
- **CPU**: x86 CPU with AMX support
- **RAM**: At least 2TGB system memory
- Swap can be used if CPU memory is insufficient
### Inference (LoRA Adapter + Original Model)
- **SGLang + KTransformers**
- **GPU**: 2 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
- **CPU**: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids)
- **RAM**: At least 600GB system memory
- **Storage**: ~600GB for model weights (native INT4 weight, same weight dir for CPU and GPU)
## Step 0: Environment Setup
We recommend to separate **two conda environments**:
| Environment | Purpose |
| ----------- | --------------------------------------------------- |
| `kt-kernel` | Inference & serving (KTransformers + SGLang) |
| `kt-sft` | Training (LlamaFactory + KTransformers SFT backend) |
### 0.1 Inference Environment: `kt-kernel`
```bash
conda create -n kt-kernel python=3.11
conda activate kt-kernel
git clone https://github.com/kvcache-ai/ktransformers.git
git checkout kimi_k2.5
git submodule update --init --recursive
cd kt-kernel && ./install.sh
```
### 0.2 Install SGLang (Inference / Serving)
**Recommended for Kimi-K2.5:**
```bash
git clone https://github.com/kvcache-ai/sglang.git
cd sglang
git checkout kimi_k2.5
pip install -e "python[all]"
```
### 0.3 Training Environment: `kt-sft`
```bash
conda create -n kt-sft python=3.11
conda activate kt-sft
git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e .
```
### 0.4 Install KTransformers SFT Dependencies
```bash
conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
# Install matching wheels (recommended), from https://github.com/kvcache-ai/ktransformers/releases
pip install ktransformers-<matching-version>.whl
pip install flash_attn-<matching-version>.whl
```
## Step 1: Prepare Model Weights (BF16 for SFT)
### 1.1 Download INT4 Weights
KTransformers **requires BF16 weights for SFT**.
```bash
# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU)
huggingface-cli download moonshotai/Kimi-K2.5 \
--local-dir /path/to/kimi-k2.5
```
### 1.2 Convert INT4 → BF16
Kimi-K2.5 base model is in **INT4** format, convert it to **BF16** before SFT.
## Step 2: Prepare YAML for LoRA SFT (KTransformers Backend)
### 2.1 Training YAML (LoRA SFT)
Example file:
`examples/train_lora/kimik2_lora_sft_kt.yaml`
Required fields:
```yaml
stage: sft
finetuning_type: lora
bf16: true
use_kt: true
kt_optimize_rule: <rule.yaml>
cpu_infer: 32
chunk_size: 8192
```
Other fields (dataset, output_dir, learning rate, epochs) can be adjusted as usual.
### 2.2 Inference YAML (LlamaFactory Verification)
Key requirements:
- `adapter_name_or_path`: LoRA output directory
- `infer_backend: ktransformers`
- **Same `use_kt` and `kt_optimize_rule` as training**
This YAML is used only for **quick verification**, not production serving.
## Step 3: Run LoRA SFT
```bash
conda activate kt-sft
cd LlamaFactory
USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
```
After training, the LoRA adapter is saved to `output_dir`.
## Step 4: Post-SFT Quick Verification with LlamaFactory (Optional)
Before production deployment, the new PDF recommends a **lightweight sanity check**.
```bash
conda activate kt-sft
cd LlamaFactory
llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
```
Purpose:
- Validate LoRA correctness
- Ensure reproducibility
- Not for throughput benchmarking
## Step 5: SGLang Serving with LoRA (Recommended Delivery Path)
This is the **major runtime update** introduced by the new PDF.
### 5.1 Convert LoRA for SGLang
```bash
python ktransformers/kt-kernel/scripts/convert_lora.py \
--base_path /path/to/kimi-base-model \
--lora_path /path/to/llamafactory/output_dir \
--output_path /path/to/lora_converted
```
### 5.2 (Optional) Convert CPU Weights to INT8
To reduce CPU memory usage:
```bash
python ktransformers/kt-kernel/scripts/convert_cpu_weights.py \
--base_path /path/to/kimi-base-model \
--output_dir /path/to/kimi-base-model-int8
```
This produces:
```text
/path/to/kimi-base-model-int8/int8
```
### 5.3 Launch SGLang Server with LoRA
```bash
conda activate kt-kernel
python -m sglang.launch_server \
--enable-lora \
--lora-paths lora1=/path/to/lora_converted \
--lora-backend triton \
--model-path /path/to/kimi-base-model \
--tp 1 \
--trust-remote-code \
--context-length 4096 \
--kt-weight-path /path/to/kimi-base-model-int8/int8 \
--mem-fraction-static 0.9
```
Notes:
- `--kt-weight-path` points to CPU INT8 weights
- Adjust `tp`, `context-length`, and memory parameters per machine
- RAWINT4 inference paths can follow **Kimi-K2.5-Native** directly