mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-20 14:29:22 +00:00
This reverts commit a368140d76.
This commit is contained in:
@@ -1,154 +0,0 @@
|
||||
# Running Kimi-K2.5 with SGLang and KT-Kernel
|
||||
|
||||
This tutorial demonstrates how to run Kimi-K2.5 model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. This setup enables efficient deployment of large MoE models by offloading experts to CPU.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Hardware Requirements](#hardware-requirements)
|
||||
- [Prerequisites](#prerequisites)
|
||||
- [Step 1: Download Model Weights](#step-1-download-model-weights)
|
||||
- [Step 2: Launch SGLang Server](#step-2-launch-sglang-server)
|
||||
- [Step 3: Send Inference Requests](#step-3-send-inference-requests)
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
**Minimum Configuration:**
|
||||
- **GPU**: NVIDIA RTX 2x4090 48GB (or equivalent with at least total 48GB VRAM available)
|
||||
- **CPU**: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids)
|
||||
- **RAM**: At least 600GB system memory
|
||||
- **Storage**: ~600GB for model weights (native INT4 weight, same weight folder for CPU and GPU)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before starting, ensure you have:
|
||||
|
||||
1. **KT-Kernel installed**:
|
||||
|
||||
Note: Latest KTransformers' EPLB feature for Kimi-K2.5 will be supported soon.
|
||||
|
||||
```
|
||||
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||
git checkout kimi_k2.5
|
||||
git submodule update --init --recursive
|
||||
cd kt-kernel && ./install.sh
|
||||
```
|
||||
|
||||
2. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang)
|
||||
|
||||
Note: Currently, please clone our custom SGLang repository:
|
||||
|
||||
```
|
||||
git clone https://github.com/kvcache-ai/sglang.git
|
||||
git checkout kimi_k2.5
|
||||
cd sglang && pip install -e "python[all]"
|
||||
// maybe need to reinstall cudnn according to the issue when launching SGLang
|
||||
pip install nvidia-cudnn-cu12==9.16.0.29
|
||||
```
|
||||
|
||||
3. **CUDA toolkit** - Compatible with your GPU (CUDA 12.8+ recommended)
|
||||
4. **Hugging Face CLI** - For downloading models:
|
||||
|
||||
```bash
|
||||
pip install huggingface-hub
|
||||
```
|
||||
|
||||
## Step 1: Download Model Weights
|
||||
|
||||
```bash
|
||||
# Create a directory for models
|
||||
mkdir -p /path/to/models
|
||||
cd /path/to/models
|
||||
|
||||
# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU)
|
||||
huggingface-cli download moonshotai/Kimi-K2.5 \
|
||||
--local-dir /path/to/kimi-k2.5
|
||||
```
|
||||
|
||||
**Note:** Replace `/path/to/models` with your actual storage path throughout this tutorial.
|
||||
|
||||
## Step 2: Launch SGLang Server
|
||||
|
||||
Start the SGLang server with KT-Kernel integration for CPU-GPU heterogeneous inference.
|
||||
|
||||
|
||||
### Launch Command (4x RTX 4090 Example)
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 31245 \
|
||||
--model /path/to/kimi-k2.5 \
|
||||
--kt-weight-path /path/to/kimi-k2.5 \
|
||||
--kt-cpuinfer 96 \
|
||||
--kt-threadpool-count 2 \
|
||||
--kt-num-gpu-experts 30 \
|
||||
--kt-method RAWINT4 \
|
||||
--kt-gpu-prefill-token-threshold 400 \
|
||||
--trust-remote-code \
|
||||
--mem-fraction-static 0.94 \
|
||||
--served-model-name Kimi-K2.5 \
|
||||
--enable-mixed-chunk \
|
||||
--tensor-parallel-size 4 \
|
||||
--enable-p2p-check \
|
||||
--disable-shared-experts-fusion \
|
||||
--chunked-prefill-size 32658 \
|
||||
--max-total-tokens 50000 \
|
||||
--attention-backend flashinfer
|
||||
```
|
||||
|
||||
It takes about 2~3 minutes to start the server.
|
||||
|
||||
See [KT-Kernel Parameters](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel#kt-kernel-parameters) for detailed parameter tuning guidelines.
|
||||
|
||||
## Step 3: Send Inference Requests
|
||||
|
||||
Once the server is running, you can send inference requests using the OpenAI-compatible API.
|
||||
|
||||
### Basic Chat Completion Request
|
||||
|
||||
```bash
|
||||
curl -s http://localhost:31245/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Kimi-K2.5",
|
||||
"stream": false,
|
||||
"messages": [
|
||||
{"role": "user", "content": "hi, who are you?"}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Example Response
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "2a4e83f8a79b4b57b103b0f298fbaa7d",
|
||||
"object": "chat.completion",
|
||||
"created": 1769333912,
|
||||
"model": "Kimi-K2.5",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": " The user is asking \"hi, who are you?\" which is a simple greeting and identity question. I need to respond appropriately by introducing myself clearly and concisely.\n\nI am Kimi, a large language model trained by Moonshot AI. I should state my name, my nature (AI assistant), and my developer (Moonshot AI). I should keep it friendly and helpful.\n\nKey points to include:\n- Greet them back (\"hi\" or \"hello\")\n- State my name: Kimi\n- State what I am: an AI assistant/language model\n- Mention my developer: Moonshot AI\n- Briefly describe my purpose: to help answer questions, provide information, and assist with various tasks\n- Keep it concise but informative\n- Use a friendly, professional tone\n\nI should avoid overly technical jargon while being accurate. The response should be welcoming and set the stage for further interaction.\n\nPossible response:\n\"Hi! I'm Kimi, an AI assistant created by Moonshot AI. I'm designed to help answer questions, provide information, and assist with a wide range of tasks. How can I help you today?\"\n\nThis covers all the necessary points and invites the user to continue the conversation. </think> Hi! I'm Kimi, an AI assistant created by Moonshot AI. I'm designed to help answer questions, provide information, and assist with a wide range of tasks. How can I help you today?",
|
||||
"reasoning_content": null,
|
||||
"tool_calls": null
|
||||
},
|
||||
"logprobs": null,
|
||||
"finish_reason": "stop",
|
||||
"matched_stop": 163586
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 32,
|
||||
"total_tokens": 317,
|
||||
"completion_tokens": 285,
|
||||
"prompt_tokens_details": null,
|
||||
"reasoning_tokens": 0
|
||||
},
|
||||
"metadata": {
|
||||
"weight_version": "default"
|
||||
}
|
||||
}
|
||||
```
|
||||
@@ -1,222 +0,0 @@
|
||||
# Kimi-K2.5 LoRA SFT Tutorial
|
||||
|
||||
This tutorial demonstrates how to perform **LoRA Supervised Fine-Tuning (SFT)** on **Kimi-K2.5** using **LlamaFactory** with **KTransformers** as the backend, and then serve the fine-tuned model using **SGLang**.
|
||||
|
||||
The workflow is:
|
||||
|
||||
```txt
|
||||
KTransformers + LlamaFactory LoRA SFT → (Optional) LlamaFactory Verification → SGLang Serving
|
||||
```
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Hardware Requirements](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#hardware-requirements)
|
||||
- [Prerequisites](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#prerequisites)
|
||||
- [Step 0: Environment Setup (Method 1: Source Install)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-0-environment-setup-method-1-source-install)
|
||||
- [Step 1: Prepare Model Weights (BF16 for SFT)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-1-prepare-model-weights-bf16-for-sft)
|
||||
- [Step 2: Prepare YAML for LoRA SFT (KTransformers Backend)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-2-prepare-yaml-for-lora-sft-ktransformers-backend)
|
||||
- [Step 3: Run LoRA SFT](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-3-run-lora-sft)
|
||||
- [Step 4: Post-SFT Quick Verification with LlamaFactory (Optional)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-4-post-sft-quick-verification-with-LlamaFactory-optional)
|
||||
- [Step 5: SGLang Serving with LoRA (Recommended Delivery Path)](https://chatgpt.com/c/6975bb7f-52e0-839c-a727-ec4b5d6723b5#step-5-sglang-serving-with-lora-recommended-delivery-path)
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
### Training (LoRA SFT)
|
||||
|
||||
- **LlamaFactory + KTransformers**
|
||||
- **GPU**: 4 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
|
||||
- **CPU**: x86 CPU with AMX support
|
||||
- **RAM**: At least 2TGB system memory
|
||||
- Swap can be used if CPU memory is insufficient
|
||||
|
||||
### Inference (LoRA Adapter + Original Model)
|
||||
|
||||
- **SGLang + KTransformers**
|
||||
- **GPU**: 2 * NVIDIA RTX 4090 24GB (or equivalent with at least total 48GB VRAM available)
|
||||
- **CPU**: x86 CPU with AVX512F support (e.g., Intel Sapphire Rapids)
|
||||
- **RAM**: At least 600GB system memory
|
||||
- **Storage**: ~600GB for model weights (native INT4 weight, same weight dir for CPU and GPU)
|
||||
|
||||
|
||||
|
||||
## Step 0: Environment Setup
|
||||
|
||||
We recommend to separate **two conda environments**:
|
||||
|
||||
| Environment | Purpose |
|
||||
| ----------- | --------------------------------------------------- |
|
||||
| `kt-kernel` | Inference & serving (KTransformers + SGLang) |
|
||||
| `kt-sft` | Training (LlamaFactory + KTransformers SFT backend) |
|
||||
|
||||
### 0.1 Inference Environment: `kt-kernel`
|
||||
|
||||
```bash
|
||||
conda create -n kt-kernel python=3.11
|
||||
conda activate kt-kernel
|
||||
|
||||
git clone https://github.com/kvcache-ai/ktransformers.git
|
||||
git checkout kimi_k2.5
|
||||
git submodule update --init --recursive
|
||||
cd kt-kernel && ./install.sh
|
||||
```
|
||||
|
||||
### 0.2 Install SGLang (Inference / Serving)
|
||||
|
||||
**Recommended for Kimi-K2.5:**
|
||||
|
||||
```bash
|
||||
git clone https://github.com/kvcache-ai/sglang.git
|
||||
cd sglang
|
||||
git checkout kimi_k2.5
|
||||
pip install -e "python[all]"
|
||||
```
|
||||
|
||||
### 0.3 Training Environment: `kt-sft`
|
||||
|
||||
```bash
|
||||
conda create -n kt-sft python=3.11
|
||||
conda activate kt-sft
|
||||
|
||||
git clone https://github.com/hiyouga/LlamaFactory.git
|
||||
cd LlamaFactory
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### 0.4 Install KTransformers SFT Dependencies
|
||||
|
||||
```bash
|
||||
conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
|
||||
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
|
||||
|
||||
# Install matching wheels (recommended), from https://github.com/kvcache-ai/ktransformers/releases
|
||||
pip install ktransformers-<matching-version>.whl
|
||||
pip install flash_attn-<matching-version>.whl
|
||||
```
|
||||
|
||||
## Step 1: Prepare Model Weights (BF16 for SFT)
|
||||
|
||||
### 1.1 Download INT4 Weights
|
||||
|
||||
KTransformers **requires BF16 weights for SFT**.
|
||||
|
||||
```bash
|
||||
# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU)
|
||||
huggingface-cli download moonshotai/Kimi-K2.5 \
|
||||
--local-dir /path/to/kimi-k2.5
|
||||
```
|
||||
|
||||
### 1.2 Convert INT4 → BF16
|
||||
|
||||
Kimi-K2.5 base model is in **INT4** format, convert it to **BF16** before SFT.
|
||||
|
||||
## Step 2: Prepare YAML for LoRA SFT (KTransformers Backend)
|
||||
|
||||
### 2.1 Training YAML (LoRA SFT)
|
||||
|
||||
Example file:
|
||||
`examples/train_lora/kimik2_lora_sft_kt.yaml`
|
||||
|
||||
Required fields:
|
||||
|
||||
```yaml
|
||||
stage: sft
|
||||
finetuning_type: lora
|
||||
bf16: true
|
||||
|
||||
use_kt: true
|
||||
kt_optimize_rule: <rule.yaml>
|
||||
cpu_infer: 32
|
||||
chunk_size: 8192
|
||||
```
|
||||
|
||||
Other fields (dataset, output_dir, learning rate, epochs) can be adjusted as usual.
|
||||
|
||||
### 2.2 Inference YAML (LlamaFactory Verification)
|
||||
|
||||
Key requirements:
|
||||
|
||||
- `adapter_name_or_path`: LoRA output directory
|
||||
- `infer_backend: ktransformers`
|
||||
- **Same `use_kt` and `kt_optimize_rule` as training**
|
||||
|
||||
This YAML is used only for **quick verification**, not production serving.
|
||||
|
||||
## Step 3: Run LoRA SFT
|
||||
|
||||
```bash
|
||||
conda activate kt-sft
|
||||
cd LlamaFactory
|
||||
|
||||
USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
|
||||
```
|
||||
|
||||
After training, the LoRA adapter is saved to `output_dir`.
|
||||
|
||||
## Step 4: Post-SFT Quick Verification with LlamaFactory (Optional)
|
||||
|
||||
Before production deployment, the new PDF recommends a **lightweight sanity check**.
|
||||
|
||||
```bash
|
||||
conda activate kt-sft
|
||||
cd LlamaFactory
|
||||
|
||||
llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
|
||||
```
|
||||
|
||||
Purpose:
|
||||
|
||||
- Validate LoRA correctness
|
||||
- Ensure reproducibility
|
||||
- Not for throughput benchmarking
|
||||
|
||||
## Step 5: SGLang Serving with LoRA (Recommended Delivery Path)
|
||||
|
||||
This is the **major runtime update** introduced by the new PDF.
|
||||
|
||||
### 5.1 Convert LoRA for SGLang
|
||||
|
||||
```bash
|
||||
python ktransformers/kt-kernel/scripts/convert_lora.py \
|
||||
--base_path /path/to/kimi-base-model \
|
||||
--lora_path /path/to/llamafactory/output_dir \
|
||||
--output_path /path/to/lora_converted
|
||||
```
|
||||
|
||||
### 5.2 (Optional) Convert CPU Weights to INT8
|
||||
|
||||
To reduce CPU memory usage:
|
||||
|
||||
```bash
|
||||
python ktransformers/kt-kernel/scripts/convert_cpu_weights.py \
|
||||
--base_path /path/to/kimi-base-model \
|
||||
--output_dir /path/to/kimi-base-model-int8
|
||||
```
|
||||
|
||||
This produces:
|
||||
|
||||
```text
|
||||
/path/to/kimi-base-model-int8/int8
|
||||
```
|
||||
|
||||
### 5.3 Launch SGLang Server with LoRA
|
||||
|
||||
```bash
|
||||
conda activate kt-kernel
|
||||
|
||||
python -m sglang.launch_server \
|
||||
--enable-lora \
|
||||
--lora-paths lora1=/path/to/lora_converted \
|
||||
--lora-backend triton \
|
||||
--model-path /path/to/kimi-base-model \
|
||||
--tp 1 \
|
||||
--trust-remote-code \
|
||||
--context-length 4096 \
|
||||
--kt-weight-path /path/to/kimi-base-model-int8/int8 \
|
||||
--mem-fraction-static 0.9
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `--kt-weight-path` points to CPU INT8 weights
|
||||
- Adjust `tp`, `context-length`, and memory parameters per machine
|
||||
- RAWINT4 inference paths can follow **Kimi-K2.5-Native** directly
|
||||
Reference in New Issue
Block a user