mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-03-14 18:37:23 +00:00
* [feat]: simplify sglang installation with submodule, auto-sync CI, and version alignment
- Add kvcache-ai/sglang as git submodule at third_party/sglang (branch = main)
- Add top-level install.sh for one-click source installation (sglang + kt-kernel)
- Add sglang-kt as hard dependency in kt-kernel/pyproject.toml
- Add CI workflow to auto-sync sglang submodule daily and create PR
- Add CI workflow to build and publish sglang-kt to PyPI
- Integrate sglang-kt build into release-pypi.yml (version.py bump publishes both packages)
- Align sglang-kt version with ktransformers via SGLANG_KT_VERSION env var injection
- Update Dockerfile to use submodule and inject aligned version
- Update all 13 doc files, CLI hints, and i18n strings to reference new install methods
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: bump version to 0.5.2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: rename PyPI package from kt-kernel to ktransformers
Users can now `pip install ktransformers` to get everything
(sglang-kt is auto-installed as a dependency).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Revert "[build]: rename PyPI package from kt-kernel to ktransformers"
This reverts commit e0cbbf6364.
* [build]: add ktransformers meta-package for PyPI
`pip install ktransformers` now works as a single install command.
It pulls kt-kernel (which in turn pulls sglang-kt).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [fix]: show sglang-kt package version in kt version command
- Prioritize sglang-kt package version (aligned with ktransformers)
over sglang internal __version__
- Update display name from "sglang" to "sglang-kt"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [fix]: improve sglang-kt detection in kt doctor and kt version
Recognize sglang-kt package name as proof of kvcache-ai fork installation.
Previously both commands fell through to "PyPI (not recommended)" for
non-editable local source installs. Now version.py reuses the centralized
check_sglang_installation() logic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [build]: bump version to 0.5.2.post1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
226 lines
7.8 KiB
Markdown
226 lines
7.8 KiB
Markdown
# CPU-GPU Expert Scheduling Tutorial
|
|
|
|
This tutorial demonstrates how to use the CPU-GPU expert scheduling feature in KTransformers with SGLang. This feature introduces a flexible GPU expert mask system that allows intelligent placement of MoE experts across CPU and GPU, optimizing inference performance based on workload patterns.
|
|
|
|
## Table of Contents
|
|
|
|
- [Table of Contents](#table-of-contents)
|
|
- [Hardware Requirements](#hardware-requirements)
|
|
- [Prerequisites](#prerequisites)
|
|
- [Step 1: Download Model Weights](#step-1-download-model-weights)
|
|
- [Step 2: Launch Server with Expert Scheduling](#step-2-launch-server-with-expert-scheduling)
|
|
- [Basic Usage](#basic-usage)
|
|
- [Expert Placement Strategies](#expert-placement-strategies)
|
|
- [Key Parameters](#key-parameters)
|
|
- [Step 3: Send Inference Requests](#step-3-send-inference-requests)
|
|
- [Option A: Interactive Chat with KT CLI](#option-a-interactive-chat-with-kt-cli)
|
|
- [Option B: OpenAI-Compatible API](#option-b-openai-compatible-api)
|
|
- [Performance](#performance)
|
|
- [Troubleshooting](#troubleshooting)
|
|
- [Additional Resources](#additional-resources)
|
|
|
|
## Hardware Requirements
|
|
|
|
**Minimum Configuration:**
|
|
- **GPU**: NVIDIA RTX 4090 24 GB (or equivalent with at least 24GB VRAM available)
|
|
- **CPU**: x86 CPU with AVX512 support (e.g., Intel Sapphire Rapids, AMD EPYC)
|
|
- **RAM**: At least 256GB system memory
|
|
- **Storage**: Sufficient space for model weights
|
|
|
|
**Tested Configuration:**
|
|
|
|
- **GPU**: 4 x NVIDIA GeForce RTX 4090 (24 GB)
|
|
- **CPU**: Intel Xeon Gold 6454S
|
|
- **RAM**: 512GB DDR5
|
|
- **OS**: Linux (Ubuntu 20.04+ recommended)
|
|
|
|
## Prerequisites
|
|
|
|
Before starting, ensure you have:
|
|
|
|
1. **SGLang installed**
|
|
|
|
Install the kvcache-ai fork of SGLang (one of):
|
|
|
|
```bash
|
|
# Option A: One-click install (from ktransformers root)
|
|
./install.sh
|
|
|
|
# Option B: pip install
|
|
pip install sglang-kt
|
|
```
|
|
|
|
2. **KTransformers installed**
|
|
|
|
```bash
|
|
git clone https://github.com/kvcache-ai/ktransformers.git
|
|
cd ktransformers/kt-kernel
|
|
bash ./install.sh
|
|
```
|
|
|
|
After installation, verify the CLI is working:
|
|
|
|
```bash
|
|
kt version
|
|
```
|
|
|
|
3. **CUDA toolkit** - CUDA 12.0+ recommended
|
|
4. **Hugging Face CLI** - For downloading models:
|
|
```bash
|
|
pip install -U huggingface-hub
|
|
```
|
|
|
|
## Step 1: Download Model Weights
|
|
|
|
Download your preferred MoE model weights. This feature supports various MoE models including:
|
|
|
|
* **Qwen3-Next-80B-A3B-Instruct-FP8**
|
|
|
|
```bash
|
|
huggingface-cli download Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --local-dir /path/to/qwen3-next-80b
|
|
```
|
|
|
|
## Step 2: Launch Server with Expert Scheduling
|
|
|
|
### Basic Usage
|
|
|
|
The simplest way to start the server with expert scheduling:
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model /path/to/model \
|
|
--kt-num-gpu-experts 8 \
|
|
--kt-expert-placement-strategy uniform
|
|
```
|
|
|
|
### Expert Placement Strategies
|
|
|
|
The system provides four expert placement strategies:
|
|
|
|
| Strategy | Description | Use Case |
|
|
|----------|-------------|----------|
|
|
| `uniform` | Distributes GPU experts evenly across all MoE layers | Default, no prior statistics needed |
|
|
| `frequency` | Places most frequently activated experts on GPU | Best performance when activation statistics are available |
|
|
| `front-loading` | Fills GPU experts from the first layer onwards | Testing or specific workload patterns |
|
|
| `random` | Randomly selects experts with fixed seed (42) | Baseline comparison |
|
|
|
|
**Using Frequency Strategy (Recommended for best performance):**
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model /path/to/model \
|
|
--kt-num-gpu-experts 8 \
|
|
--kt-expert-placement-strategy frequency \
|
|
--init-expert-location /path/to/activation_stats.pt
|
|
```
|
|
|
|
**Using Dynamic Expert Update:**
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model /path/to/model \
|
|
--kt-num-gpu-experts 8 \
|
|
--kt-expert-placement-strategy frequency \
|
|
--init-expert-location /path/to/activation_stats.pt \
|
|
--kt-enable-dynamic-expert-update \
|
|
--kt-gpu-prefill-token-threshold 512
|
|
```
|
|
|
|
### Key Parameters
|
|
|
|
| Parameter | Description |
|
|
|-----------|-------------|
|
|
| `--kt-num-gpu-experts` | Number of GPU experts per MoE layer. Internally multiplied by the number of MoE layers to get the total GPU experts. Ignored if `--kt-gpu-experts-ratio` is set. |
|
|
| `--kt-gpu-experts-ratio` | Ratio of total experts to place on GPU (0.0-1.0). If set, overrides `--kt-num-gpu-experts`. Example: 0.1 means 10% of all experts across all layers will be on GPU. |
|
|
| `--kt-expert-placement-strategy` | Expert placement strategy: `frequency`, `uniform`, `front-loading`, or `random`. Default: `uniform`. |
|
|
| `--init-expert-location` | Path to activation statistics file (`.pt`) for `frequency` strategy. |
|
|
| `--kt-enable-dynamic-expert-update` | Enable dynamic expert update during inference. |
|
|
| `--kt-gpu-prefill-token-threshold` | Token threshold for triggering dynamic expert redistribution during prefill. |
|
|
| `--record-kt-gpu-expert-distribution` | Enable recording of GPU expert distribution for analysis. |
|
|
| `--expert-distribution-recorder-mode` | Recording mode: `stat` (default), `stat_approx`, `per_pass`, or `per_token`. |
|
|
|
|
## Step 3: Send Inference Requests
|
|
|
|
Once the server is running (default: `http://localhost:30000`), you can interact with the model in several ways:
|
|
|
|
### Option A: Interactive Chat with KT CLI
|
|
|
|
The easiest way to chat with the model:
|
|
|
|
```bash
|
|
kt chat
|
|
```
|
|
|
|
This opens an interactive terminal chat session. Type your messages and press Enter to send. Use `Ctrl+C` to exit.
|
|
|
|
### Option B: OpenAI-Compatible API
|
|
|
|
The server exposes an OpenAI-compatible API at `http://localhost:30000/v1`.
|
|
|
|
**curl example (streaming):**
|
|
|
|
```bash
|
|
curl http://localhost:30000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "model-name",
|
|
"messages": [{"role": "user", "content": "Hello!"}],
|
|
"stream": true
|
|
}'
|
|
```
|
|
|
|
## Performance
|
|
|
|
### Throughput (tokens/s)
|
|
|
|
The following benchmarks were measured on Qwen3-Next-80B-A3B-Instruct-FP8 with 4 x RTX 4090, Intel Xeon Gold 6454S, tensor parallel size 4, using ShareGPT dataset:
|
|
|
|
| GPU Expert Ratio | random | uniform | front-loading | frequency | dynamic-expert-update |
|
|
|------------------|--------|---------|---------------|-----------|----------------------|
|
|
| 0% | 53.01 | 52.96 | 54.18 | 52.72 | 53.37 |
|
|
| 10% | 56.63 | 56.57 | 57.18 | 58.60 | 70.22 |
|
|
| 20% | 58.75 | 60.28 | 58.82 | 61.92 | 74.73 |
|
|
| 30% | 62.86 | 62.08 | 63.87 | 66.50 | 75.55 |
|
|
| 40% | 66.81 | 66.82 | 67.45 | 72.78 | 80.98 |
|
|
| 50% | 70.38 | 65.25 | 73.65 | 76.19 | 81.17 |
|
|
| 60% | 71.33 | 72.80 | 77.95 | 82.33 | 82.30 |
|
|
| 70% | 74.40 | 76.17 | 81.59 | 89.37 | 88.70 |
|
|
| 80% | 79.71 | 79.20 | 89.20 | 100.67 | 92.31 |
|
|
| 90% | 88.82 | 81.06 | 98.14 | 107.15 | 95.04 |
|
|
| 100% | 112.61 | 112.32 | 111.82 | 114.26 | 112.99 |
|
|
|
|
The `frequency` and `dynamic-expert-update` strategies show significant performance improvements over baseline strategies, especially at lower GPU expert ratios.
|
|
|
|
## Troubleshooting
|
|
|
|
### OOM (Out of Memory) Issues
|
|
|
|
If you encounter OOM, adjust these parameters when launching the server:
|
|
|
|
| Parameter | VRAM Impact |
|
|
|-----------|-------------|
|
|
| `--kt-num-gpu-experts` / `--kt-gpu-experts-ratio` | Reduces expert weight VRAM usage |
|
|
| `--chunked-prefill-size` | Reduces prefill extra VRAM allocation |
|
|
| `--max-total-tokens` | Reduces KV cache VRAM usage |
|
|
|
|
### Dynamic Expert Update Not Triggering
|
|
|
|
Ensure all conditions are met:
|
|
1. `--kt-enable-dynamic-expert-update` is enabled
|
|
2. `--kt-gpu-prefill-token-threshold` is set
|
|
3. Prefill length >= threshold value
|
|
|
|
### Statistics Recording
|
|
|
|
To save expert distribution statistics to a custom path, set the environment variable:
|
|
|
|
```bash
|
|
export SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/path/to/output
|
|
```
|
|
|
|
## Additional Resources
|
|
|
|
- [KT-Kernel Documentation](../../../kt-kernel/README.md)
|
|
- [SGLang GitHub](https://github.com/sgl-project/sglang)
|
|
- [KTransformers GitHub](https://github.com/kvcache-ai/ktransformers)
|