Files
ktransformers/doc/en/kt-kernel/experts-sched-Tutorial.md
Jianwei Dong 15c624dcae Fix/sglang kt detection (#1875)
* [feat]: simplify sglang installation with submodule, auto-sync CI, and version alignment

- Add kvcache-ai/sglang as git submodule at third_party/sglang (branch = main)
- Add top-level install.sh for one-click source installation (sglang + kt-kernel)
- Add sglang-kt as hard dependency in kt-kernel/pyproject.toml
- Add CI workflow to auto-sync sglang submodule daily and create PR
- Add CI workflow to build and publish sglang-kt to PyPI
- Integrate sglang-kt build into release-pypi.yml (version.py bump publishes both packages)
- Align sglang-kt version with ktransformers via SGLANG_KT_VERSION env var injection
- Update Dockerfile to use submodule and inject aligned version
- Update all 13 doc files, CLI hints, and i18n strings to reference new install methods

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [build]: bump version to 0.5.2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [build]: rename PyPI package from kt-kernel to ktransformers

Users can now `pip install ktransformers` to get everything
(sglang-kt is auto-installed as a dependency).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "[build]: rename PyPI package from kt-kernel to ktransformers"

This reverts commit e0cbbf6364.

* [build]: add ktransformers meta-package for PyPI

`pip install ktransformers` now works as a single install command.
It pulls kt-kernel (which in turn pulls sglang-kt).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [fix]: show sglang-kt package version in kt version command

- Prioritize sglang-kt package version (aligned with ktransformers)
  over sglang internal __version__
- Update display name from "sglang" to "sglang-kt"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [fix]: improve sglang-kt detection in kt doctor and kt version

Recognize sglang-kt package name as proof of kvcache-ai fork installation.
Previously both commands fell through to "PyPI (not recommended)" for
non-editable local source installs. Now version.py reuses the centralized
check_sglang_installation() logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [build]: bump version to 0.5.2.post1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 16:54:48 +08:00

7.8 KiB

CPU-GPU Expert Scheduling Tutorial

This tutorial demonstrates how to use the CPU-GPU expert scheduling feature in KTransformers with SGLang. This feature introduces a flexible GPU expert mask system that allows intelligent placement of MoE experts across CPU and GPU, optimizing inference performance based on workload patterns.

Table of Contents

Hardware Requirements

Minimum Configuration:

  • GPU: NVIDIA RTX 4090 24 GB (or equivalent with at least 24GB VRAM available)
  • CPU: x86 CPU with AVX512 support (e.g., Intel Sapphire Rapids, AMD EPYC)
  • RAM: At least 256GB system memory
  • Storage: Sufficient space for model weights

Tested Configuration:

  • GPU: 4 x NVIDIA GeForce RTX 4090 (24 GB)
  • CPU: Intel Xeon Gold 6454S
  • RAM: 512GB DDR5
  • OS: Linux (Ubuntu 20.04+ recommended)

Prerequisites

Before starting, ensure you have:

  1. SGLang installed

    Install the kvcache-ai fork of SGLang (one of):

    # Option A: One-click install (from ktransformers root)
    ./install.sh
    
    # Option B: pip install
    pip install sglang-kt
    
  2. KTransformers installed

    git clone https://github.com/kvcache-ai/ktransformers.git
    cd ktransformers/kt-kernel
    bash ./install.sh
    

    After installation, verify the CLI is working:

    kt version
    
  3. CUDA toolkit - CUDA 12.0+ recommended

  4. Hugging Face CLI - For downloading models:

    pip install -U huggingface-hub
    

Step 1: Download Model Weights

Download your preferred MoE model weights. This feature supports various MoE models including:

  • Qwen3-Next-80B-A3B-Instruct-FP8

    huggingface-cli download Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --local-dir /path/to/qwen3-next-80b
    

Step 2: Launch Server with Expert Scheduling

Basic Usage

The simplest way to start the server with expert scheduling:

python -m sglang.launch_server \
    --model /path/to/model \
    --kt-num-gpu-experts 8 \
    --kt-expert-placement-strategy uniform

Expert Placement Strategies

The system provides four expert placement strategies:

Strategy Description Use Case
uniform Distributes GPU experts evenly across all MoE layers Default, no prior statistics needed
frequency Places most frequently activated experts on GPU Best performance when activation statistics are available
front-loading Fills GPU experts from the first layer onwards Testing or specific workload patterns
random Randomly selects experts with fixed seed (42) Baseline comparison

Using Frequency Strategy (Recommended for best performance):

python -m sglang.launch_server \
    --model /path/to/model \
    --kt-num-gpu-experts 8 \
    --kt-expert-placement-strategy frequency \
    --init-expert-location /path/to/activation_stats.pt

Using Dynamic Expert Update:

python -m sglang.launch_server \
    --model /path/to/model \
    --kt-num-gpu-experts 8 \
    --kt-expert-placement-strategy frequency \
    --init-expert-location /path/to/activation_stats.pt \
    --kt-enable-dynamic-expert-update \
    --kt-gpu-prefill-token-threshold 512

Key Parameters

Parameter Description
--kt-num-gpu-experts Number of GPU experts per MoE layer. Internally multiplied by the number of MoE layers to get the total GPU experts. Ignored if --kt-gpu-experts-ratio is set.
--kt-gpu-experts-ratio Ratio of total experts to place on GPU (0.0-1.0). If set, overrides --kt-num-gpu-experts. Example: 0.1 means 10% of all experts across all layers will be on GPU.
--kt-expert-placement-strategy Expert placement strategy: frequency, uniform, front-loading, or random. Default: uniform.
--init-expert-location Path to activation statistics file (.pt) for frequency strategy.
--kt-enable-dynamic-expert-update Enable dynamic expert update during inference.
--kt-gpu-prefill-token-threshold Token threshold for triggering dynamic expert redistribution during prefill.
--record-kt-gpu-expert-distribution Enable recording of GPU expert distribution for analysis.
--expert-distribution-recorder-mode Recording mode: stat (default), stat_approx, per_pass, or per_token.

Step 3: Send Inference Requests

Once the server is running (default: http://localhost:30000), you can interact with the model in several ways:

Option A: Interactive Chat with KT CLI

The easiest way to chat with the model:

kt chat

This opens an interactive terminal chat session. Type your messages and press Enter to send. Use Ctrl+C to exit.

Option B: OpenAI-Compatible API

The server exposes an OpenAI-compatible API at http://localhost:30000/v1.

curl example (streaming):

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model-name",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Performance

Throughput (tokens/s)

The following benchmarks were measured on Qwen3-Next-80B-A3B-Instruct-FP8 with 4 x RTX 4090, Intel Xeon Gold 6454S, tensor parallel size 4, using ShareGPT dataset:

GPU Expert Ratio random uniform front-loading frequency dynamic-expert-update
0% 53.01 52.96 54.18 52.72 53.37
10% 56.63 56.57 57.18 58.60 70.22
20% 58.75 60.28 58.82 61.92 74.73
30% 62.86 62.08 63.87 66.50 75.55
40% 66.81 66.82 67.45 72.78 80.98
50% 70.38 65.25 73.65 76.19 81.17
60% 71.33 72.80 77.95 82.33 82.30
70% 74.40 76.17 81.59 89.37 88.70
80% 79.71 79.20 89.20 100.67 92.31
90% 88.82 81.06 98.14 107.15 95.04
100% 112.61 112.32 111.82 114.26 112.99

The frequency and dynamic-expert-update strategies show significant performance improvements over baseline strategies, especially at lower GPU expert ratios.

Troubleshooting

OOM (Out of Memory) Issues

If you encounter OOM, adjust these parameters when launching the server:

Parameter VRAM Impact
--kt-num-gpu-experts / --kt-gpu-experts-ratio Reduces expert weight VRAM usage
--chunked-prefill-size Reduces prefill extra VRAM allocation
--max-total-tokens Reduces KV cache VRAM usage

Dynamic Expert Update Not Triggering

Ensure all conditions are met:

  1. --kt-enable-dynamic-expert-update is enabled
  2. --kt-gpu-prefill-token-threshold is set
  3. Prefill length >= threshold value

Statistics Recording

To save expert distribution statistics to a custom path, set the environment variable:

export SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/path/to/output

Additional Resources