mirror of https://github.com/kvcache-ai/ktransformers.git synced 2026-04-20 22:39:17 +00:00

Files

Jianwei Dong 15c624dcae Fix/sglang kt detection (#1875 )

* [feat]: simplify sglang installation with submodule, auto-sync CI, and version alignment

- Add kvcache-ai/sglang as git submodule at third_party/sglang (branch = main)
- Add top-level install.sh for one-click source installation (sglang + kt-kernel)
- Add sglang-kt as hard dependency in kt-kernel/pyproject.toml
- Add CI workflow to auto-sync sglang submodule daily and create PR
- Add CI workflow to build and publish sglang-kt to PyPI
- Integrate sglang-kt build into release-pypi.yml (version.py bump publishes both packages)
- Align sglang-kt version with ktransformers via SGLANG_KT_VERSION env var injection
- Update Dockerfile to use submodule and inject aligned version
- Update all 13 doc files, CLI hints, and i18n strings to reference new install methods

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [build]: bump version to 0.5.2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [build]: rename PyPI package from kt-kernel to ktransformers

Users can now `pip install ktransformers` to get everything
(sglang-kt is auto-installed as a dependency).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "[build]: rename PyPI package from kt-kernel to ktransformers"

This reverts commit e0cbbf6364.

* [build]: add ktransformers meta-package for PyPI

`pip install ktransformers` now works as a single install command.
It pulls kt-kernel (which in turn pulls sglang-kt).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [fix]: show sglang-kt package version in kt version command

- Prioritize sglang-kt package version (aligned with ktransformers)
  over sglang internal __version__
- Update display name from "sglang" to "sglang-kt"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [fix]: improve sglang-kt detection in kt doctor and kt version

Recognize sglang-kt package name as proof of kvcache-ai fork installation.
Previously both commands fell through to "PyPI (not recommended)" for
non-editable local source installs. Now version.py reuses the centralized
check_sglang_installation() logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [build]: bump version to 0.5.2.post1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-04 16:54:48 +08:00

5.5 KiB

Raw Permalink Blame History

KTransformers+SGLang Inference Deployment

Please Note This is Quantization Deployment. For Native Kimi K2 Thinking deployment please refer to here.

Installation

Step 1: Install SGLang

Install the kvcache-ai fork of SGLang (one of):

# Option A: One-click install (from ktransformers root)
./install.sh

# Option B: pip install
pip install sglang-kt

Important: Use sglang-kt (kvcache-ai fork), not the official sglang package. Run pip uninstall sglang first if you have the official version installed.

Step 2: Install KTransformers CPU Kernels

The KTransformers CPU kernels (kt-kernel) provide AMX-optimized computation for hybrid inference, for detailed installation instructions and troubleshooting, refer to the official kt-kernel installation guide.

Download Model

Download the official KIMI weights as GPU weights.

huggingface: https://huggingface.co/moonshotai/Kimi-K2-Thinking
modelscope: https://modelscope.cn/models/moonshotai/Kimi-K2-Thinking

Download the AMX INT4 quantized weights from https://huggingface.co/KVCache-ai/Kimi-K2-Thinking-CPU-weight as CPU weights.

How to start

python -m sglang.launch_server   --host 0.0.0.0   --port 60000   --model path/to/Kimi-K2-Thinking/   --kt-weight-path path/to/Kimi-K2-Instruct-CPU-weight/   --kt-cpuinfer 56   --kt-threadpool-count 2   --kt-num-gpu-experts 200   --kt-method AMXINT4   --attention-backend flashinfer   --trust-remote-code   --mem-fraction-static 0.98   --chunked-prefill-size 4096   --max-running-requests 37   --max-total-tokens 37000   --enable-mixed-chunk   --tensor-parallel-size 8   --enable-p2p-check   --disable-shared-experts-fusion

tips:

--kt-cpuinfer: is recommended to be set to (number of physical CPU cores - 8 (number of GPUs)).

--kt-num-gpu-experts: refers to the number of experts retained on GPUs, which should be adjusted according to your available GPU memory and expected KV cache space.

Test

When testing, you need to add --disable-radix-cache and --disable-chunked-prefix-cache when starting the server.

bench prefill

python -m sglang.bench_serving   --backend sglang   --host 127.0.0.1   --port 60000   --num-prompts 37 --random-input-len 1024 --random-output-len 1 --random-range-ratio 1.0 --dataset-name random

bench decode

python -m sglang.bench_serving   --backend sglang   --host 127.0.0.1   --port 60000   --num-prompts 37 --random-input-len 10 --random-output-len 512 --random-range-ratio 1.0 --dataset-name random

Performance

System Configuration:

GPUs: 8× NVIDIA L20
CPU: Intel(R) Xeon(R) Gold 6454S

Bench prefill

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     37
Benchmark duration (s):                  65.58
Total input tokens:                      37888
Total input text tokens:                 37888
Total input vision tokens:               0
Total generated tokens:                  37
Total generated tokens (retokenized):    37
Request throughput (req/s):              0.56
Input token throughput (tok/s):          577.74
Output token throughput (tok/s):         0.56
Total token throughput (tok/s):          578.30
Concurrency:                             23.31
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   41316.50
Median E2E Latency (ms):                 41500.35
---------------Time to First Token----------------
Mean TTFT (ms):                          41316.48
Median TTFT (ms):                        41500.35
P99 TTFT (ms):                           65336.31
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P95 ITL (ms):                            0.00
P99 ITL (ms):                            0.00
Max ITL (ms):                            0.00
==================================================

Bench decode

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     37
Benchmark duration (s):                  412.66
Total input tokens:                      370
Total input text tokens:                 370
Total input vision tokens:               0
Total generated tokens:                  18944
Total generated tokens (retokenized):    18618
Request throughput (req/s):              0.09
Input token throughput (tok/s):          0.90
Output token throughput (tok/s):         45.91
Total token throughput (tok/s):          46.80
Concurrency:                             37.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   412620.35
Median E2E Latency (ms):                 412640.56
---------------Time to First Token----------------
Mean TTFT (ms):                          3551.87
Median TTFT (ms):                        3633.59
P99 TTFT (ms):                           3637.37
---------------Inter-Token Latency----------------
Mean ITL (ms):                           800.53
Median ITL (ms):                         797.89
P95 ITL (ms):                            840.06
P99 ITL (ms):                            864.96
Max ITL (ms):                            3044.56
==================================================

5.5 KiB Raw Permalink Blame History Unescape Escape