Merge pull request #1562 from kvcache-ai/kimi-k2-thinking

Kimi k2 thinking
2026-04-19 22:09:10 +00:00 · 2025-11-06 18:17:46 +08:00
parent 1ab570b5ca 86229c852d
commit f3c4dbe181
2 changed files with 124 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -23,6 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

 <h2 id="Updates">🔥 Updates</h2>

+* **Nov 6, 2025**: Support Kimi-K2-Thinking inference ([Tutorial](./doc/en/Kimi-K2-Thinking.md)) and fine-tune ([Tutorial](./doc/en/SFT_Installation_Guide_KimiK2.md))
 * **Nov 4, 2025**: KTransformers Fine-Tuning × LLaMA-Factory Integration. ([Tutorial](./doc/en/KTransformers-Fine-Tuning_User-Guide.md))
 * **Oct 27, 2025**: Support Ascend NPU. ([Tutorial](./doc/zh/DeepseekR1_V3_tutorial_zh_for_Ascend_NPU.md))
 * **Oct 10, 2025**: Integrating into SGLang. ([Roadmap](https://github.com/sgl-project/sglang/issues/11425))
--- a/doc/en/Kimi-K2-Thinking.md
+++ b/doc/en/Kimi-K2-Thinking.md
@@ -0,0 +1,123 @@
+# KTransformers+SGLang Inference Deployment
+
+## Installation
+
+Step 1: Install SGLang
+
+Follow the [official SGLang installation](https://docs.sglang.ai/get_started/install.html) guide to install SGLang:
+```
+pip install "sglang[all]"
+```
+
+Step 2: Install KTransformers CPU Kernels
+
+The KTransformers CPU kernels (kt-kernel) provide AMX-optimized computation for hybrid inference, for detailed installation instructions and troubleshooting, refer to the official [kt-kernel installation guide](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md).
+
+## Download Model
+
+Download the official KIMI weights as GPU weights.
+
+* huggingface: https://huggingface.co/moonshotai/Kimi-K2-Thinking
+* modelscope: https://modelscope.cn/models/moonshotai/Kimi-K2-Thinking
+
+Download the AMX INT4 quantized weights provided by Approaching AI from https://modelscope.cn/models/ApproachingAI2024/Kimi-K2-Thinking-CPU-weight as CPU weights.
+
+## How to start
+```
+python -m sglang.launch_server   --host 0.0.0.0   --port 60000   --model path/to/Kimi-K2-Thinking/   --kt-amx-weight-path path/to/Kimi-K2-Instruct-CPU-weight/   --kt-cpuinfer 56   --kt-threadpool-count 2   --kt-num-gpu-experts 200   --kt-amx-method AMXINT4   --attention-backend triton   --trust-remote-code   --mem-fraction-static 0.98   --chunked-prefill-size 4096   --max-running-requests 37   --max-total-tokens 37000   --enable-mixed-chunk   --tensor-parallel-size 8   --enable-p2p-check   --disable-shared-experts-fusion
+```
+tips:
+
+`--kt-cpuinfer`: is recommended to be set to (number of physical CPU cores - 8 (number of GPUs)).
+
+`--kt-num-gpu-experts`: refers to the number of experts retained on GPUs, which should be adjusted according to your available GPU memory and expected KV cache space.
+
+## Test
+
+When testing, you need to add `--disable-radix-cache` and `--disable-chunked-prefix-cache` when starting the server.
+
+### bench prefill
+```
+python -m sglang.bench_serving   --backend sglang   --host 127.0.0.1   --port 60000   --num-prompts 37 --random-input-len 1024 --random-output-len 1 --random-range-ratio 1.0 --dataset-name random
+```
+
+### bench decode
+```
+python -m sglang.bench_serving   --backend sglang   --host 127.0.0.1   --port 60000   --num-prompts 37 --random-input-len 10 --random-output-len 512 --random-range-ratio 1.0 --dataset-name random
+```
+
+## Performance
+
+### System Configuration:
+
+- GPUs: 8× NVIDIA L20
+- CPU: Intel(R) Xeon(R) Gold 6454S
+
+### Bench prefill
+```
+============ Serving Benchmark Result ============
+Backend:                                 sglang
+Traffic request rate:                    inf
+Max request concurrency:                 not set
+Successful requests:                     37
+Benchmark duration (s):                  65.58
+Total input tokens:                      37888
+Total input text tokens:                 37888
+Total input vision tokens:               0
+Total generated tokens:                  37
+Total generated tokens (retokenized):    37
+Request throughput (req/s):              0.56
+Input token throughput (tok/s):          577.74
+Output token throughput (tok/s):         0.56
+Total token throughput (tok/s):          578.30
+Concurrency:                             23.31
+----------------End-to-End Latency----------------
+Mean E2E Latency (ms):                   41316.50
+Median E2E Latency (ms):                 41500.35
+---------------Time to First Token----------------
+Mean TTFT (ms):                          41316.48
+Median TTFT (ms):                        41500.35
+P99 TTFT (ms):                           65336.31
+---------------Inter-Token Latency----------------
+Mean ITL (ms):                           0.00
+Median ITL (ms):                         0.00
+P95 ITL (ms):                            0.00
+P99 ITL (ms):                            0.00
+Max ITL (ms):                            0.00
+==================================================
+```
+
+### Bench decode
+
+```
+============ Serving Benchmark Result ============
+Backend:                                 sglang
+Traffic request rate:                    inf
+Max request concurrency:                 not set
+Successful requests:                     37
+Benchmark duration (s):                  412.66
+Total input tokens:                      370
+Total input text tokens:                 370
+Total input vision tokens:               0
+Total generated tokens:                  18944
+Total generated tokens (retokenized):    18618
+Request throughput (req/s):              0.09
+Input token throughput (tok/s):          0.90
+Output token throughput (tok/s):         45.91
+Total token throughput (tok/s):          46.80
+Concurrency:                             37.00
+----------------End-to-End Latency----------------
+Mean E2E Latency (ms):                   412620.35
+Median E2E Latency (ms):                 412640.56
+---------------Time to First Token----------------
+Mean TTFT (ms):                          3551.87
+Median TTFT (ms):                        3633.59
+P99 TTFT (ms):                           3637.37
+---------------Inter-Token Latency----------------
+Mean ITL (ms):                           800.53
+Median ITL (ms):                         797.89
+P95 ITL (ms):                            840.06
+P99 ITL (ms):                            864.96
+Max ITL (ms):                            3044.56
+==================================================
+```