mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
107 lines
3.3 KiB
Plaintext
107 lines
3.3 KiB
Plaintext
---
|
|
title: "Kimi-K2.5 Usage"
|
|
metatags:
|
|
description: "Deploy Kimi-K2.5 with SGLang: 1T-parameter multimodal MoE model, 256K context, MLA attention, MoonViT vision encoder, thinking and instant modes, tool calling support."
|
|
---
|
|
[Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) is Moonshot AI's open-source, native multimodal, agentic MoE. It is a 1T-parameter model (32B active) with 256K context, MLA attention, and a MoonViT vision encoder, supporting both thinking and instant modes.
|
|
|
|
In SGLang, Kimi-K2.5 uses the `kimi_k2` reasoning and tool-call parsers for correct thinking and tool handling.
|
|
|
|
```{note} Example
|
|
Kimi-K2.5 support is in SGLang main and will land in the next release. Use the latest main or a nightly image until then.
|
|
```
|
|
|
|
Official deployment guide: [Kimi-K2.5 deployment guide](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/docs/deploy_guidance)
|
|
|
|
## Install (Latest Main)
|
|
|
|
```bash Command
|
|
uv pip install "sglang @ git+https://github.com/sgl-project/sglang.git#subdirectory=python"
|
|
# For CUDA 12:
|
|
uv pip install "nvidia-cudnn-cu12==9.16.0.29"
|
|
# For CUDA 13:
|
|
uv pip install "nvidia-cudnn-cu13==9.16.0.29"
|
|
```
|
|
|
|
## Launch Kimi-K2.5 with SGLang
|
|
|
|
Example: single node, TP8 on H200.
|
|
|
|
```bash Command
|
|
python3 -m sglang.launch_server \
|
|
--model-path moonshotai/Kimi-K2.5 \
|
|
--tp 8 \
|
|
--trust-remote-code \
|
|
--tool-call-parser kimi_k2 \
|
|
--reasoning-parser kimi_k2
|
|
```
|
|
|
|
### Parser Requirements
|
|
|
|
- `--tool-call-parser kimi_k2`: Required for tool calling.
|
|
- `--reasoning-parser kimi_k2`: Required to parse thinking content; thinking mode is enabled by default.
|
|
|
|
## Test the Deployment
|
|
|
|
Thinking mode is enabled by default. To disable thinking (instant mode), pass `extra_body.chat_template_kwargs.thinking=false`.
|
|
|
|
```bash Command
|
|
# Thinking mode (default)
|
|
curl http://localhost:30000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "moonshotai/Kimi-K2.5",
|
|
"messages": [
|
|
{"role": "system", "content": "You are a helpful assistant."},
|
|
{"role": "user", "content": "Explain mixture-of-experts in one sentence."}
|
|
],
|
|
"max_tokens": 256
|
|
}'
|
|
```
|
|
|
|
```bash Command
|
|
# Instant mode (thinking disabled)
|
|
curl http://localhost:30000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "moonshotai/Kimi-K2.5",
|
|
"messages": [
|
|
{"role": "user", "content": "Give one sentence on MoE models."}
|
|
],
|
|
"max_tokens": 128,
|
|
"extra_body": {"chat_template_kwargs": {"thinking": false}}
|
|
}'
|
|
```
|
|
|
|
## Multimodal Inputs (Image/Video)
|
|
|
|
Kimi-K2.5 is multimodal. Image inputs are supported via the OpenAI-compatible vision API. For more details, see `openai_api_vision.ipynb`.
|
|
|
|
```bash Command
|
|
# Image input (SGLang)
|
|
curl http://localhost:30000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "moonshotai/Kimi-K2.5",
|
|
"messages": [
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{"type": "text", "text": "Describe this image."},
|
|
{
|
|
"type": "image_url",
|
|
"image_url": {
|
|
"url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
],
|
|
"max_tokens": 256
|
|
}'
|
|
```
|
|
|
|
<Note>
|
|
Video chat is experimental and is only supported in the official Moonshot API for now.
|
|
</Note>
|