mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
77 lines
2.4 KiB
Plaintext
77 lines
2.4 KiB
Plaintext
---
|
|
title: "Launch GLM-4.5 / GLM-4.6 / GLM-4.7 with SGLang"
|
|
metatags:
|
|
description: "Deploy GLM-4.5/4.6/4.7 models with SGLang: FP8 inference, EAGLE speculative decoding, function calling support. Optimized for H100/H200 GPUs."
|
|
---
|
|
|
|
## Launch GLM-4.5 / GLM-4.6 / GLM-4.7 with SGLang
|
|
|
|
To serve GLM-4.5 / GLM-4.6 FP8 models on 8xH100/H200 GPUs:
|
|
|
|
```bash Command
|
|
python3 -m sglang.launch_server --model zai-org/GLM-4.6-FP8 --tp 8
|
|
```
|
|
|
|
### EAGLE Speculative Decoding
|
|
|
|
**Description**: SGLang has supported GLM-4.5 / GLM-4.6 models
|
|
with [EAGLE speculative decoding](../advanced_features/speculative_decoding#EAGLE-Decoding).
|
|
|
|
**Usage**:
|
|
Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and
|
|
`--speculative-num-draft-tokens` to enable this feature. For example:
|
|
|
|
```bash Command
|
|
python3 -m sglang.launch_server \
|
|
--model-path zai-org/GLM-4.6-FP8 \
|
|
--tp-size 8 \
|
|
--tool-call-parser glm45 \
|
|
--reasoning-parser glm45 \
|
|
--speculative-algorithm EAGLE \
|
|
--speculative-num-steps 3 \
|
|
--speculative-eagle-topk 1 \
|
|
--speculative-num-draft-tokens 4 \
|
|
--mem-fraction-static 0.9 \
|
|
--served-model-name glm-4.6-fp8 \
|
|
--enable-custom-logit-processor
|
|
```
|
|
|
|
<Tip>
|
|
To enable the experimental overlap scheduler for EAGLE speculative decoding, set the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages.
|
|
</Tip>
|
|
|
|
### Thinking Budget for GLM-4.5 / GLM-4.6
|
|
**Note**: For GLM-4.7, `--tool-call-parser` should be set to `glm47`, for GLM-4.5 and GLM-4.6, it should be set to `glm45`.
|
|
|
|
In SGLang, we can implement thinking budget with `CustomLogitProcessor`.
|
|
|
|
Launch a server with `--enable-custom-logit-processor` flag on.
|
|
|
|
Sample Request:
|
|
|
|
```python Example
|
|
import openai
|
|
from rich.pretty import pprint
|
|
from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor
|
|
|
|
|
|
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")
|
|
response = client.chat.completions.create(
|
|
model="zai-org/GLM-4.6",
|
|
messages=[
|
|
{
|
|
"role": "user",
|
|
"content": "Question: Is Paris the Capital of France?",
|
|
}
|
|
],
|
|
max_tokens=1024,
|
|
extra_body={
|
|
"custom_logit_processor": Glm4MoeThinkingBudgetLogitProcessor().to_str(),
|
|
"custom_params": {
|
|
"thinking_budget": 512,
|
|
},
|
|
},
|
|
)
|
|
pprint(response)
|
|
```
|