mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
118 lines
5.8 KiB
Plaintext
118 lines
5.8 KiB
Plaintext
---
|
|
title: "Llama4 Usage"
|
|
metatags:
|
|
description: "Deploy Llama 4 Scout (109B) and Maverick (400B) with SGLang: up to 10M context, hybrid KV cache, vision support. Optimized for H100/H200 GPUs."
|
|
---
|
|
[Llama 4](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD) is Meta's latest generation of open-source LLM model with industry-leading performance.
|
|
|
|
SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since [v0.4.5](https://github.com/sgl-project/sglang/releases/tag/v0.4.5).
|
|
|
|
Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-project/sglang/issues/5118).
|
|
|
|
## Launch Llama 4 with SGLang
|
|
|
|
To serve Llama 4 models on 8xH100/H200 GPUs:
|
|
|
|
```bash Command
|
|
python3 -m sglang.launch_server \
|
|
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
|
|
--tp 8 \
|
|
--context-length 1000000
|
|
```
|
|
|
|
### Configuration Tips
|
|
|
|
- **OOM Mitigation**: Adjust `--context-length` to avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8\*H100 and up to 2.5M on 8\*H200. For the Maverick model, we don't need to set context length on 8\*H200. When hybrid kv cache is enabled, `--context-length` can be set up to 5M on 8\*H100 and up to 10M on 8\*H200 for the Scout model.
|
|
|
|
- **Attention Backend Auto-Selection**: SGLang automatically selects the optimal attention backend for Llama 4 based on your hardware. You typically don't need to specify `--attention-backend` manually:
|
|
- **Blackwell GPUs (B200/GB200)**: `trtllm_mha`
|
|
- **Hopper GPUs (H100/H200)**: `fa3`
|
|
- **AMD GPUs**: `aiter`
|
|
- **Intel XPU**: `intel_xpu`
|
|
- **Other platforms**: `triton` (fallback)
|
|
|
|
To override the auto-selection, explicitly specify `--attention-backend` with one of the supported backends: `fa3`, `aiter`, `triton`, `trtllm_mha`, or `intel_xpu`.
|
|
|
|
- **Chat Template**: Add `--chat-template llama-4` for chat completion tasks.
|
|
- **Enable Multi-Modal**: Add `--enable-multimodal` for multi-modal capabilities.
|
|
- **Enable Hybrid-KVCache**: Set `--swa-full-tokens-ratio` to adjust the ratio of SWA layer (for Llama4, it's local attention layer) KV tokens / full layer KV tokens. (default: 0.8, range: 0-1)
|
|
|
|
|
|
### EAGLE Speculative Decoding
|
|
**Description**: SGLang has supported Llama 4 Maverick (400B) with [EAGLE speculative decoding](../advanced_features/speculative_decoding#EAGLE-Decoding).
|
|
|
|
**Usage**:
|
|
Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
|
|
```text Output
|
|
python3 -m sglang.launch_server \
|
|
--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
|
|
--speculative-algorithm EAGLE3 \
|
|
--speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
|
|
--speculative-num-steps 3 \
|
|
--speculative-eagle-topk 1 \
|
|
--speculative-num-draft-tokens 4 \
|
|
--trust-remote-code \
|
|
--tp 8 \
|
|
--context-length 1000000
|
|
```
|
|
|
|
- **Note** The Llama 4 draft model *nvidia/Llama-4-Maverick-17B-128E-Eagle3* can only recognize conversations in chat mode.
|
|
|
|
## Benchmarking Results
|
|
|
|
### Accuracy Test with `lm_eval`
|
|
|
|
The accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the [official benchmark numbers](https://ai.meta.com/blog/llama-4-multimodal-intelligence/).
|
|
|
|
Benchmark results on MMLU Pro dataset with 8*H100:
|
|
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
|
|
<colgroup>
|
|
<col style={{width: "34%"}} />
|
|
<col style={{width: "33%"}} />
|
|
<col style={{width: "33%"}} />
|
|
</colgroup>
|
|
<thead>
|
|
<tr style={{borderBottom: "2px solid #d55816"}}>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}></th>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Llama-4-Scout-17B-16E-Instruct</th>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-4-Maverick-17B-128E-Instruct</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Official Benchmark</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>74.3</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>80.5</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>SGLang</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>75.2</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>80.7</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
Commands:
|
|
|
|
```bash Command
|
|
# Llama-4-Scout-17B-16E-Instruct model
|
|
python -m sglang.launch_server \
|
|
--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
|
|
--port 30000 \
|
|
--tp 8 \
|
|
--mem-fraction-static 0.8 \
|
|
--context-length 65536
|
|
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
|
|
|
|
# Llama-4-Maverick-17B-128E-Instruct
|
|
python -m sglang.launch_server \
|
|
--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
|
|
--port 30000 \
|
|
--tp 8 \
|
|
--mem-fraction-static 0.8 \
|
|
--context-length 65536
|
|
lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
|
|
```
|
|
|
|
Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/5092).
|