mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-07-01 04:08:10 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
367 lines
12 KiB
Plaintext
367 lines
12 KiB
Plaintext
---
|
|
title: CLI reference
|
|
sidebarTitle: CLI
|
|
description: Run one-off generation tasks and launch the HTTP server from the command line.
|
|
---
|
|
|
|
The `sglang` CLI provides two main subcommands for diffusion inference:
|
|
|
|
- **`sglang generate`** -- run a one-off generation without a persistent server
|
|
- **`sglang serve`** -- launch the OpenAI-compatible HTTP server
|
|
|
|
## Prerequisites
|
|
|
|
A working SGLang Diffusion installation with the `sglang` CLI available in your `$PATH`. See the [installation guide](../installation) for setup instructions.
|
|
|
|
## Generate
|
|
|
|
Run a one-off generation task without launching a persistent server. Pass both server arguments and sampling parameters after the `generate` subcommand:
|
|
|
|
```bash
|
|
SERVER_ARGS=(
|
|
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
|
|
--text-encoder-cpu-offload
|
|
--pin-cpu-memory
|
|
--num-gpus 4
|
|
--ulysses-degree=2
|
|
--ring-degree=2
|
|
)
|
|
|
|
SAMPLING_ARGS=(
|
|
--prompt "A curious raccoon"
|
|
--save-output
|
|
--output-path outputs
|
|
--output-file-name "A curious raccoon.mp4"
|
|
)
|
|
|
|
sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
|
|
```
|
|
|
|
You can also enable Cache-DiT acceleration via an environment variable:
|
|
|
|
```bash
|
|
SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
|
|
```
|
|
|
|
<Note>
|
|
HTTP server-related arguments are ignored in `generate` mode. The process shuts down automatically once generation completes.
|
|
</Note>
|
|
|
|
## Serve
|
|
|
|
Launch the SGLang Diffusion HTTP server and interact through the OpenAI-compatible API.
|
|
|
|
```bash
|
|
SERVER_ARGS=(
|
|
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
|
|
--text-encoder-cpu-offload
|
|
--pin-cpu-memory
|
|
--num-gpus 4
|
|
--ulysses-degree=2
|
|
--ring-degree=2
|
|
)
|
|
|
|
sglang serve "${SERVER_ARGS[@]}"
|
|
```
|
|
|
|
- `--model-path` -- which model to load (e.g. `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`)
|
|
- `--port` -- HTTP port to listen on (default: `30010`)
|
|
|
|
For full API usage including image/video generation and LoRA management, see the [OpenAI API documentation](./openai-api).
|
|
|
|
---
|
|
|
|
## Supported arguments
|
|
|
|
### Server arguments
|
|
|
|
<Accordion title="Server arguments reference">
|
|
|
|
| Argument | Description |
|
|
|:--|:--|
|
|
| `--model-path MODEL_PATH` | Path to the model or HuggingFace model ID |
|
|
| `--lora-path LORA_PATH` | Path to a LoRA adapter (local or HuggingFace ID). If omitted, LoRA is not applied |
|
|
| `--lora-nickname NAME` | Nickname for the LoRA adapter (default: `default`) |
|
|
| `--num-gpus NUM` | Number of GPUs to use |
|
|
| `--tp-size SIZE` | Tensor parallelism size (encoder only; keep at most 1 when text encoder offload is enabled) |
|
|
| `--sp-degree SIZE` | Sequence parallelism size (typically should match the number of GPUs) |
|
|
| `--ulysses-degree SIZE` | DeepSpeed-Ulysses-style SP degree in USP |
|
|
| `--ring-degree SIZE` | Ring attention-style SP degree in USP |
|
|
| `--attention-backend BACKEND` | Attention backend. Native pipelines: `fa`, `torch_sdpa`, `sage_attn`, etc. Diffusers pipelines: `flash`, `_flash_3_hub`, `sage`, `xformers` |
|
|
| `--attention-backend-config CONFIG` | Config for the attention backend. Accepts a JSON string, a JSON/YAML file path, or `key=value` pairs |
|
|
| `--cache-dit-config PATH` | Path to a Cache-DiT YAML/JSON config (diffusers backend only) |
|
|
| `--dit-precision DTYPE` | Precision for the DiT model (`fp32`, `fp16`, `bf16`) |
|
|
| `--text-encoder-cpu-offload` | Offload text encoders to CPU |
|
|
| `--pin-cpu-memory` | Pin CPU memory for faster transfers |
|
|
|
|
</Accordion>
|
|
|
|
### Sampling parameters
|
|
|
|
<Accordion title="Generation parameters">
|
|
|
|
| Argument | Description |
|
|
|:--|:--|
|
|
| `--prompt PROMPT` | Text description for the image or video to generate |
|
|
| `--negative-prompt PROMPT` | Negative prompt to guide generation away from certain concepts |
|
|
| `--num-inference-steps STEPS` | Number of denoising steps |
|
|
| `--seed SEED` | Random seed for reproducible generation |
|
|
|
|
</Accordion>
|
|
|
|
<Accordion title="Image/video configuration">
|
|
|
|
| Argument | Description |
|
|
|:--|:--|
|
|
| `--height HEIGHT` | Height of the generated output |
|
|
| `--width WIDTH` | Width of the generated output |
|
|
| `--num-frames NUM` | Number of frames to generate (video only) |
|
|
| `--fps FPS` | Frames per second for the saved output (video only) |
|
|
|
|
</Accordion>
|
|
|
|
<Accordion title="Output options">
|
|
|
|
| Argument | Description |
|
|
|:--|:--|
|
|
| `--save-output` | Save the image or video to disk |
|
|
| `--output-path PATH` | Directory to save the generated output |
|
|
| `--output-file-name NAME` | File name for the saved output |
|
|
| `--return-frames` | Return the raw frames instead of saving |
|
|
|
|
</Accordion>
|
|
|
|
### Frame interpolation (video only)
|
|
|
|
Frame interpolation is a post-processing step that synthesizes new frames between each pair of consecutive generated frames, producing smoother motion without re-running the diffusion model.
|
|
|
|
The `--frame-interpolation-exp` flag controls how many rounds of interpolation to apply: each round inserts one new frame into every gap between adjacent frames, so the output frame count follows the formula:
|
|
|
|
$$
|
|
\text{output frames} = (N - 1) \times 2^{\text{exp}} + 1
|
|
$$
|
|
|
|
For example, 5 original frames with `exp=1` -> 4 gaps x 1 new frame + 5 originals = **9 frames**; with `exp=2` -> **17 frames**.
|
|
|
|
| Argument | Description |
|
|
|:--|:--|
|
|
| `--enable-frame-interpolation` | Enable frame interpolation. Model weights are downloaded automatically on first use |
|
|
| `--frame-interpolation-exp EXP` | Interpolation exponent -- `1` = 2x temporal resolution, `2` = 4x, etc. (default: `1`) |
|
|
| `--frame-interpolation-scale SCALE` | RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`) |
|
|
| `--frame-interpolation-model-path PATH` | Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically) |
|
|
|
|
**Example** -- generate a 5-frame video and interpolate to 9 frames ($(5 - 1) \times 2^1 + 1 = 9$):
|
|
|
|
```bash
|
|
sglang generate \
|
|
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
|
|
--prompt "A dog running through a park" \
|
|
--num-frames 5 \
|
|
--enable-frame-interpolation \
|
|
--frame-interpolation-exp 1 \
|
|
--save-output
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration files
|
|
|
|
Instead of passing every parameter on the command line, you can use a JSON or YAML config file. Command-line arguments take precedence over config values.
|
|
|
|
```bash
|
|
sglang generate --config config.json
|
|
```
|
|
|
|
<Tabs>
|
|
<Tab title="JSON">
|
|
```json config.json
|
|
{
|
|
"model_path": "FastVideo/FastHunyuan-diffusers",
|
|
"prompt": "A beautiful woman in a red dress walking down a street",
|
|
"output_path": "outputs/",
|
|
"num_gpus": 2,
|
|
"sp_size": 2,
|
|
"tp_size": 1,
|
|
"num_frames": 45,
|
|
"height": 720,
|
|
"width": 1280,
|
|
"num_inference_steps": 6,
|
|
"seed": 1024,
|
|
"fps": 24,
|
|
"precision": "bf16",
|
|
"vae_precision": "fp16",
|
|
"vae_tiling": true,
|
|
"vae_sp": true,
|
|
"vae_config": {
|
|
"load_encoder": false,
|
|
"load_decoder": true,
|
|
"tile_sample_min_height": 256,
|
|
"tile_sample_min_width": 256
|
|
},
|
|
"text_encoder_precisions": ["fp16", "fp16"],
|
|
"mask_strategy_file_path": null,
|
|
"enable_torch_compile": false
|
|
}
|
|
```
|
|
</Tab>
|
|
<Tab title="YAML">
|
|
```yaml config.yaml
|
|
model_path: "FastVideo/FastHunyuan-diffusers"
|
|
prompt: "A beautiful woman in a red dress walking down a street"
|
|
output_path: "outputs/"
|
|
num_gpus: 2
|
|
sp_size: 2
|
|
tp_size: 1
|
|
num_frames: 45
|
|
height: 720
|
|
width: 1280
|
|
num_inference_steps: 6
|
|
seed: 1024
|
|
fps: 24
|
|
precision: "bf16"
|
|
vae_precision: "fp16"
|
|
vae_tiling: true
|
|
vae_sp: true
|
|
vae_config:
|
|
load_encoder: false
|
|
load_decoder: true
|
|
tile_sample_min_height: 256
|
|
tile_sample_min_width: 256
|
|
text_encoder_precisions:
|
|
- "fp16"
|
|
- "fp16"
|
|
mask_strategy_file_path: null
|
|
enable_torch_compile: false
|
|
```
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
To see all available options:
|
|
|
|
```bash
|
|
sglang generate --help
|
|
```
|
|
|
|
---
|
|
|
|
## Component path overrides
|
|
|
|
You can override any pipeline component (e.g. `vae`, `transformer`, `text_encoder`) by specifying a custom checkpoint path with `--<component>-path`, where `<component>` matches the key in the model's `model_index.json`.
|
|
|
|
### Example: FLUX.2-dev with Tiny AutoEncoder
|
|
|
|
Replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding:
|
|
|
|
```bash
|
|
sglang serve \
|
|
--model-path=black-forest-labs/FLUX.2-dev \
|
|
--vae-path=fal/FLUX.2-Tiny-AutoEncoder
|
|
```
|
|
|
|
You can also use a local path:
|
|
|
|
```bash
|
|
sglang serve \
|
|
--model-path=black-forest-labs/FLUX.2-dev \
|
|
--vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae
|
|
```
|
|
|
|
<Warning>
|
|
The component key must match the one in the model's `model_index.json` (e.g. `vae`).
|
|
The path must be either a HuggingFace repo ID or point to a complete component folder containing `config.json` and safetensors files.
|
|
</Warning>
|
|
|
|
---
|
|
|
|
## Diffusers backend
|
|
|
|
SGLang Diffusion supports a diffusers backend that runs any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for models without native SGLang implementations or models with custom pipeline classes.
|
|
|
|
### Backend arguments
|
|
|
|
| Argument | Values | Description |
|
|
|:--|:--|:--|
|
|
| `--backend` | `auto` (default), `sglang`, `diffusers` | `auto`: prefer native SGLang, fallback to diffusers. `sglang`: force native (fails if unavailable). `diffusers`: force vanilla diffusers pipeline |
|
|
| `--diffusers-attention-backend` | `flash`, `_flash_3_hub`, `sage`, `xformers`, `native` | Attention backend for diffusers pipelines |
|
|
| `--trust-remote-code` | flag | Required for models with custom pipeline classes |
|
|
| `--vae-tiling` | flag | Enable VAE tiling for large image support (decodes tile-by-tile) |
|
|
| `--vae-slicing` | flag | Enable VAE slicing for lower memory usage (decodes slice-by-slice) |
|
|
| `--dit-precision` | `fp16`, `bf16`, `fp32` | Precision for the diffusion transformer |
|
|
| `--vae-precision` | `fp16`, `bf16`, `fp32` | Precision for the VAE |
|
|
|
|
### Example: running Ovis-Image-7B
|
|
|
|
[Ovis-Image-7B](https://huggingface.co/AIDC-AI/Ovis-Image-7B) is a 7B text-to-image model optimized for high-quality text rendering.
|
|
|
|
```bash
|
|
sglang generate \
|
|
--model-path AIDC-AI/Ovis-Image-7B \
|
|
--backend diffusers \
|
|
--trust-remote-code \
|
|
--diffusers-attention-backend flash \
|
|
--prompt "A serene Japanese garden with cherry blossoms" \
|
|
--height 1024 \
|
|
--width 1024 \
|
|
--num-inference-steps 30 \
|
|
--save-output \
|
|
--output-path outputs \
|
|
--output-file-name ovis_garden.png
|
|
```
|
|
|
|
### Extra diffusers arguments
|
|
|
|
For pipeline-specific parameters not exposed via CLI, use `diffusers_kwargs` in a config file:
|
|
|
|
```json config.json
|
|
{
|
|
"model_path": "AIDC-AI/Ovis-Image-7B",
|
|
"backend": "diffusers",
|
|
"prompt": "A beautiful landscape",
|
|
"diffusers_kwargs": {
|
|
"cross_attention_kwargs": {"scale": 0.5}
|
|
}
|
|
}
|
|
```
|
|
|
|
```bash
|
|
sglang generate --config config.json
|
|
```
|
|
|
|
### Cache-DiT acceleration
|
|
|
|
Users on the diffusers backend can leverage Cache-DiT acceleration by loading custom cache configs from a YAML file. See the [Cache-DiT documentation](../cache-dit) for details.
|
|
|
|
---
|
|
|
|
## Cloud storage support
|
|
|
|
The server supports automatically uploading generated artifacts to S3-compatible cloud storage (AWS S3, MinIO, Alibaba Cloud OSS, Tencent Cloud COS).
|
|
|
|
The workflow is: **Generate -> Upload -> Delete local file**. The API response returns the public URL of the uploaded object.
|
|
|
|
1. **Install boto3**
|
|
|
|
```bash
|
|
pip install boto3
|
|
```
|
|
|
|
2. **Set environment variables**
|
|
|
|
```bash
|
|
export SGLANG_CLOUD_STORAGE_TYPE=s3
|
|
export SGLANG_S3_BUCKET_NAME=my-bucket
|
|
export SGLANG_S3_ACCESS_KEY_ID=your-access-key
|
|
export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
|
|
|
|
# Optional: custom endpoint for MinIO/OSS/COS
|
|
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
|
|
```
|
|
|
|
3. **Launch the server**
|
|
|
|
```bash
|
|
sglang serve --model-path MODEL_PATH
|
|
```
|
|
|
|
See the [environment variables reference](../environment-variables) for all storage-related variables.
|