--- title: CLI reference sidebarTitle: CLI description: Run one-off generation tasks and launch the HTTP server from the command line. --- The `sglang` CLI provides two main subcommands for diffusion inference: - **`sglang generate`** -- run a one-off generation without a persistent server - **`sglang serve`** -- launch the OpenAI-compatible HTTP server ## Prerequisites A working SGLang Diffusion installation with the `sglang` CLI available in your `$PATH`. See the [installation guide](../installation) for setup instructions. ## Generate Run a one-off generation task without launching a persistent server. Pass both server arguments and sampling parameters after the `generate` subcommand: ```bash SERVER_ARGS=( --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers --text-encoder-cpu-offload --pin-cpu-memory --num-gpus 4 --ulysses-degree=2 --ring-degree=2 ) SAMPLING_ARGS=( --prompt "A curious raccoon" --save-output --output-path outputs --output-file-name "A curious raccoon.mp4" ) sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}" ``` You can also enable Cache-DiT acceleration via an environment variable: ```bash SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}" ``` HTTP server-related arguments are ignored in `generate` mode. The process shuts down automatically once generation completes. ## Serve Launch the SGLang Diffusion HTTP server and interact through the OpenAI-compatible API. ```bash SERVER_ARGS=( --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers --text-encoder-cpu-offload --pin-cpu-memory --num-gpus 4 --ulysses-degree=2 --ring-degree=2 ) sglang serve "${SERVER_ARGS[@]}" ``` - `--model-path` -- which model to load (e.g. `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`) - `--port` -- HTTP port to listen on (default: `30010`) For full API usage including image/video generation and LoRA management, see the [OpenAI API documentation](./openai-api). --- ## Supported arguments ### Server arguments | Argument | Description | |:--|:--| | `--model-path MODEL_PATH` | Path to the model or HuggingFace model ID | | `--lora-path LORA_PATH` | Path to a LoRA adapter (local or HuggingFace ID). If omitted, LoRA is not applied | | `--lora-nickname NAME` | Nickname for the LoRA adapter (default: `default`) | | `--num-gpus NUM` | Number of GPUs to use | | `--tp-size SIZE` | Tensor parallelism size (encoder only; keep at most 1 when text encoder offload is enabled) | | `--sp-degree SIZE` | Sequence parallelism size (typically should match the number of GPUs) | | `--ulysses-degree SIZE` | DeepSpeed-Ulysses-style SP degree in USP | | `--ring-degree SIZE` | Ring attention-style SP degree in USP | | `--attention-backend BACKEND` | Attention backend. Native pipelines: `fa`, `torch_sdpa`, `sage_attn`, etc. Diffusers pipelines: `flash`, `_flash_3_hub`, `sage`, `xformers` | | `--attention-backend-config CONFIG` | Config for the attention backend. Accepts a JSON string, a JSON/YAML file path, or `key=value` pairs | | `--cache-dit-config PATH` | Path to a Cache-DiT YAML/JSON config (diffusers backend only) | | `--dit-precision DTYPE` | Precision for the DiT model (`fp32`, `fp16`, `bf16`) | | `--text-encoder-cpu-offload` | Offload text encoders to CPU | | `--pin-cpu-memory` | Pin CPU memory for faster transfers | ### Sampling parameters | Argument | Description | |:--|:--| | `--prompt PROMPT` | Text description for the image or video to generate | | `--negative-prompt PROMPT` | Negative prompt to guide generation away from certain concepts | | `--num-inference-steps STEPS` | Number of denoising steps | | `--seed SEED` | Random seed for reproducible generation | | Argument | Description | |:--|:--| | `--height HEIGHT` | Height of the generated output | | `--width WIDTH` | Width of the generated output | | `--num-frames NUM` | Number of frames to generate (video only) | | `--fps FPS` | Frames per second for the saved output (video only) | | Argument | Description | |:--|:--| | `--save-output` | Save the image or video to disk | | `--output-path PATH` | Directory to save the generated output | | `--output-file-name NAME` | File name for the saved output | | `--return-frames` | Return the raw frames instead of saving | ### Frame interpolation (video only) Frame interpolation is a post-processing step that synthesizes new frames between each pair of consecutive generated frames, producing smoother motion without re-running the diffusion model. The `--frame-interpolation-exp` flag controls how many rounds of interpolation to apply: each round inserts one new frame into every gap between adjacent frames, so the output frame count follows the formula: $$ \text{output frames} = (N - 1) \times 2^{\text{exp}} + 1 $$ For example, 5 original frames with `exp=1` -> 4 gaps x 1 new frame + 5 originals = **9 frames**; with `exp=2` -> **17 frames**. | Argument | Description | |:--|:--| | `--enable-frame-interpolation` | Enable frame interpolation. Model weights are downloaded automatically on first use | | `--frame-interpolation-exp EXP` | Interpolation exponent -- `1` = 2x temporal resolution, `2` = 4x, etc. (default: `1`) | | `--frame-interpolation-scale SCALE` | RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`) | | `--frame-interpolation-model-path PATH` | Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically) | **Example** -- generate a 5-frame video and interpolate to 9 frames ($(5 - 1) \times 2^1 + 1 = 9$): ```bash sglang generate \ --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ --prompt "A dog running through a park" \ --num-frames 5 \ --enable-frame-interpolation \ --frame-interpolation-exp 1 \ --save-output ``` --- ## Configuration files Instead of passing every parameter on the command line, you can use a JSON or YAML config file. Command-line arguments take precedence over config values. ```bash sglang generate --config config.json ``` ```json config.json { "model_path": "FastVideo/FastHunyuan-diffusers", "prompt": "A beautiful woman in a red dress walking down a street", "output_path": "outputs/", "num_gpus": 2, "sp_size": 2, "tp_size": 1, "num_frames": 45, "height": 720, "width": 1280, "num_inference_steps": 6, "seed": 1024, "fps": 24, "precision": "bf16", "vae_precision": "fp16", "vae_tiling": true, "vae_sp": true, "vae_config": { "load_encoder": false, "load_decoder": true, "tile_sample_min_height": 256, "tile_sample_min_width": 256 }, "text_encoder_precisions": ["fp16", "fp16"], "mask_strategy_file_path": null, "enable_torch_compile": false } ``` ```yaml config.yaml model_path: "FastVideo/FastHunyuan-diffusers" prompt: "A beautiful woman in a red dress walking down a street" output_path: "outputs/" num_gpus: 2 sp_size: 2 tp_size: 1 num_frames: 45 height: 720 width: 1280 num_inference_steps: 6 seed: 1024 fps: 24 precision: "bf16" vae_precision: "fp16" vae_tiling: true vae_sp: true vae_config: load_encoder: false load_decoder: true tile_sample_min_height: 256 tile_sample_min_width: 256 text_encoder_precisions: - "fp16" - "fp16" mask_strategy_file_path: null enable_torch_compile: false ``` To see all available options: ```bash sglang generate --help ``` --- ## Component path overrides You can override any pipeline component (e.g. `vae`, `transformer`, `text_encoder`) by specifying a custom checkpoint path with `---path`, where `` matches the key in the model's `model_index.json`. ### Example: FLUX.2-dev with Tiny AutoEncoder Replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding: ```bash sglang serve \ --model-path=black-forest-labs/FLUX.2-dev \ --vae-path=fal/FLUX.2-Tiny-AutoEncoder ``` You can also use a local path: ```bash sglang serve \ --model-path=black-forest-labs/FLUX.2-dev \ --vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae ``` The component key must match the one in the model's `model_index.json` (e.g. `vae`). The path must be either a HuggingFace repo ID or point to a complete component folder containing `config.json` and safetensors files. --- ## Diffusers backend SGLang Diffusion supports a diffusers backend that runs any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for models without native SGLang implementations or models with custom pipeline classes. ### Backend arguments | Argument | Values | Description | |:--|:--|:--| | `--backend` | `auto` (default), `sglang`, `diffusers` | `auto`: prefer native SGLang, fallback to diffusers. `sglang`: force native (fails if unavailable). `diffusers`: force vanilla diffusers pipeline | | `--diffusers-attention-backend` | `flash`, `_flash_3_hub`, `sage`, `xformers`, `native` | Attention backend for diffusers pipelines | | `--trust-remote-code` | flag | Required for models with custom pipeline classes | | `--vae-tiling` | flag | Enable VAE tiling for large image support (decodes tile-by-tile) | | `--vae-slicing` | flag | Enable VAE slicing for lower memory usage (decodes slice-by-slice) | | `--dit-precision` | `fp16`, `bf16`, `fp32` | Precision for the diffusion transformer | | `--vae-precision` | `fp16`, `bf16`, `fp32` | Precision for the VAE | ### Example: running Ovis-Image-7B [Ovis-Image-7B](https://huggingface.co/AIDC-AI/Ovis-Image-7B) is a 7B text-to-image model optimized for high-quality text rendering. ```bash sglang generate \ --model-path AIDC-AI/Ovis-Image-7B \ --backend diffusers \ --trust-remote-code \ --diffusers-attention-backend flash \ --prompt "A serene Japanese garden with cherry blossoms" \ --height 1024 \ --width 1024 \ --num-inference-steps 30 \ --save-output \ --output-path outputs \ --output-file-name ovis_garden.png ``` ### Extra diffusers arguments For pipeline-specific parameters not exposed via CLI, use `diffusers_kwargs` in a config file: ```json config.json { "model_path": "AIDC-AI/Ovis-Image-7B", "backend": "diffusers", "prompt": "A beautiful landscape", "diffusers_kwargs": { "cross_attention_kwargs": {"scale": 0.5} } } ``` ```bash sglang generate --config config.json ``` ### Cache-DiT acceleration Users on the diffusers backend can leverage Cache-DiT acceleration by loading custom cache configs from a YAML file. See the [Cache-DiT documentation](../cache-dit) for details. --- ## Cloud storage support The server supports automatically uploading generated artifacts to S3-compatible cloud storage (AWS S3, MinIO, Alibaba Cloud OSS, Tencent Cloud COS). The workflow is: **Generate -> Upload -> Delete local file**. The API response returns the public URL of the uploaded object. 1. **Install boto3** ```bash pip install boto3 ``` 2. **Set environment variables** ```bash export SGLANG_CLOUD_STORAGE_TYPE=s3 export SGLANG_S3_BUCKET_NAME=my-bucket export SGLANG_S3_ACCESS_KEY_ID=your-access-key export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key # Optional: custom endpoint for MinIO/OSS/COS export SGLANG_S3_ENDPOINT_URL=https://minio.example.com ``` 3. **Launch the server** ```bash sglang serve --model-path MODEL_PATH ``` See the [environment variables reference](../environment-variables) for all storage-related variables.