mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-07-01 04:08:10 +00:00
419 lines
13 KiB
Markdown
419 lines
13 KiB
Markdown
# Cache-DiT
|
|
|
|
SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to **1.69x inference speedup** with minimal quality loss.
|
|
|
|
## Overview
|
|
|
|
**Cache-DiT** uses intelligent caching strategies to skip redundant computation in the denoising loop:
|
|
|
|
- **DBCache (Dual Block Cache)**: Dynamically decides when to cache transformer blocks based on residual differences
|
|
- **TaylorSeer**: Uses Taylor expansion for calibration to optimize caching decisions
|
|
- **SCM (Step Computation Masking)**: Step-level caching control for additional speedup
|
|
|
|
## Basic Usage
|
|
|
|
Enable Cache-DiT by exporting the environment variable and using `sglang generate` or `sglang serve` :
|
|
|
|
```bash
|
|
SGLANG_CACHE_DIT_ENABLED=true \
|
|
sglang generate --model-path Qwen/Qwen-Image \
|
|
--prompt "A beautiful sunset over the mountains"
|
|
```
|
|
|
|
## Diffusers Backend
|
|
|
|
Cache-DiT supports loading acceleration configs from a custom YAML file. For
|
|
diffusers pipelines (`diffusers` backend), pass the YAML/JSON path via `--cache-dit-config`. This
|
|
flow requires cache-dit >= 1.2.0 (`cache_dit.load_configs`).
|
|
|
|
### Single GPU inference
|
|
|
|
Define a `cache.yaml` file that contains:
|
|
|
|
- DBCache + TaylorSeer
|
|
|
|
```yaml
|
|
cache_config:
|
|
max_warmup_steps: 8
|
|
warmup_interval: 2
|
|
max_cached_steps: -1
|
|
max_continuous_cached_steps: 2
|
|
Fn_compute_blocks: 1
|
|
Bn_compute_blocks: 0
|
|
residual_diff_threshold: 0.12
|
|
enable_taylorseer: true
|
|
taylorseer_order: 1
|
|
```
|
|
|
|
Then apply the config with:
|
|
|
|
```bash
|
|
sglang generate \
|
|
--backend diffusers \
|
|
--model-path Qwen/Qwen-Image \
|
|
--cache-dit-config cache.yaml \
|
|
--prompt "A beautiful sunset over the mountains"
|
|
```
|
|
|
|
- DBCache + TaylorSeer + SCM (Step Computation Mask)
|
|
|
|
```yaml
|
|
cache_config:
|
|
max_warmup_steps: 8
|
|
warmup_interval: 2
|
|
max_cached_steps: -1
|
|
max_continuous_cached_steps: 2
|
|
Fn_compute_blocks: 1
|
|
Bn_compute_blocks: 0
|
|
residual_diff_threshold: 0.12
|
|
enable_taylorseer: true
|
|
taylorseer_order: 1
|
|
# Must set the num_inference_steps for SCM. The SCM will automatically
|
|
# generate the steps computation mask based on the num_inference_steps.
|
|
# Reference: https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/#scm-steps-computation-masking
|
|
num_inference_steps: 28
|
|
steps_computation_mask: fast
|
|
```
|
|
|
|
- DBCache + TaylorSeer + SCM (Step Computation Mask) + Cache CFG
|
|
|
|
```yaml
|
|
cache_config:
|
|
max_warmup_steps: 8
|
|
warmup_interval: 2
|
|
max_cached_steps: -1
|
|
max_continuous_cached_steps: 2
|
|
Fn_compute_blocks: 1
|
|
Bn_compute_blocks: 0
|
|
residual_diff_threshold: 0.12
|
|
enable_taylorseer: true
|
|
taylorseer_order: 1
|
|
num_inference_steps: 28
|
|
steps_computation_mask: fast
|
|
enable_sperate_cfg: true # e.g, Qwen-Image, Wan, Chroma, Ovis-Image, etc.
|
|
```
|
|
|
|
### Distributed inference
|
|
|
|
- 1D Parallelism
|
|
|
|
Define a parallelism only config yaml `parallel.yaml` file that contains:
|
|
|
|
```yaml
|
|
parallelism_config:
|
|
ulysses_size: auto
|
|
attention_backend: native
|
|
```
|
|
|
|
Then, apply the distributed inference acceleration config from yaml. `ulysses_size: auto` means that cache-dit will auto detect the `world_size` as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4.
|
|
|
|
Then apply the distributed config with: (Note: please add `--num-gpus N` to specify the number of gpus for distributed inference)
|
|
|
|
```bash
|
|
sglang generate \
|
|
--backend diffusers \
|
|
--num-gpus 4 \
|
|
--model-path Qwen/Qwen-Image \
|
|
--cache-dit-config parallel.yaml \
|
|
--prompt "A futuristic cityscape at sunset"
|
|
```
|
|
|
|
- 2D Parallelism
|
|
|
|
You can also define a 2D parallelism config yaml `parallel_2d.yaml` file that contains:
|
|
|
|
```yaml
|
|
parallelism_config:
|
|
ulysses_size: auto
|
|
tp_size: 2
|
|
attention_backend: native
|
|
```
|
|
Then, apply the 2D parallelism config from yaml. Here `tp_size: 2` means using tensor parallelism with size 2. The `ulysses_size: auto` means that cache-dit will auto detect the `world_size // tp_size` as the ulysses_size.
|
|
|
|
- 3D Parallelism
|
|
|
|
You can also define a 3D parallelism config yaml `parallel_3d.yaml` file that contains:
|
|
|
|
```yaml
|
|
parallelism_config:
|
|
ulysses_size: 2
|
|
ring_size: 2
|
|
tp_size: 2
|
|
attention_backend: native
|
|
```
|
|
Then, apply the 3D parallelism config from yaml. Here `ulysses_size: 2`, `ring_size: 2`, `tp_size: 2` means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.
|
|
|
|
- Ulysses Anything Attention
|
|
|
|
To enable Ulysses Anything Attention, you can define a parallelism config yaml `parallel_uaa.yaml` file that contains:
|
|
|
|
```yaml
|
|
parallelism_config:
|
|
ulysses_size: auto
|
|
attention_backend: native
|
|
ulysses_anything: true
|
|
```
|
|
|
|
- Ulysses FP8 Communication
|
|
|
|
For device that don't have NVLink support, you can enable Ulysses FP8 Communication to further reduce the communication overhead. You can define a parallelism config yaml `parallel_fp8.yaml` file that contains:
|
|
|
|
```yaml
|
|
parallelism_config:
|
|
ulysses_size: auto
|
|
attention_backend: native
|
|
ulysses_float8: true
|
|
```
|
|
|
|
- Async Ulysses CP
|
|
|
|
You can also enable async ulysses CP to overlap the communication and computation. Define a parallelism config yaml `parallel_async.yaml` file that contains:
|
|
|
|
```yaml
|
|
parallelism_config:
|
|
ulysses_size: auto
|
|
attention_backend: native
|
|
ulysses_async: true # Now, only support for FLUX.1, Qwen-Image, Ovis-Image and Z-Image.
|
|
```
|
|
Then, apply the config from yaml. Here `ulysses_async: true` means enabling async ulysses CP.
|
|
|
|
- TE-P and VAE-P
|
|
|
|
You can also specify the extra parallel modules in the yaml config. For example, define a parallelism config yaml `parallel_extra.yaml` file that contains:
|
|
|
|
```yaml
|
|
parallelism_config:
|
|
ulysses_size: auto
|
|
attention_backend: native
|
|
extra_parallel_modules: ["text_encoder", "vae"]
|
|
```
|
|
|
|
|
|
### Hybrid Cache and Parallelism
|
|
|
|
Define a hybrid cache and parallel acceleration config yaml `hybrid.yaml` file that contains:
|
|
|
|
```yaml
|
|
cache_config:
|
|
max_warmup_steps: 8
|
|
warmup_interval: 2
|
|
max_cached_steps: -1
|
|
max_continuous_cached_steps: 2
|
|
Fn_compute_blocks: 1
|
|
Bn_compute_blocks: 0
|
|
residual_diff_threshold: 0.12
|
|
enable_taylorseer: true
|
|
taylorseer_order: 1
|
|
parallelism_config:
|
|
ulysses_size: auto
|
|
attention_backend: native
|
|
extra_parallel_modules: ["text_encoder", "vae"]
|
|
```
|
|
|
|
Then, apply the hybrid cache and parallel acceleration config from yaml.
|
|
|
|
```bash
|
|
sglang generate \
|
|
--backend diffusers \
|
|
--num-gpus 4 \
|
|
--model-path Qwen/Qwen-Image \
|
|
--cache-dit-config hybrid.yaml \
|
|
--prompt "A beautiful sunset over the mountains"
|
|
```
|
|
|
|
### Attention Backend
|
|
|
|
In some cases, users may want to only specify the attention backend without any other optimization configs. In this case, you can define a yaml file `attention.yaml` that only contains:
|
|
|
|
```yaml
|
|
attention_backend: "flash" # '_flash_3' for Hopper
|
|
```
|
|
|
|
### Quantization
|
|
|
|
You can also specify the quantization config in the yaml file, required `torchao>=0.16.0`. For example, define a yaml file `quantize.yaml` that contains:
|
|
|
|
```yaml
|
|
quantize_config: # quantization configuration for transformer modules
|
|
# float8 (DQ), float8_weight_only, float8_blockwise, int8 (DQ), int8_weight_only, etc.
|
|
quant_type: "float8"
|
|
# layers to exclude from quantization (transformer). layers that contains any of the
|
|
# keywords in the exclude_layers list will be excluded from quantization. This is useful
|
|
# for some sensitive layers that are not robust to quantization, e.g., embedding layers.
|
|
exclude_layers:
|
|
- "embedder"
|
|
- "embed"
|
|
verbose: false # whether to print verbose logs during quantization
|
|
```
|
|
Then, apply the quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization. For example:
|
|
|
|
```bash
|
|
sglang generate \
|
|
--backend diffusers \
|
|
--model-path Qwen/Qwen-Image \
|
|
--warmup \
|
|
--cache-dit-config quantize.yaml \
|
|
--enable-torch-compile \
|
|
--dit-cpu-offload false \
|
|
--text-encoder-cpu-offload false \
|
|
--prompt "A beautiful sunset over the mountains"
|
|
```
|
|
|
|
### Combined Configs: Cache + Parallelism + Quantization
|
|
|
|
You can also combine all the above configs together in a single yaml file `combined.yaml` that contains:
|
|
|
|
```yaml
|
|
cache_config:
|
|
max_warmup_steps: 8
|
|
warmup_interval: 2
|
|
max_cached_steps: -1
|
|
max_continuous_cached_steps: 2
|
|
Fn_compute_blocks: 1
|
|
Bn_compute_blocks: 0
|
|
residual_diff_threshold: 0.12
|
|
enable_taylorseer: true
|
|
taylorseer_order: 1
|
|
parallelism_config:
|
|
ulysses_size: auto
|
|
attention_backend: native
|
|
extra_parallel_modules: ["text_encoder", "vae"]
|
|
quantize_config:
|
|
quant_type: "float8"
|
|
exclude_layers:
|
|
- "embedder"
|
|
- "embed"
|
|
verbose: false
|
|
```
|
|
Then, apply the combined cache, parallelism and quantization config from yaml. Please also enable torch.compile for better performance if you are using quantization.
|
|
|
|
## Advanced Configuration
|
|
|
|
### DBCache Parameters
|
|
|
|
DBCache controls block-level caching behavior:
|
|
|
|
| Parameter | Env Variable | Default | Description |
|
|
|-----------|---------------------------|---------|------------------------------------------|
|
|
| Fn | `SGLANG_CACHE_DIT_FN` | 1 | Number of first blocks to always compute |
|
|
| Bn | `SGLANG_CACHE_DIT_BN` | 0 | Number of last blocks to always compute |
|
|
| W | `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching starts |
|
|
| R | `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold |
|
|
| MC | `SGLANG_CACHE_DIT_MC` | 3 | Maximum continuous cached steps |
|
|
|
|
### TaylorSeer Configuration
|
|
|
|
TaylorSeer improves caching accuracy using Taylor expansion:
|
|
|
|
| Parameter | Env Variable | Default | Description |
|
|
|-----------|-------------------------------|---------|---------------------------------|
|
|
| Enable | `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator |
|
|
| Order | `SGLANG_CACHE_DIT_TS_ORDER` | 1 | Taylor expansion order (1 or 2) |
|
|
|
|
### Combined Configuration Example
|
|
|
|
DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters
|
|
simultaneously:
|
|
|
|
```bash
|
|
SGLANG_CACHE_DIT_ENABLED=true \
|
|
SGLANG_CACHE_DIT_FN=2 \
|
|
SGLANG_CACHE_DIT_BN=1 \
|
|
SGLANG_CACHE_DIT_WARMUP=4 \
|
|
SGLANG_CACHE_DIT_RDT=0.4 \
|
|
SGLANG_CACHE_DIT_MC=4 \
|
|
SGLANG_CACHE_DIT_TAYLORSEER=true \
|
|
SGLANG_CACHE_DIT_TS_ORDER=2 \
|
|
sglang generate --model-path black-forest-labs/FLUX.1-dev \
|
|
--prompt "A curious raccoon in a forest"
|
|
```
|
|
|
|
### SCM (Step Computation Masking)
|
|
|
|
SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and
|
|
which to use cached results.
|
|
|
|
**SCM Presets**
|
|
|
|
SCM is configured with presets:
|
|
|
|
| Preset | Compute Ratio | Speed | Quality |
|
|
|----------|---------------|----------|------------|
|
|
| `none` | 100% | Baseline | Best |
|
|
| `slow` | ~75% | ~1.3x | High |
|
|
| `medium` | ~50% | ~2x | Good |
|
|
| `fast` | ~35% | ~3x | Acceptable |
|
|
| `ultra` | ~25% | ~4x | Lower |
|
|
|
|
**Usage**
|
|
|
|
```bash
|
|
SGLANG_CACHE_DIT_ENABLED=true \
|
|
SGLANG_CACHE_DIT_SCM_PRESET=medium \
|
|
sglang generate --model-path Qwen/Qwen-Image \
|
|
--prompt "A futuristic cityscape at sunset"
|
|
```
|
|
|
|
**Custom SCM Bins**
|
|
|
|
For fine-grained control over which steps to compute vs cache:
|
|
|
|
```bash
|
|
SGLANG_CACHE_DIT_ENABLED=true \
|
|
SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
|
|
SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
|
|
sglang generate --model-path Qwen/Qwen-Image \
|
|
--prompt "A futuristic cityscape at sunset"
|
|
```
|
|
|
|
**SCM Policy**
|
|
|
|
| Policy | Env Variable | Description |
|
|
|-----------|---------------------------------------|---------------------------------------------|
|
|
| `dynamic` | `SGLANG_CACHE_DIT_SCM_POLICY=dynamic` | Adaptive caching based on content (default) |
|
|
| `static` | `SGLANG_CACHE_DIT_SCM_POLICY=static` | Fixed caching pattern |
|
|
|
|
## Environment Variables
|
|
|
|
All Cache-DiT parameters can be configured via environment variables.
|
|
See [Environment Variables](../../environment_variables.md) for the complete list.
|
|
|
|
## Supported Models
|
|
|
|
SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:
|
|
|
|
| Model Family | Example Models |
|
|
|--------------|-----------------------------|
|
|
| Wan | Wan2.1, Wan2.2 |
|
|
| Flux | FLUX.1-dev, FLUX.2-dev |
|
|
| Z-Image | Z-Image-Turbo |
|
|
| Qwen | Qwen-Image, Qwen-Image-Edit |
|
|
| Hunyuan | HunyuanVideo |
|
|
|
|
## Performance Tips
|
|
|
|
1. **Start with defaults**: The default parameters work well for most models
|
|
2. **Use TaylorSeer**: It typically improves both speed and quality
|
|
3. **Tune R threshold**: Lower values = better quality, higher values = faster
|
|
4. **SCM for extra speed**: Use `medium` preset for good speed/quality balance
|
|
5. **Warmup matters**: Higher warmup = more stable caching decisions
|
|
|
|
## Limitations
|
|
|
|
- **SGLang-native pipelines**: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically
|
|
disabled when `world_size > 1`.
|
|
- **SCM minimum steps**: SCM requires >= 8 inference steps to be effective
|
|
- **Model support**: Only models registered in Cache-DiT's BlockAdapterRegister are supported
|
|
|
|
## Troubleshooting
|
|
|
|
### SCM disabled for low step count
|
|
|
|
For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache
|
|
acceleration still works.
|
|
|
|
## References
|
|
|
|
- [Cache-DiT](https://github.com/vipshop/cache-dit)
|
|
- [SGLang Diffusion](../index.md)
|