mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
106 lines
4.6 KiB
Plaintext
106 lines
4.6 KiB
Plaintext
---
|
|
title: "Performance Optimization"
|
|
description: "Optimize SGLang diffusion performance with caching, kernels, and profiling."
|
|
---
|
|
|
|
SGLang-Diffusion provides multiple performance optimization strategies to accelerate inference. This section covers all available performance tuning options.
|
|
|
|
## Overview
|
|
|
|
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
|
|
<colgroup>
|
|
<col style={{width: "22%"}} />
|
|
<col style={{width: "18%"}} />
|
|
<col style={{width: "60%"}} />
|
|
</colgroup>
|
|
<thead>
|
|
<tr style={{borderBottom: "2px solid #d55816"}}>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Optimization</th>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Type</th>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Description</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Cache-DiT</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Caching</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Block-level caching with DBCache, TaylorSeer, and SCM</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>TeaCache</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Caching</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Timestep-level caching using L1 similarity</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Attention Backends</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Kernel</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Optimized attention implementations (FlashAttention, SageAttention, etc.)</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Profiling</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Diagnostics</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>PyTorch Profiler and Nsight Systems guidance</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Caching Strategies
|
|
|
|
SGLang supports two complementary caching approaches:
|
|
|
|
### Cache-DiT
|
|
|
|
[Cache-DiT](https://github.com/vipshop/cache-dit) provides block-level caching with advanced strategies. It can achieve up to **1.69x speedup**.
|
|
|
|
**Quick Start:**
|
|
```bash
|
|
SGLANG_CACHE_DIT_ENABLED=true \
|
|
sglang generate --model-path Qwen/Qwen-Image \
|
|
--prompt "A beautiful sunset over the mountains"
|
|
```
|
|
|
|
**Key Features:**
|
|
- **DBCache**: Dynamic block-level caching based on residual differences
|
|
- **TaylorSeer**: Taylor expansion-based calibration for optimized caching
|
|
- **SCM**: Step-level computation masking for additional speedup
|
|
|
|
See [Cache-DiT documentation](./cache-dit) for detailed configuration.
|
|
|
|
### TeaCache
|
|
|
|
TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.
|
|
|
|
**Quick Overview:**
|
|
- Tracks L1 distance between modulated inputs across timesteps
|
|
- When accumulated distance is below threshold, reuses cached residual
|
|
- Supports CFG with separate positive/negative caches
|
|
|
|
**Supported Models:** Wan (wan2.1, wan2.2), Hunyuan (HunyuanVideo), Z-Image
|
|
|
|
See [TeaCache documentation](./tea-cache) for detailed configuration.
|
|
|
|
## Attention Backends
|
|
|
|
Different attention backends offer varying performance characteristics depending on your hardware and model:
|
|
|
|
- **FlashAttention**: Fastest on NVIDIA GPUs with fp16/bf16
|
|
- **SageAttention**: Alternative optimized implementation
|
|
- **xformers**: Memory-efficient attention
|
|
- **SDPA**: PyTorch native scaled dot-product attention
|
|
|
|
See [Attention backends](./attention-backends) for platform support and configuration options.
|
|
|
|
## Profiling
|
|
|
|
To diagnose performance bottlenecks, SGLang-Diffusion supports profiling tools:
|
|
|
|
- **PyTorch Profiler**: Built-in Python profiling
|
|
- **Nsight Systems**: GPU kernel-level analysis
|
|
|
|
See [Profiling guide](./profiling) for detailed instructions.
|
|
|
|
## References
|
|
|
|
- [Cache-DiT Repository](https://github.com/vipshop/cache-dit)
|
|
- [TeaCache Paper](https://arxiv.org/abs/2411.14324)
|