mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 04:08:10 +00:00

Files

yuefeng Wu a20d12ae96 [diffusion][doc]: add ring sp performance benchmark page (#20998 )

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

2026-03-30 20:26:05 +03:00

2.1 KiB

Raw Blame History

Ring SP Benchmark: Wan2.2-TI2V-5B (u1r2 vs Baseline)

This page reports Ring-SP performance for Wan2.2-TI2V-5B-Diffusers using:

Parallel config: sp=2, ulysses=1, ring=2 (short: u1r2)
Baseline config: sp=1, ulysses=1, ring=1 (short: u1r1)

Benchmark Setup

Model: Wan2.2-TI2V-5B-Diffusers
GPU: 48G RTX40 series * 2

Online Serving

Ring SP (`u1r2`)

sglang serve \
  --model-type diffusion \
  --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --num-gpus 2 --sp-degree 2 --ulysses-degree 1 --ring-degree 2 \
  --port 8898

Baseline (`u1r1`)

sglang serve \
  --model-type diffusion \
  --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --num-gpus 1 --sp-degree 1 --ulysses-degree 1 --ring-degree 1 \
  --port 8898

Benchmarks

Benchmark Disclaimer

These benchmarks are provided for reference under one specific setup and command configuration. Actual performance may vary with model settings, runtime environment, and request patterns.

Stage Time Breakdown

Stage / Metric	`u1r2` (s)	`u1r1` baseline (s)	Speedup
InputValidation	0.1060	0.1029	0.97x
TextEncoding	1.3965	2.2261	1.59x
LatentPreparation	0.0002	0.0002	1.00x
TimestepPreparation	0.0003	0.0004	1.33x
Denoising	52.6358	71.6785	1.36x
Decoding	7.6708	13.4314	1.75x
Total	63.74	90.63	1.42x

Memory Usage

Memory Metric	`u1r2` (GB)	`u1r1` baseline (GB)	Delta
Peak GPU Memory	20.07	27.40	-7.33
Peak Allocated	13.35	20.40	-7.05
Memory Overhead	6.72	7.00	-0.28
Overhead Ratio	33.5%	25.6%	+7.9pp

Summary

End-to-end latency improves from 90.63s to 63.74s (1.42x).
Main gains come from Denoising (1.36x) and Decoding (1.75x).
Absolute memory usage drops noticeably on Ring-SP (Peak GPU Memory -7.33GB, Peak Allocated -7.05GB).
Overhead ratio rises (+7.9pp), so future tuning can focus on reducing communication/runtime overhead while preserving the latency gain.

2.1 KiB Raw Blame History