mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-07-01 04:08:10 +00:00
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2.1 KiB
2.1 KiB
Ring SP Benchmark: Wan2.2-TI2V-5B (u1r2 vs Baseline)
This page reports Ring-SP performance for Wan2.2-TI2V-5B-Diffusers using:
- Parallel config:
sp=2, ulysses=1, ring=2(short:u1r2) - Baseline config:
sp=1, ulysses=1, ring=1(short:u1r1)
Benchmark Setup
- Model:
Wan2.2-TI2V-5B-Diffusers - GPU:
48G RTX40 series * 2
Online Serving
Ring SP (u1r2)
sglang serve \
--model-type diffusion \
--model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 2 --sp-degree 2 --ulysses-degree 1 --ring-degree 2 \
--port 8898
Baseline (u1r1)
sglang serve \
--model-type diffusion \
--model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 1 --sp-degree 1 --ulysses-degree 1 --ring-degree 1 \
--port 8898
Benchmarks
Benchmark Disclaimer
These benchmarks are provided for reference under one specific setup and command configuration. Actual performance may vary with model settings, runtime environment, and request patterns.
Stage Time Breakdown
| Stage / Metric | u1r2 (s) |
u1r1 baseline (s) |
Speedup |
|---|---|---|---|
| InputValidation | 0.1060 | 0.1029 | 0.97x |
| TextEncoding | 1.3965 | 2.2261 | 1.59x |
| LatentPreparation | 0.0002 | 0.0002 | 1.00x |
| TimestepPreparation | 0.0003 | 0.0004 | 1.33x |
| Denoising | 52.6358 | 71.6785 | 1.36x |
| Decoding | 7.6708 | 13.4314 | 1.75x |
| Total | 63.74 | 90.63 | 1.42x |
Memory Usage
| Memory Metric | u1r2 (GB) |
u1r1 baseline (GB) |
Delta |
|---|---|---|---|
| Peak GPU Memory | 20.07 | 27.40 | -7.33 |
| Peak Allocated | 13.35 | 20.40 | -7.05 |
| Memory Overhead | 6.72 | 7.00 | -0.28 |
| Overhead Ratio | 33.5% | 25.6% | +7.9pp |
Summary
- End-to-end latency improves from
90.63sto63.74s(1.42x). - Main gains come from
Denoising(1.36x) andDecoding(1.75x). - Absolute memory usage drops noticeably on Ring-SP (
Peak GPU Memory -7.33GB,Peak Allocated -7.05GB). - Overhead ratio rises (
+7.9pp), so future tuning can focus on reducing communication/runtime overhead while preserving the latency gain.