Files
sglang/docs/diffusion/api/post_processing.md
2026-03-09 02:06:40 +08:00

149 lines
5.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Post-Processing
SGLang diffusion supports optional post-processing steps that run after
generation to improve temporal smoothness (frame interpolation) or spatial
resolution (upscaling). These steps are independent of the diffusion model and
can be combined in a single run.
When both are enabled, **frame interpolation runs first** (increasing the frame
count), then **upscaling runs on every frame** (increasing the spatial
resolution).
---
## Frame Interpolation (video only)
Frame interpolation synthesizes new frames between each pair of consecutive
generated frames, producing smoother motion without re-running the diffusion
model.
The `--frame-interpolation-exp` flag controls how many rounds of interpolation
to apply: each round inserts one new frame into every gap between adjacent
frames, so the output frame count follows the formula:
> **(N 1) × 2^exp + 1**
>
> e.g. 5 original frames with `exp=1` → 4 gaps × 1 new frame + 5 originals = **9** frames;
> with `exp=2` → **17** frames.
### CLI Arguments
| Argument | Description |
|----------|-------------|
| `--enable-frame-interpolation` | Enable frame interpolation. Model weights are downloaded automatically on first use. |
| `--frame-interpolation-exp {EXP}` | Interpolation exponent — `1` = 2× temporal resolution, `2` = 4×, etc. (default: `1`) |
| `--frame-interpolation-scale {SCALE}` | RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`) |
| `--frame-interpolation-model-path {PATH}` | Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically) |
### Supported Models
Frame interpolation uses the [RIFE](https://github.com/hzwer/Practical-RIFE)
(Real-Time Intermediate Flow Estimation) architecture. Only **RIFE 4.22.lite**
(`IFNet` with 4-scale `IFBlock` backbone) is supported. The network topology is
hard-coded, so custom weights provided via `--frame-interpolation-model-path`
must be a `flownet.pkl` checkpoint that is compatible with this architecture.
Other RIFE versions (e.g., older `v4.x` variants with different block counts)
or entirely different frame interpolation methods (FILM, AMT, etc.) are **not
supported**.
| Weight | HuggingFace Repo | Description |
|--------|------------------|-------------|
| RIFE 4.22.lite *(default)* | [`elfgum/RIFE-4.22.lite`](https://huggingface.co/elfgum/RIFE-4.22.lite) | Lightweight model, downloaded automatically on first use |
### Example
Generate a 5-frame video and interpolate to 9 frames ((5 1) × 2¹ + 1 = 9):
```bash
sglang generate \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--prompt "A dog running through a park" \
--num-frames 5 \
--enable-frame-interpolation \
--frame-interpolation-exp 1 \
--save-output
```
---
## Upscaling (image and video)
Upscaling increases the spatial resolution of generated images or video frames
using [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN). The model weights
are downloaded automatically on first use and cached for subsequent runs.
### CLI Arguments
| Argument | Description |
|----------|-------------|
| `--enable-upscaling` | Enable post-generation upscaling using Real-ESRGAN. |
| `--upscaling-scale {SCALE}` | Desired upscaling factor (default: `4`). The 4× model is used internally; if a different scale is requested, a bicubic resize is applied after the network output. |
| `--upscaling-model-path {PATH}` | Local `.pth` file, HuggingFace repo ID, or `repo_id:filename` for Real-ESRGAN weights (default: `ai-forever/Real-ESRGAN` with `RealESRGAN_x4.pth`, downloaded automatically). Use the `repo_id:filename` format to specify a custom weight file from a HuggingFace repo (e.g. `my-org/my-esrgan:weights.pth`). |
### Supported Models
Upscaling supports two Real-ESRGAN network architectures. The correct
architecture is **auto-detected** from the checkpoint keys, so you only need to
point `--upscaling-model-path` at a valid `.pth` file:
| Architecture | Example Weights | Description |
|--------------|-----------------|-------------|
| **RRDBNet** | `RealESRGAN_x4plus.pth` | Heavier model with higher quality; best for photos |
| **SRVGGNetCompact** | `RealESRGAN_x4.pth` *(default)*, `realesr-animevideov3.pth`, `realesr-general-x4v3.pth` | Lightweight model; faster inference, good for video |
The default weight file is
[`ai-forever/Real-ESRGAN`](https://huggingface.co/ai-forever/Real-ESRGAN) with
`RealESRGAN_x4.pth` (SRVGGNetCompact, 4× native scale).
Other super-resolution models (e.g., SwinIR, HAT, BSRGAN) are **not supported**
— only Real-ESRGAN checkpoints using the two architectures above are
compatible.
### Examples
Generate a 1024×1024 image and upscale to 4096×4096:
```bash
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--prompt "A cat sitting on a windowsill" \
--output-size 1024x1024 \
--enable-upscaling \
--save-output
```
Generate a video and upscale each frame by 4×:
```bash
sglang generate \
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--prompt "A curious raccoon" \
--enable-upscaling \
--upscaling-scale 4 \
--save-output
```
---
## Combining Frame Interpolation and Upscaling
Frame interpolation and upscaling can be combined in a single run.
Interpolation is applied first (increasing the frame count), then upscaling is
applied to every frame (increasing the spatial resolution).
Example — generate 5 frames, interpolate to 9 frames, and upscale each frame
by 4×:
```bash
sglang generate \
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--prompt "A curious raccoon" \
--num-frames 5 \
--enable-frame-interpolation \
--frame-interpolation-exp 1 \
--enable-upscaling \
--upscaling-scale 4 \
--save-output
```