sglang/docs/diffusion/api/post_processing.md

# Post-Processing

SGLang diffusion supports optional post-processing steps that run after
generation to improve temporal smoothness (frame interpolation) or spatial
resolution (upscaling). These steps are independent of the diffusion model and
can be combined in a single run.

When both are enabled, **frame interpolation runs first** (increasing the frame
count), then **upscaling runs on every frame** (increasing the spatial
resolution).

---

## Frame Interpolation (video only)

Frame interpolation synthesizes new frames between each pair of consecutive
generated frames, producing smoother motion without re-running the diffusion
model.

The `--frame-interpolation-exp` flag controls how many rounds of interpolation
to apply: each round inserts one new frame into every gap between adjacent
frames, so the output frame count follows the formula:

> **(N − 1) × 2^exp + 1**
>
> e.g. 5 original frames with `exp=1` → 4 gaps × 1 new frame + 5 originals = **9** frames;
> with `exp=2` → **17** frames.

### CLI Arguments

| Argument | Description |
|----------|-------------|
| `--enable-frame-interpolation` | Enable frame interpolation. Model weights are downloaded automatically on first use. |
| `--frame-interpolation-exp {EXP}` | Interpolation exponent — `1` = 2× temporal resolution, `2` = 4×, etc. (default: `1`) |
| `--frame-interpolation-scale {SCALE}` | RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`) |
| `--frame-interpolation-model-path {PATH}` | Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically) |

### Supported Models

Frame interpolation uses the [RIFE](https://github.com/hzwer/Practical-RIFE)
(Real-Time Intermediate Flow Estimation) architecture. Only **RIFE 4.22.lite**
(`IFNet` with 4-scale `IFBlock` backbone) is supported. The network topology is
hard-coded, so custom weights provided via `--frame-interpolation-model-path`
must be a `flownet.pkl` checkpoint that is compatible with this architecture.

Other RIFE versions (e.g., older `v4.x` variants with different block counts)
or entirely different frame interpolation methods (FILM, AMT, etc.) are **not
supported**.

| Weight | HuggingFace Repo | Description |
|--------|------------------|-------------|
| RIFE 4.22.lite *(default)* | [`elfgum/RIFE-4.22.lite`](https://huggingface.co/elfgum/RIFE-4.22.lite) | Lightweight model, downloaded automatically on first use |

### Example

Generate a 5-frame video and interpolate to 9 frames ((5 − 1) × 2¹ + 1 = 9):

```bash
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --prompt "A dog running through a park" \
  --num-frames 5 \
  --enable-frame-interpolation \
  --frame-interpolation-exp 1 \
  --save-output
```

---

## Upscaling (image and video)

Upscaling increases the spatial resolution of generated images or video frames
using [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN). The model weights
are downloaded automatically on first use and cached for subsequent runs.

### CLI Arguments

| Argument | Description |
|----------|-------------|
| `--enable-upscaling` | Enable post-generation upscaling using Real-ESRGAN. |
| `--upscaling-scale {SCALE}` | Desired upscaling factor (default: `4`). The 4× model is used internally; if a different scale is requested, a bicubic resize is applied after the network output. |
| `--upscaling-model-path {PATH}` | Local `.pth` file, HuggingFace repo ID, or `repo_id:filename` for Real-ESRGAN weights (default: `ai-forever/Real-ESRGAN` with `RealESRGAN_x4.pth`, downloaded automatically). Use the `repo_id:filename` format to specify a custom weight file from a HuggingFace repo (e.g. `my-org/my-esrgan:weights.pth`). |

### Supported Models

Upscaling supports two Real-ESRGAN network architectures. The correct
architecture is **auto-detected** from the checkpoint keys, so you only need to
point `--upscaling-model-path` at a valid `.pth` file:

| Architecture | Example Weights | Description |
|--------------|-----------------|-------------|
| **RRDBNet** | `RealESRGAN_x4plus.pth` | Heavier model with higher quality; best for photos |
| **SRVGGNetCompact** | `RealESRGAN_x4.pth` *(default)*, `realesr-animevideov3.pth`, `realesr-general-x4v3.pth` | Lightweight model; faster inference, good for video |

The default weight file is
[`ai-forever/Real-ESRGAN`](https://huggingface.co/ai-forever/Real-ESRGAN) with
`RealESRGAN_x4.pth` (SRVGGNetCompact, 4× native scale).

Other super-resolution models (e.g., SwinIR, HAT, BSRGAN) are **not supported**
— only Real-ESRGAN checkpoints using the two architectures above are
compatible.

### Examples

Generate a 1024×1024 image and upscale to 4096×4096:

```bash
sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --prompt "A cat sitting on a windowsill" \
  --output-size 1024x1024 \
  --enable-upscaling \
  --save-output
```

Generate a video and upscale each frame by 4×:

```bash
sglang generate \
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
  --prompt "A curious raccoon" \
  --enable-upscaling \
  --upscaling-scale 4 \
  --save-output
```

---

## Combining Frame Interpolation and Upscaling

Frame interpolation and upscaling can be combined in a single run.
Interpolation is applied first (increasing the frame count), then upscaling is
applied to every frame (increasing the spatial resolution).

Example — generate 5 frames, interpolate to 9 frames, and upscale each frame
by 4×:

```bash
sglang generate \
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
  --prompt "A curious raccoon" \
  --num-frames 5 \
  --enable-frame-interpolation \
  --frame-interpolation-exp 1 \
  --enable-upscaling \
  --upscaling-scale 4 \
  --save-output
```