Files
sglang/docs/diffusion/quantization.md

7.7 KiB

Quantization

SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep the base model and the quantized transformer override separate.

Quick Reference

Use these paths:

  • --model-path: the base or original model
  • --transformer-path: a quantized transformers-style transformer component directory that already contains its own config.json
  • --transformer-weights-path: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID

Recommended example:

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
  --prompt "a curious pikachu"

For quantized transformers-style transformer component folders:

sglang generate \
  --model-path /path/to/base-model \
  --transformer-path /path/to/quantized-transformer \
  --prompt "A Logo With Bold Large Text: SGL Diffusion"

NOTE: Some model-specific integrations also accept a quantized repo or local directory directly as --model-path, but that is a compatibility path. If a repo contains multiple candidate checkpoints, pass --transformer-weights-path explicitly.

Quant Families

Here, quant_family means a checkpoint and loading family with shared CLI usage and loader behavior. It is not just the numeric precision or a kernel backend.

quant_family checkpoint form canonical CLI supported models extra dependency platform / notes
fp8 Quantized transformer component folder, or safetensors with quantization_config metadata --transformer-path or --transformer-weights-path ALL None Component-folder and single-file flows are both supported
nvfp4-modelopt NVFP4 safetensors file, sharded directory, or repo providing transformer weights --transformer-weights-path FLUX.2 comfy-kitchen optional on Blackwell Blackwell can use a best-performance kit when available; otherwise SGLang falls back to the generic ModelOpt FP4 path
nunchaku-svdq Pre-quantized Nunchaku transformer weights, usually named svdq-{int4|fp4}_r{rank}-... --transformer-weights-path Model-specific support such as Qwen-Image, FLUX, and Z-Image nunchaku SGLang can infer precision and rank from the filename and supports both int4 and nvfp4

NVFP4

Usage Examples

Recommended usage keeps the base model and quantized transformer override separate:

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --save-output

SGLang also supports passing the NVFP4 repo or local directory directly as --model-path:

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev-NVFP4 \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --save-output

Notes

  • --transformer-weights-path is still the canonical CLI for NVFP4 transformer checkpoints.
  • Direct --model-path loading is a compatibility path for FLUX.2 NVFP4-style repos or local directories.
  • If --transformer-weights-path is provided explicitly, it takes precedence over the compatibility --model-path flow.
  • For local directories, SGLang first looks for *-mixed.safetensors, then falls back to loading from the directory.
  • On Blackwell, comfy-kitchen can provide the best-performance path when available; otherwise SGLang falls back to the generic ModelOpt FP4 path.

Nunchaku (SVDQuant)

Install

Install the runtime dependency first:

pip install nunchaku

For platform-specific installation methods and troubleshooting, see the Nunchaku installation guide.

File Naming and Auto-Detection

For Nunchaku checkpoints, --model-path should still point to the original base model, while --transformer-weights-path points to the quantized transformer weights.

If the basename of --transformer-weights-path contains the pattern svdq-(int4|fp4)_r{rank}, SGLang will automatically:

  • enable SVDQuant
  • infer --quantization-precision
  • infer --quantization-rank

Examples:

checkpoint name fragment inferred precision inferred rank notes
svdq-int4_r32 int4 32 Standard INT4 checkpoint
svdq-int4_r128 int4 128 Higher-quality INT4 checkpoint
svdq-fp4_r32 nvfp4 32 fp4 in the filename maps to CLI value nvfp4
svdq-fp4_r128 nvfp4 128 Higher-quality NVFP4 checkpoint

Common filenames:

filename precision rank typical use
svdq-int4_r32-qwen-image.safetensors int4 32 Balanced default
svdq-int4_r128-qwen-image.safetensors int4 128 Quality-focused
svdq-fp4_r32-qwen-image.safetensors nvfp4 32 RTX 50-series / NVFP4 path
svdq-fp4_r128-qwen-image.safetensors nvfp4 128 Quality-focused NVFP4
svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensors int4 32 Lightning 4-step
svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensors int4 128 Lightning 8-step

If your checkpoint name does not follow this convention, pass --enable-svdquant, --quantization-precision, and --quantization-rank explicitly.

Usage Examples

Recommended auto-detected flow:

sglang generate \
  --model-path Qwen/Qwen-Image \
  --transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \
  --prompt "change the raccoon to a cute cat" \
  --attention-backend torch_sdpa \
  --save-output

Manual override when the filename does not encode the quant settings:

sglang generate \
  --model-path Qwen/Qwen-Image \
  --transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \
  --enable-svdquant \
  --quantization-precision int4 \
  --quantization-rank 128 \
  --prompt "a beautiful sunset" \
  --attention-backend torch_sdpa \
  --save-output

Notes

  • --transformer-weights-path is the canonical flag for Nunchaku checkpoints. Older config names such as quantized_model_path are treated as compatibility aliases.
  • Auto-detection only happens when the checkpoint basename matches svdq-(int4|fp4)_r{rank}.
  • The CLI values are int4 and nvfp4. In filenames, the NVFP4 variant is written as fp4.
  • Lightning checkpoints usually expect matching --num-inference-steps, such as 4 or 8.
  • Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x) or SM12x GPUs. Hopper (SM90) is currently rejected.