mirror of https://github.com/kvcache-ai/sglang.git synced 2026-06-30 19:57:52 +00:00

Files

Mick 6cc5717e8a [diffusion] doc: update quantization.md (#21356 )

2026-03-25 14:48:38 +08:00

7.7 KiB

Raw Blame History

Quantization

SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep the base model and the quantized transformer override separate.

Quick Reference

Use these paths:

--model-path: the base or original model
--transformer-path: a quantized transformers-style transformer component directory that already contains its own config.json
--transformer-weights-path: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID

Recommended example:

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
  --prompt "a curious pikachu"

For quantized transformers-style transformer component folders:

sglang generate \
  --model-path /path/to/base-model \
  --transformer-path /path/to/quantized-transformer \
  --prompt "A Logo With Bold Large Text: SGL Diffusion"

NOTE: Some model-specific integrations also accept a quantized repo or local directory directly as --model-path, but that is a compatibility path. If a repo contains multiple candidate checkpoints, pass --transformer-weights-path explicitly.

Quant Families

Here, quant_family means a checkpoint and loading family with shared CLI usage and loader behavior. It is not just the numeric precision or a kernel backend.

quant_family	checkpoint form	canonical CLI	supported models	extra dependency	platform / notes
`fp8`	Quantized transformer component folder, or safetensors with `quantization_config` metadata	`--transformer-path` or `--transformer-weights-path`	ALL	None	Component-folder and single-file flows are both supported
`nvfp4-modelopt`	NVFP4 safetensors file, sharded directory, or repo providing transformer weights	`--transformer-weights-path`	FLUX.2	`comfy-kitchen` optional on Blackwell	Blackwell can use a best-performance kit when available; otherwise SGLang falls back to the generic ModelOpt FP4 path
`nunchaku-svdq`	Pre-quantized Nunchaku transformer weights, usually named `svdq-{int4\|fp4}_r{rank}-...`	`--transformer-weights-path`	Model-specific support such as Qwen-Image, FLUX, and Z-Image	`nunchaku`	SGLang can infer precision and rank from the filename and supports both `int4` and `nvfp4`

NVFP4

Usage Examples

Recommended usage keeps the base model and quantized transformer override separate:

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev \
  --transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --save-output

SGLang also supports passing the NVFP4 repo or local directory directly as --model-path:

sglang generate \
  --model-path black-forest-labs/FLUX.2-dev-NVFP4 \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --save-output

Notes

--transformer-weights-path is still the canonical CLI for NVFP4 transformer checkpoints.
Direct --model-path loading is a compatibility path for FLUX.2 NVFP4-style repos or local directories.
If --transformer-weights-path is provided explicitly, it takes precedence over the compatibility --model-path flow.
For local directories, SGLang first looks for *-mixed.safetensors, then falls back to loading from the directory.
On Blackwell, comfy-kitchen can provide the best-performance path when available; otherwise SGLang falls back to the generic ModelOpt FP4 path.

Nunchaku (SVDQuant)

Install

Install the runtime dependency first:

pip install nunchaku

For platform-specific installation methods and troubleshooting, see the Nunchaku installation guide.

File Naming and Auto-Detection

For Nunchaku checkpoints, --model-path should still point to the original base model, while --transformer-weights-path points to the quantized transformer weights.

If the basename of --transformer-weights-path contains the pattern svdq-(int4|fp4)_r{rank}, SGLang will automatically:

enable SVDQuant
infer --quantization-precision
infer --quantization-rank

Examples:

checkpoint name fragment	inferred precision	inferred rank	notes
`svdq-int4_r32`	`int4`	`32`	Standard INT4 checkpoint
`svdq-int4_r128`	`int4`	`128`	Higher-quality INT4 checkpoint
`svdq-fp4_r32`	`nvfp4`	`32`	`fp4` in the filename maps to CLI value `nvfp4`
`svdq-fp4_r128`	`nvfp4`	`128`	Higher-quality NVFP4 checkpoint

Common filenames:

filename	precision	rank	typical use
`svdq-int4_r32-qwen-image.safetensors`	`int4`	`32`	Balanced default
`svdq-int4_r128-qwen-image.safetensors`	`int4`	`128`	Quality-focused
`svdq-fp4_r32-qwen-image.safetensors`	`nvfp4`	`32`	RTX 50-series / NVFP4 path
`svdq-fp4_r128-qwen-image.safetensors`	`nvfp4`	`128`	Quality-focused NVFP4
`svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensors`	`int4`	`32`	Lightning 4-step
`svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensors`	`int4`	`128`	Lightning 8-step

If your checkpoint name does not follow this convention, pass --enable-svdquant, --quantization-precision, and --quantization-rank explicitly.

Usage Examples

Recommended auto-detected flow:

sglang generate \
  --model-path Qwen/Qwen-Image \
  --transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \
  --prompt "change the raccoon to a cute cat" \
  --attention-backend torch_sdpa \
  --save-output

Manual override when the filename does not encode the quant settings:

sglang generate \
  --model-path Qwen/Qwen-Image \
  --transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \
  --enable-svdquant \
  --quantization-precision int4 \
  --quantization-rank 128 \
  --prompt "a beautiful sunset" \
  --attention-backend torch_sdpa \
  --save-output

Notes

--transformer-weights-path is the canonical flag for Nunchaku checkpoints. Older config names such as quantized_model_path are treated as compatibility aliases.
Auto-detection only happens when the checkpoint basename matches svdq-(int4|fp4)_r{rank}.
The CLI values are int4 and nvfp4. In filenames, the NVFP4 variant is written as fp4.
Lightning checkpoints usually expect matching --num-inference-steps, such as 4 or 8.
Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x) or SM12x GPUs. Hopper (SM90) is currently rejected.

7.7 KiB Raw Blame History

Quantization

Quick Reference

Quant Families

NVFP4

Usage Examples

Notes

Nunchaku (SVDQuant)

Install

File Naming and Auto-Detection

Usage Examples

Notes

7.7 KiB

Raw Blame History