7.7 KiB
Quantization
SGLang-Diffusion supports quantized transformer checkpoints. In most cases, keep the base model and the quantized transformer override separate.
Quick Reference
Use these paths:
--model-path: the base or original model--transformer-path: a quantized transformers-style transformer component directory that already contains its ownconfig.json--transformer-weights-path: quantized transformer weights provided as a single safetensors file, a sharded safetensors directory, a local path, or a Hugging Face repo ID
Recommended example:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "a curious pikachu"
For quantized transformers-style transformer component folders:
sglang generate \
--model-path /path/to/base-model \
--transformer-path /path/to/quantized-transformer \
--prompt "A Logo With Bold Large Text: SGL Diffusion"
NOTE: Some model-specific integrations also accept a quantized repo or local
directory directly as --model-path, but that is a compatibility path. If a
repo contains multiple candidate checkpoints, pass
--transformer-weights-path explicitly.
Quant Families
Here, quant_family means a checkpoint and loading family with shared CLI
usage and loader behavior. It is not just the numeric precision or a kernel
backend.
| quant_family | checkpoint form | canonical CLI | supported models | extra dependency | platform / notes |
|---|---|---|---|---|---|
fp8 |
Quantized transformer component folder, or safetensors with quantization_config metadata |
--transformer-path or --transformer-weights-path |
ALL | None | Component-folder and single-file flows are both supported |
nvfp4-modelopt |
NVFP4 safetensors file, sharded directory, or repo providing transformer weights | --transformer-weights-path |
FLUX.2 | comfy-kitchen optional on Blackwell |
Blackwell can use a best-performance kit when available; otherwise SGLang falls back to the generic ModelOpt FP4 path |
nunchaku-svdq |
Pre-quantized Nunchaku transformer weights, usually named svdq-{int4|fp4}_r{rank}-... |
--transformer-weights-path |
Model-specific support such as Qwen-Image, FLUX, and Z-Image | nunchaku |
SGLang can infer precision and rank from the filename and supports both int4 and nvfp4 |
NVFP4
Usage Examples
Recommended usage keeps the base model and quantized transformer override separate:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev \
--transformer-weights-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
SGLang also supports passing the NVFP4 repo or local directory directly as
--model-path:
sglang generate \
--model-path black-forest-labs/FLUX.2-dev-NVFP4 \
--prompt "A Logo With Bold Large Text: SGL Diffusion" \
--save-output
Notes
--transformer-weights-pathis still the canonical CLI for NVFP4 transformer checkpoints.- Direct
--model-pathloading is a compatibility path for FLUX.2 NVFP4-style repos or local directories. - If
--transformer-weights-pathis provided explicitly, it takes precedence over the compatibility--model-pathflow. - For local directories, SGLang first looks for
*-mixed.safetensors, then falls back to loading from the directory. - On Blackwell,
comfy-kitchencan provide the best-performance path when available; otherwise SGLang falls back to the generic ModelOpt FP4 path.
Nunchaku (SVDQuant)
Install
Install the runtime dependency first:
pip install nunchaku
For platform-specific installation methods and troubleshooting, see the Nunchaku installation guide.
File Naming and Auto-Detection
For Nunchaku checkpoints, --model-path should still point to the original
base model, while --transformer-weights-path points to the quantized
transformer weights.
If the basename of --transformer-weights-path contains the pattern
svdq-(int4|fp4)_r{rank}, SGLang will automatically:
- enable SVDQuant
- infer
--quantization-precision - infer
--quantization-rank
Examples:
| checkpoint name fragment | inferred precision | inferred rank | notes |
|---|---|---|---|
svdq-int4_r32 |
int4 |
32 |
Standard INT4 checkpoint |
svdq-int4_r128 |
int4 |
128 |
Higher-quality INT4 checkpoint |
svdq-fp4_r32 |
nvfp4 |
32 |
fp4 in the filename maps to CLI value nvfp4 |
svdq-fp4_r128 |
nvfp4 |
128 |
Higher-quality NVFP4 checkpoint |
Common filenames:
| filename | precision | rank | typical use |
|---|---|---|---|
svdq-int4_r32-qwen-image.safetensors |
int4 |
32 |
Balanced default |
svdq-int4_r128-qwen-image.safetensors |
int4 |
128 |
Quality-focused |
svdq-fp4_r32-qwen-image.safetensors |
nvfp4 |
32 |
RTX 50-series / NVFP4 path |
svdq-fp4_r128-qwen-image.safetensors |
nvfp4 |
128 |
Quality-focused NVFP4 |
svdq-int4_r32-qwen-image-lightningv1.0-4steps.safetensors |
int4 |
32 |
Lightning 4-step |
svdq-int4_r128-qwen-image-lightningv1.1-8steps.safetensors |
int4 |
128 |
Lightning 8-step |
If your checkpoint name does not follow this convention, pass
--enable-svdquant, --quantization-precision, and --quantization-rank
explicitly.
Usage Examples
Recommended auto-detected flow:
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/svdq-int4_r32-qwen-image.safetensors \
--prompt "change the raccoon to a cute cat" \
--attention-backend torch_sdpa \
--save-output
Manual override when the filename does not encode the quant settings:
sglang generate \
--model-path Qwen/Qwen-Image \
--transformer-weights-path /path/to/custom_nunchaku_checkpoint.safetensors \
--enable-svdquant \
--quantization-precision int4 \
--quantization-rank 128 \
--prompt "a beautiful sunset" \
--attention-backend torch_sdpa \
--save-output
Notes
--transformer-weights-pathis the canonical flag for Nunchaku checkpoints. Older config names such asquantized_model_pathare treated as compatibility aliases.- Auto-detection only happens when the checkpoint basename matches
svdq-(int4|fp4)_r{rank}. - The CLI values are
int4andnvfp4. In filenames, the NVFP4 variant is written asfp4. - Lightning checkpoints usually expect matching
--num-inference-steps, such as4or8. - Current runtime validation only allows Nunchaku on NVIDIA CUDA Ampere (SM8x) or SM12x GPUs. Hopper (SM90) is currently rejected.