Files
sglang/docs/platforms/ascend_npu_quantization.md
Артем Савкин 424a380077 [NPU] NPU quantization refactoring & more quantization formats support (#14504)
Co-authored-by: TamirBaydasov <mr.jeijy@gmail.com>
Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com>
Co-authored-by: Савкин Артем <savkinartem@MacBook-Air-Viktoria.local>
Co-authored-by: Edward Shogulin <edward.shogulin@gmail.com>
2026-01-15 04:25:15 +08:00

1.1 KiB

Quantization on Ascend.

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.

ModelSlim on Ascend support:

  • W4A4 dynamic linear
  • W8A8 static linear
  • W8A8 dynamic linear
  • W4A8 dynamic MOE
  • W8A8 dynamic MOE

AWQ on Ascend support:

  • W4A16 linear
  • W8A16 linear # Need to test
  • W4A16 MOE # Need to test

Compressed-tensors (LLM Compressor) on Ascend support: