mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-03-15 02:47:22 +00:00
375 lines
12 KiB
Markdown
375 lines
12 KiB
Markdown
# Weight Quantization Tools
|
||
|
||
KT-Kernel provides weight conversion tools for CPU-GPU hybrid inference (e.g., integrating KTransformers with SGLang). Both tools work together to enable heterogeneous expert placement:
|
||
|
||
- **CPU Weights (`convert_cpu_weights.py`)**: Quantize weights to INT4/INT8 with AMX optimization for CPU-resident "cold" experts
|
||
- **GPU Weights (`convert_gpu_weights.py`)**: Apply GPTQ/RTN quantization (W4A16/W8A16) for GPU-resident "hot" experts
|
||
|
||
---
|
||
|
||
## CPU Weight Quantization
|
||
|
||
Convert weights to INT4/INT8 format optimized for AMX inference on CPU. These quantized weights are used for "cold" experts (less frequently accessed) that run on CPU in hybrid inference scenarios.
|
||
|
||
### Quantization Methods
|
||
|
||
- **INT4**: 4-bit quantization for maximum memory efficiency
|
||
- **INT8**: 8-bit quantization for better accuracy
|
||
|
||
### Supported Input Formats
|
||
|
||
- **FP8**: 8-bit floating point with automatic dequantization
|
||
- **FP16**: 16-bit floating point
|
||
- **BF16**: BFloat16 format
|
||
|
||
> **⚠️ Precision Warning:** Quantizing directly from FP8 to INT4/INT8 may cause significant accuracy degradation. For best results, use the original **BF16** model as the source for INT4/INT8 quantization.
|
||
|
||
## Basic Usage
|
||
|
||
### Quantize BF16 model to INT4
|
||
|
||
```bash
|
||
python scripts/convert_cpu_weights.py \
|
||
--input-path /path/to/bf16/model \
|
||
--input-type bf16 \
|
||
--output /path/to/output \
|
||
--quant-method int4
|
||
```
|
||
|
||
### Quantize FP16 model to INT8
|
||
|
||
```bash
|
||
python scripts/convert_cpu_weights.py \
|
||
--input-path /path/to/fp16/model \
|
||
--input-type fp16 \
|
||
--output /path/to/output \
|
||
--quant-method int8
|
||
```
|
||
|
||
### Quantize FP8 model to INT4
|
||
|
||
```bash
|
||
python scripts/convert_cpu_weights.py \
|
||
--input-path /path/to/fp8/model \
|
||
--input-type fp8 \
|
||
--output /path/to/output \
|
||
--quant-method int4
|
||
```
|
||
|
||
## Output Format
|
||
|
||
By default, the converted weights are saved in SafeTensors format with NUMA-aware layout:
|
||
|
||
```
|
||
output_dir/
|
||
├── model-00001-of-00050.safetensors
|
||
├── model-00002-of-00050.safetensors
|
||
├── ...
|
||
├── config.json
|
||
└── tokenizer files...
|
||
```
|
||
|
||
Each expert's weights are split across NUMA nodes for optimal memory access:
|
||
- `blk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.weight`: Quantized weights
|
||
- `blk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.scale`: Quantization scales
|
||
|
||
## Advanced Options
|
||
|
||
### Low Memory Mode
|
||
|
||
For systems with insufficient memory to complete full model quantization, use the `--no-merge-safetensor` flag to keep weights in layer folder structure without merging into safetensor files:
|
||
|
||
```bash
|
||
python scripts/convert_cpu_weights.py \
|
||
--input-path /path/to/model \
|
||
--input-type bf16 \
|
||
--output /path/to/output \
|
||
--quant-method int4 \
|
||
--no-merge-safetensor
|
||
```
|
||
|
||
This will save quantized weights in the following folder structure:
|
||
|
||
```
|
||
output_dir/
|
||
├── _layer_0/
|
||
│ ├── _numa_0/
|
||
│ │ ├── INT4_down_0_*.kt
|
||
│ │ ├── INT4_gate_0_*.kt
|
||
│ │ └── INT4_up_0_*.kt
|
||
│ └── _numa_1/
|
||
│ └── ...
|
||
├── _layer_1/
|
||
│ └── ...
|
||
└── ...
|
||
```
|
||
|
||
**When to use `--no-merge-safetensor`:**
|
||
- Machine runs out of memory during the merge step
|
||
- Need to process very large models on memory-constrained systems
|
||
- Want to preserve intermediate layer-wise quantized weights
|
||
|
||
### Resume Layer
|
||
|
||
For memory-constrained systems that are unable to complete quantization despite enabling low memory mode with `--no-merge-safetensor`, restart the script with the `--resume-layer` arg to specify the layer from which to continue the conversion process. In the example below, we skip layers 0-11 and resume conversion starting with layer 12.
|
||
|
||
```bash
|
||
python scripts/convert_cpu_weights.py \
|
||
--input-path /path/to/model \
|
||
--input-type bf16 \
|
||
--output /path/to/output \
|
||
--quant-method int4 \
|
||
--no-merge-safetensor
|
||
--resume-layer 12
|
||
```
|
||
|
||
## Examples
|
||
|
||
### Example 1: Quantize DeepSeek-V3.1 (FP8 → INT4)
|
||
|
||
```bash
|
||
python scripts/convert_cpu_weights.py \
|
||
--input-path /mnt/data/models/DeepSeek-V3.1 \
|
||
--input-type fp8 \
|
||
--output /mnt/data/models/DeepSeek-V3.1-INT4 \
|
||
--quant-method int4 \
|
||
--cpuinfer-threads 60 \
|
||
--threadpool-count 2
|
||
```
|
||
|
||
### Example 2: Quantize Qwen3-Next-80B (BF16 → INT4, Low Memory)
|
||
|
||
```bash
|
||
python scripts/convert_cpu_weights.py \
|
||
--input-path /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
|
||
--input-type bf16 \
|
||
--output /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-INT4 \
|
||
--quant-method int4 \
|
||
--cpuinfer-threads 60 \
|
||
--threadpool-count 2 \
|
||
--no-merge-safetensor
|
||
```
|
||
|
||
---
|
||
|
||
## GPU Weight Quantization
|
||
|
||
### Prerequisites
|
||
|
||
GPU weight quantization requires additional dependencies. Install them before proceeding:
|
||
|
||
```bash
|
||
pip install accelerate transformers llmcompressor datasets
|
||
```
|
||
|
||
**Required packages:**
|
||
- `accelerate`: For distributed model loading and device mapping
|
||
- `transformers`: For model and tokenizer loading
|
||
- `llmcompressor`: For quantization (supports GPTQ and RTN methods)
|
||
- `datasets`: For calibration data loading (GPTQ only)
|
||
|
||
**Documentation:** This tool is based on llmcompressor. For more details, see [llmcompressor quantization guide](https://docs.vllm.ai/projects/llm-compressor/en/latest/getting-started/compress/#select-a-quantization-method-and-scheme).
|
||
|
||
### Overview
|
||
|
||
Apply weight quantization to model weights for GPU-resident "hot" experts (frequently accessed) in CPU-GPU hybrid inference. This tool works together with `convert_cpu_weights.py` to enable heterogeneous expert placement:
|
||
|
||
- **GPU-resident experts** ("hot" experts) use GPTQ/RTN quantization (this tool) for efficient GPU memory usage
|
||
- **CPU-resident experts** ("cold" experts) use AMX-optimized INT4/INT8 quantization (convert_cpu_weights.py)
|
||
- **Attention layers, gates, and shared experts** remain in higher precision
|
||
|
||
This approach maximizes throughput and resource utilization by intelligently distributing experts across CPUs and GPUs.
|
||
|
||
### Quantization Methods
|
||
|
||
#### 1. GPTQ (Calibration-based, Default)
|
||
**Pros:**
|
||
- Higher accuracy through calibration-based quantization
|
||
- Recommended for production deployments
|
||
|
||
**Cons:**
|
||
- Requires calibration dataset
|
||
- Slower quantization process
|
||
- Higher memory requirements (needs Hessian matrix)
|
||
|
||
#### 2. RTN (Round-To-Nearest)
|
||
**Pros:**
|
||
- Fast quantization (no calibration needed)
|
||
- Lower memory requirements
|
||
- Good for quick testing and prototyping
|
||
|
||
**Cons:**
|
||
- Slightly lower accuracy compared to GPTQ
|
||
- No calibration optimization
|
||
|
||
### Quantization Types
|
||
|
||
- **W4A16**: 4-bit weights, 16-bit activations (INT4)
|
||
- **W8A16**: 8-bit weights, 16-bit activations (INT8)
|
||
|
||
### Basic Usage
|
||
|
||
#### GPTQ Quantization (Recommended for Production)
|
||
```bash
|
||
python scripts/convert_gpu_weights.py \
|
||
--model_id /path/to/model \
|
||
--output_dir /path/to/output \
|
||
--quant_method GPTQ \
|
||
--quant_type W4A16
|
||
```
|
||
|
||
#### RTN Quantization (Fast, for Testing)
|
||
```bash
|
||
python scripts/convert_gpu_weights.py \
|
||
--model_id /path/to/model \
|
||
--output_dir /path/to/output \
|
||
--quant_method RTN \
|
||
--quant_type W4A16
|
||
```
|
||
|
||
### Memory Requirements
|
||
|
||
Understanding memory requirements is crucial for successful quantization. The requirements differ significantly between RTN and GPTQ methods.
|
||
|
||
#### RTN Memory Requirements
|
||
|
||
RTN only requires memory for quantization parameters (scales/zero-points):
|
||
|
||
| Component | Requirement |
|
||
|-----------|-------------|
|
||
| **DRAM (CPU Memory)** | ≥ Total model parameters |
|
||
| **VRAM (GPU Memory)** | ≥ Single layer parameters |
|
||
|
||
**Example: DeepSeek-R1-0528-BF16 (684B parameters)**
|
||
- DRAM: ~1368 GB (684B params × 2 bytes)
|
||
- VRAM: ~22.4 GB (1 layer)
|
||
|
||
#### GPTQ Memory Requirements
|
||
|
||
GPTQ requires additional memory for Hessian matrices during calibration:
|
||
|
||
| Component | Requirement |
|
||
|-----------|-------------|
|
||
| **DRAM (CPU Memory)** | ≥ Total model parameters |
|
||
| **VRAM (GPU Memory)** | ≥ Single layer parameters × 2 |
|
||
|
||
The Hessian matrix is approximately the same size as the layer weights and is used to increase accuracy recovery.
|
||
|
||
**Example: DeepSeek-R1-0528-BF16 (684B parameters)**
|
||
- DRAM: ~1368 GB (684B params × 2 bytes)
|
||
- VRAM: ~44.8 GB (1 layer × 2 for Hessian matrix)
|
||
|
||
#### Method Comparison
|
||
|
||
| Method | Speed | VRAM | Accuracy | Use Case |
|
||
|--------|-------|------|----------|----------|
|
||
| **RTN** | Fast | Low (~22GB) | Good | Testing, prototyping |
|
||
| **GPTQ** | Slow | High (~45GB) | Better | Production deployment |
|
||
|
||
### Advanced Options
|
||
|
||
#### Calibration Configuration (GPTQ Only)
|
||
|
||
For GPTQ quantization, control the calibration process for better quantization quality:
|
||
|
||
```bash
|
||
python scripts/convert_gpu_weights.py \
|
||
--model_id /path/to/model \
|
||
--output_dir /path/to/output \
|
||
--quant_method GPTQ \
|
||
--quant_type W4A16 \
|
||
--num_calibration_samples 512 \
|
||
--max_sequence_length 2048 \
|
||
--dataset HuggingFaceH4/ultrachat_200k \
|
||
--dataset_split train_sft
|
||
```
|
||
|
||
**Options (GPTQ only):**
|
||
- `--num_calibration_samples`: Number of samples for calibration (default: 512)
|
||
- `--max_sequence_length`: Maximum sequence length (default: 2048)
|
||
- `--dataset`: HuggingFace dataset for calibration
|
||
- `--dataset_split`: Dataset split to use
|
||
- `--dampening_frac`: Dampening fraction to reduce quantization noise (default: 0.1)
|
||
|
||
#### Memory Management
|
||
|
||
Use `--max_gpu_memory` to limit GPU memory usage and offload remaining layers to CPU:
|
||
|
||
```bash
|
||
python scripts/convert_gpu_weights.py \
|
||
--model_id /path/to/model \
|
||
--output_dir /path/to/output \
|
||
--quant_method GPTQ \
|
||
--quant_type W4A16 \
|
||
--max_gpu_memory "40GiB"
|
||
```
|
||
|
||
**Recommended settings for GPTQ:**
|
||
|
||
| GPU VRAM | Suggested `--max_gpu_memory` | Notes |
|
||
|----------|------------------------------|-------|
|
||
| 24 GiB | 10-12 GiB | Reserve ~50% for Hessian |
|
||
| 48 GiB | 24-30 GiB | Reserve ~40% for Hessian |
|
||
| 80 GiB | 40-50 GiB | Reserve ~40% for Hessian |
|
||
|
||
**Recommended settings for RTN:**
|
||
|
||
| GPU VRAM | Suggested `--max_gpu_memory` | Notes |
|
||
|----------|------------------------------|-------|
|
||
| 24 GiB | 18-20 GiB | No Hessian needed |
|
||
| 48 GiB | 40-45 GiB | No Hessian needed |
|
||
| 80 GiB | 70-75 GiB | No Hessian needed |
|
||
|
||
**Options:**
|
||
- `--max_gpu_memory`: Maximum GPU memory for model weights per device (e.g., '40GiB')
|
||
- `--max_cpu_memory`: Maximum CPU memory (default: 1000GiB when `--max_gpu_memory` is set)
|
||
|
||
**Important:** llmcompressor does not support disk offloading. Ensure your machine has enough GPU + CPU memory to load the entire model. If you still encounter OOM:
|
||
1. Use RTN instead of GPTQ (requires less memory)
|
||
2. Reduce `--num_calibration_samples` (GPTQ only, e.g., 256)
|
||
3. Reduce `--max_sequence_length` (GPTQ only, e.g., 1024)
|
||
4. Use `--force_cpu` to run entirely on CPU (slower but avoids GPU OOM)
|
||
|
||
### Examples
|
||
|
||
#### Example 1: GPTQ Quantization for Production (Qwen3-Next-80B, W4A16)
|
||
|
||
```bash
|
||
python scripts/convert_gpu_weights.py \
|
||
--model_id /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
|
||
--output_dir /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-GPTQ-W4A16 \
|
||
--quant_method GPTQ \
|
||
--quant_type W4A16 \
|
||
--num_calibration_samples 512 \
|
||
--max_sequence_length 2048 \
|
||
--max_gpu_memory "40GiB" \
|
||
--trust_remote_code
|
||
```
|
||
|
||
#### Example 2: RTN Quantization for Fast Testing (DeepSeek-R1, W4A16)
|
||
|
||
```bash
|
||
python scripts/convert_gpu_weights.py \
|
||
--model_id /mnt/data/models/DeepSeek-R1-0528-BF16 \
|
||
--output_dir /mnt/data/models/DeepSeek-R1-0528-RTN-W4A16 \
|
||
--quant_method RTN \
|
||
--quant_type W4A16 \
|
||
--max_gpu_memory "70GiB" \
|
||
--trust_remote_code
|
||
```
|
||
|
||
#### Example 3: GPTQ with Custom Calibration Dataset (GLM-4.5-Air, W8A16)
|
||
|
||
```bash
|
||
python scripts/convert_gpu_weights.py \
|
||
--model_id /mnt/data/models/GLM-4.5-Air \
|
||
--output_dir /mnt/data/models/GLM-4.5-Air-GPTQ-W8A16 \
|
||
--quant_method GPTQ \
|
||
--quant_type W8A16 \
|
||
--dataset "tatsu-lab/alpaca" \
|
||
--dataset_split "train" \
|
||
--num_calibration_samples 256 \
|
||
--max_gpu_memory "40GiB" \
|
||
--trust_remote_code
|
||
```
|