# Weight Quantization Tools

KT-Kernel provides weight conversion tools for CPU-GPU hybrid inference (e.g., integrating KTransformers with SGLang). Both tools work together to enable heterogeneous expert placement:

- **CPU Weights (`convert_cpu_weights.py`)**: Quantize weights to INT4/INT8 with AMX optimization for CPU-resident "cold" experts
- **GPU Weights (`convert_gpu_weights.py`)**: Apply GPTQ/RTN quantization (W4A16/W8A16) for GPU-resident "hot" experts

---

## CPU Weight Quantization

Convert weights to INT4/INT8 format optimized for AMX inference on CPU. These quantized weights are used for "cold" experts (less frequently accessed) that run on CPU in hybrid inference scenarios.

### Quantization Methods

- **INT4**: 4-bit quantization for maximum memory efficiency
- **INT8**: 8-bit quantization for better accuracy

### Supported Input Formats

- **FP8**: 8-bit floating point with automatic dequantization
- **FP16**: 16-bit floating point
- **BF16**: BFloat16 format

> **⚠️ Precision Warning:** Quantizing directly from FP8 to INT4/INT8 may cause significant accuracy degradation. For best results, use the original **BF16** model as the source for INT4/INT8 quantization.

## Basic Usage

### Quantize BF16 model to INT4

```bash
python scripts/convert_cpu_weights.py \
  --input-path /path/to/bf16/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4
```

### Quantize FP16 model to INT8

```bash
python scripts/convert_cpu_weights.py \
  --input-path /path/to/fp16/model \
  --input-type fp16 \
  --output /path/to/output \
  --quant-method int8
```

### Quantize FP8 model to INT4

```bash
python scripts/convert_cpu_weights.py \
  --input-path /path/to/fp8/model \
  --input-type fp8 \
  --output /path/to/output \
  --quant-method int4
```

## Output Format

By default, the converted weights are saved in SafeTensors format with NUMA-aware layout:

```
output_dir/
├── model-00001-of-00050.safetensors
├── model-00002-of-00050.safetensors
├── ...
├── config.json
└── tokenizer files...
```

Each expert's weights are split across NUMA nodes for optimal memory access:
- `blk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.weight`: Quantized weights
- `blk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.scale`: Quantization scales

## Advanced Options

### Low Memory Mode

For systems with insufficient memory to complete full model quantization, use the `--no-merge-safetensor` flag to keep weights in layer folder structure without merging into safetensor files:

```bash
python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4 \
  --no-merge-safetensor
```

This will save quantized weights in the following folder structure:

```
output_dir/
├── _layer_0/
│   ├── _numa_0/
│   │   ├── INT4_down_0_*.kt
│   │   ├── INT4_gate_0_*.kt
│   │   └── INT4_up_0_*.kt
│   └── _numa_1/
│       └── ...
├── _layer_1/
│   └── ...
└── ...
```

**When to use `--no-merge-safetensor`:**
- Machine runs out of memory during the merge step
- Need to process very large models on memory-constrained systems
- Want to preserve intermediate layer-wise quantized weights

### Resume Layer

For memory-constrained systems that are unable to complete quantization despite enabling low memory mode with `--no-merge-safetensor`, restart the script with the `--resume-layer` arg to specify the layer from which to continue the conversion process. In the example below, we skip layers 0-11 and resume conversion starting with layer 12.

```bash
python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4 \
  --no-merge-safetensor
  --resume-layer 12
```

## Examples

### Example 1: Quantize DeepSeek-V3.1 (FP8 → INT4)

```bash
python scripts/convert_cpu_weights.py \
  --input-path /mnt/data/models/DeepSeek-V3.1 \
  --input-type fp8 \
  --output /mnt/data/models/DeepSeek-V3.1-INT4 \
  --quant-method int4 \
  --cpuinfer-threads 60 \
  --threadpool-count 2
```

### Example 2: Quantize Qwen3-Next-80B (BF16 → INT4, Low Memory)

```bash
python scripts/convert_cpu_weights.py \
  --input-path /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
  --input-type bf16 \
  --output /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-INT4 \
  --quant-method int4 \
  --cpuinfer-threads 60 \
  --threadpool-count 2 \
  --no-merge-safetensor
```

---

## GPU Weight Quantization

### Prerequisites

GPU weight quantization requires additional dependencies. Install them before proceeding:

```bash
pip install accelerate transformers llmcompressor datasets
```

**Required packages:**
- `accelerate`: For distributed model loading and device mapping
- `transformers`: For model and tokenizer loading
- `llmcompressor`: For quantization (supports GPTQ and RTN methods)
- `datasets`: For calibration data loading (GPTQ only)

**Documentation:** This tool is based on llmcompressor. For more details, see [llmcompressor quantization guide](https://docs.vllm.ai/projects/llm-compressor/en/latest/getting-started/compress/#select-a-quantization-method-and-scheme).

### Overview

Apply weight quantization to model weights for GPU-resident "hot" experts (frequently accessed) in CPU-GPU hybrid inference. This tool works together with `convert_cpu_weights.py` to enable heterogeneous expert placement:

- **GPU-resident experts** ("hot" experts) use GPTQ/RTN quantization (this tool) for efficient GPU memory usage
- **CPU-resident experts** ("cold" experts) use AMX-optimized INT4/INT8 quantization (convert_cpu_weights.py)
- **Attention layers, gates, and shared experts** remain in higher precision

This approach maximizes throughput and resource utilization by intelligently distributing experts across CPUs and GPUs.

### Quantization Methods

#### 1. GPTQ (Calibration-based, Default)
**Pros:**
- Higher accuracy through calibration-based quantization
- Recommended for production deployments

**Cons:**
- Requires calibration dataset
- Slower quantization process
- Higher memory requirements (needs Hessian matrix)

#### 2. RTN (Round-To-Nearest)
**Pros:**
- Fast quantization (no calibration needed)
- Lower memory requirements
- Good for quick testing and prototyping

**Cons:**
- Slightly lower accuracy compared to GPTQ
- No calibration optimization

### Quantization Types

- **W4A16**: 4-bit weights, 16-bit activations (INT4)
- **W8A16**: 8-bit weights, 16-bit activations (INT8)

### Basic Usage

#### GPTQ Quantization (Recommended for Production)
```bash
python scripts/convert_gpu_weights.py \
  --model_id /path/to/model \
  --output_dir /path/to/output \
  --quant_method GPTQ \
  --quant_type W4A16
```

#### RTN Quantization (Fast, for Testing)
```bash
python scripts/convert_gpu_weights.py \
  --model_id /path/to/model \
  --output_dir /path/to/output \
  --quant_method RTN \
  --quant_type W4A16
```

### Memory Requirements

Understanding memory requirements is crucial for successful quantization. The requirements differ significantly between RTN and GPTQ methods.

#### RTN Memory Requirements

RTN only requires memory for quantization parameters (scales/zero-points):

| Component | Requirement |
|-----------|-------------|
| **DRAM (CPU Memory)** | ≥ Total model parameters |
| **VRAM (GPU Memory)** | ≥ Single layer parameters |

**Example: DeepSeek-R1-0528-BF16 (684B parameters)**
- DRAM: ~1368 GB (684B params × 2 bytes)
- VRAM: ~22.4 GB (1 layer)

#### GPTQ Memory Requirements

GPTQ requires additional memory for Hessian matrices during calibration:

| Component | Requirement |
|-----------|-------------|
| **DRAM (CPU Memory)** | ≥ Total model parameters |
| **VRAM (GPU Memory)** | ≥ Single layer parameters × 2 |

The Hessian matrix is approximately the same size as the layer weights and is used to increase accuracy recovery.

**Example: DeepSeek-R1-0528-BF16 (684B parameters)**
- DRAM: ~1368 GB (684B params × 2 bytes)
- VRAM: ~44.8 GB (1 layer × 2 for Hessian matrix)

#### Method Comparison

| Method | Speed | VRAM | Accuracy | Use Case |
|--------|-------|------|----------|----------|
| **RTN** | Fast | Low (~22GB) | Good | Testing, prototyping |
| **GPTQ** | Slow | High (~45GB) | Better | Production deployment |

### Advanced Options

#### Calibration Configuration (GPTQ Only)

For GPTQ quantization, control the calibration process for better quantization quality:

```bash
python scripts/convert_gpu_weights.py \
  --model_id /path/to/model \
  --output_dir /path/to/output \
  --quant_method GPTQ \
  --quant_type W4A16 \
  --num_calibration_samples 512 \
  --max_sequence_length 2048 \
  --dataset HuggingFaceH4/ultrachat_200k \
  --dataset_split train_sft
```

**Options (GPTQ only):**
- `--num_calibration_samples`: Number of samples for calibration (default: 512)
- `--max_sequence_length`: Maximum sequence length (default: 2048)
- `--dataset`: HuggingFace dataset for calibration
- `--dataset_split`: Dataset split to use
- `--dampening_frac`: Dampening fraction to reduce quantization noise (default: 0.1)

#### Memory Management

Use `--max_gpu_memory` to limit GPU memory usage and offload remaining layers to CPU:

```bash
python scripts/convert_gpu_weights.py \
  --model_id /path/to/model \
  --output_dir /path/to/output \
  --quant_method GPTQ \
  --quant_type W4A16 \
  --max_gpu_memory "40GiB"
```

**Recommended settings for GPTQ:**

| GPU VRAM | Suggested `--max_gpu_memory` | Notes |
|----------|------------------------------|-------|
| 24 GiB   | 10-12 GiB | Reserve ~50% for Hessian |
| 48 GiB   | 24-30 GiB | Reserve ~40% for Hessian |
| 80 GiB   | 40-50 GiB | Reserve ~40% for Hessian |

**Recommended settings for RTN:**

| GPU VRAM | Suggested `--max_gpu_memory` | Notes |
|----------|------------------------------|-------|
| 24 GiB   | 18-20 GiB | No Hessian needed |
| 48 GiB   | 40-45 GiB | No Hessian needed |
| 80 GiB   | 70-75 GiB | No Hessian needed |

**Options:**
- `--max_gpu_memory`: Maximum GPU memory for model weights per device (e.g., '40GiB')
- `--max_cpu_memory`: Maximum CPU memory (default: 1000GiB when `--max_gpu_memory` is set)

**Important:** llmcompressor does not support disk offloading. Ensure your machine has enough GPU + CPU memory to load the entire model. If you still encounter OOM:
1. Use RTN instead of GPTQ (requires less memory)
2. Reduce `--num_calibration_samples` (GPTQ only, e.g., 256)
3. Reduce `--max_sequence_length` (GPTQ only, e.g., 1024)
4. Use `--force_cpu` to run entirely on CPU (slower but avoids GPU OOM)

### Examples

#### Example 1: GPTQ Quantization for Production (Qwen3-Next-80B, W4A16)

```bash
python scripts/convert_gpu_weights.py \
  --model_id /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
  --output_dir /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-GPTQ-W4A16 \
  --quant_method GPTQ \
  --quant_type W4A16 \
  --num_calibration_samples 512 \
  --max_sequence_length 2048 \
  --max_gpu_memory "40GiB" \
  --trust_remote_code
```

#### Example 2: RTN Quantization for Fast Testing (DeepSeek-R1, W4A16)

```bash
python scripts/convert_gpu_weights.py \
  --model_id /mnt/data/models/DeepSeek-R1-0528-BF16 \
  --output_dir /mnt/data/models/DeepSeek-R1-0528-RTN-W4A16 \
  --quant_method RTN \
  --quant_type W4A16 \
  --max_gpu_memory "70GiB" \
  --trust_remote_code
```

#### Example 3: GPTQ with Custom Calibration Dataset (GLM-4.5-Air, W8A16)

```bash
python scripts/convert_gpu_weights.py \
  --model_id /mnt/data/models/GLM-4.5-Air \
  --output_dir /mnt/data/models/GLM-4.5-Air-GPTQ-W8A16 \
  --quant_method GPTQ \
  --quant_type W8A16 \
  --dataset "tatsu-lab/alpaca" \
  --dataset_split "train" \
  --num_calibration_samples 256 \
  --max_gpu_memory "40GiB" \
  --trust_remote_code
```