mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
258 lines
7.5 KiB
Plaintext
258 lines
7.5 KiB
Plaintext
---
|
|
title: "Checkpoint Engine Integration"
|
|
metatags:
|
|
description: "SGLang checkpoint engine: distributed model weight loading, parallel multi-node setup, broadcast and P2P modes. Reduces loading time for large models."
|
|
---
|
|
The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.
|
|
|
|
## Overview
|
|
|
|
The checkpoint engine integration allows SGLang to:
|
|
- Load model weights in parallel using multiple processes
|
|
- Distribute weight loading across multiple nodes to increase effective disk bandwidth
|
|
- Overlap weight loading with other initialization tasks like CUDA graph capture
|
|
- Support both single-node and multi-node deployments
|
|
|
|
## Installation
|
|
|
|
First, install the checkpoint engine package:
|
|
|
|
```bash Command
|
|
pip install 'checkpoint-engine[p2p]'
|
|
```
|
|
|
|
## Architecture
|
|
|
|
The system consists of two main components:
|
|
|
|
1. **SGLang Server**: Runs with `--wait-for-initial-weights` flag to wait for weights before becoming ready
|
|
2. **Checkpoint Engine Workers**: Separate processes (managed by torchrun) that load and distribute model weights
|
|
|
|
The checkpoint engine uses a parameter server architecture with support for:
|
|
- **Broadcast mode**: Weights are broadcast from loading processes to inference processes
|
|
- **P2P mode**: Direct peer-to-peer weight transfer between processes
|
|
- **All mode**: Combination of both broadcast and P2P methods
|
|
|
|
## Usage Examples
|
|
|
|
### Single Node Setup
|
|
|
|
**Terminal 1 - Launch SGLang Server:**
|
|
```bash Command
|
|
python -m sglang.launch_server \
|
|
--model-path Qwen/Qwen3-8B \
|
|
--tp 8 \
|
|
--load-format dummy \
|
|
--wait-for-initial-weights
|
|
```
|
|
|
|
**Terminal 2 - Run Checkpoint Engine:**
|
|
|
|
Using sglang entrypoint:
|
|
```bash Command
|
|
python -m sglang.srt.checkpoint_engine.update \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 8
|
|
```
|
|
|
|
Using torchrun directly:
|
|
```bash Command
|
|
torchrun --nproc-per-node 8 \
|
|
examples/checkpoint_engine/update.py \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 8
|
|
```
|
|
|
|
### Multi-Node Setup (2 Nodes)
|
|
|
|
**Node 0:**
|
|
|
|
Launch SGLang server:
|
|
```bash Command
|
|
python -m sglang.launch_server \
|
|
--model-path Qwen/Qwen3-8B \
|
|
--tp 8 \
|
|
--load-format dummy \
|
|
--wait-for-initial-weights \
|
|
--host [IP]
|
|
```
|
|
|
|
Run checkpoint engine:
|
|
|
|
Using sglang entrypoint (recommended):
|
|
```bash Command
|
|
python -m sglang.srt.checkpoint_engine.update \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 8
|
|
```
|
|
|
|
Using torchrun directly:
|
|
```bash Command
|
|
torchrun --nproc-per-node 8 \
|
|
--nnodes 2 \
|
|
--node-rank 0 \
|
|
--master-addr [IP] \
|
|
--master-port 29500 \
|
|
examples/checkpoint_engine/update.py \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 8
|
|
```
|
|
|
|
**Node 1:**
|
|
|
|
Launch SGLang server:
|
|
```bash Command
|
|
python -m sglang.launch_server \
|
|
--model-path Qwen/Qwen3-8B \
|
|
--tp 8 \
|
|
--load-format dummy \
|
|
--wait-for-initial-weights \
|
|
--host [IP]
|
|
```
|
|
|
|
Run checkpoint engine:
|
|
|
|
Using sglang entrypoint (recommended):
|
|
```bash Command
|
|
python -m sglang.srt.checkpoint_engine.update \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 8
|
|
```
|
|
|
|
Using torchrun directly:
|
|
```bash Command
|
|
torchrun --nproc-per-node 8 \
|
|
--nnodes 2 \
|
|
--node-rank 1 \
|
|
--master-addr [IP] \
|
|
--master-port 29500 \
|
|
examples/checkpoint_engine/update.py \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 8
|
|
```
|
|
|
|
### Multi-Node Setup with Tensor Parallelism (TP=16)
|
|
|
|
**Node 0:**
|
|
|
|
Launch SGLang server:
|
|
```bash Command
|
|
python -m sglang.launch_server \
|
|
--model-path Qwen/Qwen3-8B \
|
|
--tp 8 \
|
|
--load-format dummy \
|
|
--wait-for-initial-weights \
|
|
--host [IP] \
|
|
--dist-init-addr [IP]:9120 \
|
|
--nnodes 2 \
|
|
--node-rank 0
|
|
```
|
|
|
|
Run checkpoint engine:
|
|
|
|
Using sglang entrypoint (recommended):
|
|
```bash Command
|
|
python -m sglang.srt.checkpoint_engine.update \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 16
|
|
```
|
|
|
|
Using torchrun directly:
|
|
```bash Command
|
|
torchrun --nproc-per-node 8 \
|
|
--nnodes 2 \
|
|
--node-rank 0 \
|
|
--master-addr [IP] \
|
|
--master-port 29500 \
|
|
examples/checkpoint_engine/update.py \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 16
|
|
```
|
|
|
|
**Node 1:**
|
|
|
|
Launch SGLang server:
|
|
```bash Command
|
|
python -m sglang.launch_server \
|
|
--model-path Qwen/Qwen3-8B \
|
|
--tp 8 \
|
|
--load-format dummy \
|
|
--wait-for-initial-weights \
|
|
--host [IP] \
|
|
--dist-init-addr [IP]:9120 \
|
|
--nnodes 2 \
|
|
--node-rank 1
|
|
```
|
|
|
|
Run checkpoint engine:
|
|
|
|
Using sglang entrypoint (recommended):
|
|
```bash Command
|
|
python -m sglang.srt.checkpoint_engine.update \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 16
|
|
```
|
|
|
|
Using torchrun directly:
|
|
```bash Command
|
|
torchrun --nproc-per-node 8 \
|
|
--nnodes 2 \
|
|
--node-rank 1 \
|
|
--master-addr [IP] \
|
|
--master-port 29500 \
|
|
examples/checkpoint_engine/update.py \
|
|
--update-method broadcast \
|
|
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
|
|
--inference-parallel-size 16
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
### SGLang Server Options
|
|
|
|
- `--load-format dummy`: Use dummy format for initial loading (allows overlapping with other tasks)
|
|
- `--wait-for-initial-weights`: Wait for checkpoint engine to provide weights before becoming ready
|
|
- `--host`: Host address for multi-node setups
|
|
- `--dist-init-addr`: Distributed initialization address for tensor parallelism
|
|
|
|
### Checkpoint Engine Options
|
|
|
|
- `--update-method`: Weight update method (`broadcast`, `p2p`, or `all`)
|
|
- `--checkpoint-path`: Path to model checkpoint directory
|
|
- `--inference-parallel-size`: Number of inference parallel processes
|
|
- `--endpoint`: SGLang server endpoint (default: `http://localhost:19730`)
|
|
- `--checkpoint-name`: Name for the checkpoint (default: `my-checkpoint-iter-0`)
|
|
- `--save-metas-file`: File to save checkpoint metadata
|
|
- `--load-metas-file`: File to load checkpoint metadata from
|
|
- `--uds`: Unix domain socket path for communication
|
|
- `--weight-version`: Version identifier for weights
|
|
|
|
## Performance Benefits
|
|
|
|
The checkpoint engine provides significant time savings in two main aspects:
|
|
|
|
1. **Multi-node Loading**: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.
|
|
|
|
2. **Single Process Optimization**: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.
|
|
|
|
## Troubleshooting
|
|
|
|
- Ensure checkpoint engine package is installed: `pip install 'checkpoint-engine[p2p]'`
|
|
- Verify network connectivity between nodes in multi-node setups
|
|
- Check that the checkpoint path contains valid model files
|
|
- Monitor logs for connection errors between SGLang server and checkpoint engine
|
|
- Use `--sleep-time` parameter to add delays if needed for debugging
|
|
|
|
## References
|
|
|
|
- [Checkpoint Engine Repository](https://github.com/MoonshotAI/checkpoint-engine)
|