mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
143 lines
5.6 KiB
Plaintext
143 lines
5.6 KiB
Plaintext
---
|
|
title: "Loading Models from Object Storage"
|
|
metatags:
|
|
description: "Load SGLang models directly from S3, Google Cloud Storage, Azure Blob, and S3-compatible object storage with runai_streamer."
|
|
---
|
|
|
|
SGLang supports direct loading of models from object storage (S3 and Google Cloud Storage) without requiring a full local download. This feature uses the `runai_streamer` load format to stream model weights directly from cloud storage, significantly reducing startup time and local storage requirements.
|
|
|
|
## Overview
|
|
|
|
When loading models from object storage, SGLang uses a two-phase approach:
|
|
|
|
1. **Metadata Download** (once, before process launch): Configuration files and tokenizer files are downloaded to a local cache
|
|
2. **Weight Streaming** (lazy, during model loading): Model weights are streamed directly from object storage as needed
|
|
|
|
## Supported Storage Backends
|
|
|
|
1. **Amazon S3**: `s3://bucket-name/path/to/model/`
|
|
2. **Google Cloud Storage**: `gs://bucket-name/path/to/model/`
|
|
3. **Azure Blob**: `az://some-azure-container/path/`
|
|
4. **S3 compatible**: `s3://bucket-name/path/to/model/`
|
|
|
|
## Quick Start
|
|
|
|
### Basic Usage
|
|
|
|
Simply provide an object storage URI as the model path:
|
|
|
|
```bash
|
|
# S3
|
|
python -m sglang.launch_server \
|
|
--model-path s3://my-bucket/models/llama-3-8b/ \
|
|
--load-format runai_streamer
|
|
|
|
# Google Cloud Storage
|
|
python -m sglang.launch_server \
|
|
--model-path gs://my-bucket/models/llama-3-8b/ \
|
|
--load-format runai_streamer
|
|
```
|
|
|
|
**Note**: The `--load-format runai_streamer` is automatically detected when using object storage URIs, so you can omit it:
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model-path s3://my-bucket/models/llama-3-8b/
|
|
```
|
|
|
|
### With Tensor Parallelism
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model-path gs://my-bucket/models/llama-70b/ \
|
|
--tp 4 \
|
|
--model-loader-extra-config '{"distributed": true}'
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Load Format
|
|
|
|
The `runai_streamer` load format is specifically designed for object storage, ssd and shared file systems
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model-path s3://bucket/model/ \
|
|
--load-format runai_streamer
|
|
```
|
|
|
|
### Extended Configuration Parameters
|
|
|
|
Use `--model-loader-extra-config` to pass additional configuration as a JSON string:
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model-path s3://bucket/model/ \
|
|
--model-loader-extra-config '{
|
|
"distributed": true,
|
|
"concurrency": 8,
|
|
"memory_limit": 2147483648
|
|
}'
|
|
```
|
|
|
|
#### Available Parameters
|
|
|
|
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
|
|
<colgroup>
|
|
<col style={{width: "22%"}} />
|
|
<col style={{width: "16%"}} />
|
|
<col style={{width: "44%"}} />
|
|
<col style={{width: "18%"}} />
|
|
</colgroup>
|
|
<thead>
|
|
<tr style={{borderBottom: "2px solid #d55816"}}>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Parameter</th>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Type</th>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Description</th>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Default</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>distributed</code></td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>bool</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Enable distributed streaming for multi-GPU setups. Automatically set to <code>true</code> for object storage paths and cuda alike devices.</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>Auto-detected</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>concurrency</code></td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>int</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Number of concurrent download streams. Higher values can improve throughput for large models.</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>4</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><code>memory_limit</code></td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>int</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>Memory limit (in bytes) for the streaming buffer.</td>
|
|
<td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>System-dependent</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Performance Considerations
|
|
|
|
### Distributed Streaming
|
|
|
|
For multi-GPU setups, enable distributed streaming to parallelize weight loading between the processes:
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model-path s3://bucket/model/ \
|
|
--tp 8 \
|
|
--model-loader-extra-config '{"distributed": true}'
|
|
```
|
|
|
|
## Limitations
|
|
|
|
- **Supported Formats**: Currently only supports `.safetensors` weight format (recommended format)
|
|
- **Supported Device**: Distributed streaming is supported on cuda alike devices. Otherwise fallback to non distributed streaming
|
|
|
|
## See Also
|
|
|
|
- [Runai model streamer documentation](https://github.com/run-ai/runai-model-streamer)
|