mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-07-01 04:08:10 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
333 lines
8.6 KiB
Plaintext
333 lines
8.6 KiB
Plaintext
---
|
|
title: "Quickstart"
|
|
description: "Get up and running with SGLang in minutes: install, launch a server, and send your first request."
|
|
---
|
|
|
|
## Overview
|
|
|
|
This guide walks you through the entire flow of getting started with SGLang:
|
|
|
|
1. **Install** SGLang
|
|
2. **Launch** an inference server
|
|
3. **Send requests** using cURL, OpenAI Python client, Python `requests`, or the native SGLang API
|
|
|
|
By the end, you'll have a working SGLang server responding to your prompts.
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
- **Python**: 3.9 or higher
|
|
- **GPU**: NVIDIA GPU with CUDA support (sm75 and above, e.g., T4, A10, A100, L4, L40S, H100)
|
|
- **OS**: Linux (recommended)
|
|
|
|
<Note>
|
|
For other platforms, see the dedicated guides for [AMD GPUs](../hardware-platforms/amd-gpus), [Intel Xeon CPUs](../hardware-platforms/cpu-server), [Google TPUs](../hardware-platforms/tpu), [NVIDIA Jetson](../hardware-platforms/nvidia), [Ascend NPUs](../hardware-platforms/ascend-npus/SGLang-installation-with-NPUs-support), and [Intel XPU](../hardware-platforms/xpu).
|
|
</Note>
|
|
|
|
---
|
|
|
|
## Installation
|
|
|
|
<Tabs>
|
|
<Tab title="Pip / uv (Recommended)">
|
|
We recommend using **uv** for faster installation:
|
|
|
|
```bash
|
|
pip install --upgrade pip
|
|
pip install uv
|
|
uv pip install sglang
|
|
```
|
|
</Tab>
|
|
<Tab title="From Source">
|
|
```bash
|
|
# Clone and install from source
|
|
git clone https://github.com/sgl-project/sglang.git
|
|
cd sglang
|
|
pip install --upgrade pip
|
|
pip install -e "python"
|
|
```
|
|
</Tab>
|
|
<Tab title="Docker">
|
|
The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
|
|
|
|
Replace `<secret>` with your [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens):
|
|
|
|
```bash
|
|
docker run --gpus all \
|
|
--shm-size 32g \
|
|
-p 30000:30000 \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
--env "HF_TOKEN=<secret>" \
|
|
--ipc=host \
|
|
lmsysorg/sglang:latest \
|
|
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
|
|
```
|
|
|
|
For production deployments, use the smaller **runtime** variant (~40% size reduction):
|
|
|
|
```bash
|
|
docker run --gpus all \
|
|
--shm-size 32g \
|
|
-p 30000:30000 \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
--env "HF_TOKEN=<secret>" \
|
|
--ipc=host \
|
|
lmsysorg/sglang:latest-runtime \
|
|
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
|
|
```
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
<Tip>
|
|
If you encounter `OSError: CUDA_HOME environment variable is not set`, set it with:
|
|
```bash
|
|
export CUDA_HOME=/usr/local/cuda-<your-cuda-version>
|
|
```
|
|
</Tip>
|
|
|
|
---
|
|
|
|
## Launch a Server
|
|
|
|
Start the SGLang server with a model. Here we use `qwen/qwen2.5-0.5b-instruct` as a lightweight example:
|
|
|
|
```bash
|
|
python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --port 30000
|
|
```
|
|
|
|
Wait until you see `The server is fired up and ready to roll!` in the terminal output.
|
|
|
|
<Note>
|
|
Once the server is running, API documentation is available at:
|
|
- **Swagger UI**: `http://localhost:30000/docs`
|
|
- **ReDoc**: `http://localhost:30000/redoc`
|
|
- **OpenAPI Spec**: `http://localhost:30000/openapi.json`
|
|
</Note>
|
|
|
|
<Info>
|
|
The server automatically applies the chat template from the Hugging Face tokenizer. You can override it with `--chat-template` when launching.
|
|
</Info>
|
|
|
|
---
|
|
|
|
## Send Requests
|
|
|
|
SGLang is fully **OpenAI API-compatible**, so you can use the same tools and libraries you already know.
|
|
|
|
### Using cURL
|
|
|
|
```bash
|
|
curl http://localhost:30000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "qwen/qwen2.5-0.5b-instruct",
|
|
"messages": [
|
|
{"role": "user", "content": "What is the capital of France?"}
|
|
]
|
|
}'
|
|
```
|
|
|
|
### Using OpenAI Python Client
|
|
|
|
Install the OpenAI Python library if you haven't:
|
|
|
|
```bash
|
|
pip install openai
|
|
```
|
|
|
|
Then send a request:
|
|
|
|
```python Example
|
|
import openai
|
|
|
|
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
|
|
|
response = client.chat.completions.create(
|
|
model="qwen/qwen2.5-0.5b-instruct",
|
|
messages=[
|
|
{"role": "user", "content": "List 3 countries and their capitals."},
|
|
],
|
|
temperature=0,
|
|
max_tokens=64,
|
|
)
|
|
|
|
print(response.choices[0].message.content)
|
|
```
|
|
|
|
#### Streaming
|
|
|
|
```python Example
|
|
import openai
|
|
|
|
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
|
|
|
response = client.chat.completions.create(
|
|
model="qwen/qwen2.5-0.5b-instruct",
|
|
messages=[
|
|
{"role": "user", "content": "List 3 countries and their capitals."},
|
|
],
|
|
temperature=0,
|
|
max_tokens=64,
|
|
stream=True,
|
|
)
|
|
|
|
for chunk in response:
|
|
if chunk.choices[0].delta.content:
|
|
print(chunk.choices[0].delta.content, end="", flush=True)
|
|
```
|
|
|
|
### Using Python Requests
|
|
|
|
```python Example
|
|
import requests
|
|
|
|
url = "http://localhost:30000/v1/chat/completions"
|
|
|
|
data = {
|
|
"model": "qwen/qwen2.5-0.5b-instruct",
|
|
"messages": [{"role": "user", "content": "What is the capital of France?"}],
|
|
}
|
|
|
|
response = requests.post(url, json=data)
|
|
print(response.json())
|
|
```
|
|
|
|
### Using the Native `/generate` API
|
|
|
|
SGLang also provides a native `/generate` endpoint for more flexibility.
|
|
|
|
```python Example
|
|
import requests
|
|
|
|
response = requests.post(
|
|
"http://localhost:30000/generate",
|
|
json={
|
|
"text": "The capital of France is",
|
|
"sampling_params": {
|
|
"temperature": 0,
|
|
"max_new_tokens": 32,
|
|
},
|
|
},
|
|
)
|
|
|
|
print(response.json())
|
|
```
|
|
|
|
#### Streaming with `/generate`
|
|
|
|
```python Example
|
|
import requests
|
|
import json
|
|
|
|
response = requests.post(
|
|
"http://localhost:30000/generate",
|
|
json={
|
|
"text": "The capital of France is",
|
|
"sampling_params": {
|
|
"temperature": 0,
|
|
"max_new_tokens": 32,
|
|
},
|
|
"stream": True,
|
|
},
|
|
stream=True,
|
|
)
|
|
|
|
prev = 0
|
|
for chunk in response.iter_lines(decode_unicode=False):
|
|
chunk = chunk.decode("utf-8")
|
|
if chunk and chunk.startswith("data:"):
|
|
if chunk == "data: [DONE]":
|
|
break
|
|
data = json.loads(chunk[5:].strip("\n"))
|
|
output = data["text"]
|
|
print(output[prev:], end="", flush=True)
|
|
prev = len(output)
|
|
```
|
|
|
|
---
|
|
|
|
## Offline Batch Inference (No Server)
|
|
|
|
SGLang also supports offline batch inference using the `Engine` class directly -- no HTTP server required.
|
|
|
|
```python Example
|
|
import sglang as sgl
|
|
|
|
llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")
|
|
|
|
prompts = [
|
|
"Hello, my name is",
|
|
"The president of the United States is",
|
|
"The capital of France is",
|
|
"The future of AI is",
|
|
]
|
|
|
|
sampling_params = {"temperature": 0.8, "top_p": 0.95}
|
|
|
|
outputs = llm.generate(prompts, sampling_params)
|
|
|
|
for prompt, output in zip(prompts, outputs):
|
|
print(f"Prompt: {prompt}\nGenerated text: {output['text']}\n")
|
|
|
|
llm.shutdown()
|
|
```
|
|
|
|
---
|
|
|
|
## Common Troubleshooting
|
|
|
|
<AccordionGroup>
|
|
<Accordion title="CUDA_HOME not set">
|
|
Set the `CUDA_HOME` environment variable to your CUDA install root:
|
|
```bash
|
|
export CUDA_HOME=/usr/local/cuda-<your-cuda-version>
|
|
```
|
|
</Accordion>
|
|
<Accordion title="FlashInfer issues on sm75+ devices">
|
|
Switch to alternative backends by adding these flags when launching the server:
|
|
```bash
|
|
--attention-backend triton --sampling-backend pytorch
|
|
```
|
|
</Accordion>
|
|
<Accordion title="Reinstalling FlashInfer">
|
|
```bash
|
|
pip3 install --upgrade flashinfer-python --force-reinstall --no-deps
|
|
rm -rf ~/.cache/flashinfer
|
|
```
|
|
</Accordion>
|
|
<Accordion title="ptxas error on B300/GB300 (sm_103a)">
|
|
```bash
|
|
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
|
|
```
|
|
</Accordion>
|
|
</AccordionGroup>
|
|
|
|
---
|
|
|
|
{/*
|
|
WIP, TBD linked later
|
|
## What's Next?
|
|
|
|
<CardGroup cols={2}>
|
|
<Card title="OpenAI-Compatible APIs" href="/basic_usage/openai_api_completions">
|
|
Explore the full Chat Completions and Completions APIs, including multi-turn conversations.
|
|
</Card>
|
|
<Card title="Vision Language Models" href="/basic_usage/openai_api_vision">
|
|
Send image inputs alongside text using OpenAI-compatible vision APIs.
|
|
</Card>
|
|
<Card title="Sampling Parameters" href="/basic_usage/sampling_params">
|
|
Fine-tune generation with temperature, top-p, frequency penalty, and more.
|
|
</Card>
|
|
<Card title="Server Arguments" href="/advanced_features/server_arguments">
|
|
Customize server behavior with advanced launch arguments like tensor parallelism.
|
|
</Card>
|
|
<Card title="Structured Outputs" href="/advanced_features/structured_outputs">
|
|
Constrain model output to JSON, regex, or EBNF grammars.
|
|
</Card>
|
|
<Card title="Ollama-Compatible API" href="/basic_usage/ollama_api">
|
|
Use the familiar Ollama CLI and Python library with SGLang as the backend.
|
|
</Card>
|
|
</CardGroup>
|
|
*/}
|