---
title: "Quickstart"
description: "Get up and running with SGLang in minutes: install, launch a server, and send your first request."
---
## Overview
This guide walks you through the entire flow of getting started with SGLang:
1. **Install** SGLang
2. **Launch** an inference server
3. **Send requests** using cURL, OpenAI Python client, Python `requests`, or the native SGLang API
By the end, you'll have a working SGLang server responding to your prompts.
---
## Prerequisites
- **Python**: 3.9 or higher
- **GPU**: NVIDIA GPU with CUDA support (sm75 and above, e.g., T4, A10, A100, L4, L40S, H100)
- **OS**: Linux (recommended)
For other platforms, see the dedicated guides for [AMD GPUs](../hardware-platforms/amd-gpus), [Intel Xeon CPUs](../hardware-platforms/cpu-server), [Google TPUs](../hardware-platforms/tpu), [NVIDIA Jetson](../hardware-platforms/nvidia), [Ascend NPUs](../hardware-platforms/ascend-npus/SGLang-installation-with-NPUs-support), and [Intel XPU](../hardware-platforms/xpu).
---
## Installation
We recommend using **uv** for faster installation:
```bash
pip install --upgrade pip
pip install uv
uv pip install sglang
```
```bash
# Clone and install from source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python"
```
The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
Replace `` with your [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens):
```bash
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```
For production deployments, use the smaller **runtime** variant (~40% size reduction):
```bash
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=" \
--ipc=host \
lmsysorg/sglang:latest-runtime \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```
If you encounter `OSError: CUDA_HOME environment variable is not set`, set it with:
```bash
export CUDA_HOME=/usr/local/cuda-
```
---
## Launch a Server
Start the SGLang server with a model. Here we use `qwen/qwen2.5-0.5b-instruct` as a lightweight example:
```bash
python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --port 30000
```
Wait until you see `The server is fired up and ready to roll!` in the terminal output.
Once the server is running, API documentation is available at:
- **Swagger UI**: `http://localhost:30000/docs`
- **ReDoc**: `http://localhost:30000/redoc`
- **OpenAPI Spec**: `http://localhost:30000/openapi.json`
The server automatically applies the chat template from the Hugging Face tokenizer. You can override it with `--chat-template` when launching.
---
## Send Requests
SGLang is fully **OpenAI API-compatible**, so you can use the same tools and libraries you already know.
### Using cURL
```bash
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/qwen2.5-0.5b-instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
```
### Using OpenAI Python Client
Install the OpenAI Python library if you haven't:
```bash
pip install openai
```
Then send a request:
```python Example
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="qwen/qwen2.5-0.5b-instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
#### Streaming
```python Example
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="qwen/qwen2.5-0.5b-instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
```
### Using Python Requests
```python Example
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "qwen/qwen2.5-0.5b-instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
}
response = requests.post(url, json=data)
print(response.json())
```
### Using the Native `/generate` API
SGLang also provides a native `/generate` endpoint for more flexibility.
```python Example
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print(response.json())
```
#### Streaming with `/generate`
```python Example
import requests
import json
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
"stream": True,
},
stream=True,
)
prev = 0
for chunk in response.iter_lines(decode_unicode=False):
chunk = chunk.decode("utf-8")
if chunk and chunk.startswith("data:"):
if chunk == "data: [DONE]":
break
data = json.loads(chunk[5:].strip("\n"))
output = data["text"]
print(output[prev:], end="", flush=True)
prev = len(output)
```
---
## Offline Batch Inference (No Server)
SGLang also supports offline batch inference using the `Engine` class directly -- no HTTP server required.
```python Example
import sglang as sgl
llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"Prompt: {prompt}\nGenerated text: {output['text']}\n")
llm.shutdown()
```
---
## Common Troubleshooting
Set the `CUDA_HOME` environment variable to your CUDA install root:
```bash
export CUDA_HOME=/usr/local/cuda-
```
Switch to alternative backends by adding these flags when launching the server:
```bash
--attention-backend triton --sampling-backend pytorch
```
```bash
pip3 install --upgrade flashinfer-python --force-reinstall --no-deps
rm -rf ~/.cache/flashinfer
```
```bash
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
```
---
{/*
WIP, TBD linked later
## What's Next?
Explore the full Chat Completions and Completions APIs, including multi-turn conversations.
Send image inputs alongside text using OpenAI-compatible vision APIs.
Fine-tune generation with temperature, top-p, frequency penalty, and more.
Customize server behavior with advanced launch arguments like tensor parallelism.
Constrain model output to JSON, regex, or EBNF grammars.
Use the familiar Ollama CLI and Python library with SGLang as the backend.
*/}