--- title: "Quickstart" description: "Get up and running with SGLang in minutes: install, launch a server, and send your first request." --- ## Overview This guide walks you through the entire flow of getting started with SGLang: 1. **Install** SGLang 2. **Launch** an inference server 3. **Send requests** using cURL, OpenAI Python client, Python `requests`, or the native SGLang API By the end, you'll have a working SGLang server responding to your prompts. --- ## Prerequisites - **Python**: 3.9 or higher - **GPU**: NVIDIA GPU with CUDA support (sm75 and above, e.g., T4, A10, A100, L4, L40S, H100) - **OS**: Linux (recommended) For other platforms, see the dedicated guides for [AMD GPUs](../hardware-platforms/amd-gpus), [Intel Xeon CPUs](../hardware-platforms/cpu-server), [Google TPUs](../hardware-platforms/tpu), [NVIDIA Jetson](../hardware-platforms/nvidia), [Ascend NPUs](../hardware-platforms/ascend-npus/SGLang-installation-with-NPUs-support), and [Intel XPU](../hardware-platforms/xpu). --- ## Installation We recommend using **uv** for faster installation: ```bash pip install --upgrade pip pip install uv uv pip install sglang ``` ```bash # Clone and install from source git clone https://github.com/sgl-project/sglang.git cd sglang pip install --upgrade pip pip install -e "python" ``` The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags). Replace `` with your [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens): ```bash docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 ``` For production deployments, use the smaller **runtime** variant (~40% size reduction): ```bash docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=" \ --ipc=host \ lmsysorg/sglang:latest-runtime \ python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 ``` If you encounter `OSError: CUDA_HOME environment variable is not set`, set it with: ```bash export CUDA_HOME=/usr/local/cuda- ``` --- ## Launch a Server Start the SGLang server with a model. Here we use `qwen/qwen2.5-0.5b-instruct` as a lightweight example: ```bash python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --port 30000 ``` Wait until you see `The server is fired up and ready to roll!` in the terminal output. Once the server is running, API documentation is available at: - **Swagger UI**: `http://localhost:30000/docs` - **ReDoc**: `http://localhost:30000/redoc` - **OpenAPI Spec**: `http://localhost:30000/openapi.json` The server automatically applies the chat template from the Hugging Face tokenizer. You can override it with `--chat-template` when launching. --- ## Send Requests SGLang is fully **OpenAI API-compatible**, so you can use the same tools and libraries you already know. ### Using cURL ```bash curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen/qwen2.5-0.5b-instruct", "messages": [ {"role": "user", "content": "What is the capital of France?"} ] }' ``` ### Using OpenAI Python Client Install the OpenAI Python library if you haven't: ```bash pip install openai ``` Then send a request: ```python Example import openai client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") response = client.chat.completions.create( model="qwen/qwen2.5-0.5b-instruct", messages=[ {"role": "user", "content": "List 3 countries and their capitals."}, ], temperature=0, max_tokens=64, ) print(response.choices[0].message.content) ``` #### Streaming ```python Example import openai client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None") response = client.chat.completions.create( model="qwen/qwen2.5-0.5b-instruct", messages=[ {"role": "user", "content": "List 3 countries and their capitals."}, ], temperature=0, max_tokens=64, stream=True, ) for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) ``` ### Using Python Requests ```python Example import requests url = "http://localhost:30000/v1/chat/completions" data = { "model": "qwen/qwen2.5-0.5b-instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}], } response = requests.post(url, json=data) print(response.json()) ``` ### Using the Native `/generate` API SGLang also provides a native `/generate` endpoint for more flexibility. ```python Example import requests response = requests.post( "http://localhost:30000/generate", json={ "text": "The capital of France is", "sampling_params": { "temperature": 0, "max_new_tokens": 32, }, }, ) print(response.json()) ``` #### Streaming with `/generate` ```python Example import requests import json response = requests.post( "http://localhost:30000/generate", json={ "text": "The capital of France is", "sampling_params": { "temperature": 0, "max_new_tokens": 32, }, "stream": True, }, stream=True, ) prev = 0 for chunk in response.iter_lines(decode_unicode=False): chunk = chunk.decode("utf-8") if chunk and chunk.startswith("data:"): if chunk == "data: [DONE]": break data = json.loads(chunk[5:].strip("\n")) output = data["text"] print(output[prev:], end="", flush=True) prev = len(output) ``` --- ## Offline Batch Inference (No Server) SGLang also supports offline batch inference using the `Engine` class directly -- no HTTP server required. ```python Example import sglang as sgl llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct") prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = {"temperature": 0.8, "top_p": 0.95} outputs = llm.generate(prompts, sampling_params) for prompt, output in zip(prompts, outputs): print(f"Prompt: {prompt}\nGenerated text: {output['text']}\n") llm.shutdown() ``` --- ## Common Troubleshooting Set the `CUDA_HOME` environment variable to your CUDA install root: ```bash export CUDA_HOME=/usr/local/cuda- ``` Switch to alternative backends by adding these flags when launching the server: ```bash --attention-backend triton --sampling-backend pytorch ``` ```bash pip3 install --upgrade flashinfer-python --force-reinstall --no-deps rm -rf ~/.cache/flashinfer ``` ```bash export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas ``` --- {/* WIP, TBD linked later ## What's Next? Explore the full Chat Completions and Completions APIs, including multi-turn conversations. Send image inputs alongside text using OpenAI-compatible vision APIs. Fine-tune generation with temperature, top-p, frequency penalty, and more. Customize server behavior with advanced launch arguments like tensor parallelism. Constrain model output to JSON, regex, or EBNF grammars. Use the familiar Ollama CLI and Python library with SGLang as the backend. */}