mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
257 lines
11 KiB
Plaintext
257 lines
11 KiB
Plaintext
---
|
|
title: Installation
|
|
description: Install SGLang with pip/uv, source, Docker, Kubernetes, and cloud deployment options.
|
|
keywords:
|
|
- installation
|
|
- sglang
|
|
- pip
|
|
- docker
|
|
---
|
|
|
|
You can install SGLang using one of the methods below.
|
|
This page primarily applies to common NVIDIA GPU platforms.
|
|
For other or newer platforms, please refer to the dedicated pages for [AMD GPUs](../hardware-platforms/amd-gpus), [Intel Xeon CPUs](../hardware-platforms/cpu-server), [Google TPU](../hardware-platforms/tpu), [NVIDIA DGX Spark](https://lmsys.org/blog/2025-11-03-gpt-oss-on-nvidia-dgx-spark/), [NVIDIA Jetson](../hardware-platforms/nvidia), [Ascend NPUs](../hardware-platforms/ascend-npus/SGLang-installation-with-NPUs-support), and [Intel XPU](../hardware-platforms/xpu).
|
|
|
|
<a id="install-methods"></a>
|
|
## Install methods
|
|
|
|
<Tabs>
|
|
<Tab title="Pip or uv">
|
|
It is recommended to use <Tooltip tip="A fast Python package manager.">uv</Tooltip> for faster installation:
|
|
|
|
```bash
|
|
pip install --upgrade pip
|
|
pip install uv
|
|
uv pip install "sglang"
|
|
```
|
|
|
|
### Quick fixes to common problems
|
|
|
|
<AccordionGroup>
|
|
<Accordion title="Wrong torch version">
|
|
In some cases (for example, GB200), the command above might install a wrong torch version (for example, the CPU version) due to dependency resolution. Reinstall the correct [PyTorch](https://pytorch.org/get-started/locally/) with the following:
|
|
|
|
```bash
|
|
uv pip install "torch" "torchvision" --extra-index-url https://download.pytorch.org/whl/cu129 --force-reinstall
|
|
```
|
|
</Accordion>
|
|
|
|
<Accordion title="CUDA 13 without Docker">
|
|
If you do not have Docker access, install the matching `sgl_kernel` wheel from [the sgl-project whl releases](https://github.com/sgl-project/whl/releases) after installing SGLang. Replace `X.Y.Z` with the `sgl_kernel` version required by your SGLang (you can find this by running `uv pip show sgl_kernel`).
|
|
|
|
**x86_64**
|
|
|
|
```bash
|
|
uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sgl_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_x86_64.whl"
|
|
```
|
|
|
|
**aarch64**
|
|
|
|
```bash
|
|
uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sgl_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_aarch64.whl"
|
|
```
|
|
</Accordion>
|
|
|
|
<Accordion title="CUDA_HOME not set">
|
|
Choose one of the following solutions:
|
|
|
|
1. Set `CUDA_HOME` to your CUDA install root:
|
|
|
|
```bash
|
|
export CUDA_HOME=/usr/local/cuda-<your-cuda-version>
|
|
```
|
|
|
|
2. Install FlashInfer first following the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
|
|
</Accordion>
|
|
</AccordionGroup>
|
|
</Tab>
|
|
|
|
<Tab title="From source">
|
|
```bash
|
|
git clone https://github.com/sgl-project/sglang.git
|
|
cd sglang
|
|
pip install --upgrade pip
|
|
pip install -e "python"
|
|
```
|
|
|
|
### Quick fixes to common problems
|
|
|
|
<AccordionGroup>
|
|
<Accordion title="Development setup">
|
|
If you want to develop SGLang, try the dev docker image. Refer to [setup docker container](../developer_guide/development_guide_using_docker#setup-docker-container). The docker image is `lmsysorg/sglang:dev`.
|
|
</Accordion>
|
|
</AccordionGroup>
|
|
</Tab>
|
|
|
|
<Tab title="Docker">
|
|
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
|
|
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
|
|
|
|
**Standard image**
|
|
|
|
```bash
|
|
docker run --gpus all \
|
|
--shm-size 32g \
|
|
-p 30000:30000 \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
--env "HF_TOKEN=<secret>" \
|
|
--ipc=host \
|
|
lmsysorg/sglang:latest \
|
|
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
|
|
```
|
|
|
|
**Runtime image for production**
|
|
|
|
```bash
|
|
docker run --gpus all \
|
|
--shm-size 32g \
|
|
-p 30000:30000 \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
--env "HF_TOKEN=<secret>" \
|
|
--ipc=host \
|
|
lmsysorg/sglang:latest-runtime \
|
|
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
|
|
```
|
|
|
|
You can also find the nightly docker images [here](https://hub.docker.com/r/lmsysorg/sglang/tags?name=nightly).
|
|
|
|
<Note>
|
|
On B300/GB300 (SM103) or CUDA 13 environment, use the nightly image at `lmsysorg/sglang:dev-cu13` or stable image at `lmsysorg/sglang:latest-cu130-runtime`. Do not re-install the project as editable inside the docker image, since it will override the version of libraries specified by the cu13 docker image.
|
|
</Note>
|
|
</Tab>
|
|
|
|
<Tab title="Kubernetes">
|
|
Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).
|
|
|
|
<Tabs>
|
|
<Tab title="Single node serving">
|
|
For models that fit into GPUs on one node, create the deployment and service with llama-31-8b as example.
|
|
|
|
```bash
|
|
kubectl apply -f docker/k8s-sglang-service.yaml
|
|
```
|
|
</Tab>
|
|
|
|
<Tab title="Multi-node serving">
|
|
For larger models (for example, `DeepSeek-R1`), modify the model path and arguments, then create the statefulset and service.
|
|
|
|
```bash
|
|
kubectl apply -f docker/k8s-sglang-distributed-sts.yaml
|
|
```
|
|
</Tab>
|
|
</Tabs>
|
|
</Tab>
|
|
|
|
<Tab title="Docker Compose">
|
|
<Note>
|
|
This method is recommended if you plan to serve it as a service. A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
|
|
</Note>
|
|
|
|
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine.
|
|
2. Start the service:
|
|
|
|
```bash
|
|
docker compose up -d
|
|
```
|
|
</Tab>
|
|
|
|
<Tab title="SkyPilot">
|
|
To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).
|
|
|
|
1. Install SkyPilot and set up Kubernetes cluster or cloud access. See [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
|
|
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
|
|
|
|
**SkyPilot YAML: `sglang.yaml`**
|
|
|
|
```yaml Config
|
|
# sglang.yaml
|
|
envs:
|
|
HF_TOKEN: null
|
|
|
|
resources:
|
|
image_id: docker:lmsysorg/sglang:latest
|
|
accelerators: A100
|
|
ports: 30000
|
|
|
|
run: |
|
|
conda deactivate
|
|
python3 -m sglang.launch_server \
|
|
--model-path meta-llama/Llama-3.1-8B-Instruct \
|
|
--host 0.0.0.0 \
|
|
--port 30000
|
|
```
|
|
|
|
```bash
|
|
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
|
|
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
|
|
|
|
# Get the HTTP API endpoint
|
|
sky status --endpoint 30000 sglang
|
|
```
|
|
|
|
3. To scale with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
|
|
</Tab>
|
|
|
|
<Tab title="AWS SageMaker">
|
|
To deploy on SGLang on AWS SageMaker, check out [AWS SageMaker Inference](https://aws.amazon.com/sagemaker/ai/deploy).
|
|
|
|
Amazon Web Services provide supports for SGLang containers along with routine security patching. For available SGLang containers, check out [AWS SGLang DLCs](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sglang-containers).
|
|
|
|
To host a model with your own container, follow the following steps:
|
|
|
|
1. Build a docker container with [sagemaker.Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/sagemaker.Dockerfile) alongside the [serve](https://github.com/sgl-project/sglang/blob/main/docker/serve) script, then push it to AWS ECR.
|
|
|
|
**Dockerfile build script: `build-and-push.sh`**
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
AWS_ACCOUNT="<YOUR_AWS_ACCOUNT>"
|
|
AWS_REGION="<YOUR_AWS_REGION>"
|
|
REPOSITORY_NAME="<YOUR_REPOSITORY_NAME>"
|
|
IMAGE_TAG="<YOUR_IMAGE_TAG>"
|
|
|
|
ECR_REGISTRY="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com"
|
|
IMAGE_URI="${ECR_REGISTRY}/${REPOSITORY_NAME}:${IMAGE_TAG}"
|
|
|
|
echo "Starting build and push process..."
|
|
|
|
# Login to ECR
|
|
echo "Logging into ECR..."
|
|
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${ECR_REGISTRY}
|
|
|
|
# Build the image
|
|
echo "Building Docker image..."
|
|
docker build -t ${IMAGE_URI} -f sagemaker.Dockerfile .
|
|
|
|
echo "Pushing ${IMAGE_URI}"
|
|
docker push ${IMAGE_URI}
|
|
|
|
echo "Build and push completed successfully!"
|
|
```
|
|
|
|
2. Deploy a model for serving on AWS Sagemaker. Refer to [deploy_and_serve_endpoint.py](https://github.com/sgl-project/sglang/blob/main/examples/sagemaker/deploy_and_serve_endpoint.py). For more information, check out [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk).
|
|
|
|
**Default command**
|
|
|
|
The model server on SageMaker runs: `python3 -m sglang.launch_server --model-path opt/ml/model --host 0.0.0.0 --port 8080`.
|
|
|
|
**Custom arguments**
|
|
|
|
The [serve](https://github.com/sgl-project/sglang/blob/main/docker/serve) script exposes all options in `python3 -m sglang.launch_server --help` through environment variables prefixed with `SM_SGLANG_`.
|
|
|
|
**Environment variable mapping**
|
|
|
|
The serve script converts variables with prefix `SM_SGLANG_` from `SM_SGLANG_INPUT_ARGUMENT` into `--input-argument` for the `python3 -m sglang.launch_server` CLI.
|
|
|
|
**Example**
|
|
|
|
To run [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) with reasoning parser, add `SM_SGLANG_MODEL_PATH=Qwen/Qwen3-0.6B` and `SM_SGLANG_REASONING_PARSER=qwen3`.
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
## Common notes
|
|
|
|
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (for example, T4, A10, A100, L4, L40S, H100), switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
|
|
- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
|
|
- When encountering `ptxas fatal : Value 'sm_103a' is not defined for option 'gpu-name'` on B300/GB300, fix it with `export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas`.
|