mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
104 lines
3.0 KiB
Plaintext
104 lines
3.0 KiB
Plaintext
---
|
|
title: "Multi-Node Deployment"
|
|
metatags:
|
|
description: "SGLang multi-node: Llama 405B on 2 nodes, DeepSeek V3/R1, SLURM cluster deployment examples."
|
|
---
|
|
## Llama 3.1 405B
|
|
|
|
**Run 405B (fp16) on Two Nodes**
|
|
|
|
```bash Command
|
|
# replace 172.16.4.52:20000 with your own node ip address and port of the first node
|
|
|
|
python3 -m sglang.launch_server \
|
|
--model-path meta-llama/Meta-Llama-3.1-405B-Instruct \
|
|
--tp 16 \
|
|
--dist-init-addr 172.16.4.52:20000 \
|
|
--nnodes 2 \
|
|
--node-rank 0
|
|
|
|
python3 -m sglang.launch_server \
|
|
--model-path meta-llama/Meta-Llama-3.1-405B-Instruct \
|
|
--tp 16 \
|
|
--dist-init-addr 172.16.4.52:20000 \
|
|
--nnodes 2 \
|
|
--node-rank 1
|
|
```
|
|
|
|
Note that LLama 405B (fp8) can also be launched on a single node.
|
|
|
|
```bash Command
|
|
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
|
|
```
|
|
|
|
## DeepSeek V3/R1
|
|
|
|
Please refer to [DeepSeek documents for reference](../../basic_usage/deepseek_v3#running-examples-on-multi-node).
|
|
|
|
## Multi-Node Inference on SLURM
|
|
|
|
This example showcases how to serve SGLang server across multiple nodes by SLURM. Submit the following job to the SLURM cluster.
|
|
|
|
```bash Command
|
|
#!/bin/bash -l
|
|
|
|
#SBATCH -o SLURM_Logs/%x_%j_master.out
|
|
#SBATCH -e SLURM_Logs/%x_%j_master.err
|
|
#SBATCH -D ./
|
|
#SBATCH -J Llama-405B-Online-Inference-TP16-SGL
|
|
|
|
#SBATCH --nodes=2
|
|
#SBATCH --ntasks=2
|
|
#SBATCH --ntasks-per-node=1 # Ensure 1 task per node
|
|
#SBATCH --cpus-per-task=18
|
|
#SBATCH --mem=224GB
|
|
#SBATCH --partition="lmsys.org"
|
|
#SBATCH --gres=gpu:8
|
|
#SBATCH --time=12:00:00
|
|
|
|
echo "[INFO] Activating environment on node $SLURM_PROCID"
|
|
if ! source ENV_FOLDER/bin/activate; then
|
|
echo "[ERROR] Failed to activate environment" >&2
|
|
exit 1
|
|
fi
|
|
|
|
# Define parameters
|
|
model=MODEL_PATH
|
|
tp_size=16
|
|
|
|
echo "[INFO] Running inference"
|
|
echo "[INFO] Model: $model"
|
|
echo "[INFO] TP Size: $tp_size"
|
|
|
|
# Set NCCL initialization address using the hostname of the head node
|
|
HEAD_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n 1)
|
|
NCCL_INIT_ADDR="${HEAD_NODE}:8000"
|
|
echo "[INFO] NCCL_INIT_ADDR: $NCCL_INIT_ADDR"
|
|
|
|
# Launch the model server on each node using SLURM
|
|
srun --ntasks=2 --nodes=2 --output="SLURM_Logs/%x_%j_node$SLURM_NODEID.out" \
|
|
--error="SLURM_Logs/%x_%j_node$SLURM_NODEID.err" \
|
|
python3 -m sglang.launch_server \
|
|
--model-path "$model" \
|
|
--grammar-backend "xgrammar" \
|
|
--tp "$tp_size" \
|
|
--dist-init-addr "$NCCL_INIT_ADDR" \
|
|
--nnodes 2 \
|
|
--node-rank "$SLURM_NODEID" &
|
|
|
|
# Wait for the NCCL server to be ready on port 30000
|
|
while ! nc -z "$HEAD_NODE" 30000; do
|
|
sleep 1
|
|
echo "[INFO] Waiting for $HEAD_NODE:30000 to accept connections"
|
|
done
|
|
|
|
echo "[INFO] $HEAD_NODE:30000 is ready to accept connections"
|
|
|
|
# Keep the script running until the SLURM job times out
|
|
wait
|
|
```
|
|
|
|
Then, you can test the server by sending requests following other [documents](../../basic_usage/openai_api_completions).
|
|
|
|
Thanks for [aflah02](https://github.com/aflah02) for providing the example, based on his [blog post](https://aflah02.substack.com/p/multi-node-llm-inference-with-sglang).
|