mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 19:57:52 +00:00
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com> Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com> Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com> Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com> Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Maitri Shah <shah29maitri@gmail.com> Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com> Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com> Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com> Co-authored-by: IshhanKheria <ishhankheria06@gmail.com> Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com> Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com> Co-authored-by: longGGGGGG <553746008@qq.com> Co-authored-by: Richard <richardchen@radixark.ai> Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com> Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com> Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu> Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com> Co-authored-by: nimeshas <nimesha.s106@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
356 lines
17 KiB
Plaintext
356 lines
17 KiB
Plaintext
---
|
|
title: "CPU Servers"
|
|
---
|
|
|
|
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
|
|
SGLang is enabled and optimized on the CPUs equipped with Intel® AMX® Instructions,
|
|
which are 4th generation or newer Intel® Xeon® Scalable Processors.
|
|
|
|
## Optimized Model List
|
|
|
|
A list of popular LLMs are optimized and run efficiently on CPU,
|
|
including the most notable open-source models like Llama series, Qwen series,
|
|
and DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus.
|
|
|
|
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
|
|
<colgroup>
|
|
<col style={{width: "22%"}} />
|
|
<col style={{width: "26%"}} />
|
|
<col style={{width: "30%"}} />
|
|
<col style={{width: "22%"}} />
|
|
</colgroup>
|
|
<thead>
|
|
<tr style={{borderBottom: "2px solid #d55816"}}>
|
|
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Model Name</th>
|
|
<th style={{textAlign: "center", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>BF16</th>
|
|
<th style={{textAlign: "center", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>W8A8_INT8</th>
|
|
<th style={{textAlign: "center", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>FP8</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek-R1</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8">DeepSeek-R1-Channel-INT8</a></td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/deepseek-ai/DeepSeek-R1">DeepSeek-R1</a></td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek-V3.1-Terminus</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8">DeepSeek-V3.1-Terminus-Channel-int8</a></td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus">DeepSeek-V3.1-Terminus</a></td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-3.2-3B</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct">Llama-3.2-3B-Instruct</a></td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8">Llama-3.2-3B-quantized.w8a8</a></td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-3.1-8B</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct">Llama-3.1-8B-Instruct</a></td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8">Llama-3.1-8B-quantized.w8a8</a></td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>QwQ-32B</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8">QwQ-32B-quantized.w8a8</a></td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek-Distilled-Llama</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8">DeepSeek-R1-Distill-Llama-70B-quantized.w8a8</a></td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
|
|
</tr>
|
|
<tr>
|
|
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Qwen3-235B</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.02)"}}>—</td>
|
|
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8">Qwen3-235B-A22B-FP8</a></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
> **Note:** The model identifiers listed in the table above have been verified on 6th Gen Intel® Xeon® P-core platforms.
|
|
|
|
## Installation
|
|
|
|
<Tabs>
|
|
<Tab title="Docker (Recommended)">
|
|
It is recommended to use Docker for setting up the SGLang environment.
|
|
A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile) is provided to facilitate the installation.
|
|
|
|
> **Note:** Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
|
|
|
|
<CodeGroup>
|
|
```bash Clone, Build and Run
|
|
# Clone the SGLang repository
|
|
git clone https://github.com/sgl-project/sglang.git
|
|
cd sglang/docker
|
|
|
|
# Build the docker image
|
|
docker build -t sglang-cpu:latest -f xeon.Dockerfile .
|
|
|
|
# Initiate a docker container
|
|
docker run \
|
|
-it \
|
|
--privileged \
|
|
--ipc=host \
|
|
--network=host \
|
|
-v /dev/shm:/dev/shm \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
-p 30000:30000 \
|
|
-e "HF_TOKEN=<secret>" \
|
|
sglang-cpu:latest /bin/bash
|
|
```
|
|
</CodeGroup>
|
|
</Tab>
|
|
|
|
<Tab title="From Source">
|
|
If you prefer to install SGLang in a bare metal environment, the setup process is as follows.
|
|
|
|
Please install the required packages and libraries beforehand if they are not already present on your system.
|
|
You can refer to the Ubuntu-based installation commands in
|
|
[the Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile#L11) for guidance.
|
|
|
|
1. **Install uv and create a virtual environment**
|
|
<CodeGroup>
|
|
```bash Create Virtual Environment
|
|
# Taking '/opt' as the example uv env folder, feel free to change it as needed
|
|
cd /opt
|
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
source $HOME/.local/bin/env
|
|
uv venv --python 3.12
|
|
source .venv/bin/activate
|
|
```
|
|
</CodeGroup>
|
|
|
|
2. **Create a config file for torch package indexes**
|
|
Create the `uv.toml` config file:
|
|
|
|
<CodeGroup>
|
|
```bash Open Config File
|
|
vim .venv/uv.toml
|
|
```
|
|
</CodeGroup>
|
|
|
|
Press `a` to enter insert mode in `vim`, then paste the following content:
|
|
|
|
<CodeGroup>
|
|
```toml
|
|
[[index]]
|
|
name = "torch"
|
|
url = "https://download.pytorch.org/whl/cpu"
|
|
|
|
[[index]]
|
|
name = "torchvision"
|
|
url = "https://download.pytorch.org/whl/cpu"
|
|
|
|
[[index]]
|
|
name = "torchaudio"
|
|
url = "https://download.pytorch.org/whl/cpu"
|
|
|
|
[[index]]
|
|
name = "triton"
|
|
url = "https://download.pytorch.org/whl/cpu"
|
|
```
|
|
</CodeGroup>
|
|
|
|
Save the file (press `Esc`, then type `:x` and hit `Enter`), then set it as the default `uv` config:
|
|
|
|
<CodeGroup>
|
|
```bash Set Config Path
|
|
export UV_CONFIG_FILE=/opt/.venv/uv.toml
|
|
```
|
|
</CodeGroup>
|
|
|
|
3. **Clone SGLang and build packages**
|
|
<CodeGroup>
|
|
```bash Build SGLang
|
|
# Clone the SGLang code
|
|
git clone https://github.com/sgl-project/sglang.git
|
|
cd sglang
|
|
git checkout <YOUR-DESIRED-VERSION>
|
|
|
|
# Use dedicated toml file
|
|
cd python
|
|
cp pyproject_cpu.toml pyproject.toml
|
|
# Install SGLang dependent libs, and build SGLang main package
|
|
uv pip install --upgrade pip setuptools
|
|
uv pip install .
|
|
|
|
# Build the CPU backend kernels
|
|
cd ../sgl-kernel
|
|
cp pyproject_cpu.toml pyproject.toml
|
|
uv pip install .
|
|
```
|
|
</CodeGroup>
|
|
|
|
4. **Set required environment variables**
|
|
<CodeGroup>
|
|
```bash Set Environment Variables
|
|
export SGLANG_USE_CPU_ENGINE=1
|
|
|
|
# Set 'LD_LIBRARY_PATH' and 'LD_PRELOAD' to ensure the libs can be loaded by sglang processes
|
|
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
|
|
export LD_PRELOAD=${LD_PRELOAD}:/opt/.venv/lib/libiomp5.so:${LD_LIBRARY_PATH}/libtcmalloc.so.4:${LD_LIBRARY_PATH}/libtbbmalloc.so.2
|
|
```
|
|
</CodeGroup>
|
|
|
|
> **Note:** The environment variable `SGLANG_USE_CPU_ENGINE=1` is required to enable the SGLang service with the CPU engine.
|
|
|
|
> **Note:** If you encounter code compilation issues during the `sgl-kernel` building process, please check your `gcc` and `g++` versions and upgrade them if they are outdated. It is recommended to use `gcc-13` and `g++-13` as they have been verified in the official Docker container.
|
|
|
|
> **Note:** The system library path is typically located in one of the following directories: `~/.local/lib/`, `/usr/local/lib/`, `/usr/local/lib64/`, `/usr/lib/`, `/usr/lib64/`, and `/usr/lib/x86_64-linux-gnu/`. In the above example commands, `/usr/lib/x86_64-linux-gnu` is used. Please adjust the path according to your server configuration.
|
|
|
|
It is recommended to add the following to your `~/.bashrc` file to avoid setting these variables every time you open a new terminal:
|
|
|
|
<CodeGroup>
|
|
```bash Persist in ~/.bashrc
|
|
source .venv/bin/activate
|
|
export SGLANG_USE_CPU_ENGINE=1
|
|
export LD_LIBRARY_PATH=<YOUR-SYSTEM-LIBRARY-FOLDER>
|
|
export LD_PRELOAD=<YOUR-LIBS-PATHS>
|
|
```
|
|
</CodeGroup>
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
## Launch of the Serving Engine
|
|
|
|
Example command to launch SGLang serving:
|
|
|
|
<CodeGroup>
|
|
```bash Launch Server
|
|
python -m sglang.launch_server \
|
|
--model <MODEL_ID_OR_PATH> \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--host 0.0.0.0 \
|
|
--tp 6
|
|
```
|
|
</CodeGroup>
|
|
|
|
> **Note:** For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
|
|
|
|
> **Note:** The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6). On a CPU platform, a TP rank means a sub-NUMA cluster (SNC). You can get the SNC count using `lscpu`. If the specified TP rank number differs from the total SNC count, the system will automatically utilize the first `n` SNCs — but `n` cannot exceed the total SNC number.
|
|
>
|
|
> To specify the cores to be used, set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`. For example, to use the first 40 cores of each SNC on a Xeon® 6980P server (which has 43-43-42 cores on the 3 SNCs of a socket):
|
|
|
|
<CodeGroup>
|
|
```bash Set Thread Binding
|
|
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
|
|
```
|
|
</CodeGroup>
|
|
|
|
> Please beware that with `SGLANG_CPU_OMP_THREADS_BIND` set, the available memory amounts of the ranks may not be determined in advance. You may need to set `--max-total-tokens` to avoid out-of-memory errors.
|
|
|
|
> **Note:** For optimizing decoding with `torch.compile`, add the flag `--enable-torch-compile`. To specify the maximum batch size, set `--torch-compile-max-bs`. For example, `--enable-torch-compile --torch-compile-max-bs 4` uses `torch.compile` with a maximum batch size of 4. The maximum applicable batch size is 16.
|
|
|
|
> **Note:** A warmup step is automatically triggered when the service is started. The server is ready when you see the log `The server is fired up and ready to roll!`.
|
|
|
|
## Benchmarking with Requests
|
|
|
|
You can benchmark the performance via the `bench_serving` script.
|
|
Run the command in another terminal. An example command would be:
|
|
|
|
<CodeGroup>
|
|
```bash Run Benchmark
|
|
python -m sglang.bench_serving \
|
|
--dataset-name random \
|
|
--random-input-len 1024 \
|
|
--random-output-len 1024 \
|
|
--num-prompts 1 \
|
|
--request-rate inf \
|
|
--random-range-ratio 1.0
|
|
```
|
|
</CodeGroup>
|
|
|
|
Detailed parameter descriptions are available via the command:
|
|
|
|
<CodeGroup>
|
|
```bash Benchmark Help
|
|
python -m sglang.bench_serving -h
|
|
```
|
|
</CodeGroup>
|
|
|
|
Additionally, requests can be formatted using
|
|
[the OpenAI Completions API](../basic_usage/openai_api_completions)
|
|
and sent via the command line (e.g., using `curl`) or through your own scripts.
|
|
|
|
## Example Usage Commands
|
|
|
|
Large Language Models can range from fewer than 1 billion to several hundred billion parameters.
|
|
Dense models larger than 20B are expected to run on flagship 6th Gen Intel® Xeon® processors
|
|
with dual sockets and a total of 6 sub-NUMA clusters. Dense models of approximately 10B parameters or fewer,
|
|
or MoE (Mixture of Experts) models with fewer than 10B activated parameters, can run on more common
|
|
4th generation or newer Intel® Xeon® processors, or utilize a single socket of the flagship 6th Gen Intel® Xeon® processors.
|
|
|
|
### Example: Running DeepSeek-V3.1-Terminus
|
|
|
|
<CodeGroup>
|
|
```bash W8A8_INT8
|
|
python -m sglang.launch_server \
|
|
--model IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8 \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--quantization w8a8_int8 \
|
|
--host 0.0.0.0 \
|
|
--enable-torch-compile \
|
|
--torch-compile-max-bs 4 \
|
|
--tp 6
|
|
```
|
|
|
|
```bash FP8
|
|
python -m sglang.launch_server \
|
|
--model deepseek-ai/DeepSeek-V3.1-Terminus \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--host 0.0.0.0 \
|
|
--enable-torch-compile \
|
|
--torch-compile-max-bs 4 \
|
|
--tp 6
|
|
```
|
|
</CodeGroup>
|
|
|
|
> **Note:** Please set `--torch-compile-max-bs` to the maximum desired batch size for your deployment, which can be up to 16. The value `4` in the examples is illustrative.
|
|
|
|
### Example: Running Llama-3.2-3B
|
|
|
|
<CodeGroup>
|
|
```bash BF16
|
|
python -m sglang.launch_server \
|
|
--model meta-llama/Llama-3.2-3B-Instruct \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--host 0.0.0.0 \
|
|
--enable-torch-compile \
|
|
--torch-compile-max-bs 16 \
|
|
--tp 2
|
|
```
|
|
|
|
```bash W8A8_INT8
|
|
python -m sglang.launch_server \
|
|
--model RedHatAI/Llama-3.2-3B-quantized.w8a8 \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--quantization w8a8_int8 \
|
|
--host 0.0.0.0 \
|
|
--enable-torch-compile \
|
|
--torch-compile-max-bs 16 \
|
|
--tp 2
|
|
```
|
|
</CodeGroup>
|
|
|
|
> **Note:** The `--torch-compile-max-bs` and `--tp` settings are examples that should be adjusted for your setup. For instance, use `--tp 3` to utilize 1 socket with 3 sub-NUMA clusters on an Intel® Xeon® 6980P server.
|
|
|
|
Once the server has been launched, you can test it using the `bench_serving` command or create
|
|
your own commands or scripts following [the benchmarking example](#benchmarking-with-requests).
|