Files
sglang/docs_new/docs/hardware-platforms/cpu-server.mdx
Mingyi a3291b5654 Add new Mintlify documentation site (docs_new/) (#23001)
Co-authored-by: AdityaVKochar <adityavardhankochar@gmail.com>
Co-authored-by: mintlify[bot] <109931778+mintlify[bot]@users.noreply.github.com>
Co-authored-by: adhyan-jain <adhyanjain2006@gmail.com>
Co-authored-by: Adhyan Jain <71976554+adhyan-jain@users.noreply.github.com>
Co-authored-by: Maitri-shah29 <maitrirajivshah@gmail.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: Maitri Shah <shah29maitri@gmail.com>
Co-authored-by: Aditya Vardhan Kochar <80113212+AdityaVKochar@users.noreply.github.com>
Co-authored-by: Rishit Shivam <164783543+pokymono@users.noreply.github.com>
Co-authored-by: Rishitshivam <164783543+Rishitshivam@users.noreply.github.com>
Co-authored-by: IshhanKheria <ishhankheria06@gmail.com>
Co-authored-by: Ishita Joshi <ishitata.joshi@gmail.com>
Co-authored-by: Richard Chen <104477092+Richardczl98@users.noreply.github.com>
Co-authored-by: longGGGGGG <553746008@qq.com>
Co-authored-by: Richard <richardchen@radixark.ai>
Co-authored-by: Nakul Sinha <nakul.new4socials@gmail.com>
Co-authored-by: Divyam Agrawal <ludicrouslytrue@gmail.com>
Co-authored-by: Richardczl98 <Zhenlinc@stanford.edu>
Co-authored-by: Krishang Zinzuwadia <krishangzinzuwadia@gmail.com>
Co-authored-by: nimeshas <nimesha.s106@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jignas Paturu <86356085+JignasP@users.noreply.github.com>
Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
2026-04-20 15:10:22 -07:00

356 lines
17 KiB
Plaintext

---
title: "CPU Servers"
---
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
SGLang is enabled and optimized on the CPUs equipped with Intel® AMX® Instructions,
which are 4th generation or newer Intel® Xeon® Scalable Processors.
## Optimized Model List
A list of popular LLMs are optimized and run efficiently on CPU,
including the most notable open-source models like Llama series, Qwen series,
and DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus.
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}>
<colgroup>
<col style={{width: "22%"}} />
<col style={{width: "26%"}} />
<col style={{width: "30%"}} />
<col style={{width: "22%"}} />
</colgroup>
<thead>
<tr style={{borderBottom: "2px solid #d55816"}}>
<th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Model Name</th>
<th style={{textAlign: "center", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>BF16</th>
<th style={{textAlign: "center", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>W8A8_INT8</th>
<th style={{textAlign: "center", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>FP8</th>
</tr>
</thead>
<tbody>
<tr>
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek-R1</td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8">DeepSeek-R1-Channel-INT8</a></td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/deepseek-ai/DeepSeek-R1">DeepSeek-R1</a></td>
</tr>
<tr>
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek-V3.1-Terminus</td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8">DeepSeek-V3.1-Terminus-Channel-int8</a></td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus">DeepSeek-V3.1-Terminus</a></td>
</tr>
<tr>
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-3.2-3B</td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct">Llama-3.2-3B-Instruct</a></td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8">Llama-3.2-3B-quantized.w8a8</a></td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
</tr>
<tr>
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-3.1-8B</td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct">Llama-3.1-8B-Instruct</a></td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8">Llama-3.1-8B-quantized.w8a8</a></td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
</tr>
<tr>
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>QwQ-32B</td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8">QwQ-32B-quantized.w8a8</a></td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
</tr>
<tr>
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek-Distilled-Llama</td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.02)"}}><a href="https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8">DeepSeek-R1-Distill-Llama-70B-quantized.w8a8</a></td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
</tr>
<tr>
<td style={{padding: "9px 12px", whiteSpace: "nowrap", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Qwen3-235B</td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.05)"}}>—</td>
<td style={{padding: "9px 12px", textAlign: "center", color: "gray", backgroundColor: "rgba(255,255,255,0.02)"}}>—</td>
<td style={{padding: "9px 12px", textAlign: "center", backgroundColor: "rgba(255,255,255,0.05)"}}><a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8">Qwen3-235B-A22B-FP8</a></td>
</tr>
</tbody>
</table>
> **Note:** The model identifiers listed in the table above have been verified on 6th Gen Intel® Xeon® P-core platforms.
## Installation
<Tabs>
<Tab title="Docker (Recommended)">
It is recommended to use Docker for setting up the SGLang environment.
A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile) is provided to facilitate the installation.
> **Note:** Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
<CodeGroup>
```bash Clone, Build and Run
# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker
# Build the docker image
docker build -t sglang-cpu:latest -f xeon.Dockerfile .
# Initiate a docker container
docker run \
-it \
--privileged \
--ipc=host \
--network=host \
-v /dev/shm:/dev/shm \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 30000:30000 \
-e "HF_TOKEN=<secret>" \
sglang-cpu:latest /bin/bash
```
</CodeGroup>
</Tab>
<Tab title="From Source">
If you prefer to install SGLang in a bare metal environment, the setup process is as follows.
Please install the required packages and libraries beforehand if they are not already present on your system.
You can refer to the Ubuntu-based installation commands in
[the Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile#L11) for guidance.
1. **Install uv and create a virtual environment**
<CodeGroup>
```bash Create Virtual Environment
# Taking '/opt' as the example uv env folder, feel free to change it as needed
cd /opt
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
uv venv --python 3.12
source .venv/bin/activate
```
</CodeGroup>
2. **Create a config file for torch package indexes**
Create the `uv.toml` config file:
<CodeGroup>
```bash Open Config File
vim .venv/uv.toml
```
</CodeGroup>
Press `a` to enter insert mode in `vim`, then paste the following content:
<CodeGroup>
```toml
[[index]]
name = "torch"
url = "https://download.pytorch.org/whl/cpu"
[[index]]
name = "torchvision"
url = "https://download.pytorch.org/whl/cpu"
[[index]]
name = "torchaudio"
url = "https://download.pytorch.org/whl/cpu"
[[index]]
name = "triton"
url = "https://download.pytorch.org/whl/cpu"
```
</CodeGroup>
Save the file (press `Esc`, then type `:x` and hit `Enter`), then set it as the default `uv` config:
<CodeGroup>
```bash Set Config Path
export UV_CONFIG_FILE=/opt/.venv/uv.toml
```
</CodeGroup>
3. **Clone SGLang and build packages**
<CodeGroup>
```bash Build SGLang
# Clone the SGLang code
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout <YOUR-DESIRED-VERSION>
# Use dedicated toml file
cd python
cp pyproject_cpu.toml pyproject.toml
# Install SGLang dependent libs, and build SGLang main package
uv pip install --upgrade pip setuptools
uv pip install .
# Build the CPU backend kernels
cd ../sgl-kernel
cp pyproject_cpu.toml pyproject.toml
uv pip install .
```
</CodeGroup>
4. **Set required environment variables**
<CodeGroup>
```bash Set Environment Variables
export SGLANG_USE_CPU_ENGINE=1
# Set 'LD_LIBRARY_PATH' and 'LD_PRELOAD' to ensure the libs can be loaded by sglang processes
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
export LD_PRELOAD=${LD_PRELOAD}:/opt/.venv/lib/libiomp5.so:${LD_LIBRARY_PATH}/libtcmalloc.so.4:${LD_LIBRARY_PATH}/libtbbmalloc.so.2
```
</CodeGroup>
> **Note:** The environment variable `SGLANG_USE_CPU_ENGINE=1` is required to enable the SGLang service with the CPU engine.
> **Note:** If you encounter code compilation issues during the `sgl-kernel` building process, please check your `gcc` and `g++` versions and upgrade them if they are outdated. It is recommended to use `gcc-13` and `g++-13` as they have been verified in the official Docker container.
> **Note:** The system library path is typically located in one of the following directories: `~/.local/lib/`, `/usr/local/lib/`, `/usr/local/lib64/`, `/usr/lib/`, `/usr/lib64/`, and `/usr/lib/x86_64-linux-gnu/`. In the above example commands, `/usr/lib/x86_64-linux-gnu` is used. Please adjust the path according to your server configuration.
It is recommended to add the following to your `~/.bashrc` file to avoid setting these variables every time you open a new terminal:
<CodeGroup>
```bash Persist in ~/.bashrc
source .venv/bin/activate
export SGLANG_USE_CPU_ENGINE=1
export LD_LIBRARY_PATH=<YOUR-SYSTEM-LIBRARY-FOLDER>
export LD_PRELOAD=<YOUR-LIBS-PATHS>
```
</CodeGroup>
</Tab>
</Tabs>
## Launch of the Serving Engine
Example command to launch SGLang serving:
<CodeGroup>
```bash Launch Server
python -m sglang.launch_server \
--model <MODEL_ID_OR_PATH> \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--host 0.0.0.0 \
--tp 6
```
</CodeGroup>
> **Note:** For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
> **Note:** The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6). On a CPU platform, a TP rank means a sub-NUMA cluster (SNC). You can get the SNC count using `lscpu`. If the specified TP rank number differs from the total SNC count, the system will automatically utilize the first `n` SNCs — but `n` cannot exceed the total SNC number.
>
> To specify the cores to be used, set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`. For example, to use the first 40 cores of each SNC on a Xeon® 6980P server (which has 43-43-42 cores on the 3 SNCs of a socket):
<CodeGroup>
```bash Set Thread Binding
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
```
</CodeGroup>
> Please beware that with `SGLANG_CPU_OMP_THREADS_BIND` set, the available memory amounts of the ranks may not be determined in advance. You may need to set `--max-total-tokens` to avoid out-of-memory errors.
> **Note:** For optimizing decoding with `torch.compile`, add the flag `--enable-torch-compile`. To specify the maximum batch size, set `--torch-compile-max-bs`. For example, `--enable-torch-compile --torch-compile-max-bs 4` uses `torch.compile` with a maximum batch size of 4. The maximum applicable batch size is 16.
> **Note:** A warmup step is automatically triggered when the service is started. The server is ready when you see the log `The server is fired up and ready to roll!`.
## Benchmarking with Requests
You can benchmark the performance via the `bench_serving` script.
Run the command in another terminal. An example command would be:
<CodeGroup>
```bash Run Benchmark
python -m sglang.bench_serving \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1 \
--request-rate inf \
--random-range-ratio 1.0
```
</CodeGroup>
Detailed parameter descriptions are available via the command:
<CodeGroup>
```bash Benchmark Help
python -m sglang.bench_serving -h
```
</CodeGroup>
Additionally, requests can be formatted using
[the OpenAI Completions API](../basic_usage/openai_api_completions)
and sent via the command line (e.g., using `curl`) or through your own scripts.
## Example Usage Commands
Large Language Models can range from fewer than 1 billion to several hundred billion parameters.
Dense models larger than 20B are expected to run on flagship 6th Gen Intel® Xeon® processors
with dual sockets and a total of 6 sub-NUMA clusters. Dense models of approximately 10B parameters or fewer,
or MoE (Mixture of Experts) models with fewer than 10B activated parameters, can run on more common
4th generation or newer Intel® Xeon® processors, or utilize a single socket of the flagship 6th Gen Intel® Xeon® processors.
### Example: Running DeepSeek-V3.1-Terminus
<CodeGroup>
```bash W8A8_INT8
python -m sglang.launch_server \
--model IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8 \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--quantization w8a8_int8 \
--host 0.0.0.0 \
--enable-torch-compile \
--torch-compile-max-bs 4 \
--tp 6
```
```bash FP8
python -m sglang.launch_server \
--model deepseek-ai/DeepSeek-V3.1-Terminus \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--host 0.0.0.0 \
--enable-torch-compile \
--torch-compile-max-bs 4 \
--tp 6
```
</CodeGroup>
> **Note:** Please set `--torch-compile-max-bs` to the maximum desired batch size for your deployment, which can be up to 16. The value `4` in the examples is illustrative.
### Example: Running Llama-3.2-3B
<CodeGroup>
```bash BF16
python -m sglang.launch_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--host 0.0.0.0 \
--enable-torch-compile \
--torch-compile-max-bs 16 \
--tp 2
```
```bash W8A8_INT8
python -m sglang.launch_server \
--model RedHatAI/Llama-3.2-3B-quantized.w8a8 \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--quantization w8a8_int8 \
--host 0.0.0.0 \
--enable-torch-compile \
--torch-compile-max-bs 16 \
--tp 2
```
</CodeGroup>
> **Note:** The `--torch-compile-max-bs` and `--tp` settings are examples that should be adjusted for your setup. For instance, use `--tp 3` to utilize 1 socket with 3 sub-NUMA clusters on an Intel® Xeon® 6980P server.
Once the server has been launched, you can test it using the `bench_serving` command or create
your own commands or scripts following [the benchmarking example](#benchmarking-with-requests).