mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-07-01 04:08:10 +00:00
336 lines
13 KiB
Markdown
336 lines
13 KiB
Markdown
# CPU Servers
|
|
|
|
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
|
|
SGLang is enabled and optimized on the CPUs equipped with Intel® AMX® Instructions,
|
|
which are 4th generation or newer Intel® Xeon® Scalable Processors.
|
|
|
|
## Optimized Model List
|
|
|
|
A list of popular LLMs are optimized and run efficiently on CPU,
|
|
including the most notable open-source models like Llama series, Qwen series,
|
|
and DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus.
|
|
|
|
| Model Name | BF16 | W8A8_INT8 | FP8 |
|
|
|:---:|:---:|:---:|:---:|
|
|
| DeepSeek-R1 | | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
|
|
| DeepSeek-V3.1-Terminus | | [IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8](https://huggingface.co/IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8) | [deepseek-ai/DeepSeek-V3.1-Terminus](https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus) |
|
|
| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) | |
|
|
| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) | |
|
|
| QwQ-32B | | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) | |
|
|
| DeepSeek-Distilled-Llama | | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) | |
|
|
| Qwen3-235B | | | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |
|
|
|
|
**Note:** The model identifiers listed in the table above
|
|
have been verified on 6th Gen Intel® Xeon® P-core platforms.
|
|
|
|
## Installation
|
|
|
|
### Install Using Docker
|
|
|
|
It is recommended to use Docker for setting up the SGLang environment.
|
|
A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile) is provided to facilitate the installation.
|
|
Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
|
|
|
|
```bash
|
|
# Clone the SGLang repository
|
|
git clone https://github.com/sgl-project/sglang.git
|
|
cd sglang/docker
|
|
|
|
# Build the docker image
|
|
docker build -t sglang-cpu:latest -f xeon.Dockerfile .
|
|
|
|
# Initiate a docker container
|
|
docker run \
|
|
-it \
|
|
--privileged \
|
|
--ipc=host \
|
|
--network=host \
|
|
-v /dev/shm:/dev/shm \
|
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
|
-p 30000:30000 \
|
|
-e "HF_TOKEN=<secret>" \
|
|
sglang-cpu:latest /bin/bash
|
|
```
|
|
|
|
### Install From Source
|
|
|
|
If you prefer to install SGLang in a bare metal environment,
|
|
the setup process is as follows:
|
|
|
|
Please install the required packages and libraries beforehand if
|
|
they are not already present on your system.
|
|
You can refer to the Ubuntu-based installation commands in
|
|
[the Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile#L11)
|
|
for guidance.
|
|
|
|
1. Install `uv` package manager, then create and activate a virtual environment:
|
|
|
|
```bash
|
|
# Taking '/opt' as the example uv env folder, feel free to change it as needed
|
|
cd /opt
|
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
source $HOME/.local/bin/env
|
|
uv venv --python 3.12
|
|
source .venv/bin/activate
|
|
```
|
|
|
|
2. Create a config file to direct the installation channel
|
|
(a.k.a. index-url) of `torch` related packages:
|
|
|
|
```bash
|
|
vim .venv/uv.toml
|
|
```
|
|
|
|
Press 'a' to enter insert mode of `vim`, paste the following content into the created file
|
|
|
|
```file
|
|
[[index]]
|
|
name = "torch"
|
|
url = "https://download.pytorch.org/whl/cpu"
|
|
|
|
[[index]]
|
|
name = "torchvision"
|
|
url = "https://download.pytorch.org/whl/cpu"
|
|
|
|
[[index]]
|
|
name = "torchaudio"
|
|
url = "https://download.pytorch.org/whl/cpu"
|
|
|
|
[[index]]
|
|
name = "triton"
|
|
url = "https://download.pytorch.org/whl/cpu"
|
|
|
|
```
|
|
|
|
Save the file (in `vim`, press 'esc' to exit insert mode, then ':x+Enter'),
|
|
and set it as the default `uv` config.
|
|
|
|
```bash
|
|
export UV_CONFIG_FILE=/opt/.venv/uv.toml
|
|
```
|
|
|
|
3. Clone the `sglang` source code and build the packages
|
|
|
|
```bash
|
|
# Clone the SGLang code
|
|
git clone https://github.com/sgl-project/sglang.git
|
|
cd sglang
|
|
git checkout <YOUR-DESIRED-VERSION>
|
|
|
|
# Use dedicated toml file
|
|
cd python
|
|
cp pyproject_cpu.toml pyproject.toml
|
|
# Install SGLang dependent libs, and build SGLang main package
|
|
uv pip install --upgrade pip setuptools
|
|
uv pip install .
|
|
|
|
# Build the CPU backend kernels
|
|
cd ../sgl-kernel
|
|
cp pyproject_cpu.toml pyproject.toml
|
|
uv pip install .
|
|
```
|
|
|
|
4. Set the required environment variables
|
|
|
|
```bash
|
|
export SGLANG_USE_CPU_ENGINE=1
|
|
|
|
# Set 'LD_LIBRARY_PATH' and 'LD_PRELOAD' to ensure the libs can be loaded by sglang processes
|
|
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
|
|
export LD_PRELOAD=${LD_PRELOAD}:/opt/.venv/lib/libiomp5.so:${LD_LIBRARY_PATH}/libtcmalloc.so.4:${LD_LIBRARY_PATH}/libtbbmalloc.so.2
|
|
```
|
|
|
|
Notes:
|
|
|
|
- Note that the environment variable `SGLANG_USE_CPU_ENGINE=1`
|
|
is required to enable the SGLang service with the CPU engine.
|
|
|
|
- If you encounter code compilation issues during the `sgl-kernel` building process,
|
|
please check your `gcc` and `g++` versions and upgrade them if they are outdated.
|
|
It is recommended to use `gcc-13` and `g++-13` as they have been verified
|
|
in the official Docker container.
|
|
|
|
- The system library path is typically located in one of the following directories:
|
|
`~/.local/lib/`, `/usr/local/lib/`, `/usr/local/lib64/`, `/usr/lib/`, `/usr/lib64/`
|
|
and `/usr/lib/x86_64-linux-gnu/`. In the above example commands, `/usr/lib/x86_64-linux-gnu`
|
|
is used. Please adjust the path according to your server configuration.
|
|
|
|
- It is recommended to add the following to your `~/.bashrc` file to
|
|
avoid setting these variables every time you open a new terminal:
|
|
|
|
```bash
|
|
source .venv/bin/activate
|
|
export SGLANG_USE_CPU_ENGINE=1
|
|
export LD_LIBRARY_PATH=<YOUR-SYSTEM-LIBRARY-FOLDER>
|
|
export LD_PRELOAD=<YOUR-LIBS-PATHS>
|
|
```
|
|
|
|
## Launch of the Serving Engine
|
|
|
|
Example command to launch SGLang serving:
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model <MODEL_ID_OR_PATH> \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--host 0.0.0.0 \
|
|
--tp 6
|
|
```
|
|
|
|
Notes:
|
|
|
|
1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
|
|
|
|
2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
|
|
The number of TP specified is how many TP ranks will be used during the execution.
|
|
On a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
|
|
Usually we can get the SNC information (How many available) from the Operating System with e.g. `lscpu` command.
|
|
|
|
If the specified TP rank number differs from the total SNC count,
|
|
the system will automatically utilize the first `n` SNCs.
|
|
Note that `n` cannot exceed the total SNC number, doing so will result in an error.
|
|
|
|
`SGLANG_CPU_OMP_THREADS_BIND` allows explicit control of CPU cores for each tensor parallel (TP) rank.
|
|
|
|
**example 1**: Run SGLang service with TP=6, using the first 40 cores of each SNC on a Xeon® 6980P server,
|
|
which has 43-43-42 cores on the 3 SNCs of a socket, we should set:
|
|
|
|
```bash
|
|
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
|
|
```
|
|
This configuration is equivalent to:
|
|
- rank 0: `numactl -C 0-39 -m 0`
|
|
- rank 1: `numactl -C 43-82 -m 1`
|
|
- rank 2: `numactl -C 86-125 -m 2`
|
|
- rank 3: `numactl -C 128-167 -m 3`
|
|
- rank 4: `numactl -C 171-210 -m 4`
|
|
- rank 5: `numactl -C 214-253 -m 5`
|
|
|
|
|
|
**example 2**: Run SGLang service with TP=2, using 96 cores cross 3 SNCs on a Xeon® 6972P server,
|
|
which has 32-32-32 cores on the 3 SNCs in a socket, we should set:
|
|
```bash
|
|
export SGLANG_CPU_OMP_THREADS_BIND="0-95|96-191"
|
|
```
|
|
This configuration is equivalent to:
|
|
- rank 0: `numactl -C 0-95 -m 0-2`
|
|
- rank 1: `numactl -C 96-191 -m 3-5`
|
|
|
|
Please beware that with SGLANG_CPU_OMP_THREADS_BIND set,
|
|
the available memory amounts of the ranks may not be determined in prior.
|
|
You may need to set proper `--max-total-tokens` to avoid the out-of-memory error.
|
|
|
|
3. For optimizing decoding with torch.compile, please add the flag `--enable-torch-compile`.
|
|
To specify the maximum batch size when using `torch.compile`, set the flag `--torch-compile-max-bs`.
|
|
For example, `--enable-torch-compile --torch-compile-max-bs 4` means using `torch.compile`
|
|
and setting the maximum batch size to 4.
|
|
|
|
4. A warmup step is automatically triggered when the service is started.
|
|
The server is ready when you see the log `The server is fired up and ready to roll!`.
|
|
|
|
## Benchmarking with Requests
|
|
|
|
You can benchmark the performance via the `bench_serving` script.
|
|
Run the command in another terminal. An example command would be:
|
|
|
|
```bash
|
|
python -m sglang.bench_serving \
|
|
--dataset-name random \
|
|
--random-input-len 1024 \
|
|
--random-output-len 1024 \
|
|
--num-prompts 1 \
|
|
--request-rate inf \
|
|
--random-range-ratio 1.0
|
|
```
|
|
|
|
Detailed parameter descriptions are available via the command:
|
|
|
|
```bash
|
|
python -m sglang.bench_serving -h
|
|
```
|
|
|
|
Additionally, requests can be formatted using
|
|
[the OpenAI Completions API](https://docs.sglang.io/basic_usage/openai_api_completions.html)
|
|
and sent via the command line (e.g., using `curl`) or through your own scripts.
|
|
|
|
## Example Usage Commands
|
|
|
|
Large Language Models can range from fewer than 1 billion to several hundred billion parameters.
|
|
Dense models larger than 20B are expected to run on flagship 6th Gen Intel® Xeon® processors
|
|
with dual sockets and a total of 6 sub-NUMA clusters. Dense models of approximately 10B parameters or fewer,
|
|
or MoE (Mixture of Experts) models with fewer than 10B activated parameters, can run on more common
|
|
4th generation or newer Intel® Xeon® processors, or utilize a single socket of the flagship 6th Gen Intel® Xeon® processors.
|
|
|
|
### Example: Running DeepSeek-V3.1-Terminus
|
|
|
|
An example command to launch service of W8A8_INT8 DeepSeek-V3.1-Terminus on a Xeon® 6980P server:
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8 \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--quantization w8a8_int8 \
|
|
--host 0.0.0.0 \
|
|
--enable-torch-compile \
|
|
--torch-compile-max-bs 4 \
|
|
--tp 6
|
|
```
|
|
|
|
Similarly, an example command to launch service of FP8 DeepSeek-V3.1-Terminus would be:
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model deepseek-ai/DeepSeek-V3.1-Terminus \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--host 0.0.0.0 \
|
|
--enable-torch-compile \
|
|
--torch-compile-max-bs 4 \
|
|
--tp 6
|
|
```
|
|
|
|
Note: Please set `--torch-compile-max-bs` to the maximum desired batch size for your deployment,
|
|
which can be up to 16. The value `4` in the examples is illustrative.
|
|
|
|
### Example: Running Llama-3.2-3B
|
|
|
|
An example command to launch service of Llama-3.2-3B with BF16 precision:
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model meta-llama/Llama-3.2-3B-Instruct \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--host 0.0.0.0 \
|
|
--enable-torch-compile \
|
|
--torch-compile-max-bs 16 \
|
|
--tp 2
|
|
```
|
|
|
|
The example command to launch service of W8A8_INT8 version of Llama-3.2-3B:
|
|
|
|
```bash
|
|
python -m sglang.launch_server \
|
|
--model RedHatAI/Llama-3.2-3B-quantized.w8a8 \
|
|
--trust-remote-code \
|
|
--disable-overlap-schedule \
|
|
--device cpu \
|
|
--quantization w8a8_int8 \
|
|
--host 0.0.0.0 \
|
|
--enable-torch-compile \
|
|
--torch-compile-max-bs 16 \
|
|
--tp 2
|
|
```
|
|
|
|
Note: The `--torch-compile-max-bs` and `--tp` settings are examples that should be adjusted for your setup.
|
|
For instance, use `--tp 3` to utilize 1 socket with 3 sub-NUMA clusters on an Intel® Xeon® 6980P server.
|
|
|
|
Once the server have been launched, you can test it using the `bench_serving` command or create
|
|
your own commands or scripts following [the benchmarking example](#benchmarking-with-requests).
|