Merge upstream/main: bring in PCG (Piecewise CUDA Graph) support for Qwen3.5 GDN

This commit is contained in:
xwy-amd8
2026-02-26 07:52:14 +00:00
1007 changed files with 82825 additions and 30557 deletions

View File

@@ -0,0 +1,559 @@
---
name: add-jit-kernel
description: Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module
---
# Tutorial: Adding a New JIT Kernel to SGLang
This tutorial walks through adding a simple element-wise scale operation as a JIT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
## Goal
Add a new operation that scales each element of a tensor by a scalar factor:
- Input: tensor `x` (CUDA) and scalar `factor` (float, passed as C++ template argument)
- Output: `x * factor` (element-wise), allocated internally
- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
## When to use JIT vs AOT (`sgl-kernel`)
- **JIT (`jit_kernel`)**: lightweight, few dependencies, rapid iteration, compiled on first use
- **AOT (`sgl-kernel`)**: depends on CUTLASS / FlashInfer / DeepGEMM, needs pre-built wheel
---
## Common Abstractions in `python/sglang/jit_kernel/include/sgl_kernel/`
**Always prefer these abstractions over raw CUDA primitives.** They provide safety, readability, and consistency with the rest of the codebase.
### `utils.h` — Host-side utilities
```cpp
#include <sgl_kernel/utils.h>
```
- **`host::RuntimeCheck(cond, args...)`** — Assert a condition at runtime; throws `PanicError` with file/line info on failure. Prefer this over bare `assert`.
- **`host::Panic(args...)`** — Unconditionally throw a `PanicError` with a descriptive message.
- **`host::div_ceil(a, b)`** — Integer ceiling division `(a + b - 1) / b`.
- **`host::irange(n)`** / **`host::irange(start, end)`** — Range views for cleaner loops.
- **`host::pointer::offset(ptr, offsets...)`** — Byte-safe pointer arithmetic on `void*`. Use this instead of raw casts.
### `utils.cuh` — Device-side utilities + `LaunchKernel`
```cpp
#include <sgl_kernel/utils.cuh>
```
- **Type aliases**: `fp16_t`, `bf16_t`, `fp32_t`, `fp8_e4m3_t`, `fp8_e5m2_t` and their packed variants `fp16x2_t`, `bf16x2_t`, `fp32x2_t`, etc.
- **`SGL_DEVICE`** — Expands to `__forceinline__ __device__`. Use on all device functions.
- **`device::kWarpThreads`** — Constant `32`.
- **`device::load_as<T>(ptr, offset)`** / **`device::store_as<T>(ptr, val, offset)`** — Type-safe loads/stores from `void*`.
- **`device::pointer::offset(ptr, offsets...)`** — Pointer arithmetic on device.
- **`host::LaunchKernel(grid, block, device_or_stream [, smem])`** — RAII kernel launcher that:
- Resolves the CUDA stream from a `DLDevice` via TVM-FFI automatically.
- Checks the CUDA error with file/line info after launch via `operator()(kernel, args...)`.
- Supports `.enable_pdl(bool)` for PDL (Programmatic Dependent Launch, SM90+).
- **`host::RuntimeDeviceCheck(cudaError_t)`** — Check a CUDA error; throw on failure.
### `tensor.h` — Tensor validation (`TensorMatcher`, Symbolic types)
```cpp
#include <sgl_kernel/tensor.h>
```
This is the **primary validation API** for all kernel launchers. Use it to validate every `tvm::ffi::TensorView` argument.
- **`host::SymbolicSize{"name"}`** — A named symbolic dimension. Call `.set_value(n)` to pin it, `.unwrap()` to extract after verification.
- **`host::SymbolicDType`** — Symbolic dtype. Use `.set_options<Ts...>()` to restrict allowed types.
- **`host::SymbolicDevice`** — Symbolic device. Use `.set_options<kDLCUDA>()` to restrict to CUDA.
- **`host::TensorMatcher({dims...})`** — Fluent builder for tensor validation:
- `.with_dtype<T>()` — require a specific C++ type (e.g. `fp16_t`)
- `.with_dtype<T1, T2, ...>()` — allow a set of types
- `.with_device<kDLCUDA>(device_sym)` — require CUDA, bind device to symbol
- `.with_strides({strides...})` — validate strides (omit to require contiguous)
- `.verify(tensor_view)` — execute the check; throws `PanicError` with full context on failure; **chainable** (`verify(a).verify(b)` to check multiple tensors with the same shape)
**Typical pattern:**
```cpp
auto N = SymbolicSize{"num_elements"};
auto device = SymbolicDevice{};
device.set_options<kDLCUDA>();
TensorMatcher({N}) //
.with_dtype<fp16_t>()
.with_device(device)
.verify(dst)
.verify(src); // same shape, dtype, device as dst
const size_t n = N.unwrap();
const DLDevice dev = device.unwrap();
```
### `type.cuh` — `dtype_trait<T>` and `packed_t<T>`
```cpp
#include <sgl_kernel/type.cuh>
```
- **`dtype_trait<T>`** — Static trait struct for each scalar type. Provides:
- `dtype_trait<T>::from(value)` — convert from another type (e.g. `fp32_t``fp16_t`)
- `dtype_trait<T>::abs/sqrt/rsqrt/max/min(x)` — type-dispatched math (for `fp32_t`)
- **`packed_t<T>`** — Two-element packed alias: `packed_t<fp16_t>` = `fp16x2_t`, `packed_t<bf16_t>` = `bf16x2_t`, `packed_t<fp32_t>` = `fp32x2_t`. Use for vectorized loads/stores.
- **`device::cast<To, From>(value)`** — Type-safe cast using `dtype_trait`, e.g. `cast<fp32x2_t, fp16x2_t>(v)`.
### `vec.cuh` — Vectorized memory access (`AlignedVector`)
```cpp
#include <sgl_kernel/vec.cuh>
```
- **`device::AlignedVector<T, N>`** — Aligned storage for N elements of type T. N must be a power of two, `sizeof(T)*N <= 32`. Enables 128-bit vector loads/stores for bandwidth efficiency.
- `.load(ptr, offset)` — vectorized load from `ptr[offset]`
- `.store(ptr, offset)` — vectorized store to `ptr[offset]`
- `.fill(value)` — fill all lanes
- `operator[](i)` — element access
### `tile.cuh` — `tile::Memory` (strided memory access pattern)
```cpp
#include <sgl_kernel/tile.cuh>
```
- **`device::tile::Memory<T>::cta(blockDim.x)`** — Creates a tile accessor where each thread handles `tid = threadIdx.x` with stride `blockDim.x`. Common for loops over a 1D array.
- **`.load(ptr, offset)`** — loads `ptr[tid + offset * blockDim.x]`
- **`.store(ptr, val, offset)`** — stores to `ptr[tid + offset * blockDim.x]`
- **`.in_bound(n, offset)`** — boundary check
### `math.cuh` — Device math (`device::math::`)
```cpp
#include <sgl_kernel/math.cuh>
```
- `device::math::max/min/abs/sqrt/rsqrt<T>(a, b)` — type-dispatched math via `dtype_trait`
- `device::math::exp/sin/cos(float)` — fast float math wrappers
### `warp.cuh` — Warp-level primitives
```cpp
#include <sgl_kernel/warp.cuh>
```
- `device::warp::reduce_sum<T>(value)` — warp-level sum reduction via `__shfl_xor_sync`
- `device::warp::reduce_max<T>(value)` — warp-level max reduction
### `cta.cuh` — CTA-level primitives
```cpp
#include <sgl_kernel/cta.cuh>
```
- `device::cta::reduce_max<T>(value, smem, min_value)` — CTA-wide max using shared memory + warp reduction. Caller is responsible for a `__syncthreads()` after if the result in `smem[0]` is needed.
### `atomic.cuh` — Atomic operations
```cpp
#include <sgl_kernel/atomic.cuh>
```
- `device::atomic::max(float* addr, float value)` — float atomic max (handles negative values correctly via bit tricks).
### `runtime.cuh` — Occupancy and device info
```cpp
#include <sgl_kernel/runtime.cuh>
```
- `host::runtime::get_blocks_per_sm(kernel, block_dim)` — max active blocks per SM (occupancy)
- `host::runtime::get_sm_count(device_id)` — number of SMs on the device
- `host::runtime::get_cc_major(device_id)` — compute capability major version
**Persistent kernel pattern** (cap blocks to SM count × occupancy):
```cpp
static const uint32_t max_occ = runtime::get_blocks_per_sm(kernel, kBlockSize);
static const uint32_t num_sm = runtime::get_sm_count(device.unwrap().device_id);
const auto num_blocks = std::min(num_sm * max_occ, div_ceil(n, kBlockSize));
LaunchKernel(num_blocks, kBlockSize, device.unwrap())(kernel, params);
```
---
## Step 0 (optional): Generate a `.clangd` config for better IDE support
```bash
python -m sglang.jit_kernel
```
---
## Step 1: Implement the CUDA kernel in `jit_kernel/csrc/`
Create `python/sglang/jit_kernel/csrc/elementwise/scale.cuh`.
The implementation fully uses the project abstractions described above:
```cpp
#include <sgl_kernel/tensor.h> // TensorMatcher, SymbolicSize, SymbolicDevice
#include <sgl_kernel/type.cuh> // dtype_trait, fp16_t, bf16_t, fp32_t
#include <sgl_kernel/utils.h> // RuntimeCheck, div_ceil
#include <sgl_kernel/utils.cuh> // LaunchKernel, SGL_DEVICE
#include <sgl_kernel/vec.cuh> // AlignedVector
#include <dlpack/dlpack.h>
#include <tvm/ffi/container/tensor.h>
namespace {
// ----------------------------------------------------------------
// Kernel: element-wise scale using vectorized 128-bit loads/stores
// T = fp16_t | bf16_t | fp32_t
// kVecN = number of elements per vector load (e.g. 8 for fp16)
// kFactor = scale factor encoded as kFactorNumer / kFactorDenom
// ----------------------------------------------------------------
template <typename T, int kVecN, int32_t kFactorNumer, int32_t kFactorDenom>
__global__ void scale_kernel(T* __restrict__ dst,
const T* __restrict__ src,
uint32_t n_vecs,
uint32_t n_remainder,
uint32_t n_total) {
constexpr float kFactor = static_cast<float>(kFactorNumer)
/ static_cast<float>(kFactorDenom);
using vec_t = device::AlignedVector<T, kVecN>;
// --- vectorised body ---
const uint32_t vec_stride = blockDim.x * gridDim.x;
for (uint32_t vi = blockIdx.x * blockDim.x + threadIdx.x;
vi < n_vecs;
vi += vec_stride) {
vec_t v;
v.load(src, vi);
#pragma unroll
for (int i = 0; i < kVecN; ++i) {
v[i] = static_cast<T>(static_cast<float>(v[i]) * kFactor);
}
v.store(dst, vi);
}
// --- scalar tail ---
const uint32_t base = n_vecs * kVecN;
const uint32_t scalar_stride = blockDim.x * gridDim.x;
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
i < n_remainder;
i += scalar_stride) {
dst[base + i] = static_cast<T>(static_cast<float>(src[base + i]) * kFactor);
}
}
// ----------------------------------------------------------------
// Launcher: validates tensors, selects vector width, launches kernel
// ----------------------------------------------------------------
template <typename T, int32_t kFactorNumer, int32_t kFactorDenom>
void scale(tvm::ffi::TensorView dst, tvm::ffi::TensorView src) {
using namespace host;
// 1. Validate input tensors with TensorMatcher
SymbolicSize N = {"num_elements"};
SymbolicDevice device_;
device_.set_options<kDLCUDA>();
TensorMatcher({N}) //
.with_dtype<T>()
.with_device(device_)
.verify(dst)
.verify(src); // same shape / dtype / device as dst
const uint32_t n = static_cast<uint32_t>(N.unwrap());
const DLDevice device = device_.unwrap();
RuntimeCheck(n > 0, "scale: num_elements must be > 0, got ", n);
// 2. Choose vector width for 128-bit loads (16 bytes)
// fp16/bf16: 8 elements × 2 bytes = 16 bytes
// fp32: 4 elements × 4 bytes = 16 bytes
constexpr int kVecN = 16 / sizeof(T);
const uint32_t n_vecs = n / kVecN;
const uint32_t n_remainder = n % kVecN;
// 3. Launch
constexpr uint32_t kBlockSize = 256;
const uint32_t grid = div_ceil(std::max(n_vecs, n_remainder), kBlockSize);
LaunchKernel(grid, kBlockSize, device)(
scale_kernel<T, kVecN, kFactorNumer, kFactorDenom>,
static_cast<T*>(dst.data_ptr()),
static_cast<const T*>(src.data_ptr()),
n_vecs,
n_remainder,
n);
}
} // namespace
```
**Key points:**
- Include headers from `sgl_kernel/`**not** raw CUDA headers for anything already covered
- Use `TensorMatcher` for all tensor validation; never manually check shape/dtype/device
- Use `AlignedVector` for vectorised 128-bit loads/stores — significant bandwidth win
- Use `LaunchKernel` — it resolves the stream and checks errors automatically
- Use `RuntimeCheck` for runtime assertions with useful error messages
- `fp16_t` / `bf16_t` / `fp32_t` are the project's type aliases (from `utils.cuh`)
- `device::cast<To, From>` or `dtype_trait<T>::from(val)` for cross-type conversions
- `device::math::` functions for device math instead of bare `__` intrinsics
---
## Step 2: Add the Python wrapper in `jit_kernel/`
Create `python/sglang/jit_kernel/scale.py`:
```python
from __future__ import annotations
from typing import TYPE_CHECKING
import torch
from sglang.jit_kernel.utils import cache_once, load_jit, make_cpp_args
if TYPE_CHECKING:
from tvm_ffi.module import Module
@cache_once
def _jit_scale_module(dtype: torch.dtype, factor_numer: int, factor_denom: int) -> Module:
"""Compile and cache the JIT scale module for a given dtype and factor."""
args = make_cpp_args(dtype, factor_numer, factor_denom)
return load_jit(
"scale",
*args,
cuda_files=["elementwise/scale.cuh"],
cuda_wrappers=[("scale", f"scale<{args}>")],
)
def scale(src: torch.Tensor, factor: float, out: torch.Tensor | None = None) -> torch.Tensor:
"""
Element-wise scale: dst = src * factor.
Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
Parameters
----------
src : CUDA tensor (FP16 / BF16 / FP32)
factor : scale factor
out : optional pre-allocated output tensor (same shape/dtype as src)
Returns
-------
Scaled tensor (dst = src * factor).
"""
assert src.is_cuda, "src must be a CUDA tensor"
assert src.dtype in (torch.float16, torch.bfloat16, torch.float32), (
f"Unsupported dtype {src.dtype}. Supported: float16, bfloat16, float32"
)
if out is None:
out = torch.empty_like(src)
else:
assert out.shape == src.shape, "out shape must match src"
assert out.dtype == src.dtype, "out dtype must match src"
# Encode factor as integer ratio; denom=1000 gives 3 decimal places of precision
factor_denom = 1000
factor_numer = round(factor * factor_denom)
module = _jit_scale_module(src.dtype, factor_numer, factor_denom)
module.scale(out, src)
return out
```
**Key points:**
- Use `cache_once`**not** `functools.lru_cache` (incompatible with `torch.compile`)
- `load_jit` first arg(s) form the unique build marker; same marker = same cached binary
- `cuda_wrappers`: `(export_name, kernel_symbol)``export_name` is called from Python
- `make_cpp_args(dtype, ...)` converts `torch.dtype` to C++ type alias:
| `torch.dtype` | C++ type |
|--------------------|------------|
| `torch.float16` | `fp16_t` |
| `torch.bfloat16` | `bf16_t` |
| `torch.float32` | `fp32_t` |
---
## Step 3 (optional): Tune JIT build flags
```python
return load_jit(
"scale",
*args,
cuda_files=["elementwise/scale.cuh"],
cuda_wrappers=[("scale", f"scale<{args}>")],
extra_cuda_cflags=["-O3", "--use_fast_math"],
)
```
If your kernel requires SM90+, raise a clear Python error before calling `load_jit`:
```python
if torch.cuda.get_device_capability()[0] < 9:
raise RuntimeError("This kernel requires SM90 (Hopper) or later")
```
---
## Step 4: Write tests (required)
Create `python/sglang/jit_kernel/tests/test_scale.py`:
```python
import pytest
import torch
from sglang.jit_kernel.scale import scale
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [1, 127, 128, 1024, 4097]) # cover tail remainder
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0, 3.0])
def test_scale_correctness(dtype, size, factor):
src = torch.randn(size, dtype=dtype, device="cuda")
out = scale(src, factor)
expected = src * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
def test_scale_out_param(dtype):
src = torch.randn(1024, dtype=dtype, device="cuda")
out = torch.empty_like(src)
result = scale(src, 2.0, out=out)
assert result is out
torch.testing.assert_close(out, src * 2.0, rtol=1e-2, atol=1e-2)
def test_scale_cpu_error():
src = torch.randn(128, dtype=torch.float16) # CPU tensor
with pytest.raises(AssertionError, match="CUDA"):
scale(src, 2.0)
def test_scale_unsupported_dtype():
src = torch.randint(0, 10, (128,), dtype=torch.int32, device="cuda")
with pytest.raises(AssertionError, match="Unsupported dtype"):
scale(src, 2.0)
if __name__ == "__main__":
pytest.main([__file__, "-q"])
```
Run:
```bash
pytest python/sglang/jit_kernel/tests/test_scale.py -q
```
---
## Step 5: Add a benchmark (required)
Create `python/sglang/jit_kernel/benchmark/bench_scale.py`:
```python
import itertools
import torch
import triton
import triton.testing
from sglang.jit_kernel.benchmark.utils import (
DEFAULT_DEVICE,
DEFAULT_DTYPE,
get_benchmark_range,
run_benchmark,
)
from sglang.jit_kernel.scale import scale as jit_scale
SIZE_LIST = get_benchmark_range(
full_range=[2**n for n in range(10, 20)], # 1K … 512K elements
ci_range=[4096, 65536],
)
configs = list(itertools.product(SIZE_LIST))
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["size"],
x_vals=configs,
line_arg="provider",
line_vals=["jit", "torch"],
line_names=["SGL JIT Kernel", "PyTorch"],
styles=[("blue", "-"), ("red", "--")],
ylabel="us",
plot_name="scale-performance",
args={},
)
)
def benchmark(size: int, provider: str):
src = torch.randn(size, dtype=DEFAULT_DTYPE, device=DEFAULT_DEVICE)
factor = 2.0
if provider == "jit":
fn = lambda: jit_scale(src, factor)
else:
fn = lambda: src * factor
return run_benchmark(fn)
if __name__ == "__main__":
benchmark.run(print_data=True)
```
Run:
```bash
python python/sglang/jit_kernel/benchmark/bench_scale.py
```
---
## Troubleshooting
- **JIT compilation fails**: ensure the `.cuh` file is under `python/sglang/jit_kernel/csrc/`; reduce template argument combinations
- **CUDA crash / illegal memory access**: `CUDA_LAUNCH_BLOCKING=1`; `compute-sanitizer --tool memcheck python ...`
- **Unstable benchmark results**: `run_benchmark` uses CUDA-graph-based timing by default
---
## References
- `docs/developer_guide/development_jit_kernel_guide.md`
- `python/sglang/jit_kernel/utils.py``cache_once`, `load_jit`, `make_cpp_args`
- `python/sglang/jit_kernel/include/sgl_kernel/tensor.h``TensorMatcher`, `SymbolicSize/DType/Device`
- `python/sglang/jit_kernel/include/sgl_kernel/utils.cuh` — type aliases, `LaunchKernel`, `SGL_DEVICE`
- `python/sglang/jit_kernel/include/sgl_kernel/vec.cuh``AlignedVector`
- `python/sglang/jit_kernel/include/sgl_kernel/tile.cuh``tile::Memory`
- `python/sglang/jit_kernel/include/sgl_kernel/type.cuh``dtype_trait`, `packed_t`, `device::cast`
- `python/sglang/jit_kernel/include/sgl_kernel/math.cuh``device::math::`
- `python/sglang/jit_kernel/include/sgl_kernel/warp.cuh``warp::reduce_sum/max`
- `python/sglang/jit_kernel/include/sgl_kernel/cta.cuh``cta::reduce_max`
- `python/sglang/jit_kernel/include/sgl_kernel/atomic.cuh``atomic::max`
- `python/sglang/jit_kernel/include/sgl_kernel/runtime.cuh` — occupancy / SM count helpers
- `python/sglang/jit_kernel/csrc/add_constant.cuh` — minimal runnable reference
- `python/sglang/jit_kernel/csrc/elementwise/rmsnorm.cuh` — real example using `TensorMatcher` + `LaunchKernel` + `tile::Memory`
- `python/sglang/jit_kernel/csrc/elementwise/qknorm.cuh` — real example using `runtime::get_blocks_per_sm` + persistent kernel pattern
- `python/sglang/jit_kernel/benchmark/utils.py` — benchmark helpers
## Summary of Files Created
```
python/sglang/jit_kernel/csrc/elementwise/scale.cuh # NEW: CUDA kernel
python/sglang/jit_kernel/scale.py # NEW: Python wrapper
python/sglang/jit_kernel/tests/test_scale.py # NEW: Tests
python/sglang/jit_kernel/benchmark/bench_scale.py # NEW: Benchmark
```

View File

@@ -0,0 +1,358 @@
---
name: add-sgl-kernel
description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)
---
# Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)
This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
## Goal
Add a new operation that scales each element of a tensor by a scalar factor:
- Input: tensor `x` (CUDA) and scalar `factor` (float)
- Output: `x * factor` (element-wise, in-place or into pre-allocated `out`)
- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
- Dispatched via `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro (defined in `sgl-kernel/include/utils.h`)
## Two rules of thumb (must follow)
1. **Heavyweight kernels go to `sgl-kernel`.** If it depends on CUTLASS / FlashInfer / DeepGEMM (or similarly heavy stacks), implement it in `sgl-kernel/`.
2. **Lightweight kernels go to `python/sglang/jit_kernel`.** If it is small, has few dependencies, and benefits from rapid iteration, implement it as a JIT kernel instead.
In addition, every new kernel must ship with:
- **Tests** (pytest)
- **A benchmark script** (triton.testing)
---
## Repository integration map
You will typically touch these files/areas:
- Implementation: `sgl-kernel/csrc/elementwise/scale.cu` (pick the right subdirectory)
- Public declarations: `sgl-kernel/include/sgl_kernel_ops.h`
- Torch extension registration: `sgl-kernel/csrc/common_extension.cc`
- Build: `sgl-kernel/CMakeLists.txt` (`set(SOURCES ...)`)
- Python API: `sgl-kernel/python/sgl_kernel/` and `sgl-kernel/python/sgl_kernel/__init__.py`
- Tests: `sgl-kernel/tests/test_scale.py`
- Benchmarks: `sgl-kernel/benchmark/bench_scale.py`
---
## Step 1: Implement the kernel in `csrc/`
Pick the right subdirectory:
- `csrc/elementwise/` — for element-wise ops (our example)
- `csrc/gemm/`, `csrc/attention/`, `csrc/moe/` — for other categories
Create `sgl-kernel/csrc/elementwise/scale.cu`:
```cpp
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>
#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
const T* __restrict__ input,
float factor,
int64_t n) {
int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
if (idx < n) {
out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
}
}
void scale(at::Tensor& out, const at::Tensor& input, double factor) {
TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor");
TORCH_CHECK(out.is_contiguous(), "out must be contiguous");
TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape");
TORCH_CHECK(out.scalar_type() == input.scalar_type(),
"out and input must have the same dtype");
const int64_t n = input.numel();
const int threads = 256;
const int blocks = (n + threads - 1) / threads;
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
// Dispatches over float, float16, bfloat16
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
static_cast<c_type*>(out.data_ptr()),
static_cast<const c_type*>(input.data_ptr()),
static_cast<float>(factor),
n);
cudaError_t status = cudaGetLastError();
TORCH_CHECK(status == cudaSuccess,
"scale_kernel launch failed: ", cudaGetErrorString(status));
return true;
});
}
```
**Key points:**
- Use `at::Tensor` (PyTorch tensors), `TORCH_CHECK` for validation, `at::cuda::getCurrentCUDAStream()` for stream
- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` covers `float`, `half` (FP16), `__nv_bfloat16` (BF16)
- Add device error checking after every kernel launch
- If a kernel only works on certain architectures, enforce that with `TORCH_CHECK` and skip logic in tests
---
## Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`
Edit `sgl-kernel/include/sgl_kernel_ops.h`, add to the elementwise section:
```cpp
void scale(at::Tensor& out, const at::Tensor& input, double factor);
```
---
## Step 3: Register the op in `csrc/common_extension.cc`
Edit `sgl-kernel/csrc/common_extension.cc`, inside `TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)`:
```cpp
// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);
```
**Key points:**
- `Tensor!` means in-place / mutable output argument
- The schema is important for `torch.compile` and for consistent call signatures
- If your underlying C++ API uses `float` but PyTorch bindings expect `double`, the implicit cast is fine for scalars; use shims if needed for other types
---
## Step 4: Add the new source file to `CMakeLists.txt`
Edit `sgl-kernel/CMakeLists.txt`, add to `set(SOURCES ...)`:
```cmake
csrc/elementwise/scale.cu
```
**Key points:**
- Keep the list **alphabetically sorted** (the file explicitly requires this)
- If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic
---
## Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`
In `sgl-kernel/python/sgl_kernel/__init__.py`, add:
```python
from torch.ops import sgl_kernel as _ops
def scale(out: torch.Tensor, input: torch.Tensor, factor: float) -> None:
"""
Element-wise scale: out = input * factor (in-place into out).
Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
Parameters
----------
out : pre-allocated CUDA output tensor (same shape/dtype as input)
input : CUDA input tensor
factor : scale factor (float)
"""
_ops.scale(out, input, factor)
```
Or export it from the existing module organisation — follow the pattern already used by similar ops in `__init__.py`.
---
## Step 6: Write tests (required)
Create `sgl-kernel/tests/test_scale.py`:
```python
import pytest
import torch
import sgl_kernel
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
sgl_kernel.scale(out, input, factor)
expected = input * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
def test_scale_shape_mismatch():
input = torch.randn(128, dtype=torch.float16, device="cuda")
out = torch.empty(256, dtype=torch.float16, device="cuda")
with pytest.raises(RuntimeError, match="same shape"):
sgl_kernel.scale(out, input, 2.0)
def test_scale_cpu_input():
input = torch.randn(128, dtype=torch.float16) # CPU
out = torch.empty_like(input)
with pytest.raises(RuntimeError, match="CUDA"):
sgl_kernel.scale(out, input, 2.0)
if __name__ == "__main__":
pytest.main([__file__, "-q"])
```
Run:
```bash
pytest sgl-kernel/tests/test_scale.py -q
```
---
## Step 7: Add a benchmark (required)
Create `sgl-kernel/benchmark/bench_scale.py`:
```python
import itertools
import os
import torch
import triton
import triton.testing
import sgl_kernel
IS_CI = (
os.getenv("CI", "false").lower() == "true"
or os.getenv("GITHUB_ACTIONS", "false").lower() == "true"
)
dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K
factors = [2.0]
configs = list(itertools.product(dtypes, sizes))
def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
return input * factor
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["dtype", "size"],
x_vals=configs,
line_arg="provider",
line_vals=["sglang", "torch"],
line_names=["SGL Kernel", "PyTorch"],
styles=[("green", "-"), ("red", "--")],
ylabel="µs (median)",
plot_name="scale-performance",
args={},
)
)
def benchmark(dtype, size, provider):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
factor = 2.0
if provider == "sglang":
fn = lambda: sgl_kernel.scale(out, input, factor)
else:
fn = lambda: torch_scale(input, factor)
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
fn, quantiles=[0.5, 0.2, 0.8]
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms
if __name__ == "__main__":
benchmark.run(print_data=True)
```
Run:
```bash
python sgl-kernel/benchmark/bench_scale.py
```
---
## Step 8: Build and validate
Build:
```bash
cd sgl-kernel
make build -j16
```
If you need to limit host resource usage:
```bash
cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
```
Validate:
```bash
pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.py
```
---
## Troubleshooting
- **Async CUDA errors**: `CUDA_LAUNCH_BLOCKING=1`
- **Memory errors**: `compute-sanitizer --tool memcheck python ...`
- **Build is too slow / OOM**: reduce `MAX_JOBS` and `SGL_KERNEL_COMPILE_THREADS`
- **Binary bloat**: use `sgl-kernel/analyze_whl_kernel_sizes.py`
- **CMake sources list**: if your `.cu` file is missing from `SOURCES`, the symbol will be undefined at link time
---
## References
- `sgl-kernel/README.md`
- `sgl-kernel/include/sgl_kernel_ops.h`
- `sgl-kernel/csrc/common_extension.cc`
- `sgl-kernel/CMakeLists.txt`
- `sgl-kernel/include/utils.h``DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro and friends
- `sgl-kernel/csrc/elementwise/activation.cu` — reference for the FP16/BF16/FP32 dispatch pattern
## Summary of Files Created/Modified
```
sgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher
sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration
sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration
sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical)
sgl-kernel/python/sgl_kernel/__init__.py # MODIFIED: export Python API
sgl-kernel/tests/test_scale.py # NEW: tests
sgl-kernel/benchmark/bench_scale.py # NEW: benchmark
```

View File

@@ -55,6 +55,13 @@
"reason": "top contributor",
"can_rerun_stage": true
},
"Chen-0210": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
"cooldown_interval_minutes": 60,
"reason": "custom override",
"can_rerun_stage": true
},
"ClawSeven": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
@@ -121,7 +128,7 @@
"HandH1998": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
"cooldown_interval_minutes": 60,
"cooldown_interval_minutes": 0,
"reason": "custom override",
"can_rerun_stage": true
},
@@ -202,6 +209,13 @@
"reason": "custom override",
"can_rerun_stage": true
},
"Ratish1": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
"cooldown_interval_minutes": 0,
"reason": "custom override",
"can_rerun_stage": true
},
"ShangmingCai": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
@@ -720,6 +734,20 @@
"reason": "custom override",
"can_rerun_stage": true
},
"mmangkad": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
"cooldown_interval_minutes": 0,
"reason": "custom override",
"can_rerun_stage": true
},
"narutolhy": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
"cooldown_interval_minutes": 0,
"reason": "custom override",
"can_rerun_stage": true
},
"netanel-haber": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
@@ -846,6 +874,13 @@
"reason": "top contributor",
"can_rerun_stage": true
},
"sglang-npu-bot": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,
"cooldown_interval_minutes": 0,
"reason": "custom override",
"can_rerun_stage": true
},
"shaharmor98": {
"can_tag_run_ci_label": true,
"can_rerun_failed_ci": true,

20
.github/CODEOWNERS vendored
View File

@@ -1,18 +1,19 @@
.github @merrymercy @Fridge003 @ispobock @Kangyan-Zhou
/docker @Fridge003 @ispobock @HaiShaw @ishandhanani
.github @merrymercy @Fridge003 @ispobock @Kangyan-Zhou @bingxche
/docker @Fridge003 @ispobock @HaiShaw @ishandhanani @yctseng0211
/docker/npu.Dockerfile @ping1jing2 @iforgetmyname
/python/pyproject.toml @merrymercy @Fridge003 @ispobock
/python/sglang/jit_kernel @DarkSharpness @BBuf
/python/sglang/jit_kernel @DarkSharpness @BBuf @celve @HydraQYH @yuan-luo
/python/sglang/jit_kernel/diffusion @yingluosanqian @BBuf @mickqian
/python/sglang/multimodal_gen @mickqian @yhyang201
/python/sglang/multimodal_gen/runtime/layers @mickqian @yhyang201 @BBuf @yingluosanqian
/python/sglang/multimodal_gen/runtime/models/dits @mickqian @yhyang201 @BBuf @yingluosanqian
/python/sglang/multimodal_gen @mickqian @yhyang201 @ping1jing2
/python/sglang/multimodal_gen/runtime/layers @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2
/python/sglang/multimodal_gen/runtime/models/dits @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2
/python/sglang/srt/batch_invariant_ops @Fridge003 @hebiao064
/python/sglang/srt/constrained @hnyls2002 @DarkSharpness
/python/sglang/srt/compilation @hebiao064
/python/sglang/srt/disaggregation @ByronHsu @hnyls2002 @ShangmingCai
/python/sglang/srt/disaggregation/ascend @ping1jing2 @iforgetmyname
/python/sglang/srt/distributed @yizhang2077 @merrymercy @ch-wan
/python/sglang/srt/dllm @ClawSeven @btw616
/python/sglang/srt/entrypoints @ispobock @CatherineSue @slin1237 @merrymercy @JustinTong0323
/python/sglang/srt/entrypoints/grpc_server.py @CatherineSue @slin1237
/python/sglang/srt/eplb @fzyzcjy @ch-wan
@@ -21,11 +22,13 @@
/python/sglang/srt/hardware_backend/npu @ping1jing2 @iforgetmyname
/python/sglang/srt/hardware_backend/npu/quantization @OrangeRedeng @TamirBaydasov @iforgetmyname
/python/sglang/srt/layers @merrymercy @Ying1123 @Fridge003 @ispobock @HaiShaw @ch-wan @BBuf @Edwardf0t1
/python/sglang/srt/layers/attention @merrymercy @Fridge003 @ispobock @Qiaolin-Yu @hebiao064
/python/sglang/srt/layers/attention @merrymercy @Fridge003 @ispobock @Qiaolin-Yu @hebiao064 @HaiShaw
/python/sglang/srt/layers/attention/fla @yizhang2077 @hebiao064
/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py @yizhang2077 @hebiao064 @hanming-lu
/python/sglang/srt/layers/attention/mamba @yizhang2077 @hebiao064
/python/sglang/srt/layers/quantization @ch-wan @BBuf @Edwardf0t1 @FlamingoPg @AniZpZ
/python/sglang/srt/layers/attention/nsa @1am9trash @hubertlu-tw @kkHuang-amd @HaiShaw @Fridge003 @hlu1 @rainj-me
/python/sglang/srt/layers/quantization @ch-wan @BBuf @Edwardf0t1 @FlamingoPg @AniZpZ @HaiShaw
/python/sglang/srt/layers/quantization/quark @kkHuang-amd @yichiche @hubertlu-tw @1am9trash @BowenBao
/python/sglang/srt/lora @Ying1123 @Fridge003 @lifuhuang
/python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
/python/sglang/srt/managers/scheduler_pp_mixin.py @ShangmingCai @XucSh
@@ -34,6 +37,7 @@
/python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py @hebiao064
/python/sglang/srt/models/deepseek_v2.py @fzyzcjy @zhyncs @ispobock @ch-wan @merrymercy @Fridge003
/python/sglang/srt/multimodal @mickqian @JustinTong0323 @yhyang201 @yuan-luo
/python/sglang/srt/observability @merrymercy @fzyzcjy @sufeng-buaa
/python/sglang/srt/speculative @Ying1123 @merrymercy @hnyls2002
/sgl-kernel @zhyncs @ispobock @BBuf @yizhang2077 @merrymercy @FlamingoPg @HaiShaw
/sgl-model-gateway @slin1237 @CatherineSue

View File

@@ -0,0 +1,27 @@
name: Upload CUDA Coredumps
description: Upload CUDA coredump files as artifacts and clean up the directory.
inputs:
artifact-suffix:
description: Suffix appended to the artifact name (e.g. matrix partition id)
required: false
default: ""
retention-days:
description: Number of days to retain the artifact
required: false
default: "7"
runs:
using: composite
steps:
- name: Upload CUDA coredumps
uses: actions/upload-artifact@v4
with:
name: cuda-coredumps-${{ github.job }}${{ inputs.artifact-suffix && format('-{0}', inputs.artifact-suffix) }}
path: ${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}/
retention-days: ${{ inputs.retention-days }}
if-no-files-found: ignore
- name: Cleanup CUDA coredumps
shell: bash
run: rm -rf "${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}"

View File

@@ -26,25 +26,9 @@ jobs:
run: SKIP=no-commit-to-branch pre-commit run --all-files --show-diff-on-failure
- name: Run sgl-kernel clang-format checks
uses: DoozyX/clang-format-lint-action@v0.18.1
uses: DoozyX/clang-format-lint-action@v0.20
with:
source: sgl-kernel
extensions: h,c,cpp,hpp,cu,cuh,cc
clangFormatVersion: 18
clangFormatVersion: 20
style: file
- name: Check proto files are in sync
run: |
if ! diff -q python/sglang/srt/grpc/sglang_scheduler.proto sgl-model-gateway/src/proto/sglang_scheduler.proto; then
echo "❌ ERROR: Proto files are out of sync!"
echo ""
echo "The following files must be kept identical:"
echo " - python/sglang/srt/grpc/sglang_scheduler.proto"
echo " - sgl-model-gateway/src/proto/sglang_scheduler.proto"
echo ""
echo "Please ensure both files have the same content."
echo ""
echo "Differences:"
diff python/sglang/srt/grpc/sglang_scheduler.proto sgl-model-gateway/src/proto/sglang_scheduler.proto || true
exit 1
fi

View File

@@ -1,4 +1,4 @@
name: List Active PR Runs
name: List Active Runs
on:
workflow_dispatch:
@@ -15,13 +15,13 @@ permissions:
pull-requests: read
jobs:
list-active-pr-runs:
list-active-runs:
runs-on: ubuntu-latest
steps:
- name: Install GitHub CLI
run: sudo apt-get install -y gh jq
- name: List active PR runs grouped by PR
- name: List active runs grouped by PR
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
@@ -31,7 +31,7 @@ jobs:
set -euo pipefail
echo "========================================="
echo "🔍 Active PR Workflow Runs Report"
echo "🔍 Active Workflow Runs Report"
echo "========================================="
echo ""
@@ -54,7 +54,7 @@ jobs:
--workflow "$workflow_file" \
--json databaseId,status,event,headBranch,createdAt,updatedAt,headSha,number,attempt \
--limit 500 \
| jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress") | select(.event=="pull_request")')
| jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress")')
if [ -z "$active_runs" ]; then
continue
@@ -64,6 +64,7 @@ jobs:
echo "$active_runs" | while read -r run; do
run_id=$(echo "$run" | jq -r '.databaseId')
run_status=$(echo "$run" | jq -r '.status')
run_event=$(echo "$run" | jq -r '.event')
created_at=$(echo "$run" | jq -r '.createdAt')
head_sha=$(echo "$run" | jq -r '.headSha')
run_number=$(echo "$run" | jq -r '.number')
@@ -83,12 +84,12 @@ jobs:
continue
fi
# Find PR number
# Find PR number (may be empty for non-PR runs)
pr_number=$(gh api "repos/$REPO/pulls?state=open&head=${head_owner}:${head_branch}" \
--jq '.[0].number // empty' 2>/dev/null || true)
if [ -z "$pr_number" ]; then
continue
pr_number="NO_PR"
fi
# Get jobs for this run (with pagination to avoid missing jobs)
@@ -106,25 +107,25 @@ jobs:
queue_time=$((current_time - created_time))
queue_minutes=$((queue_time / 60))
# Store data in temporary file
echo "$pr_number|$workflow_file|$run_id|$run_status|$running_jobs|$queued_jobs|$runners|$queue_minutes|$created_at|$head_sha|$run_attempt" >> "$pr_data_file"
# Store data in temporary file (unified format with event and branch)
echo "$pr_number|$workflow_file|$run_id|$run_status|$running_jobs|$queued_jobs|$runners|$queue_minutes|$created_at|$head_sha|$run_attempt|$run_event|$head_branch" >> "$pr_data_file"
done
done
echo ""
echo "========================================="
echo "📊 Active PRs Summary"
echo "📊 Active Runs Summary"
echo "========================================="
echo ""
if [ ! -s "$pr_data_file" ]; then
echo "✅ No active PR runs found"
echo "✅ No active runs found"
rm -f "$pr_data_file"
exit 0
fi
# Get unique PR numbers
pr_numbers=$(cat "$pr_data_file" | cut -d'|' -f1 | sort -u)
# Get unique PR numbers (exclude NO_PR entries)
pr_numbers=$(cut -d'|' -f1 < "$pr_data_file" | grep -v '^NO_PR$' | sort -u || true)
# Separate high priority and normal PRs
high_priority_prs=()
@@ -240,11 +241,74 @@ jobs:
echo ""
done
# --- Non-PR Runs Section ---
non_pr_runs=$(grep '^NO_PR|' "$pr_data_file" 2>/dev/null || true)
non_pr_running=0
non_pr_queued=0
if [ -n "$non_pr_runs" ]; then
echo "========================================="
echo "📦 Non-PR Runs (manual / scheduled / other)"
echo "========================================="
echo ""
echo "$non_pr_runs" | while read -r line; do
workflow=$(echo "$line" | cut -d'|' -f2)
run_id=$(echo "$line" | cut -d'|' -f3)
status=$(echo "$line" | cut -d'|' -f4)
running=$(echo "$line" | cut -d'|' -f5)
queued=$(echo "$line" | cut -d'|' -f6)
runners=$(echo "$line" | cut -d'|' -f7)
queue_min=$(echo "$line" | cut -d'|' -f8)
created=$(echo "$line" | cut -d'|' -f9)
attempt=$(echo "$line" | cut -d'|' -f11)
event=$(echo "$line" | cut -d'|' -f12)
branch=$(echo "$line" | cut -d'|' -f13)
run_url="https://github.com/$REPO/actions/runs/$run_id"
retry_count=$((attempt - 1))
retry_indicator=""
if [ "$retry_count" -gt 0 ]; then
retry_indicator=" 🔄 Retry #$retry_count"
fi
echo " 📦 Workflow: $workflow (Run #$run_id)$retry_indicator"
echo " Event: $event"
echo " Branch: $branch"
echo " Status: $status"
echo " 🟢 Running jobs: $running"
echo " 🟡 Queued jobs: $queued"
if [ "$running" -gt 0 ] && [ "$runners" != "" ]; then
echo " 🖥️ Runners: $runners"
fi
if [ "$queue_min" -gt 0 ]; then
echo " ⏱️ Queue time: ${queue_min} minutes"
fi
echo " 🔗 Run URL: $run_url"
echo ""
done
non_pr_running=$(echo "$non_pr_runs" | cut -d'|' -f5 | awk '{sum+=$1} END {print sum+0}')
non_pr_queued=$(echo "$non_pr_runs" | cut -d'|' -f6 | awk '{sum+=$1} END {print sum+0}')
non_pr_count=$(echo "$non_pr_runs" | wc -l | tr -d ' ')
total_running=$((total_running + non_pr_running))
total_queued=$((total_queued + non_pr_queued))
echo " 📊 Non-PR Total: $non_pr_running running, $non_pr_queued queued"
echo ""
fi
# Overall summary
echo "========================================="
echo "📈 Overall Summary"
echo "========================================="
echo "Total PRs with active runs: $pr_count"
echo "Total non-PR active runs: ${non_pr_count:-0}"
echo "Total running jobs: $total_running"
echo "Total queued jobs: $total_queued"
echo "========================================="

File diff suppressed because it is too large Load Diff

View File

@@ -32,9 +32,13 @@ on:
- 'nightly-8-gpu-deepseek-v32'
- 'nightly-8-gpu-deepseek-v32-mtp'
- 'nightly-8-gpu-kimi-k25'
- 'nightly-8-gpu-qwen3-235b'
- 'nightly-8-gpu-glm5'
# MI35x jobs
- 'nightly-test-1-gpu-mi35x'
- 'nightly-8-gpu-mi35x-qwen3-235b-mxfp4'
- 'nightly-8-gpu-mi35x-kimi-k25'
- 'nightly-8-gpu-mi35x-glm5'
- 'nightly-accuracy-8-gpu-mi35x'
- 'nightly-8-gpu-mi35x-grok1-int4'
- 'nightly-8-gpu-mi35x-grok2'
@@ -83,11 +87,11 @@ jobs:
run: bash scripts/ci/amd/amd_ci_install_dependency.sh
- name: Nightly Unit Test (1-GPU)
timeout-minutes: 60
timeout-minutes: 90
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
python3 run_suite.py --hw amd --suite nightly-amd-1-gpu --nightly --timeout-per-file 600 --continue-on-error || TEST_EXIT_CODE=$?
python3 run_suite.py --hw amd --suite nightly-amd-1-gpu --nightly --timeout-per-file 900 --continue-on-error || TEST_EXIT_CODE=$?
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
@@ -464,36 +468,6 @@ jobs:
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
# 8-GPU Kimi-K2 (Accuracy + Speed)
nightly-8-gpu-kimi-k2:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-kimi-k2')
runs-on: linux-mi325-gpu-8
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Setup docker
run: |
touch github_summary.md
bash scripts/ci/amd/amd_ci_start_container.sh
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: bash scripts/ci/amd/amd_ci_install_dependency.sh
- name: Accuracy Test (8-GPU Kimi-K2)
timeout-minutes: 120
run: |
> github_summary.md # Clear summary file
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-kimi-k2 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$?
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
# 8-GPU Kimi-K2.5 (Accuracy)
nightly-8-gpu-kimi-k25:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-kimi-k25')
@@ -524,6 +498,67 @@ jobs:
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
nightly-8-gpu-qwen3-235b:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-qwen3-235b')
runs-on: linux-mi325-gpu-8
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Setup docker
run: |
touch github_summary.md
bash scripts/ci/amd/amd_ci_start_container.sh
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: bash scripts/ci/amd/amd_ci_install_dependency.sh
- name: Accuracy Test + Performance Test (8-GPU Qwen3)
timeout-minutes: 120
run: |
> github_summary.md # Clear summary file
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
python3 run_suite.py --hw amd --suite nightly-8-gpu-qwen3-235b --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$?
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
nightly-8-gpu-glm5:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-glm5')
runs-on: linux-mi325-gpu-8
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Setup docker
run: |
touch github_summary.md
bash scripts/ci/amd/amd_ci_start_container.sh
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh
# GLM-5 requires latest transformers for glm_moe_dsa architecture
bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git
- name: Accuracy Test (8-GPU GLM-5 NSA)
timeout-minutes: 120
run: |
> github_summary.md # Clear summary file
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-glm5 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$?
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
# ============================================== MI35x Tests ==============================================
# MI35x 1-GPU tests - platform-agnostic tests that may work on CDNA4 (gfx950)
nightly-test-1-gpu-mi35x:
@@ -549,11 +584,11 @@ jobs:
bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate
- name: Nightly Test MI35x (1-GPU)
timeout-minutes: 60
timeout-minutes: 90
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
python3 run_suite.py --hw amd --suite nightly-amd-1-gpu-mi35x --nightly --timeout-per-file 600 --continue-on-error || TEST_EXIT_CODE=$?
python3 run_suite.py --hw amd --suite nightly-amd-1-gpu-mi35x --nightly --timeout-per-file 900 --continue-on-error || TEST_EXIT_CODE=$?
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
@@ -857,6 +892,73 @@ jobs:
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
# MI35x 8-GPU Qwen3-235B-MXFP4 (Accuracy + Performance)
nightly-8-gpu-mi35x-qwen3-235b-mxfp4:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-mi35x-qwen3-235b-mxfp4')
runs-on: linux-mi35x-gpu-8
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Setup docker
run: |
touch github_summary.md
bash scripts/ci/amd/amd_ci_start_container.sh
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh
# Install tabulate for run_suite.py (missing in MI35x container)
bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate
- name: Accuracy Test + Performance Test MI35x (8-GPU Qwen3-235B-MXFP4)
timeout-minutes: 120
run: |
> github_summary.md # Clear summary file
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
python3 run_suite.py --hw amd --suite nightly-8-gpu-mi35x-qwen3-235b-mxfp4 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$?
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
nightly-8-gpu-mi35x-glm5:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-mi35x-glm5')
runs-on: linux-mi35x-gpu-8
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Setup docker
run: |
touch github_summary.md
bash scripts/ci/amd/amd_ci_start_container.sh
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh
# Install tabulate for run_suite.py (missing in MI35x container)
bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate
# GLM-5 requires latest transformers for glm_moe_dsa architecture
bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git
- name: Accuracy Test MI35x (8-GPU GLM-5 NSA)
timeout-minutes: 180
run: |
> github_summary.md # Clear summary file
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-glm5 --nightly --timeout-per-file 7200 || TEST_EXIT_CODE=$?
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
exit ${TEST_EXIT_CODE:-0}
# MI35x 8-GPU DeepSeek-V3.2 Performance Test (MTP)
nightly-perf-8-gpu-mi35x-deepseek-v32-mtp:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-perf-8-gpu-mi35x-deepseek-v32-mtp')
@@ -909,6 +1011,8 @@ jobs:
- nightly-8-gpu-deepseek-v32
- nightly-8-gpu-deepseek-v32-mtp
- nightly-8-gpu-kimi-k25
- nightly-8-gpu-qwen3-235b
- nightly-8-gpu-glm5
# MI35x jobs
- nightly-test-1-gpu-mi35x
- nightly-accuracy-8-gpu-mi35x
@@ -918,6 +1022,8 @@ jobs:
- nightly-accuracy-8-gpu-mi35x-deepseek-v32
- nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp
- nightly-8-gpu-mi35x-kimi-k25
- nightly-8-gpu-mi35x-qwen3-235b-mxfp4
- nightly-8-gpu-mi35x-glm5
# MI35x perf jobs excluded from check - perf failures don't block CI
# - nightly-perf-8-gpu-mi35x-deepseek-v32-basic
# - nightly-perf-8-gpu-mi35x-deepseek-v32-mtp

View File

@@ -44,6 +44,7 @@ concurrency:
env:
SGLANG_IS_IN_CI: true
SGLANG_CUDA_COREDUMP: "1"
HF_HUB_DOWNLOAD_TIMEOUT: 300
HF_HUB_ETAG_TIMEOUT: 300
@@ -68,6 +69,9 @@ jobs:
cd test
python3 run_suite.py --hw cuda --suite nightly-1-gpu --nightly --continue-on-error
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# General tests - 4 GPU H100
nightly-test-general-4-gpu-h100:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-4-gpu-h100')
@@ -88,6 +92,9 @@ jobs:
cd test
python3 run_suite.py --hw cuda --suite nightly-4-gpu --nightly --continue-on-error
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# General tests - 8 GPU H200
nightly-test-general-8-gpu-h200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h200')
@@ -120,6 +127,25 @@ jobs:
cd test
python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=18000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4
- name: Publish traces to storage repo
if: always()
continue-on-error: true
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
run: |
TRACE_ARGS=""
for dir in test/performance_profiles_*/; do
[ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir"
done
if [ -n "$TRACE_ARGS" ]; then
python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS
find test/performance_profiles_*/ -name '*.json.gz' -delete
else
echo "No trace directories found, skipping publish"
fi
- name: Run test
timeout-minutes: 30
env:
@@ -148,6 +174,11 @@ jobs:
retention-days: 5
if-no-files-found: ignore
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.partition }}
# General tests - 8 GPU H20
nightly-test-general-8-gpu-h20:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h20')
@@ -172,6 +203,9 @@ jobs:
cd test
python3 run_suite.py --hw cuda --suite nightly-8-gpu-h20 --nightly --continue-on-error
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# General tests - 8 GPU B200
nightly-test-general-8-gpu-b200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-b200')
@@ -201,6 +235,25 @@ jobs:
cd test
IS_BLACKWELL=1 python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=12000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4
- name: Publish traces to storage repo
if: always()
continue-on-error: true
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
run: |
TRACE_ARGS=""
for dir in test/performance_profiles_*/; do
[ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir"
done
if [ -n "$TRACE_ARGS" ]; then
python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS
find test/performance_profiles_*/ -name '*.json.gz' -delete
else
echo "No trace directories found, skipping publish"
fi
- name: Collect performance metrics
if: always()
run: |
@@ -221,6 +274,11 @@ jobs:
retention-days: 5
if-no-files-found: ignore
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.partition }}
# Text model accuracy tests
nightly-test-text-accuracy-2-gpu-runner:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-accuracy-2-gpu-runner')
@@ -241,6 +299,9 @@ jobs:
cd test
python3 run_suite.py --hw cuda --suite nightly-eval-text-2-gpu --nightly --continue-on-error --timeout-per-file 4500
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# Text model performance tests
nightly-test-text-perf-2-gpu-runner:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-perf-2-gpu-runner')
@@ -264,7 +325,7 @@ jobs:
run: |
cd test
rm -rf performance_profiles_text_models/
python3 run_suite.py --hw cuda --suite nightly-perf-text-2-gpu --nightly --continue-on-error
python3 run_suite.py --hw cuda --suite nightly-perf-text-2-gpu --nightly --continue-on-error --timeout-per-file 3600
- name: Publish traces to storage repo
env:
@@ -274,6 +335,9 @@ jobs:
run: |
python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_text_models
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# VLM accuracy tests
nightly-test-vlm-accuracy-2-gpu-runner:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-accuracy-2-gpu-runner')
@@ -294,6 +358,9 @@ jobs:
cd test
python3 run_suite.py --hw cuda --suite nightly-eval-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 9000
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# VLM performance tests
nightly-test-vlm-perf-2-gpu-runner:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-perf-2-gpu-runner')
@@ -317,7 +384,7 @@ jobs:
run: |
cd test
rm -rf performance_profiles_vlms/
python3 run_suite.py --hw cuda --suite nightly-perf-vlm-2-gpu --nightly --continue-on-error
python3 run_suite.py --hw cuda --suite nightly-perf-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 3600
- name: Publish traces to storage repo
env:
@@ -327,6 +394,9 @@ jobs:
run: |
python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_vlms
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# diffusion performance tests
nightly-test-multimodal-server-1-gpu:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-1-gpu')
@@ -351,6 +421,7 @@ jobs:
env:
SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
GITHUB_RUN_ID: ${{ github.run_id }}
GPU_CONFIG: "1-gpu-runner"
timeout-minutes: 60
run: |
@@ -360,6 +431,28 @@ jobs:
--partition-id ${{ matrix.part }} \
--total-partitions 2
- name: Collect diffusion performance metrics
if: always()
run: |
python3 scripts/ci/save_diffusion_metrics.py \
--gpu-config 1-gpu-runner \
--run-id ${{ github.run_id }} \
--output python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json \
--results-json python/diffusion-results.json
- name: Upload diffusion metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: diffusion-metrics-1gpu-partition-${{ matrix.part }}
path: python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json
retention-days: 90
if-no-files-found: ignore
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
nightly-test-multimodal-server-2-gpu:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-2-gpu')
@@ -384,6 +477,7 @@ jobs:
env:
SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
GITHUB_RUN_ID: ${{ github.run_id }}
GPU_CONFIG: "2-gpu-runner"
timeout-minutes: 60
run: |
@@ -393,6 +487,29 @@ jobs:
--partition-id ${{ matrix.part }} \
--total-partitions 2
- name: Collect diffusion performance metrics
if: always()
run: |
python3 scripts/ci/save_diffusion_metrics.py \
--gpu-config 2-gpu-runner \
--run-id ${{ github.run_id }} \
--output python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json \
--results-json python/diffusion-results.json
- name: Upload diffusion metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: diffusion-metrics-2gpu-partition-${{ matrix.part }}
path: python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json
retention-days: 90
if-no-files-found: ignore
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
# B200 Performance tests - 4 GPU
nightly-test-perf-4-gpu-b200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-4-gpu-b200')
@@ -413,6 +530,9 @@ jobs:
cd test
python3 run_suite.py --hw cuda --suite nightly-4-gpu-b200 --nightly --continue-on-error --timeout-per-file 12000
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# Specialized B200 tests - 8 GPU, for specific backends and configs
nightly-test-specialized-8-gpu-b200:
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-8-gpu-b200')
@@ -437,12 +557,17 @@ jobs:
cd test
python3 run_suite.py --hw cuda --suite nightly-8-gpu-b200 --nightly --continue-on-error --timeout-per-file 2400
# Consolidate performance metrics from all 8-GPU jobs
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
# Consolidate performance metrics from all jobs
consolidate-metrics:
if: github.repository == 'sgl-project/sglang' && always()
needs:
- nightly-test-general-8-gpu-h200
- nightly-test-general-8-gpu-b200
- nightly-test-multimodal-server-1-gpu
- nightly-test-multimodal-server-2-gpu
runs-on: ubuntu-latest
steps:
- name: Checkout code
@@ -453,7 +578,7 @@ jobs:
- name: Download all partition metrics
uses: actions/download-artifact@v4
with:
pattern: metrics-*
pattern: "*metrics-*"
path: metrics/
merge-multiple: true

115
.github/workflows/patch-docker-dev.yml vendored Normal file
View File

@@ -0,0 +1,115 @@
name: Patch Docker Image
on:
workflow_dispatch:
inputs:
pr_numbers:
description: "Comma-separated PR numbers to apply (e.g. 18962,19010)"
required: false
default: ""
image_tag:
description: "Base image tag to patch (e.g. dev-x86, dev-x86-cu13)"
required: true
concurrency:
group: patch-docker-${{ inputs.image_tag }}
cancel-in-progress: true
jobs:
patch:
if: github.repository == 'sgl-project/sglang'
runs-on: x64-docker-build-node
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Pull base image and extract commit
run: |
IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
docker pull "${IMAGE}"
if BASE_SHA=$(docker run --rm "${IMAGE}" git -C /sgl-workspace/sglang rev-parse HEAD 2>/dev/null); then
echo "Image built from commit: ${BASE_SHA}"
else
BASE_SHA=""
echo "::warning::Image has no .git directory — cannot extract base commit"
fi
echo "BASE_SHA=${BASE_SHA}" >> "$GITHUB_ENV"
- name: Generate patches
run: |
git config --global --add safe.directory "$GITHUB_WORKSPACE"
git fetch origin main
mkdir -p /tmp/patch-ctx
if [ -n "${{ inputs.pr_numbers }}" ]; then
IFS=',' read -ra PRS <<< "${{ inputs.pr_numbers }}"
for pr in "${PRS[@]}"; do
pr=$(echo "${pr}" | xargs)
echo "Fetching PR #${pr}"
git fetch origin "pull/${pr}/head:pr-${pr}"
MERGE_BASE=$(git merge-base origin/main "pr-${pr}")
echo " PR #${pr}: merge-base=${MERGE_BASE}"
git diff "${MERGE_BASE}..pr-${pr}" > "/tmp/patch-ctx/${pr}.patch"
echo " PR #${pr}: $(wc -l < /tmp/patch-ctx/${pr}.patch) lines"
done
elif [ -n "${BASE_SHA}" ]; then
echo "Generating diff: image ${BASE_SHA} → latest main"
git fetch origin "${BASE_SHA}"
git diff "${BASE_SHA}..origin/main" > /tmp/patch-ctx/main.patch
echo " main: $(wc -l < /tmp/patch-ctx/main.patch) lines"
else
echo "::error::No PR numbers specified and image has no .git — cannot generate diff against main"
exit 1
fi
TOTAL=$(cat /tmp/patch-ctx/*.patch | wc -l)
if [ "${TOTAL}" -eq 0 ]; then
echo "::warning::All patches are empty — image is already up to date"
echo "SKIP_BUILD=true" >> "$GITHUB_ENV"
fi
- name: Build patched image
if: env.SKIP_BUILD != 'true'
run: |
IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
cat <<'DOCKERFILE' > /tmp/patch-ctx/Dockerfile
ARG BASE_IMAGE
FROM ${BASE_IMAGE}
COPY *.patch /tmp/patches/
RUN cd /sgl-workspace/sglang \
&& for p in /tmp/patches/*.patch; do \
if [ ! -s "${p}" ]; then \
echo "Skipping ${p} (empty)"; \
else \
echo "Applying ${p}..." \
&& patch -p1 --fuzz=2 --no-backup-if-mismatch -f < "${p}" \
|| { echo "ERROR: Failed to apply ${p}"; exit 1; }; \
fi; \
done \
&& rm -rf /tmp/patches
DOCKERFILE
docker build \
--no-cache \
--build-arg BASE_IMAGE="${IMAGE}" \
-t "${IMAGE}" \
/tmp/patch-ctx/
- name: Push patched image
if: env.SKIP_BUILD != 'true'
run: |
IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
docker push "${IMAGE}"
echo "### Patched \`${IMAGE}\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Base commit:** \`${BASE_SHA:-unknown (no .git)}\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Source:** ${{ inputs.pr_numbers && format('PRs: {0}', inputs.pr_numbers) || 'latest main' }}" >> "$GITHUB_STEP_SUMMARY"

View File

@@ -0,0 +1,837 @@
name: PR Test ROCm 7.2 (AMD)
# Dynamic run-name for /rerun-stage commands to enable URL lookup
# Format: "[stage-name] sha" for fork PRs, "[stage-name]" for non-fork, default for normal runs
run-name: ${{ inputs.target_stage && (inputs.pr_head_sha && format('[{0}] {1}', inputs.target_stage, inputs.pr_head_sha) || format('[{0}]', inputs.target_stage)) || '' }}
on:
# run rocm 720 pr tests once a day at 2am UTC to avoid overwhelming the CI system
schedule:
- cron: '0 2 * * *'
# push:
# branches: [ main ]
# paths:
# - "python/**"
# - "scripts/ci/**"
# - "test/**"
# - "sgl-kernel/**"
# - ".github/workflows/pr-test-amd-rocm720.yml"
# - "docker/rocm720.Dockerfile"
# pull_request:
# branches: [ main ]
# paths:
# - "python/**"
# - "scripts/ci/**"
# - "test/**"
# - "sgl-kernel/**"
# - ".github/workflows/pr-test-amd-rocm720.yml"
# - "docker/rocm720.Dockerfile"
workflow_dispatch:
inputs:
target_stage:
description: "Specific stage to run (optional, for quick testing)"
required: false
type: string
default: ""
pr_head_sha:
description: "PR head SHA to checkout (for /rerun-stage on fork PRs)"
required: false
type: string
default: ""
workflow_call:
inputs:
ref:
description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
required: false
type: string
default: ''
run_all_tests:
description: "Run all tests (for releasing or testing purpose)"
required: false
type: boolean
default: false
concurrency:
# Include pr_head_sha in group for /rerun-stage dispatches to avoid collisions with main branch runs
group: pr-test-amd-rocm720-${{ inputs.pr_head_sha || inputs.ref || github.ref }}
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
jobs:
call-gate:
uses: ./.github/workflows/pr-gate.yml
secrets: inherit
check-changes:
needs: [call-gate]
runs-on: ubuntu-latest
outputs:
main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }}
sgl_kernel: ${{ steps.filter.outputs.sgl_kernel || steps.run-mode.outputs.run_all_tests }}
jit_kernel: ${{ steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }}
multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Determine run mode
id: run-mode
run: |
# Run all tests for workflow_call (when ref input is provided)
# Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref
if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then
echo "run_all_tests=true" >> $GITHUB_OUTPUT
echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})"
else
echo "run_all_tests=false" >> $GITHUB_OUTPUT
echo "Run mode: FILTERED (triggered by ${{ github.event_name }})"
fi
- name: Detect file changes
id: filter
uses: dorny/paths-filter@v3
if: steps.run-mode.outputs.run_all_tests != 'true'
with:
filters: |
main_package:
- "python/sglang/!(multimodal_gen)/**"
- "python/pyproject_rocm.toml"
- "python/pyproject_other.toml"
- "scripts/ci/amd/*"
- "scripts/ci/utils/*"
- "test/**"
- ".github/workflows/pr-test-amd-rocm720.yml"
sgl_kernel:
- "sgl-kernel/**"
- ".github/workflows/pr-test-amd-rocm720.yml"
jit_kernel:
- "python/sglang/jit_kernel/**"
- ".github/workflows/pr-test-amd-rocm720.yml"
multimodal_gen:
- "python/sglang/multimodal_gen/**"
- "python/sglang/cli/**"
- "python/pyproject_rocm.toml"
- "python/pyproject_other.toml"
# =============================================== sgl-kernel ====================================================
sgl-kernel-unit-test-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'sgl-kernel-unit-test-amd') ||
(
!inputs.target_stage &&
needs.check-changes.outputs.sgl_kernel == 'true'
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi325-gpu-1]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Run test
timeout-minutes: 14
run: |
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_align.py
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_softmax.py
docker exec -w /sglang-checkout/sgl-kernel/tests/speculative ci_sglang python3 -m pytest test_eagle_utils.py
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_apply_token_bitmask_inplace.py
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_activation.py
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_topk.py
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_kvcacheio.py
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_sigmoid.py
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_torch_defaults_reset.py
sgl-kernel-unit-test-2-gpu-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'sgl-kernel-unit-test-2-gpu-amd') ||
(
!inputs.target_stage &&
needs.check-changes.outputs.sgl_kernel == 'true'
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi325-gpu-2]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Run test
timeout-minutes: 20
run: |
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_deterministic_custom_allreduce.py
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_nccl_allreduce_determinism.py
# =============================================== primary ====================================================
stage-a-test-1-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'stage-a-test-1-amd') ||
(
!inputs.target_stage &&
(!failure() && !cancelled()) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi325-gpu-1]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Run test
timeout-minutes: 10
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-a-test-1-amd --continue-on-error
jit-kernel-unit-test-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'jit-kernel-unit-test-amd') ||
(
!inputs.target_stage &&
needs.check-changes.outputs.jit_kernel == 'true'
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi325-gpu-1]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Run JIT kernel unit tests
timeout-minutes: 10
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout" python3 -m pytest -q python/sglang/jit_kernel/tests/test_store_cache.py
stage-b-test-small-1-gpu-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'stage-b-test-small-1-gpu-amd') ||
(
!inputs.target_stage &&
(!failure() && !cancelled()) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi325-gpu-1]
part: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Run test
timeout-minutes: 30
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-small-1-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 13 --timeout-per-file 1800 --continue-on-error
stage-b-test-small-1-gpu-amd-mi35x:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'stage-b-test-small-1-gpu-amd-mi35x') ||
(
!inputs.target_stage &&
(!failure() && !cancelled()) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi35x-gpu-1]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Run test
timeout-minutes: 30
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-small-1-gpu-amd-mi35x --continue-on-error
stage-b-test-large-1-gpu-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'stage-b-test-large-1-gpu-amd') ||
(
!inputs.target_stage &&
(!failure() && !cancelled()) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi325-gpu-1]
part: [0, 1]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Run test
timeout-minutes: 30
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-large-1-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 1800 --continue-on-error
stage-b-test-large-2-gpu-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'stage-b-test-large-2-gpu-amd') ||
(
!inputs.target_stage &&
(!failure() && !cancelled()) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi325-gpu-2]
part: [0, 1]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Run test
timeout-minutes: 30
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-large-2-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 1800 --continue-on-error
multimodal-gen-test-1-gpu-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'multimodal-gen-test-1-gpu-amd') ||
(
!inputs.target_stage &&
(!failure() && !cancelled()) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
strategy:
fail-fast: false
max-parallel: 1 # Run one at a time to avoid eviction from resource exhaustion during AITER kernel JIT
matrix:
runner: [linux-mi325-gpu-1]
part: [0, 1] # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Download artifacts
if: needs.check-changes.outputs.sgl_kernel == 'true'
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build diffusion
docker exec ci_sglang pip install amdsmi
- name: Setup kernel caches
run: |
# Use the persistent /sgl-data directory (mounted from /home/runner/sgl-data)
# This directory persists across container restarts on the self-hosted runner
docker exec ci_sglang mkdir -p /sgl-data/aiter-kernels /sgl-data/miopen-cache /sgl-data/hf-cache/hub
# Clear pre-built AITER kernels from Docker image to avoid segfaults
# The image may have stale/incompatible kernels at /sgl-workspace/aiter/aiter/jit/
echo "Clearing pre-built AITER kernels from Docker image..."
docker exec ci_sglang rm -rf /sgl-workspace/aiter/aiter/jit/*.so 2>/dev/null || true
docker exec ci_sglang rm -rf /sgl-data/aiter-kernels/*.so 2>/dev/null || true
echo "AITER kernels cleared - will be rebuilt on first use"
# Create persistent cache marker if /sgl-data is a real mount (not ephemeral)
# This tells the test cleanup code to NOT delete downloaded models
if docker exec ci_sglang test -d /sgl-data && docker exec ci_sglang mountpoint -q /sgl-data 2>/dev/null; then
docker exec ci_sglang touch /sgl-data/hf-cache/.persistent_cache
echo "Created .persistent_cache marker - HF cache will persist"
else
echo "WARNING: /sgl-data is not a mount point - models will be cleaned up after each test"
fi
# Check MIOpen cache (VAE convolution kernels)
miopen_files=$(docker exec ci_sglang find /sgl-data/miopen-cache -name "*.udb" 2>/dev/null | wc -l || echo "0")
echo "Found ${miopen_files} MIOpen cache files"
- name: Diagnose HF cache and system resources
run: |
echo "=== System Memory Status ==="
free -h
echo ""
echo "=== Disk Space ==="
df -h /home/runner/sgl-data 2>/dev/null || df -h
echo ""
echo "=== HF Cache Directory Structure ==="
docker exec ci_sglang ls -la /sgl-data/hf-cache/ 2>/dev/null || echo "HF cache dir not found"
docker exec ci_sglang ls -la /sgl-data/hf-cache/hub/ 2>/dev/null || echo "HF hub cache not found"
echo ""
echo "=== Checking for cached diffusion models (1-GPU tests) ==="
# Models used in 1-GPU tests: Wan2.1-T2V-1.3B, HunyuanVideo, Qwen-Image, FLUX.1, FLUX.2
for model in "Wan-AI--Wan2.1-T2V-1.3B-Diffusers" "tencent--HunyuanVideo" "Qwen--Qwen-Image" "black-forest-labs--FLUX.1-dev" "black-forest-labs--FLUX.2-dev"; do
cache_path="/sgl-data/hf-cache/hub/models--${model}"
if docker exec ci_sglang test -d "$cache_path"; then
size=$(docker exec ci_sglang du -sh "$cache_path" 2>/dev/null | cut -f1)
echo "✓ CACHED: $model ($size)"
else
echo "✗ NOT CACHED: $model"
fi
done
echo ""
echo "=== GPU Memory Status ==="
docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available"
- name: Run diffusion server tests (1-GPU)
timeout-minutes: 60
run: |
# AMD CI: All 1-GPU tests except FLUX.2 (FLUX.1 covers same code path)
# Tests: T2V, T2I, I2V, LoRA
#
# HF download env vars:
# - HF_HUB_ENABLE_HF_TRANSFER=1: Use faster hf_transfer for downloads (if available)
# - HF_HUB_DISABLE_SYMLINKS_WARNING=1: Suppress symlink warnings
docker exec \
-e SGLANG_E2E_TOLERANCE=0.3 \
-e SGLANG_STAGE_TIME_TOLERANCE=0.2 \
-e SGLANG_NON_DENOISE_STAGE_TIME_TOLERANCE=0.6 \
-e SGLANG_DENOISE_STEP_TOLERANCE=0.6 \
-e SGLANG_DENOISE_AGG_TOLERANCE=0.3 \
-e SGLANG_TEST_NUM_INFERENCE_STEPS=5 \
-e AITER_JIT_DIR=/sgl-data/aiter-kernels \
-e MIOPEN_USER_DB_PATH=/sgl-data/miopen-cache \
-e HF_HUB_ENABLE_HF_TRANSFER=1 \
-e HF_HUB_DISABLE_SYMLINKS_WARNING=1 \
-w /sglang-checkout/python \
ci_sglang python3 sglang/multimodal_gen/test/run_suite.py \
--suite 1-gpu \
--partition-id ${{ matrix.part }} \
--total-partitions 2 \
-k "not flux_2"
# Post-test diagnostics
echo "=== Post-test System Memory Status ==="
free -h
multimodal-gen-test-2-gpu-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'multimodal-gen-test-2-gpu-amd') ||
(
!inputs.target_stage &&
(!failure() && !cancelled()) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
strategy:
fail-fast: false
max-parallel: 1 # Run one at a time to avoid eviction from resource exhaustion during AITER kernel JIT
matrix:
runner: [linux-mi325-gpu-2]
part: [0, 1] # 2 partitions: 9 tests ÷ 2 = ~4-5 tests each
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Download artifacts
if: needs.check-changes.outputs.sgl_kernel == 'true'
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build diffusion
docker exec ci_sglang pip install amdsmi
- name: Setup kernel caches
run: |
# Use the persistent /sgl-data directory (mounted from /home/runner/sgl-data)
docker exec ci_sglang mkdir -p /sgl-data/aiter-kernels /sgl-data/miopen-cache /sgl-data/hf-cache/hub
# Clear pre-built AITER kernels from Docker image to avoid segfaults
# The image may have stale/incompatible kernels at /sgl-workspace/aiter/aiter/jit/
echo "Clearing pre-built AITER kernels from Docker image..."
docker exec ci_sglang rm -rf /sgl-workspace/aiter/aiter/jit/*.so 2>/dev/null || true
docker exec ci_sglang rm -rf /sgl-data/aiter-kernels/*.so 2>/dev/null || true
echo "AITER kernels cleared - will be rebuilt on first use"
# Create persistent cache marker if /sgl-data is a real mount (not ephemeral)
# This tells the test cleanup code to NOT delete downloaded models
if docker exec ci_sglang test -d /sgl-data && docker exec ci_sglang mountpoint -q /sgl-data 2>/dev/null; then
docker exec ci_sglang touch /sgl-data/hf-cache/.persistent_cache
echo "Created .persistent_cache marker - HF cache will persist"
else
echo "WARNING: /sgl-data is not a mount point - models will be cleaned up after each test"
fi
# Check MIOpen cache (VAE convolution kernels)
miopen_files=$(docker exec ci_sglang find /sgl-data/miopen-cache -name "*.udb" 2>/dev/null | wc -l || echo "0")
echo "Found ${miopen_files} MIOpen cache files"
- name: Diagnose HF cache and system resources
run: |
echo "=== System Memory Status ==="
free -h
echo ""
echo "=== Disk Space ==="
df -h /home/runner/sgl-data 2>/dev/null || df -h
echo ""
echo "=== HF Cache Directory Structure ==="
docker exec ci_sglang ls -la /sgl-data/hf-cache/ 2>/dev/null || echo "HF cache dir not found"
docker exec ci_sglang ls -la /sgl-data/hf-cache/hub/ 2>/dev/null || echo "HF hub cache not found"
echo ""
echo "=== Checking for cached diffusion models (2-GPU tests) ==="
# Models used in 2-GPU tests: Wan2.2-T2V-A14B, Wan2.1-T2V-14B, Qwen-Image, FLUX.1
for model in "Wan-AI--Wan2.2-T2V-A14B-Diffusers" "Wan-AI--Wan2.1-T2V-14B-Diffusers" "Qwen--Qwen-Image" "black-forest-labs--FLUX.1-dev"; do
cache_path="/sgl-data/hf-cache/hub/models--${model}"
if docker exec ci_sglang test -d "$cache_path"; then
size=$(docker exec ci_sglang du -sh "$cache_path" 2>/dev/null | cut -f1)
echo "✓ CACHED: $model ($size)"
else
echo "✗ NOT CACHED: $model"
fi
done
echo ""
echo "=== GPU Memory Status ==="
docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available"
- name: Run diffusion server tests (2-GPU)
timeout-minutes: 80
run: |
# AMD CI: All 2-GPU tests including LoRA
# Tests: T2V, T2I, I2V, LoRA
#
# HF download env vars:
# - HF_HUB_ENABLE_HF_TRANSFER=1: Use faster hf_transfer for downloads (if available)
# - HF_HUB_DISABLE_SYMLINKS_WARNING=1: Suppress symlink warnings
docker exec \
-e SGLANG_E2E_TOLERANCE=0.3 \
-e SGLANG_STAGE_TIME_TOLERANCE=0.2 \
-e SGLANG_NON_DENOISE_STAGE_TIME_TOLERANCE=0.6 \
-e SGLANG_DENOISE_STEP_TOLERANCE=0.6 \
-e SGLANG_DENOISE_AGG_TOLERANCE=0.3 \
-e SGLANG_TEST_NUM_INFERENCE_STEPS=5 \
-e AITER_JIT_DIR=/sgl-data/aiter-kernels \
-e MIOPEN_USER_DB_PATH=/sgl-data/miopen-cache \
-e HF_HUB_ENABLE_HF_TRANSFER=1 \
-e HF_HUB_DISABLE_SYMLINKS_WARNING=1 \
-w /sglang-checkout/python \
ci_sglang python3 sglang/multimodal_gen/test/run_suite.py \
--suite 2-gpu \
--partition-id ${{ matrix.part }} \
--total-partitions 2
# Post-test diagnostics
echo "=== Post-test System Memory Status ==="
free -h
stage-c-test-large-8-gpu-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'stage-c-test-large-8-gpu-amd') ||
(
!inputs.target_stage &&
(!failure() && !cancelled()) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
env:
RUNNER_LABELS: linux-mi325-gpu-8
strategy:
fail-fast: false
matrix:
runner: [linux-mi325-gpu-8]
part: [0, 1, 2]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Test RCCL multi-GPU communication
timeout-minutes: 5
run: |
echo "Testing RCCL multi-GPU communication with debug info..."
docker exec ci_sglang bash -c "cd /sglang-checkout && NCCL_DEBUG=INFO RCCL_DEBUG=INFO torchrun --nproc_per_node=8 scripts/ci/amd/test_rccl_multi_gpu.py"
- name: Run test
timeout-minutes: 60
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 --timeout-per-file 3600 --continue-on-error
stage-c-test-large-8-gpu-amd-mi35x:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'stage-c-test-large-8-gpu-amd-mi35x') ||
(
!inputs.target_stage &&
(!failure() && !cancelled()) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi35x-gpu-8]
part: [0]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
- name: Run test
timeout-minutes: 60
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd-mi35x --auto-partition-id ${{ matrix.part }} --auto-partition-size 1 --timeout-per-file 3600 --continue-on-error
pr-test-amd-finish:
needs:
[
call-gate,
check-changes,
sgl-kernel-unit-test-amd,
sgl-kernel-unit-test-2-gpu-amd,
multimodal-gen-test-1-gpu-amd,
multimodal-gen-test-2-gpu-amd,
stage-a-test-1-amd,
jit-kernel-unit-test-amd,
stage-b-test-small-1-gpu-amd,
stage-b-test-small-1-gpu-amd-mi35x,
stage-b-test-large-1-gpu-amd,
stage-b-test-large-2-gpu-amd,
stage-c-test-large-8-gpu-amd,
stage-c-test-large-8-gpu-amd-mi35x,
]
if: always()
runs-on: ubuntu-latest
steps:
- name: Check all dependent job statuses
run: |
# Convert the 'needs' context to a JSON string
json_needs='${{ toJson(needs) }}'
# Get a list of all job names from the JSON keys
job_names=$(echo "$json_needs" | jq -r 'keys_unsorted[]')
for job in $job_names; do
# For each job, extract its result
result=$(echo "$json_needs" | jq -r --arg j "$job" '.[$j].result')
# Print the job name and its result
echo "$job: $result"
# Check for failure or cancellation and exit if found
if [[ "$result" == "failure" || "$result" == "cancelled" ]]; then
echo "The above jobs failed."
exit 1
fi
done
# If the loop completes, all jobs were successful
echo "All jobs completed successfully"
exit 0

View File

@@ -62,6 +62,7 @@ jobs:
outputs:
main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }}
sgl_kernel: ${{ steps.filter.outputs.sgl_kernel || steps.run-mode.outputs.run_all_tests }}
jit_kernel: ${{ steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }}
multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }}
steps:
- name: Checkout code
@@ -99,6 +100,9 @@ jobs:
sgl_kernel:
- "sgl-kernel/**"
- ".github/workflows/pr-test-amd.yml"
jit_kernel:
- "python/sglang/jit_kernel/**"
- ".github/workflows/pr-test-amd.yml"
multimodal_gen:
- "python/sglang/multimodal_gen/**"
- "python/sglang/cli/**"
@@ -235,6 +239,45 @@ jobs:
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-a-test-1-amd
jit-kernel-unit-test-amd:
needs: [check-changes]
if: |
always() &&
(
(inputs.target_stage == 'jit-kernel-unit-test-amd') ||
(
!inputs.target_stage &&
needs.check-changes.outputs.jit_kernel == 'true'
)
)
strategy:
fail-fast: false
matrix:
runner: [linux-mi325-gpu-1]
runs-on: ${{matrix.runner}}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
- name: Ensure VRAM is clear
run: bash scripts/ensure_vram_clear.sh rocm
- name: Start CI container
run: bash scripts/ci/amd/amd_ci_start_container.sh
env:
GITHUB_WORKSPACE: ${{ github.workspace }}
- name: Install dependencies
run: |
bash scripts/ci/amd/amd_ci_install_dependency.sh
- name: Run JIT kernel unit tests
timeout-minutes: 10
run: |
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout" python3 -m pytest -q python/sglang/jit_kernel/tests/test_store_cache.py
stage-b-test-small-1-gpu-amd:
needs: [check-changes, stage-a-test-1-amd]
if: |
@@ -484,7 +527,7 @@ jobs:
docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available"
- name: Run diffusion server tests (1-GPU)
timeout-minutes: 60
timeout-minutes: 70
run: |
# AMD CI: All 1-GPU tests except FLUX.2 (FLUX.1 covers same code path)
# Tests: T2V, T2I, I2V, LoRA
@@ -845,6 +888,7 @@ jobs:
multimodal-gen-test-2-gpu-amd,
stage-a-test-1-amd,
jit-kernel-unit-test-amd,
stage-b-test-small-1-gpu-amd,
stage-b-test-small-1-gpu-amd-mi35x,
stage-b-test-large-1-gpu-amd,

View File

@@ -28,9 +28,9 @@ jobs:
check-changes:
runs-on: ubuntu-latest
outputs:
changes_exist: ${{ steps.filter.outputs.main_package || steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests}}
main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }}
multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }}
changes_exist: ${{ steps.filter.outputs.main_package == 'true' || steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true'}}
main_package: ${{ steps.filter.outputs.main_package == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }}
multimodal_gen: ${{ steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }}
steps:
- name: Checkout code
uses: actions/checkout@v4

View File

@@ -52,20 +52,24 @@ on:
default: false
concurrency:
# Concurrency group structure: pr-test-{branch}-{pr_sha}-{stage}
# Concurrency group structure: pr-test-{event}-{branch}-{pr_sha}-{stage}
# - event_name prevents scheduled runs from colliding with fork PRs whose branch is named 'main'
# (without it, both resolve the branch segment to 'main' and block each other)
# - github.head_ref (pull_request) or github.ref_name (workflow_dispatch) normalizes to branch name
# - pr_head_sha isolates /rerun-stage from main branch runs
# - target_stage allows parallel stage dispatches to run independently
# This ensures pull_request and workflow_dispatch on same branch cancel each other
group: pr-test-${{ github.head_ref || github.ref_name || 'default' }}-${{ inputs.pr_head_sha || 'current' }}-${{ inputs.target_stage || inputs.ref || 'all' }}
group: pr-test-${{ github.event_name }}-${{ github.head_ref || github.ref_name || 'default' }}-${{ inputs.pr_head_sha || 'current' }}-${{ inputs.target_stage || inputs.ref || 'all' }}
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
env:
SGLANG_IS_IN_CI: true
SGLANG_CUDA_COREDUMP: "1"
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
permissions:
actions: write
contents: read
pull-requests: read
jobs:
# =============================================== check changes ====================================================
@@ -128,6 +132,7 @@ jobs:
- ".github/workflows/pr-test.yml"
multimodal_gen:
- "python/sglang/multimodal_gen/**"
- "python/sglang/jit_kernel/**"
- "python/sglang/cli/**"
- "python/pyproject.toml"
- ".github/workflows/pr-test.yml"
@@ -198,6 +203,8 @@ jobs:
- name: Set max-parallel based on run type
id: set-parallel
env:
GH_TOKEN: ${{ github.token }}
run: |
# Scheduled runs and high-priority PRs get full parallelism
if [[ "${{ github.event_name }}" == "schedule" ]]; then
@@ -206,6 +213,27 @@ jobs:
elif [[ "${{ github.event_name }}" == "pull_request" && "${{ contains(github.event.pull_request.labels.*.name, 'high priority') }}" == "true" ]]; then
echo "max_parallel=14" >> $GITHUB_OUTPUT
echo "High priority PR detected, setting max_parallel to 14"
elif [[ -n "${{ inputs.target_stage }}" ]]; then
# /rerun-stage (workflow_dispatch): query PR labels via GitHub API
# Try SHA lookup first (fork PRs), fallback to branch name (non-fork PRs)
LABELS=""
PR_HEAD_SHA="${{ inputs.pr_head_sha }}"
if [[ -n "$PR_HEAD_SHA" ]]; then
LABELS=$(gh api "repos/${{ github.repository }}/commits/${PR_HEAD_SHA}/pulls" \
--jq '.[0].labels[].name' 2>/dev/null || true)
fi
if [[ -z "$LABELS" ]]; then
LABELS=$(gh pr list --head "${{ github.ref_name }}" --repo "${{ github.repository }}" \
--json labels --jq '.[0].labels[].name' 2>/dev/null || true)
fi
echo "PR labels: ${LABELS:-"(none)"}"
if echo "$LABELS" | grep -Fxq "high priority"; then
echo "max_parallel=14" >> $GITHUB_OUTPUT
echo "High priority PR detected via API (/rerun-stage), setting max_parallel to 14"
else
echo "max_parallel=3" >> $GITHUB_OUTPUT
echo "Using default max_parallel of 3 (/rerun-stage, no high priority label)"
fi
else
echo "max_parallel=3" >> $GITHUB_OUTPUT
echo "Using default max_parallel of 3"
@@ -848,6 +876,9 @@ jobs:
# temporarily put backend-independent cpu tests here
python3 run_suite.py --hw cpu --suite default $CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
stage-a-cpu-only:
needs: [check-changes, call-gate]
if: |
@@ -950,6 +981,11 @@ jobs:
fi
python3 run_suite.py --hw cuda --suite stage-b-test-small-1-gpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 8 $CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.partition }}
# Runs on H100 (80GB, SM90) - tests that don't pass on 5090 (FA3, FP8, high VRAM, etc.)
stage-b-test-large-1-gpu:
needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels]
@@ -1001,6 +1037,11 @@ jobs:
fi
python3 run_suite.py --hw cuda --suite stage-b-test-large-1-gpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 14 --timeout-per-file 1800 $CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.partition }}
stage-b-test-large-2-gpu:
needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels]
if: |
@@ -1053,6 +1094,11 @@ jobs:
fi
python3 run_suite.py --hw cuda --suite stage-b-test-large-2-gpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 4 $CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.partition }}
stage-b-test-4-gpu-b200:
needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels]
if: |
@@ -1106,6 +1152,9 @@ jobs:
run: |
IS_BLACKWELL=1 python3 -m pytest -q python/sglang/jit_kernel/tests/test_flash_attention_4.py
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
multimodal-gen-test-1-gpu:
needs: [check-changes, call-gate, sgl-kernel-build-wheels]
if: |
@@ -1158,6 +1207,10 @@ jobs:
--total-partitions 2 \
$CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
multimodal-gen-test-2-gpu:
needs: [check-changes, call-gate, sgl-kernel-build-wheels]
@@ -1212,6 +1265,11 @@ jobs:
--total-partitions 2 \
$CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
stage-c-test-4-gpu-h100:
needs: [check-changes, call-gate, wait-for-stage-b]
if: |
@@ -1261,6 +1319,11 @@ jobs:
fi
python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-h100 --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 $CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
stage-c-test-8-gpu-h200:
needs: [check-changes, call-gate, wait-for-stage-b]
if: |
@@ -1300,14 +1363,22 @@ jobs:
run: |
CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh
# - name: Warmup Weights and JIT Compilation
# timeout-minutes: 20
# run: |
# # An example command for testing the warmup. TODO: make this more general and move them to python scripts.
# python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3-0324 --tp 8 --trust-remote-code
- name: Warmup DeepGEMM JIT Compilation
timeout-minutes: 25
run: |
python3 scripts/ci/cuda/warmup_deep_gemm.py \
deepseek-ai/DeepSeek-V3-0324:8 \
deepseek-ai/DeepSeek-V3.2-Exp:8
- name: Warmup Server CUDA Graphs
timeout-minutes: 25
run: |
python3 scripts/ci/cuda/warmup_server.py \
deepseek-ai/DeepSeek-V3-0324:8 \
inclusionAI/Ring-2.5-1T:8
- name: Run test
timeout-minutes: 20
timeout-minutes: 30
run: |
cd test
CONTINUE_ON_ERROR_FLAG=""
@@ -1316,6 +1387,11 @@ jobs:
fi
python3 run_suite.py --hw cuda --suite stage-c-test-8-gpu-h200 --auto-partition-id ${{ matrix.part }} --auto-partition-size 4 $CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
stage-c-test-8-gpu-h20:
needs: [check-changes, call-gate, wait-for-stage-b]
if: |
@@ -1366,6 +1442,11 @@ jobs:
fi
python3 run_suite.py --hw cuda --suite stage-c-test-8-gpu-h20 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 $CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
artifact-suffix: ${{ matrix.part }}
stage-c-test-deepep-4-gpu:
needs: [check-changes, call-gate, wait-for-stage-b]
if: |
@@ -1411,6 +1492,9 @@ jobs:
fi
python3 run_suite.py --hw cuda --suite stage-c-test-deepep-4-gpu $CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
stage-c-test-deepep-8-gpu-h200:
needs: [check-changes, call-gate, wait-for-stage-b]
if: |
@@ -1446,6 +1530,19 @@ jobs:
run: |
CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_deepep.sh
- name: Warmup DeepGEMM JIT Compilation
timeout-minutes: 25
run: |
python3 scripts/ci/cuda/warmup_deep_gemm.py \
deepseek-ai/DeepSeek-V3-0324:8 \
deepseek-ai/DeepSeek-V3.2-Exp:8
- name: Warmup Server CUDA Graphs
timeout-minutes: 25
run: |
python3 scripts/ci/cuda/warmup_server.py \
deepseek-ai/DeepSeek-V3-0324:8
- name: Run test
timeout-minutes: 45
run: |
@@ -1456,6 +1553,9 @@ jobs:
fi
python3 run_suite.py --hw cuda --suite stage-c-test-deepep-8-gpu-h200 $CONTINUE_ON_ERROR_FLAG
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
stage-c-test-4-gpu-b200:
needs: [check-changes, call-gate, wait-for-stage-b]
if: |
@@ -1506,52 +1606,62 @@ jobs:
fi
IS_BLACKWELL=1 python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-b200 --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 --timeout-per-file 1800 $CONTINUE_ON_ERROR_FLAG
stage-c-test-4-gpu-gb200:
needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels-arm]
if: |
always() &&
(
(inputs.target_stage == 'stage-c-test-4-gpu-gb200') ||
(
!inputs.target_stage &&
((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) &&
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
)
)
runs-on: 4-gpu-gb200
timeout-minutes: 240
env:
RUNNER_LABELS: 4-gpu-gb200
strategy:
fail-fast: false
steps:
- name: Checkout code
uses: actions/checkout@v4
- uses: ./.github/actions/upload-cuda-coredumps
if: always()
with:
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
artifact-suffix: ${{ matrix.part }}
- name: Download artifacts
if: needs.check-changes.outputs.sgl_kernel == 'true'
uses: actions/download-artifact@v4
with:
path: sgl-kernel/dist/
merge-multiple: true
pattern: wheel-python3.10-cuda12.9-aarch64
- name: Install dependencies
timeout-minutes: 20
run: |
CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} IS_BLACKWELL=1 GRACE_BLACKWELL=1 bash scripts/ci/cuda/ci_install_deepep.sh
- name: Run test
timeout-minutes: 45
run: |
cd test
CONTINUE_ON_ERROR_FLAG=""
if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then
CONTINUE_ON_ERROR_FLAG="--continue-on-error"
fi
python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-gb200 --timeout-per-file 3600 $CONTINUE_ON_ERROR_FLAG
# NOTE: GB200 stage temporarily disabled — no company-owned GB200 runner available yet.
# Re-enable when a 4-gpu-gb200 runner is provisioned.
# stage-c-test-4-gpu-gb200:
# needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels-arm]
# if: |
# always() &&
# (
# (inputs.target_stage == 'stage-c-test-4-gpu-gb200') ||
# (
# !inputs.target_stage &&
# ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) &&
# ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
# )
# )
# runs-on: 4-gpu-gb200
# timeout-minutes: 240
# env:
# RUNNER_LABELS: 4-gpu-gb200
# strategy:
# fail-fast: false
# steps:
# - name: Checkout code
# uses: actions/checkout@v4
# with:
# ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
#
# - name: Download artifacts
# if: needs.check-changes.outputs.sgl_kernel == 'true'
# uses: actions/download-artifact@v4
# with:
# path: sgl-kernel/dist/
# merge-multiple: true
# pattern: wheel-python3.10-cuda12.9-aarch64
#
# - name: Install dependencies
# timeout-minutes: 20
# run: |
# CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} IS_BLACKWELL=1 GRACE_BLACKWELL=1 bash scripts/ci/cuda/ci_install_deepep.sh
#
# - name: Run test
# timeout-minutes: 45
# run: |
# cd test
# CONTINUE_ON_ERROR_FLAG=""
# if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then
# CONTINUE_ON_ERROR_FLAG="--continue-on-error"
# fi
# python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-gb200 --timeout-per-file 3600 $CONTINUE_ON_ERROR_FLAG
#
# - uses: ./.github/actions/upload-cuda-coredumps
# if: always()
pr-test-finish:
needs:
@@ -1585,7 +1695,7 @@ jobs:
stage-c-test-deepep-4-gpu,
stage-c-test-deepep-8-gpu-h200,
stage-c-test-4-gpu-b200,
stage-c-test-4-gpu-gb200,
# stage-c-test-4-gpu-gb200, # Temporarily disabled — no GB200 runner
]
if: always()
runs-on: ubuntu-latest

View File

@@ -16,6 +16,7 @@ on:
permissions:
actions: write
contents: write
pull-requests: read
jobs:
cut-release-branch:

View File

@@ -2,7 +2,7 @@ name: Release Docker Images Nightly (AMD)
on:
workflow_dispatch:
schedule:
- cron: '0 13 * * *'
- cron: '0 12 * * *'
concurrency:
# A PR number if a pull request and otherwise the commit hash. This cancels
@@ -78,7 +78,7 @@ jobs:
tag=v${version}-${rocm_tag}
docker build . -f docker/rocm.Dockerfile --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
docker build . -f docker/rocm.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
docker push rocm/sgl-dev:${tag}-${{ env.DATE }}
# Temporarily disable docker cache seeding until performant storage is in place

View File

@@ -0,0 +1,82 @@
name: Release Docker Images ROCm 7.2.0 Nightly Preview (AMD)
on:
workflow_dispatch:
schedule:
- cron: '0 12 * * *'
concurrency:
# A PR number if a pull request and otherwise the commit hash. This cancels
# queued and in-progress runs for the same PR (presubmit) or commit
# (postsubmit). The workflow name is prepended to avoid conflicts between
# different workflows.
group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
cancel-in-progress: True
jobs:
publish:
if: github.repository == 'sgl-project/sglang'
runs-on: amd-docker-scale
environment: 'prod'
strategy:
fail-fast: false
matrix:
gpu_arch: ['gfx942-rocm720', 'gfx950-rocm720']
build_type: ['all']
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for git describe to find tags
- name: "Set Date"
run: |
echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV
- name: Get version from latest tag
id: version
run: |
# Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7)
VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//')
if [ -z "$VERSION" ]; then
echo "::error::Could not determine version from git tags"
exit 1
fi
# Get short commit hash of current HEAD
COMMIT_HASH=$(git rev-parse --short HEAD)
# Compose pretend version for setuptools_scm: e.g., 0.5.8.post1.dev20260211+g1a2b3c4
PRETEND_VERSION="${VERSION}.dev${{ env.DATE }}+g${COMMIT_HASH}"
echo "version=${VERSION}" >> $GITHUB_OUTPUT
echo "pretend_version=${PRETEND_VERSION}" >> $GITHUB_OUTPUT
echo "Detected version: ${VERSION}"
echo "Pretend version for pip: ${PRETEND_VERSION}"
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_AMD_USERNAME }}
password: ${{ secrets.DOCKERHUB_AMD_TOKEN }}
- name: Build and Push
run: |
version=${{ steps.version.outputs.version }}
pretend_version=${{ steps.version.outputs.pretend_version }}
echo "Version: ${version}"
echo "Pretend version: ${pretend_version}"
if [ "${{ matrix.gpu_arch }}" = "gfx942-rocm720" ]; then
rocm_tag="rocm720-mi30x"
elif [ "${{ matrix.gpu_arch }}" = "gfx950-rocm720" ]; then
rocm_tag="rocm720-mi35x"
else
echo "Unsupported gfx arch"
exit 1
fi
tag=v${version}-${rocm_tag}
docker build . -f docker/rocm720.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
docker push rocm/sgl-dev:${tag}-${{ env.DATE }}

View File

@@ -16,6 +16,7 @@ jobs:
environment: 'prod'
strategy:
matrix:
rocm_version: ['rocm700', 'rocm720']
gpu_arch: ['gfx942', 'gfx950']
build_type: ['all']
steps:
@@ -55,17 +56,36 @@ jobs:
version=${{ steps.version.outputs.version }}
echo "Version: ${version}"
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
rocm_tag="rocm700-mi30x"
elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
rocm_tag="rocm700-mi35x"
dockerfile=""
gpu_arch_suffix=""
if [ "${{ matrix.rocm_version }}" = "rocm700" ]; then
dockerfile="docker/rocm.Dockerfile"
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
rocm_tag="rocm700-mi30x"
elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
rocm_tag="rocm700-mi35x"
else
echo "Unsupported gfx arch"
exit 1
fi
elif [ "${{ matrix.rocm_version }}" = "rocm720" ]; then
gpu_arch_suffix="-${{ matrix.rocm_version }}"
dockerfile="docker/rocm720.Dockerfile"
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
rocm_tag="rocm720-mi30x"
elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
rocm_tag="rocm720-mi35x"
else
echo "Unsupported gfx arch"
exit 1
fi
else
echo "Unsupported gfx arch"
echo "Unsupported rocm version"
exit 1
fi
tag=v${version}-${rocm_tag}
# rocm.Dockerfile expects SGL_BRANCH with 'v' prefix for git tag checkout
docker build . -f docker/rocm.Dockerfile --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg SGL_BRANCH=v${version} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t lmsysorg/sglang:${tag} --no-cache
docker build . -f ${dockerfile} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }}${gpu_arch_suffix} --build-arg SGL_BRANCH=v${version} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t lmsysorg/sglang:${tag} --no-cache
docker push lmsysorg/sglang:${tag}

View File

@@ -10,9 +10,9 @@ on:
description: "Version to build (without v prefix, e.g., 0.5.8)"
required: true
flashinfer_version:
description: "FlashInfer version (default: 0.6.1)"
description: "FlashInfer version (default: 0.6.3)"
required: false
default: "0.6.1"
default: "0.6.3"
jobs:
publish-x86:

View File

@@ -1,122 +0,0 @@
name: Build and Push CUDA 13 Docker Images
# release this manually via workflow_dispatch for now
on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *"
jobs:
build-dev:
if: ${{ github.repository == 'sgl-project/sglang' }}
runs-on: ${{ matrix.runner }}
strategy:
matrix:
include:
- runner: x64-docker-build-node
platform: linux/amd64
build_type: all
grace_blackwell: 0
tag: dev-x86-cu13
version: 13.0.1
- runner: arm-docker-build-node
platform: linux/arm64
build_type: all
grace_blackwell: 1
tag: dev-arm64-cu13
version: 13.0.1
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
- name: Free disk space
uses: jlumbroso/free-disk-space@main
with:
tool-cache: true
docker-images: true
android: true
dotnet: true
haskell: true
large-packages: true
swap-storage: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and Push Dev Image
run: |
docker buildx build \
--platform ${{ matrix.platform }} \
--push \
--target framework \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=${{ matrix.version }} \
--build-arg BUILD_TYPE=${{ matrix.build_type }} \
--build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \
--build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \
--build-arg USE_LATEST_SGLANG=1 \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
-t lmsysorg/sglang:${{ matrix.tag }} \
--no-cache \
.
create-manifests:
runs-on: ubuntu-22.04
needs: [build-dev]
if: ${{ github.repository == 'sgl-project/sglang' }}
strategy:
matrix:
variant:
- tag: dev-cu13
x86_tag: dev-x86-cu13
arm64_tag: dev-arm64-cu13
steps:
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- run: |
docker buildx imagetools create \
-t lmsysorg/sglang:${{ matrix.variant.tag }} \
-t lmsysorg/sglang:nightly-${{ matrix.variant.tag }}-$(date +%Y%m%d)-${GITHUB_SHA:0:8} \
lmsysorg/sglang:${{ matrix.variant.x86_tag }} \
lmsysorg/sglang:${{ matrix.variant.arm64_tag }}
- name: Cleanup Old Nightly Builds
run: |
# Get JWT token for Docker Hub API
TOKEN=$(curl -s -H "Content-Type: application/json" -X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' https://hub.docker.com/v2/users/login/ | jq -r .token)
# Get all tags for the repository
TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100")
# Extract tags that match our pattern and sort by last_updated timestamp (most recent first)
TAGS=$(echo "$TAGS_RESPONSE" | jq -r '.results[] | select(.name | startswith("nightly-${{ matrix.variant.tag }}-")) | "\(.last_updated)|\(.name)"' | sort -r | cut -d'|' -f2)
# Count total tags and keep only the 14 most recent
TAG_COUNT=$(echo "$TAGS" | wc -l)
if [ "$TAG_COUNT" -gt 14 ]; then
echo "Found $TAG_COUNT nightly builds, keeping only the 14 most recent"
TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +15)
echo "Tags to delete: $TAGS_TO_DELETE"
# Delete old tags
for tag in $TAGS_TO_DELETE; do
echo "Deleting tag: $tag"
curl -X DELETE \
-H "Authorization: JWT $TOKEN" \
"https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/$tag/"
done
else
echo "Only $TAG_COUNT nightly builds found, no cleanup needed"
fi

View File

@@ -1,116 +0,0 @@
name: Build PR Development Docker Images
on:
workflow_dispatch:
inputs:
pr_number:
description: 'PR number to build from'
required: true
type: string
pr_branch:
description: 'PR branch name to build from (e.g., my-feature-branch or refs/pull/123/head)'
required: true
type: string
concurrency:
group: release-docker-dev-pr-${{ github.event.inputs.pr_number }}
cancel-in-progress: true
jobs:
build-dev:
if: ${{ github.repository == 'sgl-project/sglang' }}
environment: "prod"
runs-on: ${{ matrix.runner }}
strategy:
matrix:
include:
- runner: x64-docker-build-node
platform: linux/amd64
build_type: all
grace_blackwell: 0
arch_tag: x86
version: 12.9.1
- runner: arm-docker-build-node
platform: linux/arm64
build_type: all
grace_blackwell: 1
arch_tag: arm64
version: 12.9.1
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_branch }}
- name: Free disk space
uses: jlumbroso/free-disk-space@main
with:
tool-cache: true
docker-images: true
android: true
dotnet: true
haskell: true
large-packages: true
swap-storage: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and Push Dev Image
run: |
tag=dev-${{ matrix.arch_tag }}-pr-${{ inputs.pr_number }}
docker buildx build \
--platform ${{ matrix.platform }} \
--push \
-f docker/Dockerfile \
--target framework \
--build-arg CUDA_VERSION=${{ matrix.version }} \
--build-arg BUILD_TYPE=${{ matrix.build_type }} \
--build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \
--build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \
--build-arg BRANCH_TYPE=local \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
-t lmsysorg/sglang:${tag} \
--no-cache \
.
create-manifests:
runs-on: ubuntu-22.04
needs: [build-dev]
if: ${{ github.repository == 'sgl-project/sglang' }}
environment: "prod"
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Create multi-arch manifest
run: |
# Create PR dev manifest
docker buildx imagetools create \
-t lmsysorg/sglang:dev-pr-${{ inputs.pr_number }} \
lmsysorg/sglang:dev-x86-pr-${{ inputs.pr_number }} \
lmsysorg/sglang:dev-arm64-pr-${{ inputs.pr_number }}
echo "✓ Built Docker image: lmsysorg/sglang:dev-pr-${{ inputs.pr_number }}"
echo ""
echo "Usage:"
echo " docker pull lmsysorg/sglang:dev-pr-${{ inputs.pr_number }}"

View File

@@ -2,9 +2,22 @@ name: Build and Push Development Docker Images
on:
workflow_dispatch:
inputs:
pr_number:
description: "PR number to build from (leave empty to use current branch)"
required: false
default: ""
tag:
description: "Custom tag suffix (overrides pr_number in tag). E.g. 'my-test' → dev-x86-my-test, dev-cu13-my-test, etc."
required: false
default: ""
schedule:
- cron: "0 0 * * *"
concurrency:
group: release-docker-dev-${{ inputs.tag || inputs.pr_number || 'nightly' }}
cancel-in-progress: true
jobs:
build-dev:
if: ${{ github.repository == 'sgl-project/sglang' }}
@@ -16,20 +29,34 @@ jobs:
platform: linux/amd64
build_type: all
grace_blackwell: 0
tag: dev-x86
arch_tag: x86
version: 12.9.1
- runner: arm-docker-build-node
platform: linux/arm64
build_type: all
grace_blackwell: 1
tag: dev-arm64
arch_tag: arm64
version: 12.9.1
- runner: x64-docker-build-node
platform: linux/amd64
build_type: all
grace_blackwell: 0
arch_tag: x86-cu13
version: 13.0.1
- runner: arm-docker-build-node
platform: linux/arm64
build_type: all
grace_blackwell: 1
arch_tag: arm64-cu13
version: 13.0.1
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache
- name: Checkout repository
uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || github.ref }}
- name: Free disk space
uses: jlumbroso/free-disk-space@main
@@ -42,6 +69,12 @@ jobs:
large-packages: true
swap-storage: true
- name: Prune Docker to reclaim disk space
run: |
docker buildx prune --filter "until=72h" -f
docker system prune -af --filter "until=72h"
docker volume prune -af
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
@@ -53,18 +86,37 @@ jobs:
- name: Build and Push Dev Image
run: |
# Tag suffix: custom tag > pr number > none
SUFFIX=""
if [ -n "${{ inputs.tag }}" ]; then
SUFFIX="-${{ inputs.tag }}"
elif [ -n "${{ inputs.pr_number }}" ]; then
SUFFIX="-pr-${{ inputs.pr_number }}"
fi
TAG="dev-${{ matrix.arch_tag }}${SUFFIX}"
# Nightly (schedule) installs latest release; manual dispatch builds from checked-out source
if [ "${{ github.event_name }}" = "schedule" ]; then
SOURCE_ARG="--build-arg USE_LATEST_SGLANG=1"
else
SOURCE_ARG="--build-arg BRANCH_TYPE=local"
fi
echo "Building lmsysorg/sglang:${TAG}"
docker buildx build \
--platform ${{ matrix.platform }} \
--push \
-f docker/Dockerfile \
--target framework \
-f docker/Dockerfile \
--build-arg CUDA_VERSION=${{ matrix.version }} \
--build-arg BUILD_TYPE=${{ matrix.build_type }} \
--build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \
--build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \
--build-arg USE_LATEST_SGLANG=1 \
${SOURCE_ARG} \
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
-t lmsysorg/sglang:${{ matrix.tag }} \
-t lmsysorg/sglang:${TAG} \
--no-cache \
.
@@ -75,9 +127,12 @@ jobs:
strategy:
matrix:
variant:
- tag: dev
x86_tag: dev-x86
arm64_tag: dev-arm64
- base: dev
x86: x86
arm64: arm64
- base: dev-cu13
x86: x86-cu13
arm64: arm64-cu13
steps:
- uses: docker/setup-buildx-action@v3
@@ -85,37 +140,56 @@ jobs:
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- run: |
SHORT_SHA="${{ github.sha }}"
- name: Create multi-arch manifest
run: |
SUFFIX=""
if [ -n "${{ inputs.tag }}" ]; then
SUFFIX="-${{ inputs.tag }}"
elif [ -n "${{ inputs.pr_number }}" ]; then
SUFFIX="-pr-${{ inputs.pr_number }}"
fi
TAG="${{ matrix.variant.base }}${SUFFIX}"
X86_TAG="dev-${{ matrix.variant.x86 }}${SUFFIX}"
ARM64_TAG="dev-${{ matrix.variant.arm64 }}${SUFFIX}"
# For nightly (no suffix), also stamp a dated tag
EXTRA_TAG=""
if [ -z "${SUFFIX}" ]; then
SHORT_SHA="${{ github.sha }}"
EXTRA_TAG="-t lmsysorg/sglang:nightly-${TAG}-$(date +%Y%m%d)-${SHORT_SHA:0:8}"
fi
docker buildx imagetools create \
-t lmsysorg/sglang:${{ matrix.variant.tag }} \
-t lmsysorg/sglang:nightly-${{ matrix.variant.tag }}-$(date +%Y%m%d)-${SHORT_SHA:0:8} \
lmsysorg/sglang:${{ matrix.variant.x86_tag }} \
lmsysorg/sglang:${{ matrix.variant.arm64_tag }}
-t lmsysorg/sglang:${TAG} \
${EXTRA_TAG} \
lmsysorg/sglang:${X86_TAG} \
lmsysorg/sglang:${ARM64_TAG}
echo "✓ Published lmsysorg/sglang:${TAG}"
- name: Cleanup Old Nightly Builds
if: ${{ !inputs.tag && !inputs.pr_number }}
run: |
# Get JWT token for Docker Hub API
TOKEN=$(curl -s -H "Content-Type: application/json" -X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' https://hub.docker.com/v2/users/login/ | jq -r .token)
TOKEN=$(curl -s -H "Content-Type: application/json" \
-X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' \
https://hub.docker.com/v2/users/login/ | jq -r .token)
# Get all tags for the repository
TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100")
TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" \
"https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100")
# Extract tags that match our pattern and sort by last_updated timestamp (most recent first)
TAGS=$(echo "$TAGS_RESPONSE" | jq -r '.results[] | select(.name | startswith("nightly-${{ matrix.variant.tag }}-")) | "\(.last_updated)|\(.name)"' | sort -r | cut -d'|' -f2)
TAGS=$(echo "$TAGS_RESPONSE" | jq -r \
'.results[] | select(.name | test("^nightly-${{ matrix.variant.base }}-[0-9]")) | "\(.last_updated)|\(.name)"' \
| sort -r | cut -d'|' -f2)
# Count total tags and keep only the 14 most recent
TAG_COUNT=$(echo "$TAGS" | wc -l)
if [ "$TAG_COUNT" -gt 14 ]; then
echo "Found $TAG_COUNT nightly builds, keeping only the 14 most recent"
TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +15)
echo "Tags to delete: $TAGS_TO_DELETE"
# Delete old tags
for tag in $TAGS_TO_DELETE; do
echo "Deleting tag: $tag"
curl -X DELETE \
-H "Authorization: JWT $TOKEN" \
curl -X DELETE -H "Authorization: JWT $TOKEN" \
"https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/$tag/"
done
else

View File

@@ -1,5 +1,11 @@
name: Release Docker Images Nightly (NPU)
on:
pull_request:
branches:
- 'main'
paths:
- '.github/workflows/release-docker-npu-nightly.yml'
- 'docker/npu.Dockerfile'
workflow_dispatch:
schedule:
- cron: "0 0 * * *"
@@ -74,6 +80,6 @@ jobs:
push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
provenance: false
build-args: |
SGLANG_KERNEL_NPU_TAG=2026.01.28
SGLANG_KERNEL_NPU_TAG=2026.02.01.post2
CANN_VERSION=${{ matrix.cann_version }}
DEVICE_TYPE=${{ matrix.device_type }}

View File

@@ -87,7 +87,7 @@ jobs:
push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
provenance: false
build-args: |
SGLANG_KERNEL_NPU_TAG=2026.01.28
SGLANG_KERNEL_NPU_TAG=2026.02.01.post2
CANN_VERSION=${{ matrix.cann_version }}
DEVICE_TYPE=${{ matrix.device_type }}
SGLANG_TAG=${{ steps.version.outputs.version }}

View File

@@ -1,6 +1,8 @@
name: Release Documentation
on:
release:
types: [published]
push:
branches:
- main
@@ -25,6 +27,11 @@ jobs:
- name: Checkout code
uses: actions/checkout@v4
- name: Fetch full git history for release index
if: github.event_name == 'release'
run: |
git fetch --prune --unshallow || git fetch --prune --depth=0
- name: Install dependencies
run: |
bash scripts/ci/cuda/ci_install_dependency.sh
@@ -53,10 +60,23 @@ jobs:
make markdown
python3 wrap_run_llm.py
if [[ "${{ github.event_name }}" == "release" ]]; then
python3 release_lookup/generate_index.py --output release_lookup/release_index.json
# Copy release lookup tool for official docs on published releases.
mkdir -p _build/html/release_lookup
cp release_lookup/index.html _build/html/release_lookup/
cp release_lookup/release_index.json _build/html/release_lookup/
fi
cd _build/html
git clone https://$GITHUB_TOKEN@github.com/sgl-project/sgl-project.github.io.git ../sgl-project.github.io --depth 1
find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
if [[ "${{ github.event_name }}" == "release" ]]; then
find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
else
find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -path "../sgl-project.github.io/release_lookup*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
fi
cp -r * ../sgl-project.github.io
cp ../../README.md ../sgl-project.github.io/README.md
cd ../sgl-project.github.io

View File

@@ -54,28 +54,26 @@ jobs:
cd python
cp ../README.md ../LICENSE .
# Parse git describe output to detect exact tag builds (distance=0)
# Parse git describe output to get latest tag
# Use same command as pyproject.toml to ensure version consistency
DESC=$(git tag --list --sort=-version:refname 'v*.*.*' | head -1 | xargs git describe --tags --long 2>/dev/null || echo 'v0.0.0-0-g0000000')
DIST=$(echo "$DESC" | cut -d- -f2)
TAG=$(echo "$DESC" | cut -d- -f1)
HASH="g$(git rev-parse --short HEAD)"
BUILD_DATE=$(date -u +%Y%m%d)
# If building at exact tag (distance=0), force dev0 version for unique wheel names
if [ "$DIST" = "0" ]; then
TAG=$(echo "$DESC" | cut -d- -f1)
HASH="g$(git rev-parse --short HEAD)"
# Increment patch version for nightlies (e.g., v0.5.8 -> 0.5.9)
VERSION=${TAG#v} # Remove 'v' prefix
MAJOR=$(echo "$VERSION" | cut -d. -f1)
MINOR=$(echo "$VERSION" | cut -d. -f2)
PATCH=$(echo "$VERSION" | cut -d. -f3)
NEXT_PATCH=$((PATCH + 1))
NEXT_VERSION="${MAJOR}.${MINOR}.${NEXT_PATCH}"
# Increment patch version for nightlies (e.g., v0.5.8 -> 0.5.9.dev0)
VERSION=${TAG#v} # Remove 'v' prefix
MAJOR=$(echo "$VERSION" | cut -d. -f1)
MINOR=$(echo "$VERSION" | cut -d. -f2)
PATCH=$(echo "$VERSION" | cut -d. -f3)
NEXT_PATCH=$((PATCH + 1))
NEXT_VERSION="${MAJOR}.${MINOR}.${NEXT_PATCH}"
FORCE_VERSION="${NEXT_VERSION}.dev0+${HASH}"
echo "Building at exact tag $TAG, forcing nightly version to: $FORCE_VERSION"
export SETUPTOOLS_SCM_PRETEND_VERSION="$FORCE_VERSION"
fi
# Use date-based dev number for correct chronological sorting
# e.g., 0.5.9.dev20260215+g4cf4f0859 > 0.5.9.dev20260214+g45a4697d4
FORCE_VERSION="${NEXT_VERSION}.dev${BUILD_DATE}+${HASH}"
echo "Forcing nightly version to: $FORCE_VERSION"
export SETUPTOOLS_SCM_PRETEND_VERSION="$FORCE_VERSION"
# Build wheel
python3 -m build --wheel

View File

@@ -4,11 +4,7 @@ on:
workflow_dispatch:
inputs:
pr_number:
description: 'PR number to build wheel for'
required: true
type: string
pr_branch:
description: 'PR branch name to build from (e.g., my-feature-branch or refs/pull/123/head)'
description: 'PR number to build wheel for (works with both internal and fork PRs)'
required: true
type: string
@@ -27,7 +23,7 @@ jobs:
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_branch }}
ref: refs/pull/${{ inputs.pr_number }}/head
fetch-depth: 0 # Need full history for version generation
- name: Set up Python
@@ -38,13 +34,14 @@ jobs:
- name: Generate PR wheel version
id: gen_version
run: |
# Get base version from setuptools_scm
cd python
pip install setuptools-scm
FULL_VERSION=$(python -c "from setuptools_scm import get_version; print(get_version(root='..'))")
# Strip any existing .dev or + suffix to get clean base version
BASE_VERSION=$(echo "$FULL_VERSION" | sed 's/\.dev.*//;s/+.*//')
cd ..
# Get base version from the latest v*.*.* git tag directly
# Note: We cannot use setuptools_scm here because the [tool.setuptools_scm]
# config (with custom git_describe_command) lives in python/pyproject.toml,
# not at the repo root. Without that config, setuptools_scm falls back to
# default git describe which finds gateway-* tags instead of v*.*.* release tags.
LATEST_TAG=$(git tag --list --sort=-version:refname 'v*.*.*' | head -1)
BASE_VERSION=${LATEST_TAG#v}
echo "Latest release tag: ${LATEST_TAG}"
# Get commit info
COMMIT_HASH=$(git rev-parse --short HEAD)

View File

@@ -11,6 +11,10 @@ on:
tag_name:
type: string
required: false
pr_number:
description: "PR number to build from (e.g. 12345)"
type: string
required: false
concurrency:
group: release-sglang-kernels-${{ github.ref }}
@@ -34,6 +38,7 @@ jobs:
- uses: actions/checkout@v4
with:
submodules: "recursive"
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
@@ -46,7 +51,8 @@ jobs:
chmod +x ./build.sh
./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }}
env:
USE_CCACHE: 0
BUILD_JOBS: 64
NVCC_THREADS: 8
- name: Upload to PyPI
working-directory: sgl-kernel
@@ -65,6 +71,8 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Download artifacts
uses: actions/download-artifact@v4
@@ -127,6 +135,7 @@ jobs:
- uses: actions/checkout@v4
with:
submodules: "recursive"
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
@@ -139,7 +148,8 @@ jobs:
chmod +x ./build.sh
./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }}
env:
USE_CCACHE: 0
BUILD_JOBS: 64
NVCC_THREADS: 8
- name: Upload artifacts
uses: actions/upload-artifact@v4
@@ -152,6 +162,8 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Download artifacts
uses: actions/download-artifact@v4
@@ -207,6 +219,7 @@ jobs:
- uses: actions/checkout@v4
with:
submodules: "recursive"
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
@@ -231,6 +244,8 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
- name: Download artifacts
uses: actions/download-artifact@v4

30
.github/workflows/retag-docker.yml vendored Normal file
View File

@@ -0,0 +1,30 @@
name: Retag Docker Image
on:
workflow_dispatch:
inputs:
source_tag:
description: "Existing image tag (e.g., v0.4.7-cu129-amd64)"
required: true
target_tag:
description: "New tag to apply (e.g., latest)"
required: true
jobs:
retag:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-22.04
environment: "prod"
steps:
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Retag image
run: |
echo "Retagging lmsysorg/sglang:${{ inputs.source_tag }} -> lmsysorg/sglang:${{ inputs.target_tag }}"
docker buildx imagetools create \
-t lmsysorg/sglang:${{ inputs.target_tag }} \
lmsysorg/sglang:${{ inputs.source_tag }}

5
.gitignore vendored
View File

@@ -245,7 +245,6 @@ sgl-model-gateway/tests/fixtures/golden/
lmms-eval
**/.claude/
**/.serena/
ctags/
outputs/
@@ -262,10 +261,6 @@ outputs/
# setuptools-scm generated version file
python/sglang/_version.py
# Generated protobuf files (regenerate during wheel build or with compile_proto.py)
python/sglang/srt/grpc/*_pb2.py
python/sglang/srt/grpc/*_pb2_grpc.py
python/sglang/srt/grpc/*_pb2.pyi
# MUSA section
# Generated source files by torchada

View File

@@ -3,7 +3,7 @@ exclude: ^(python/sglang/multimodal_gen/csrc|python/sglang/jit_kernel/flash_atte
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
rev: v6.0.0
hooks:
- id: check-symlinks
- id: destroyed-symlinks
@@ -21,12 +21,12 @@ repos:
- id: debug-statements
- id: no-commit-to-branch
- repo: https://github.com/PyCQA/isort
rev: 5.13.2
rev: 7.0.0
hooks:
- id: isort
exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.11.7
rev: v0.15.1
hooks:
- id: ruff
args:
@@ -43,7 +43,7 @@ repos:
python/sglang/srt/grpc/.*_pb2_grpc\.pyi$|
)$
- repo: https://github.com/psf/black
rev: 24.10.0
rev: 26.1.0
hooks:
- id: black-jupyter
exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
@@ -53,13 +53,13 @@ repos:
- id: codespell
args: ['--config', '.codespellrc']
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v18.1.8
rev: v20.1.7
hooks:
- id: clang-format
types_or: [c++, cuda]
args: [--style=file, --verbose]
- repo: https://github.com/kynan/nbstripout
rev: 0.8.1
rev: 0.9.0
hooks:
- id: nbstripout
args:

View File

@@ -187,10 +187,8 @@ def run_grid(bs, model, method, tp_size, dtype: str):
configs = union_of_list_of_dicts(prune_configs_1, prune_configs_2)
print(
f"{bs=} || {len(full_configs)=} | {len(prune_configs_1)=} | \
{len(prune_configs_2)=} | {len(configs)=}"
)
print(f"{bs=} || {len(full_configs)=} | {len(prune_configs_1)=} | \
{len(prune_configs_2)=} | {len(configs)=}")
best_config = None
best_time_us = 1e20

166
benchmark/asr/README.md Normal file
View File

@@ -0,0 +1,166 @@
# ASR Benchmark
This benchmark evaluates the performance and accuracy (Word Error Rate - WER) of Automatic Speech Recognition (ASR) models served via SGLang.
## Supported Models
- `openai/whisper-large-v3`
- `openai/whisper-large-v3-turbo`
## Setup
Install the required dependencies:
```bash
apt install ffmpeg
pip install librosa soundfile datasets evaluate jiwer transformers openai torchcodec torch
```
## Running the Benchmark
### 1. Start SGLang Server
Launch the SGLang server with a Whisper model:
```bash
python -m sglang.launch_server --model-path openai/whisper-large-v3 --port 30000
```
### 2. Run the Benchmark Script
Basic usage (using chat completions API):
```bash
python bench_sglang.py --base-url http://localhost:30000 --model openai/whisper-large-v3 --n-examples 10
```
Using the OpenAI-compatible transcription API:
```bash
python bench_sglang.py \
--base-url http://localhost:30000 \
--model openai/whisper-large-v3 \
--api-type transcription \
--language English \
--n-examples 10
```
Run with streaming and show real-time output:
```bash
python bench_sglang.py \
--base-url http://localhost:30000 \
--model openai/whisper-large-v3 \
--api-type transcription \
--stream \
--show-predictions \
--concurrency 1
```
Run with higher concurrency and save results:
```bash
python bench_sglang.py \
--base-url http://localhost:30000 \
--model openai/whisper-large-v3 \
--concurrency 8 \
--n-examples 100 \
--output results.json \
--show-predictions
```
## Arguments
| Argument | Description | Default |
|----------|-------------|---------|
| `--base-url` | SGLang server URL | `http://localhost:30000` |
| `--model` | Model name on the server | `openai/whisper-large-v3` |
| `--dataset` | HuggingFace dataset for evaluation | `D4nt3/esb-datasets-earnings22-validation-tiny-filtered` |
| `--split` | Dataset split to use | `validation` |
| `--concurrency` | Number of concurrent requests | `4` |
| `--n-examples` | Number of examples to process (`-1` for all) | `-1` |
| `--output` | Path to save results as JSON | `None` |
| `--show-predictions` | Display sample predictions | `False` |
| `--print-n` | Number of samples to display | `5` |
| `--api-type` | API to use: `chat` (chat completions) or `transcription` (audio transcriptions) | `chat` |
| `--language` | Language for transcription API (e.g., `English`, `en`) | `None` |
| `--stream` | Enable streaming mode for transcription API | `False` |
## Metrics
The benchmark outputs:
| Metric | Description |
|--------|-------------|
| **Total Requests** | Number of successful ASR requests processed |
| **WER** | Word Error Rate (lower is better), computed using the `evaluate` library |
| **Average Latency** | Mean time per request (seconds) |
| **Median Latency** | 50th percentile latency (seconds) |
| **95th Latency** | 95th percentile latency (seconds) |
| **Throughput** | Requests processed per second |
| **Token Throughput** | Output tokens per second |
## Example Output
```bash
python bench_sglang.py --api-type transcription --concurrency 128 --model openai/whisper-large-v3 --show-predictions
Loading dataset: D4nt3/esb-datasets-earnings22-validation-tiny-filtered...
Using API type: transcription
Repo card metadata block was not found. Setting CardData to empty.
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
Performing warmup...
Processing 511 samples...
------------------------------
Results for openai/whisper-large-v3:
Total Requests: 511
WER: 12.7690
Average Latency: 1.3602s
Median Latency: 1.2090s
95th Latency: 2.9986s
Throughput: 19.02 req/s
Token Throughput: 354.19 tok/s
Total Test Time: 26.8726s
------------------------------
==================== Sample Predictions ====================
Sample 1:
REF: on the use of taxonomy i you know i think it is it is early days for us to to make any clear indications to the market about the proportion that would fall under that requirement
PRED: on the eu taxonomy i think it is early days for us to make any clear indications to the market about the proportion that would fall under that requirement
----------------------------------------
Sample 2:
REF: so within fiscal year 2021 say 120 a 100 depending on what the micro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
PRED: so within fiscal year 2021 say $120000 $100000 depending on what the macro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
----------------------------------------
Sample 3:
REF: we talked about 4.7 gigawatts
PRED: we talked about 4.7 gigawatts
----------------------------------------
Sample 4:
REF: and you know depending on that working capital build we will we will see what that yields
PRED: and depending on that working capital build we will see what that yields what
----------------------------------------
Sample 5:
REF: so on on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexs are distributed out 30 70%
PRED: so on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexes are distributed out 30% 70%
----------------------------------------
============================================================
```
## Notes
- Audio samples longer than 30 seconds are automatically filtered out (Whisper limitation)
- The benchmark performs a warmup request before measuring performance
- Results are normalized using the model's tokenizer when available
- When using `--stream` with `--show-predictions`, use `--concurrency 1` for clean sequential output
- The `--language` option accepts both full names (e.g., `English`) and ISO 639-1 codes (e.g., `en`)
## Troubleshooting
**Server connection refused**
- Ensure the SGLang server is running and accessible at the specified `--base-url`
- Check that the port is not blocked by a firewall
**Out of memory errors**
- Reduce `--concurrency` to lower GPU memory usage
- Use a smaller Whisper model variant

View File

@@ -0,0 +1,404 @@
import argparse
import asyncio
import base64
import io
import json
import time
from statistics import mean, median
import httpx
import librosa
import numpy as np
import soundfile
from datasets import load_dataset
from evaluate import load
from openai import AsyncOpenAI, OpenAI
from transformers import AutoTokenizer
def to_bytes(y, sr):
buffer = io.BytesIO()
soundfile.write(buffer, y, sr, format="WAV")
buffer.seek(0)
return buffer
async def run_asr_chat(client, model_name, y, sr):
"""Use chat completions API with audio_url for ASR."""
with to_bytes(y, sr) as f:
audio_bytes = f.read()
audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")
start_time = time.perf_counter()
response = await client.chat.completions.create(
model=model_name,
messages=[
{
"role": "user",
"content": [
{
"type": "audio_url",
"audio_url": {"url": f"data:audio/wav;base64,{audio_base64}"},
}
],
}
],
temperature=0.0,
)
end_time = time.perf_counter()
asr_text = response.choices[0].message.content
latency = end_time - start_time
return latency, asr_text
def run_asr_transcription_sync(client, model_name, y, sr, language=None):
"""Use audio transcriptions API for ASR (sync version)."""
audio_buffer = to_bytes(y, sr)
audio_buffer.name = "audio.wav" # OpenAI client needs a name attribute
start_time = time.perf_counter()
kwargs = {
"model": model_name,
"file": audio_buffer,
}
if language:
kwargs["language"] = language
transcription = client.audio.transcriptions.create(**kwargs)
end_time = time.perf_counter()
latency = end_time - start_time
return latency, transcription.text
def run_asr_transcription_stream_sync(
base_url, model_name, y, sr, language=None, show_stream=False
):
"""Use audio transcriptions API with streaming for ASR."""
audio_buffer = to_bytes(y, sr)
audio_bytes = audio_buffer.read()
data = {
"model": model_name,
"response_format": "json",
"stream": "true",
}
if language:
data["language"] = language
start_time = time.perf_counter()
text_chunks = []
if show_stream:
print("[STREAM] ", end="", flush=True)
with httpx.stream(
"POST",
f"{base_url}/v1/audio/transcriptions",
data=data,
files={"file": ("audio.wav", audio_bytes, "audio/wav")},
timeout=60.0,
) as response:
for line in response.iter_lines():
if line.startswith("data: ") and not line.startswith("data: [DONE]"):
try:
chunk = json.loads(line[6:])
if "choices" in chunk and chunk["choices"]:
delta = chunk["choices"][0].get("delta", {})
content = delta.get("content", "")
if content:
text_chunks.append(content)
if show_stream:
print(content, end="", flush=True)
except json.JSONDecodeError:
pass
if show_stream:
print() # newline after stream
end_time = time.perf_counter()
latency = end_time - start_time
return latency, "".join(text_chunks)
async def run_asr_transcription(
client,
model_name,
y,
sr,
language=None,
stream=False,
base_url=None,
show_stream=False,
):
"""Async wrapper for transcription API (runs sync call in executor)."""
loop = asyncio.get_event_loop()
if stream:
return await loop.run_in_executor(
None,
run_asr_transcription_stream_sync,
base_url,
model_name,
y,
sr,
language,
show_stream,
)
return await loop.run_in_executor(
None, run_asr_transcription_sync, client, model_name, y, sr, language
)
async def bound_asr(
sem,
client,
model_name,
tokenizer,
audio,
reference,
api_type="chat",
language=None,
stream=False,
base_url=None,
show_stream=False,
):
async with sem:
try:
if api_type == "transcription":
latency, text = await run_asr_transcription(
client,
model_name,
*audio,
language=language,
stream=stream,
base_url=base_url,
show_stream=show_stream,
)
else:
latency, text = await run_asr_chat(client, model_name, *audio)
# Calculate tokens for throughput metrics
num_output_tokens = len(tokenizer(text, add_special_tokens=False).input_ids)
# Normalize for WER evaluation
# Whisper tokenizer has a normalize method
if hasattr(tokenizer, "normalize"):
out = tokenizer.normalize(text)
ref = tokenizer.normalize(reference)
else:
out = text.lower().strip()
ref = reference.lower().strip()
return latency, num_output_tokens, out, ref
except Exception as e:
print(f"Error during ASR: {e}")
return None
async def process_dataset(
model_name,
client,
data,
concurrent_request,
api_type="chat",
language=None,
stream=False,
base_url=None,
show_predictions=False,
):
sem = asyncio.Semaphore(concurrent_request)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Warmup
print("Performing warmup...")
audio_warmup, sr_warmup = (
data[0]["audio"]["array"],
data[0]["audio"]["sampling_rate"],
)
await bound_asr(
sem,
client,
model_name,
tokenizer,
(audio_warmup, sr_warmup),
"",
api_type=api_type,
language=language,
stream=stream,
base_url=base_url,
show_stream=False, # Don't show stream during warmup
)
tasks = []
print(f"Processing {len(data)} samples...")
for sample in data:
audio, sr = sample["audio"]["array"], sample["audio"]["sampling_rate"]
tasks.append(
asyncio.create_task(
bound_asr(
sem,
client,
model_name,
tokenizer,
(audio, sr),
sample["text"],
api_type=api_type,
language=language,
stream=stream,
base_url=base_url,
show_stream=show_predictions and stream,
)
)
)
results = await asyncio.gather(*tasks)
return [r for r in results if r is not None]
def run_evaluation(args):
# Use sync client for transcription API, async for chat API
if args.api_type == "transcription":
client = OpenAI(base_url=f"{args.base_url}/v1", api_key="None")
else:
client = AsyncOpenAI(base_url=f"{args.base_url}/v1", api_key="None")
print(f"Loading dataset: {args.dataset}...")
print(f"Using API type: {args.api_type}" + (f" (streaming)" if args.stream else ""))
dataset = load_dataset(args.dataset, split=args.split)
# Filter by duration if needed (Whisper max is 30s)
def add_duration(sample):
y, sr = sample["audio"]["array"], sample["audio"]["sampling_rate"]
sample["duration_ms"] = librosa.get_duration(y=y, sr=sr) * 1000
return sample
if "duration_ms" not in dataset.column_names:
dataset = dataset.map(add_duration)
dataset = dataset.filter(lambda x: x["duration_ms"] < 30000)
if args.n_examples > 0:
dataset = dataset.select(range(min(args.n_examples, len(dataset))))
start = time.perf_counter()
results = asyncio.run(
process_dataset(
args.model,
client,
dataset,
args.concurrency,
api_type=args.api_type,
language=args.language,
stream=args.stream,
base_url=args.base_url,
show_predictions=args.show_predictions,
)
)
total_test_time = time.perf_counter() - start
if not results:
print("No successful results to evaluate.")
return
# Metrics
latencies = [res[0] for res in results]
total_tokens = sum([res[1] for res in results])
predictions = [res[2] for res in results]
references = [res[3] for res in results]
wer_metric = load("wer")
wer_score = 100 * wer_metric.compute(references=references, predictions=predictions)
print("-" * 30)
print(f"Results for {args.model}:")
print(f"Total Requests: {len(results)}")
print(f"WER: {wer_score:.4f}")
print(f"Average Latency: {mean(latencies):.4f}s")
print(f"Median Latency: {median(latencies):.4f}s")
print(f"95th Latency: {np.percentile(latencies, 95):.4f}s")
print(f"Throughput: {len(results) / total_test_time:.2f} req/s")
print(f"Token Throughput: {total_tokens / total_test_time:.2f} tok/s")
print(f"Total Test Time: {total_test_time:.4f}s")
print("-" * 30)
if args.output:
with open(args.output, "w") as f:
import json
json.dump(
{
"model": args.model,
"dataset": args.dataset,
"wer": wer_score,
"avg_latency": mean(latencies),
"throughput": len(results) / total_test_time,
"token_throughput": total_tokens / total_test_time,
},
f,
indent=2,
)
if args.show_predictions:
print("\n" + "=" * 20 + " Sample Predictions " + "=" * 20)
num_to_show = min(args.print_n, len(results))
for i in range(num_to_show):
print(f"Sample {i+1}:")
print(f" REF: {references[i]}")
print(f" PRED: {predictions[i]}")
print("-" * 40)
print("=" * 60)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Benchmark sGLang ASR performance.")
parser.add_argument(
"--base-url", default="http://localhost:30000", help="sGLang server base URL"
)
parser.add_argument(
"--model", default="openai/whisper-large-v3", help="Model name on the server"
)
parser.add_argument(
"--dataset",
default="D4nt3/esb-datasets-earnings22-validation-tiny-filtered",
help="HF dataset repo",
)
parser.add_argument("--split", default="validation", help="Dataset split")
parser.add_argument(
"--concurrency", type=int, default=4, help="Number of concurrent requests"
)
parser.add_argument(
"--n-examples",
"-n",
type=int,
default=-1,
help="Number of examples to test (-1 for all)",
)
parser.add_argument("--output", help="Path to save results in JSON")
parser.add_argument(
"--show-predictions",
action="store_true",
help="Print sample predictions and references",
)
parser.add_argument(
"--print-n", type=int, default=5, help="Number of sample predictions to print"
)
parser.add_argument(
"--api-type",
choices=["chat", "transcription"],
default="chat",
help="API type to use: 'chat' for chat completions with audio_url, 'transcription' for audio.transcriptions API",
)
parser.add_argument(
"--language",
default=None,
help="Language code for transcription API (e.g., 'en')",
)
parser.add_argument(
"--stream",
action="store_true",
help="Use streaming mode for transcription API",
)
args = parser.parse_args()
run_evaluation(args)

View File

@@ -4,7 +4,7 @@ The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVI
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.io/basic_usage/deepseek.html).
For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek V3/V3.1/R1 Model Optimizations in SGLang](https://docs.sglang.io/basic_usage/deepseek_v3.html#optimizations).
## Installation & Launch
@@ -271,7 +271,7 @@ Then we can benchmark the accuracy and latency by accessing the first node's exp
```bash
# bench accuracy
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0.0.1 --port 30000
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host 10.0.0.1 --port 30000
# bench latency
python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128

View File

@@ -7,7 +7,9 @@ import torch
from sglang.srt.layers.attention.fla.layernorm_gated import (
_layer_norm_fwd as layer_norm_fwd,
)
from sglang.srt.layers.attention.fla.layernorm_gated import rms_norm_ref
from sglang.srt.layers.attention.fla.layernorm_gated import (
rms_norm_ref,
)
def benchmark_layer_norm_fwd(

View File

@@ -48,6 +48,18 @@ def main(args):
# Select backend
set_default_backend(select_sglang_backend(args))
# Load tokenizer if enable_thinking is set
tokenizer = None
if args.enable_thinking:
from transformers import AutoTokenizer
assert (
args.tokenizer_path is not None
), "--tokenizer-path is required when --enable-thinking is set"
tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer_path, trust_remote_code=True
)
# Read data
if args.platinum:
print("Loading GSM8K Platinum dataset from HuggingFace...")
@@ -70,7 +82,16 @@ def main(args):
questions = []
labels = []
for i in range(len(lines[:num_questions])):
questions.append(get_one_example(lines, i, False))
raw_question = few_shot_examples + get_one_example(lines, i, False)
if tokenizer is not None:
messages = [{"role": "user", "content": raw_question}]
raw_question = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
questions.append(raw_question)
labels.append(get_answer_value(lines[i]["answer"]))
assert all(l != INVALID for l in labels)
arguments = [{"question": q} for q in questions]
@@ -83,9 +104,11 @@ def main(args):
@sgl.function
def few_shot_gsm8k(s, question):
s += few_shot_examples + question
s += question
s += sgl.gen(
"answer", max_tokens=512, stop=["Question", "Assistant:", "<|separator|>"]
"answer",
max_tokens=args.max_new_tokens,
stop=["Question", "Assistant:", "<|separator|>"],
)
#####################################
@@ -96,7 +119,8 @@ def main(args):
tic = time.perf_counter()
states = few_shot_gsm8k.run_batch(
arguments,
temperature=0,
temperature=args.temperature,
top_p=args.top_p,
num_threads=args.parallel,
progress_bar=True,
)
@@ -152,6 +176,20 @@ if __name__ == "__main__":
parser.add_argument("--num-shots", type=int, default=5)
parser.add_argument("--data-path", type=str, default="test.jsonl")
parser.add_argument("--num-questions", type=int, default=200)
parser.add_argument("--max-new-tokens", type=int, default=512)
parser.add_argument("--temperature", type=float, default=0.0)
parser.add_argument("--top-p", type=float, default=1.0)
parser.add_argument(
"--enable-thinking",
action="store_true",
help="Enable thinking mode by wrapping prompts with chat template",
)
parser.add_argument(
"--tokenizer-path",
type=str,
default=None,
help="Path to tokenizer (required when --enable-thinking is set)",
)
parser.add_argument(
"--platinum",
action="store_true",

View File

@@ -12,7 +12,7 @@ from bench_multiturn import (
)
from tqdm.asyncio import tqdm
from sglang.bench_serving import get_tokenizer
from sglang.benchmark.utils import get_tokenizer
class ContextWorkloadGenerator(WorkloadGenerator):

View File

@@ -12,12 +12,9 @@ from functools import wraps
import aiohttp
from sglang.bench_serving import (
RequestFuncOutput,
get_tokenizer,
remove_prefix,
sample_random_requests,
)
from sglang.bench_serving import RequestFuncOutput
from sglang.benchmark.datasets.random import sample_random_requests
from sglang.benchmark.utils import get_tokenizer, remove_prefix
# Set up logger
logger = logging.getLogger(__name__)

View File

@@ -11,7 +11,8 @@ import numpy as np
import requests
from tqdm.asyncio import tqdm
from sglang.bench_serving import get_tokenizer, sample_random_requests
from sglang.benchmark.datasets.random import sample_random_requests
from sglang.benchmark.utils import get_tokenizer
from sglang.test.kits.cache_hit_kit import async_request_sglang_generate, gen_payload

View File

@@ -32,7 +32,7 @@ from data_processing import MsgContent, SampleOutput, get_dataset
from tqdm.asyncio import tqdm
from transformers import PreTrainedTokenizerBase
from sglang.bench_serving import get_tokenizer, remove_prefix, set_ulimit
from sglang.benchmark.utils import get_tokenizer, remove_prefix, set_ulimit
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=20 * 60 * 60)

View File

@@ -11,13 +11,13 @@ from nextqa import NExTQALoader
from tqdm.asyncio import tqdm
from transformers import PreTrainedTokenizerBase
from sglang.bench_serving import (
from sglang.benchmark.datasets.common import (
SHAREGPT_FILENAME,
SHAREGPT_REPO_ID,
download_and_cache_hf_file,
gen_prompt,
get_gen_prefix_cache_path,
)
from sglang.benchmark.datasets.generated_shared_prefix import get_gen_prefix_cache_path
from sglang.benchmark.utils import download_and_cache_hf_file
from sglang.lang.chat_template import get_chat_template, get_chat_template_by_model_path
from sglang.srt.entrypoints.openai.protocol import ChatCompletionMessageContentPart
from sglang.utils import encode_video_base64
@@ -442,7 +442,15 @@ def sample_generated_shared_prefix_requests(
disable_shuffle: bool = False,
) -> SampleOutput:
"""Generate benchmark requests with shared system prompts using random tokens and caching."""
cache_path = get_gen_prefix_cache_path(args, tokenizer)
cache_path = get_gen_prefix_cache_path(
args.seed,
num_groups,
prompts_per_group,
system_prompt_len,
question_len,
output_len,
tokenizer,
)
# Try to load from cache first
if cache_path.exists():

View File

@@ -0,0 +1,536 @@
"""
Benchmark fused allreduce+rmsnorm on AMD with correctness checks.
This script targets the same fused op used by SGLang:
`tensor_model_parallel_fused_allreduce_rmsnorm`.
It reports:
- eager mode latency (prefill-like)
- graph mode latency (decode-like)
- fused availability (whether fused path returns non-None)
- correctness (fused output matches split allreduce + rmsnorm reference)
Usage example:
torchrun --nproc_per_node=8 \
benchmark/kernels/all_reduce/benchmark_fused_ar_rms_amd.py \
--dtype bfloat16 \
--prefill-shapes 2048x8192,8192x8192 \
--decode-shapes 1x8192,4x8192,16x8192 \
--warmup 10 --iters 30 --repeats 5
"""
import argparse
import csv
import os
import statistics
from typing import Dict, List, Optional, Sequence, Tuple
import torch
import torch.distributed as dist
import torch.nn.functional as F
from sglang.srt.distributed.communication_op import (
tensor_model_parallel_all_reduce,
tensor_model_parallel_fused_allreduce_rmsnorm,
)
from sglang.srt.distributed.parallel_state import (
destroy_distributed_environment,
destroy_model_parallel,
graph_capture,
init_distributed_environment,
initialize_model_parallel,
set_custom_all_reduce,
)
Shape = Tuple[int, int]
def parse_shapes(raw: str) -> List[Shape]:
shapes: List[Shape] = []
for item in [x.strip() for x in raw.split(",") if x.strip()]:
if "x" not in item:
raise ValueError(f"Invalid shape '{item}', expected MxN format.")
m_str, n_str = item.split("x", 1)
m = int(m_str)
n = int(n_str)
if m <= 0 or n <= 0:
raise ValueError(f"Invalid shape '{item}', both dims must be positive.")
shapes.append((m, n))
if not shapes:
raise ValueError("Empty shape list is not allowed.")
return shapes
def dtype_from_name(name: str) -> torch.dtype:
mapping = {
"float16": torch.float16,
"fp16": torch.float16,
"bfloat16": torch.bfloat16,
"bf16": torch.bfloat16,
}
if name not in mapping:
raise ValueError(f"Unsupported dtype: {name}")
return mapping[name]
def check_close(
a: torch.Tensor, b: torch.Tensor, dtype: torch.dtype
) -> Tuple[bool, str]:
if dtype == torch.bfloat16:
rtol, atol = 2e-2, 1.25e-1
else:
rtol, atol = 1e-2, 2e-2
try:
torch.testing.assert_close(a, b, rtol=rtol, atol=atol)
return True, "PASS"
except AssertionError:
max_diff = torch.max(torch.abs(a - b)).item()
mean_diff = torch.mean(torch.abs(a - b)).item()
return False, f"FAIL(max={max_diff:.6f},mean={mean_diff:.6f})"
def _measure_us(
fn,
warmup: int,
iters: int,
repeats: int,
device: torch.device,
) -> Tuple[float, Dict[str, float]]:
for _ in range(warmup):
fn()
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
samples_us: List[float] = []
for _ in range(max(1, repeats)):
_barrier(device)
torch.cuda.synchronize()
start_event.record()
for _ in range(iters):
fn()
end_event.record()
end_event.synchronize()
samples_us.append(start_event.elapsed_time(end_event) * 1000.0 / iters)
sorted_samples = sorted(samples_us)
p50 = float(statistics.median(sorted_samples))
p95 = float(sorted_samples[int((len(sorted_samples) - 1) * 0.95)])
return p50, {
"p50_us": p50,
"p95_us": p95,
"min_us": float(sorted_samples[0]),
"max_us": float(sorted_samples[-1]),
}
def _barrier(device: torch.device):
try:
dist.barrier(device_ids=[device.index])
except TypeError:
dist.barrier()
def _mean_across_ranks(value: float, device: torch.device) -> float:
t = torch.tensor([value], dtype=torch.float64, device=device)
dist.all_reduce(t, op=dist.ReduceOp.SUM)
t /= dist.get_world_size()
return float(t.item())
def _all_true_across_ranks(value: bool, device: torch.device) -> bool:
t = torch.tensor([1 if value else 0], dtype=torch.int32, device=device)
dist.all_reduce(t, op=dist.ReduceOp.MIN)
return bool(int(t.item()))
def _make_inputs(
shape: Shape,
dtype: torch.dtype,
seed: int,
residual_mode: str,
rank: int,
device: torch.device,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
m, n = shape
torch.manual_seed(seed + rank * 17)
x = torch.randn((m, n), dtype=torch.float32, device=device).to(dtype)
if residual_mode == "self":
residual = x.clone()
elif residual_mode == "random":
residual = torch.randn((m, n), dtype=torch.float32, device=device).to(dtype)
elif residual_mode == "zero":
residual = torch.zeros((m, n), dtype=dtype, device=device)
else:
raise ValueError(f"Unknown residual_mode: {residual_mode}")
weight = torch.randn((n,), dtype=torch.float32, device=device).to(dtype)
return x, residual, weight
def _split_reference(
x: torch.Tensor, residual: torch.Tensor, weight: torch.Tensor, eps: float
) -> Tuple[torch.Tensor, torch.Tensor]:
ar_out = tensor_model_parallel_all_reduce(x.clone())
residual_out = ar_out + residual
out = F.rms_norm(
input=residual_out,
normalized_shape=(residual_out.shape[-1],),
weight=weight,
eps=eps,
)
return out, residual_out
def bench_eager(
x: torch.Tensor,
residual: torch.Tensor,
weight: torch.Tensor,
eps: float,
warmup: int,
iters: int,
repeats: int,
) -> Dict[str, object]:
split_fn = lambda: _split_reference(x, residual, weight, eps)
split_us, split_stats = _measure_us(split_fn, warmup, iters, repeats, x.device)
fused_probe = tensor_model_parallel_fused_allreduce_rmsnorm(
x.clone(), residual.clone(), weight, eps
)
fused_available = fused_probe is not None
fused_us: Optional[float] = None
fused_stats: Optional[Dict[str, float]] = None
if fused_available:
fused_fn = lambda: tensor_model_parallel_fused_allreduce_rmsnorm(
x, residual, weight, eps
)
fused_us, fused_stats = _measure_us(fused_fn, warmup, iters, repeats, x.device)
ref_out, ref_residual = _split_reference(x, residual, weight, eps)
if fused_available:
fused_out, fused_residual = tensor_model_parallel_fused_allreduce_rmsnorm(
x.clone(), residual.clone(), weight, eps
)
out_ok, out_detail = check_close(fused_out, ref_out, x.dtype)
res_ok, res_detail = check_close(fused_residual, ref_residual, x.dtype)
correctness_ok = out_ok and res_ok
correctness_detail = f"out={out_detail}, residual={res_detail}"
else:
correctness_ok = True
correctness_detail = "SKIP(fused_unavailable)"
return {
"split_us": split_us,
"split_stats": split_stats,
"fused_available": fused_available,
"fused_us": fused_us,
"fused_stats": fused_stats,
"correctness_ok": correctness_ok,
"correctness_detail": correctness_detail,
}
def bench_graph(
x: torch.Tensor,
residual: torch.Tensor,
weight: torch.Tensor,
eps: float,
warmup: int,
iters: int,
repeats: int,
) -> Dict[str, object]:
split_x = x.clone()
split_res = residual.clone()
split_graph_out: Optional[torch.Tensor] = None
with graph_capture() as gc:
split_graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(split_graph, stream=gc.stream):
split_graph_out, _ = _split_reference(split_x, split_res, weight, eps)
def split_replay():
split_graph.replay()
split_us, split_stats = _measure_us(split_replay, warmup, iters, repeats, x.device)
fused_probe = tensor_model_parallel_fused_allreduce_rmsnorm(
x.clone(), residual.clone(), weight, eps
)
fused_available = fused_probe is not None
fused_us: Optional[float] = None
fused_stats: Optional[Dict[str, float]] = None
fused_graph_out: Optional[torch.Tensor] = None
fused_graph_residual: Optional[torch.Tensor] = None
if fused_available:
fused_x = x.clone()
fused_res = residual.clone()
with graph_capture() as gc:
fused_graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(fused_graph, stream=gc.stream):
fused_graph_out, fused_graph_residual = (
tensor_model_parallel_fused_allreduce_rmsnorm(
fused_x, fused_res, weight, eps
)
)
def fused_replay():
fused_graph.replay()
fused_us, fused_stats = _measure_us(
fused_replay, warmup, iters, repeats, x.device
)
ref_out, ref_residual = _split_reference(x, residual, weight, eps)
if (
fused_available
and fused_graph_out is not None
and fused_graph_residual is not None
):
fused_graph.replay()
torch.cuda.synchronize()
out_ok, out_detail = check_close(fused_graph_out, ref_out, x.dtype)
res_ok, res_detail = check_close(fused_graph_residual, ref_residual, x.dtype)
correctness_ok = out_ok and res_ok
correctness_detail = f"out={out_detail}, residual={res_detail}"
else:
correctness_ok = True
correctness_detail = "SKIP(fused_unavailable)"
return {
"split_us": split_us,
"split_stats": split_stats,
"fused_available": fused_available,
"fused_us": fused_us,
"fused_stats": fused_stats,
"correctness_ok": correctness_ok,
"correctness_detail": correctness_detail,
}
def _shape_bytes(shape: Shape, dtype: torch.dtype) -> int:
m, n = shape
return m * n * torch.tensor([], dtype=dtype).element_size()
def parse_args():
parser = argparse.ArgumentParser(
description="Benchmark fused allreduce+rmsnorm (prefill eager + decode graph)."
)
parser.add_argument(
"--dtype",
type=str,
default="bf16",
choices=["fp16", "bf16", "float16", "bfloat16"],
)
parser.add_argument("--eps", type=float, default=1e-6)
parser.add_argument("--seed", type=int, default=1234)
parser.add_argument(
"--residual-mode",
type=str,
default="self",
choices=["self", "random", "zero"],
help="Use residual=x (self) to match aiter test behavior by default.",
)
parser.add_argument(
"--prefill-shapes",
type=str,
default="2048x8192,8192x8192,16384x8192",
help="Comma-separated MxN shapes for eager mode.",
)
parser.add_argument(
"--decode-shapes",
type=str,
default="1x8192,2x8192,4x8192,8x8192,16x8192",
help="Comma-separated MxN shapes for graph mode.",
)
parser.add_argument("--warmup", type=int, default=10)
parser.add_argument("--iters", type=int, default=30)
parser.add_argument("--repeats", type=int, default=5)
parser.add_argument(
"--mode",
type=str,
default="both",
choices=["eager", "graph", "both"],
)
parser.add_argument(
"--csv-out",
type=str,
default=None,
help="Optional output CSV path (written on rank 0 only).",
)
return parser.parse_args()
def main():
args = parse_args()
dtype = dtype_from_name(args.dtype)
rank = int(os.environ.get("RANK", "0"))
world_size = int(os.environ.get("WORLD_SIZE", "1"))
local_rank = int(os.environ.get("LOCAL_RANK", str(rank)))
torch.cuda.set_device(local_rank % torch.cuda.device_count())
device = torch.device(f"cuda:{local_rank % torch.cuda.device_count()}")
set_custom_all_reduce(True)
init_distributed_environment(
world_size=world_size,
rank=rank,
local_rank=local_rank,
distributed_init_method="env://",
backend="nccl",
)
initialize_model_parallel(tensor_model_parallel_size=world_size)
prefill_shapes = parse_shapes(args.prefill_shapes)
decode_shapes = parse_shapes(args.decode_shapes)
if rank == 0:
print(
"Config: "
f"world_size={world_size}, dtype={dtype}, residual_mode={args.residual_mode}, "
f"warmup={args.warmup}, iters={args.iters}, repeats={args.repeats}"
)
run_modes: Sequence[str]
if args.mode == "both":
run_modes = ("eager", "graph")
else:
run_modes = (args.mode,)
csv_rows: List[Dict[str, object]] = []
for mode in run_modes:
shapes = prefill_shapes if mode == "eager" else decode_shapes
if rank == 0:
phase_name = "prefill(eager)" if mode == "eager" else "decode(graph)"
print("\n" + "=" * 120)
print(f"Mode: {phase_name}")
print(
"| Shape | Input bytes/rank | Split p50 (us) | Fused p50 (us) | Speedup | Fused available | Correctness |"
)
print(
"|:------|-----------------:|---------------:|---------------:|--------:|:----------------|:------------|"
)
for shape in shapes:
x, residual, weight = _make_inputs(
shape=shape,
dtype=dtype,
seed=args.seed,
residual_mode=args.residual_mode,
rank=rank,
device=device,
)
if mode == "eager":
metrics = bench_eager(
x=x,
residual=residual,
weight=weight,
eps=args.eps,
warmup=args.warmup,
iters=args.iters,
repeats=args.repeats,
)
else:
metrics = bench_graph(
x=x,
residual=residual,
weight=weight,
eps=args.eps,
warmup=args.warmup,
iters=args.iters,
repeats=args.repeats,
)
split_us = _mean_across_ranks(float(metrics["split_us"]), device)
fused_available = _all_true_across_ranks(
bool(metrics["fused_available"]), device
)
correctness_ok = _all_true_across_ranks(
bool(metrics["correctness_ok"]), device
)
fused_us: Optional[float] = None
if fused_available and metrics["fused_us"] is not None:
fused_us = _mean_across_ranks(float(metrics["fused_us"]), device)
if rank == 0:
m, n = shape
shape_str = f"{m}x{n}"
bytes_per_rank = _shape_bytes(shape, dtype)
if fused_us is not None and fused_us > 0:
speedup = split_us / fused_us
speedup_str = f"{speedup:.3f}x"
fused_str = f"{fused_us:.1f}"
else:
speedup_str = "N/A"
fused_str = "N/A"
correctness_text = (
"PASS" if correctness_ok else str(metrics["correctness_detail"])
)
print(
f"| {shape_str} | {bytes_per_rank} | {split_us:.1f} | {fused_str} | "
f"{speedup_str} | {str(fused_available)} | {correctness_text} |"
)
csv_rows.append(
{
"mode": mode,
"shape": shape_str,
"m": m,
"n": n,
"bytes_per_rank": bytes_per_rank,
"split_p50_us": split_us,
"fused_p50_us": fused_us if fused_us is not None else "",
"speedup_split_over_fused": (
split_us / fused_us
if fused_us is not None and fused_us > 0
else ""
),
"fused_available": fused_available,
"correctness_ok": correctness_ok,
"correctness_detail": correctness_text,
"dtype": str(dtype),
"world_size": world_size,
"residual_mode": args.residual_mode,
"warmup": args.warmup,
"iters": args.iters,
"repeats": args.repeats,
}
)
if rank == 0 and args.csv_out:
os.makedirs(os.path.dirname(args.csv_out) or ".", exist_ok=True)
fieldnames = [
"mode",
"shape",
"m",
"n",
"bytes_per_rank",
"split_p50_us",
"fused_p50_us",
"speedup_split_over_fused",
"fused_available",
"correctness_ok",
"correctness_detail",
"dtype",
"world_size",
"residual_mode",
"warmup",
"iters",
"repeats",
]
with open(args.csv_out, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(csv_rows)
print(f"\nSaved CSV to: {args.csv_out}")
_barrier(device)
destroy_model_parallel()
destroy_distributed_environment()
if __name__ == "__main__":
main()

View File

@@ -18,7 +18,13 @@ from sglang.srt.layers.moe.fused_moe_triton.triton_kernels_moe import (
triton_kernel_moe_forward,
)
from sglang.srt.layers.moe.moe_runner import MoeRunnerConfig
from sglang.srt.layers.moe.topk import TopK, TopKConfig, select_experts
from sglang.srt.layers.moe.topk import (
TopK,
TopKConfig,
TopKOutputFormat,
select_experts,
)
from sglang.srt.server_args import ServerArgs, set_global_server_args_for_scheduler
def fused_moe_triton_api(
@@ -32,8 +38,8 @@ def fused_moe_triton_api(
top_k=topk,
renormalize=False,
use_grouped_topk=False,
output_format=TopKOutputFormat.TRITON_KERNEL,
)
topk_op.use_triton_kernels = True
triton_topk_output = topk_op.forward_cuda(
hidden_states=x,
router_logits=input_gating,
@@ -199,6 +205,10 @@ def main():
parser.add_argument("--trust-remote-code", action="store_true")
args = parser.parse_args()
# Initialize global server args (required by SGLang MoE kernels)
server_args = ServerArgs(model_path=args.model)
set_global_server_args_for_scheduler(server_args)
try:
if not torch.distributed.is_initialized():
torch.distributed.init_process_group(
@@ -217,8 +227,8 @@ def main():
)
initialize_model_parallel(
tensor_model_parallel_size=args.ep_size,
pipeline_model_parallel_size=args.tp_size,
tensor_model_parallel_size=1,
expert_model_parallel_size=1,
)
model_config = get_model_config(args.model, args.tp_size, args.ep_size)

View File

@@ -35,10 +35,9 @@ from sglang.bench_serving import (
_create_bench_client_session,
calculate_metrics,
get_request,
get_tokenizer,
remove_prefix,
sample_random_requests,
)
from sglang.benchmark.datasets.random import sample_random_requests
from sglang.benchmark.utils import get_tokenizer, remove_prefix
global args

View File

@@ -13,8 +13,7 @@ number = 5
def expand_tip(topic, tip, generate):
s = (
"""Please expand a tip for a topic into a detailed paragraph.
s = """Please expand a tip for a topic into a detailed paragraph.
Topic: staying healthy
Tip: Regular Exercise
@@ -28,12 +27,7 @@ Topic: writing a blog post
Tip: structure your content effectively
Paragraph: A well-structured post is easier to read and more enjoyable. Start with an engaging introduction that hooks the reader and clearly states the purpose of your post. Use headings and subheadings to break up the text and guide readers through your content. Bullet points and numbered lists can make information more digestible. Ensure each paragraph flows logically into the next, and conclude with a summary or call-to-action that encourages reader engagement.
Topic: """
+ topic
+ "\nTip: "
+ tip
+ "\nParagraph:"
)
Topic: """ + topic + "\nTip: " + tip + "\nParagraph:"
return generate(s, max_tokens=128, stop=["\n\n"])

View File

@@ -14,8 +14,7 @@ number = 5
@sgl.function
def expand_tip(s, topic, tip):
s += (
"""Please expand a tip for a topic into a detailed paragraph.
s += """Please expand a tip for a topic into a detailed paragraph.
Topic: staying healthy
Tip: Regular Exercise
@@ -29,12 +28,7 @@ Topic: writing a blog post
Tip: structure your content effectively
Paragraph: A well-structured post is easier to read and more enjoyable. Start with an engaging introduction that hooks the reader and clearly states the purpose of your post. Use headings and subheadings to break up the text and guide readers through your content. Bullet points and numbered lists can make information more digestible. Ensure each paragraph flows logically into the next, and conclude with a summary or call-to-action that encourages reader engagement.
Topic: """
+ topic
+ "\nTip: "
+ tip
+ "\nParagraph:"
)
Topic: """ + topic + "\nTip: " + tip + "\nParagraph:"
s += sgl.gen("paragraph", max_tokens=128, stop=["\n\n"], temperature=0)

View File

@@ -2,8 +2,7 @@ number = 5
async def expand_tip_async(topic, tip, generate):
s = (
"""Please expand a tip for a topic into a detailed paragraph.
s = """Please expand a tip for a topic into a detailed paragraph.
Topic: staying healthy
Tip: Regular Exercise
@@ -17,12 +16,7 @@ Topic: writing a blog post
Tip: structure your content effectively
Paragraph: A well-structured post is easier to read and more enjoyable. Start with an engaging introduction that hooks the reader and clearly states the purpose of your post. Use headings and subheadings to break up the text and guide readers through your content. Bullet points and numbered lists can make information more digestible. Ensure each paragraph flows logically into the next, and conclude with a summary or call-to-action that encourages reader engagement.
Topic: """
+ topic
+ "\nTip: "
+ tip
+ "\nParagraph:"
)
Topic: """ + topic + "\nTip: " + tip + "\nParagraph:"
return await generate(s, max_tokens=128, stop="\n\n")

View File

@@ -19,7 +19,7 @@ ARG PIP_DEFAULT_INDEX
ARG UBUNTU_MIRROR
ARG GITHUB_ARTIFACTORY=github.com
ARG INSTALL_FLASHINFER_JIT_CACHE=0
ARG FLASHINFER_VERSION=0.6.2
ARG FLASHINFER_VERSION=0.6.3
ARG MOONCAKE_VERSION=0.3.9
#if need other arg please add in MOONCAKE_COMPILE_ARG
ARG MOONCAKE_COMPILE_ARG="-DUSE_HTTP=ON -DUSE_MNNVL=ON -DUSE_CUDA=ON -DWITH_EP=ON"

View File

@@ -93,9 +93,9 @@ RUN git clone https://github.com/sgl-project/sglang --branch $SGLANG_TAG && \
RUN ${PIP_INSTALL} wheel==0.45.1 pybind11 pyyaml decorator scipy attrs psutil \
&& mkdir sgl-kernel-npu \
&& cd sgl-kernel-npu \
&& wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/${SGLANG_KERNEL_NPU_TAG}/sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \
&& unzip sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \
&& ${PIP_INSTALL} output/deep_ep*.whl output/sgl_kernel_npu*.whl \
&& wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/${SGLANG_KERNEL_NPU_TAG}/sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-torch2.8.0-py311-cann${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \
&& unzip sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-torch2.8.0-py311-cann${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \
&& ${PIP_INSTALL} deep_ep*.whl sgl_kernel_npu*.whl \
&& cd .. && rm -rf sgl-kernel-npu \
&& cd "$(python3 -m pip show deep-ep | awk '/^Location:/ {print $2}')" && ln -sf deep_ep/deep_ep_cpp*.so

View File

@@ -21,7 +21,7 @@ ENV BUILD_TRITON="0"
ENV BUILD_LLVM="0"
ENV BUILD_AITER_ALL="1"
ENV BUILD_MOONCAKE="1"
ENV AITER_COMMIT="v0.1.10.post2"
ENV AITER_COMMIT="v0.1.10.post3"
# ===============================
# Base image 950 and args
@@ -31,7 +31,7 @@ ENV BUILD_TRITON="0"
ENV BUILD_LLVM="0"
ENV BUILD_AITER_ALL="1"
ENV BUILD_MOONCAKE="1"
ENV AITER_COMMIT="v0.1.10.post2"
ENV AITER_COMMIT="v0.1.10.post3"
# ===============================
# Chosen arch and args
FROM ${GPU_ARCH}
@@ -70,7 +70,7 @@ ARG ENABLE_MORI=0
ARG NIC_BACKEND=none
ARG MORI_REPO="https://github.com/ROCm/mori.git"
ARG MORI_COMMIT="b0dce4beebeb1f26c784eee17d5fd9785ee9447f"
ARG MORI_COMMIT="20920706a9004018dbd87c7387f207d08d0e05af"
# AMD AINIC apt repo settings
ARG AINIC_VERSION=1.117.5
@@ -214,10 +214,10 @@ RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y \
ENV CARGO_BUILD_JOBS=4
# Build and install sgl-model-gateway
RUN python3 -m pip install --no-cache-dir setuptools-rust \
RUN python3 -m pip install --no-cache-dir maturin \
&& cd /sgl-workspace/sglang/sgl-model-gateway/bindings/python \
&& /bin/bash -lc 'ulimit -n 8192 && cargo build --release' \
&& python3 -m pip install --no-cache-dir . \
&& ulimit -n 65536 && maturin build --release --features vendored-openssl --out dist \
&& python3 -m pip install --force-reinstall dist/*.whl \
&& rm -rf /root/.cache
# -----------------------
@@ -280,7 +280,7 @@ RUN /bin/bash -lc 'set -euo pipefail; \
\
# TVM Python bits need Cython + z3 before configure.
# Pin z3-solver==4.15.4.0: 4.15.4.0 has a manylinux wheel; 4.15.5.0 has no wheel and builds from source (fails: C++20 <format> needs GCC 14+, image has GCC 11).
"$VENV_PIP" install --no-cache-dir "cython>=0.29.36,<3.0" "apache-tvm-ffi>=0.1.6" "z3-solver==4.15.4.0"; \
"$VENV_PIP" install --no-cache-dir "cython>=0.29.36,<3.0" "apache-tvm-ffi @ git+https://github.com/apache/tvm-ffi.git@37d0485b2058885bf4e7a486f7d7b2174a8ac1ce" "z3-solver==4.15.4.0"; \
\
# Clone + pin TileLang (bundled TVM), then build
git clone --recursive "${TILELANG_REPO}" /opt/tilelang && \
@@ -390,10 +390,7 @@ ENV SGLANG_USE_AITER=1
ENV SGLANG_USE_ROCM700A=1
ENV NCCL_MIN_NCHANNELS=112
ENV VLLM_FP8_PADDING=1
ENV VLLM_FP8_ACT_PADDING=1
ENV VLLM_FP8_WEIGHT_PADDING=1
ENV VLLM_FP8_REDUCE_CONV=1
ENV ROCM_QUICK_REDUCE_QUANTIZATION=INT8
ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1

503
docker/rocm720.Dockerfile Normal file
View File

@@ -0,0 +1,503 @@
# Usage (to build SGLang ROCm docker image):
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx942 -t v0.5.8.post1-rocm700-mi30x -f rocm.Dockerfile .
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx942-rocm720 -t v0.5.8.post1-rocm720-mi30x-preview -f rocm720.Dockerfile .
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx950 -t v0.5.8.post1-rocm700-mi35x -f rocm.Dockerfile .
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx950-rocm720 -t v0.5.8.post1-rocm720-mi35x-preview -f rocm720.Dockerfile .
# Usage (to build SGLang ROCm + Mori docker image):
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx942 --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t v0.5.8.post1-rocm700-mi30x -f rocm.Dockerfile .
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx950 --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t v0.5.8.post1-rocm700-mi35x -f rocm.Dockerfile .
# Default base images
ARG BASE_IMAGE_942="rocm/sgl-dev:rocm7-vllm-20250904"
ARG BASE_IMAGE_942_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1"
ARG BASE_IMAGE_950="rocm/sgl-dev:rocm7-vllm-20250904"
ARG BASE_IMAGE_950_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1"
# This is necessary for scope purpose
ARG GPU_ARCH=gfx950
# ===============================
# Base image 942 with rocm700 and args
FROM $BASE_IMAGE_942 AS gfx942
ENV BUILD_VLLM="0"
ENV BUILD_TRITON="0"
ENV BUILD_LLVM="0"
ENV BUILD_AITER_ALL="1"
ENV BUILD_MOONCAKE="1"
ENV AITER_COMMIT="v0.1.10.post3"
# ===============================
# Base image 942 with rocm720 and args
FROM $BASE_IMAGE_942_ROCM720 AS gfx942-rocm720
ENV BUILD_VLLM="0"
ENV BUILD_TRITON="1"
ENV BUILD_LLVM="0"
ENV BUILD_AITER_ALL="1"
ENV BUILD_MOONCAKE="1"
ENV AITER_COMMIT="v0.1.10.post3"
# ===============================
# Base image 950 and args
FROM $BASE_IMAGE_950 AS gfx950
ENV BUILD_VLLM="0"
ENV BUILD_TRITON="0"
ENV BUILD_LLVM="0"
ENV BUILD_AITER_ALL="1"
ENV BUILD_MOONCAKE="1"
ENV AITER_COMMIT="v0.1.10.post3"
# ===============================
# Base image 950 with rocm720 and args
FROM $BASE_IMAGE_950_ROCM720 AS gfx950-rocm720
ENV BUILD_VLLM="0"
ENV BUILD_TRITON="1"
ENV BUILD_LLVM="0"
ENV BUILD_AITER_ALL="1"
ENV BUILD_MOONCAKE="1"
ENV AITER_COMMIT="v0.1.10.post3"
# ===============================
# Chosen arch and args
FROM ${GPU_ARCH}
# This is necessary for scope purpose, again
ARG GPU_ARCH=gfx950
ENV GPU_ARCH_LIST=${GPU_ARCH%-*}
ARG SGL_REPO="https://github.com/sgl-project/sglang.git"
ARG SGL_BRANCH="main"
# Version override for setuptools_scm (used in nightly builds)
ARG SETUPTOOLS_SCM_PRETEND_VERSION=""
ARG TRITON_REPO="https://github.com/triton-lang/triton.git"
ARG TRITON_COMMIT="42270451990532c67e69d753fbd026f28fcc4840"
ARG AITER_REPO="https://github.com/ROCm/aiter.git"
ARG LLVM_REPO="https://github.com/jrbyrnes/llvm-project.git"
ARG LLVM_BRANCH="MainOpSelV2"
ARG LLVM_COMMIT="6520ace8227ffe2728148d5f3b9872a870b0a560"
ARG MOONCAKE_REPO="https://github.com/kvcache-ai/Mooncake.git"
ARG MOONCAKE_COMMIT="b6a841dc78c707ec655a563453277d969fb8f38d"
ARG TILELANG_REPO="https://github.com/tile-ai/tilelang.git"
ARG TILELANG_COMMIT="ebf4a7cb8881432165ae8760e99d209d905c704a"
ARG FHT_REPO="https://github.com/jeffdaily/fast-hadamard-transform.git"
ARG FHT_BRANCH="rocm"
ARG FHT_COMMIT="46efb7d776d38638fc39f3c803eaee3dd7016bd1"
ARG ENABLE_MORI=0
ARG NIC_BACKEND=none
ARG MORI_REPO="https://github.com/ROCm/mori.git"
ARG MORI_COMMIT="20920706a9004018dbd87c7387f207d08d0e05af"
# AMD AINIC apt repo settings
ARG AINIC_VERSION=1.117.5
ARG UBUNTU_CODENAME=jammy
USER root
# Install some basic utilities
RUN python -m pip install --upgrade pip && pip install setuptools_scm
RUN apt-get purge -y sccache; python -m pip uninstall -y sccache; rm -f "$(which sccache)"
# Install AMD SMI Python package from ROCm distribution
RUN cd /opt/rocm/share/amd_smi && python3 -m pip install --no-cache-dir .
WORKDIR /sgl-workspace
# -----------------------
# llvm
RUN if [ "$BUILD_LLVM" = "1" ]; then \
ENV HIP_CLANG_PATH="/sgl-workspace/llvm-project/build/bin/" \
git clone --single-branch ${LLVM_REPO} -b ${LLVM_BRANCH} \
&& cd llvm-project \
&& git checkout ${LLVM_COMMIT} \
&& mkdir build \
&& cd build \
&& cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_TARGETS_TO_BUILD="AMDGPU;X86" -DLLVM_ENABLE_PROJECTS="clang;lld;" -DLLVM_ENABLE_RUNTIMES="compiler-rt" ../llvm \
&& make -j$(nproc); \
fi
# -----------------------
# AITER
# Unset setuptools_scm override so AITER gets its own version (AITER_COMMIT), not SGLang's
# (SETUPTOOLS_SCM_PRETEND_VERSION is set later for SGLang nightly builds and would otherwise
# leak into AITER's version when AITER uses setuptools_scm)
ENV SETUPTOOLS_SCM_PRETEND_VERSION=
RUN pip uninstall -y aiter \
&& pip install psutil pybind11 # Required by AITER setup.py
RUN git clone ${AITER_REPO} \
&& cd aiter \
&& git checkout ${AITER_COMMIT} \
&& git submodule update --init --recursive
# Hot patches for AITER in v0.1.10.post1
# This is for ROCm 7.2 only, because of the image rebase from vllm
# to rocm/pytorch.
RUN set -eux; \
case "${GPU_ARCH}" in \
*rocm720*) \
echo "ROCm 7.2 flavor detected from GPU_ARCH=${GPU_ARCH}"; \
cd aiter \
&& sed -i '459 s/if.*:/if False:/' aiter/ops/triton/attention/pa_mqa_logits.py; \
;; \
*) \
echo "Not rocm720 (GPU_ARCH=${GPU_ARCH}), skip patch"; \
;; \
esac
RUN cd aiter \
&& echo "[AITER] GPU_ARCH=${GPU_ARCH}" \
&& if [ "$BUILD_AITER_ALL" = "1" ] && [ "$BUILD_LLVM" = "1" ]; then \
sh -c "HIP_CLANG_PATH=/sgl-workspace/llvm-project/build/bin/ PREBUILD_KERNELS=1 GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \
elif [ "$BUILD_AITER_ALL" = "1" ]; then \
sh -c "PREBUILD_KERNELS=1 GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \
else \
sh -c "GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \
fi \
&& echo "export PYTHONPATH=/sgl-workspace/aiter:\${PYTHONPATH}" >> /etc/bash.bashrc
# -----------------------
# Build vLLM
ARG VLLM_REPO="https://github.com/ROCm/vllm.git"
ARG VLLM_BRANCH="9f6b92db47c3444b7a7d67451ba0c3a2d6af4c2c"
RUN if [ "$BUILD_VLLM" = "1" ]; then \
git clone ${VLLM_REPO} \
&& cd vllm \
&& git checkout ${VLLM_BRANCH} \
&& python -m pip install -r requirements/rocm.txt \
&& python setup.py clean --all \
&& python setup.py develop; \
fi
# -----------------------
# Build Mooncake
ENV PATH=$PATH:/usr/local/go/bin
RUN if [ "$BUILD_MOONCAKE" = "1" ]; then \
apt update && apt install -y zip unzip wget && \
apt install -y gcc make libtool autoconf librdmacm-dev rdmacm-utils infiniband-diags ibverbs-utils perftest ethtool libibverbs-dev rdma-core && \
apt install -y openssh-server openmpi-bin openmpi-common libopenmpi-dev && \
git clone ${MOONCAKE_REPO} && \
cd Mooncake && \
git checkout ${MOONCAKE_COMMIT} && \
git submodule update --init --recursive && \
bash dependencies.sh -y && \
rm -rf /usr/local/go && \
wget https://go.dev/dl/go1.22.2.linux-amd64.tar.gz && \
tar -C /usr/local -xzf go1.22.2.linux-amd64.tar.gz && \
rm go1.22.2.linux-amd64.tar.gz && \
mkdir -p build && \
cd build && \
cmake .. -DUSE_HIP=ON -DUSE_ETCD=ON && \
make -j "$(nproc)" && make install; \
fi
# -----------------------
# Build SGLang
ARG BUILD_TYPE=all
# Set version for setuptools_scm if provided (for nightly builds). Only pass in the SGLang
# pip install RUN so it does not affect AITER, sgl-model-gateway, TileLang, FHT, MORI, etc.
ARG SETUPTOOLS_SCM_PRETEND_VERSION
RUN pip install IPython \
&& pip install orjson \
&& pip install python-multipart \
&& pip install torchao==0.9.0 \
&& pip install pybind11
RUN pip uninstall -y sgl_kernel sglang
RUN git clone ${SGL_REPO} \
&& cd sglang \
&& echo "Using ${SGL_BRANCH} branch." \
&& git checkout ${SGL_BRANCH} \
&& cd sgl-kernel \
&& rm -f pyproject.toml \
&& mv pyproject_rocm.toml pyproject.toml \
&& AMDGPU_TARGET=$GPU_ARCH_LIST python setup_rocm.py install \
&& cd .. \
&& rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml \
&& if [ "$BUILD_TYPE" = "srt" ]; then \
export SETUPTOOLS_SCM_PRETEND_VERSION="${SETUPTOOLS_SCM_PRETEND_VERSION}" && python -m pip --no-cache-dir install -e "python[srt_hip,diffusion_hip]"; \
else \
export SETUPTOOLS_SCM_PRETEND_VERSION="${SETUPTOOLS_SCM_PRETEND_VERSION}" && python -m pip --no-cache-dir install -e "python[all_hip]"; \
fi
RUN python -m pip cache purge
# Copy config files to support MI300X in virtualized environments (MI300X_VF). Symlinks will not be created in image build.
RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
-type f -name '*MI300X*' | xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}
# Install Rust toolchain for sgl-model-gateway
ENV PATH="/root/.cargo/bin:${PATH}"
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y \
&& rustc --version && cargo --version
ENV CARGO_BUILD_JOBS=4
# Build and install sgl-model-gateway
RUN python3 -m pip install --no-cache-dir maturin \
&& cd /sgl-workspace/sglang/sgl-model-gateway/bindings/python \
&& ulimit -n 65536 && maturin build --release --features vendored-openssl --out dist \
&& python3 -m pip install --force-reinstall dist/*.whl \
&& rm -rf /root/.cache
# -----------------------
# TileLang
ENV DEBIAN_FRONTEND=noninteractive
ENV LIBGL_ALWAYS_INDIRECT=1
RUN echo "LC_ALL=en_US.UTF-8" >> /etc/environment
RUN /bin/bash -lc 'set -euo pipefail; \
echo "[TileLang] Building TileLang for ${GPU_ARCH}"; \
# System dependencies (NO llvm-dev to avoid llvm-config-16 shadowing)
apt-get update && apt-get install -y --no-install-recommends \
build-essential git wget curl ca-certificates gnupg \
libgtest-dev libgmock-dev \
libprotobuf-dev protobuf-compiler libgflags-dev libsqlite3-dev \
python3 python3-dev python3-setuptools python3-pip python3-apt \
gcc libtinfo-dev zlib1g-dev libedit-dev libxml2-dev vim \
cmake ninja-build pkg-config libstdc++6 software-properties-common \
&& rm -rf /var/lib/apt/lists/*; \
\
# Prefer the container venv
VENV_PY="/opt/venv/bin/python"; \
VENV_PIP="/opt/venv/bin/pip"; \
if [ ! -x "$VENV_PY" ]; then VENV_PY="python3"; fi; \
if [ ! -x "$VENV_PIP" ]; then VENV_PIP="pip3"; fi; \
\
# Build GoogleTest static libs (Ubuntu package ships sources only)
cmake -S /usr/src/googletest -B /tmp/build-gtest -DBUILD_GTEST=ON -DBUILD_GMOCK=ON -DCMAKE_BUILD_TYPE=Release && \
cmake --build /tmp/build-gtest -j"$(nproc)" && \
cp -v /tmp/build-gtest/lib/*.a /usr/lib/x86_64-linux-gnu/ && \
rm -rf /tmp/build-gtest; \
\
# Keep setuptools < 80 (compat with base image)
"$VENV_PIP" install --upgrade "setuptools>=77.0.3,<80" wheel cmake ninja scikit-build-core && \
"$VENV_PIP" cache purge || true; \
\
# Locate ROCm llvm-config; fallback to installing LLVM 18 if missing
LLVM_CONFIG_PATH=""; \
for p in /opt/rocm/llvm/bin/llvm-config /opt/rocm/llvm-*/bin/llvm-config /opt/rocm-*/llvm*/bin/llvm-config; do \
if [ -x "$p" ]; then LLVM_CONFIG_PATH="$p"; break; fi; \
done; \
if [ -z "$LLVM_CONFIG_PATH" ]; then \
echo "[TileLang] ROCm llvm-config not found; installing LLVM 18..."; \
curl -fsSL https://apt.llvm.org/llvm-snapshot.gpg.key | gpg --dearmor -o /etc/apt/keyrings/llvm.gpg; \
echo "deb [signed-by=/etc/apt/keyrings/llvm.gpg] http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main" > /etc/apt/sources.list.d/llvm.list; \
apt-get update; \
apt-get install -y --no-install-recommends llvm-18; \
rm -rf /var/lib/apt/lists/*; \
LLVM_CONFIG_PATH="$(command -v llvm-config-18)"; \
if [ -z "$LLVM_CONFIG_PATH" ]; then echo "ERROR: llvm-config-18 not found after install"; exit 1; fi; \
fi; \
echo "[TileLang] Using LLVM_CONFIG at: $LLVM_CONFIG_PATH"; \
export PATH="$(dirname "$LLVM_CONFIG_PATH"):/usr/local/bin:${PATH}"; \
export LLVM_CONFIG="$LLVM_CONFIG_PATH"; \
\
# Optional shim for tools that expect llvm-config-16
mkdir -p /usr/local/bin && \
printf "#!/usr/bin/env bash\nexec \"%s\" \"\$@\"\n" "$LLVM_CONFIG_PATH" > /usr/local/bin/llvm-config-16 && \
chmod +x /usr/local/bin/llvm-config-16; \
\
# TVM Python bits need Cython + z3 before configure.
# Pin z3-solver==4.15.4.0: 4.15.4.0 has a manylinux wheel; 4.15.5.0 has no wheel and builds from source (fails: C++20 <format> needs GCC 14+, image has GCC 11).
"$VENV_PIP" install --no-cache-dir "cython>=0.29.36,<3.0" "apache-tvm-ffi @ git+https://github.com/apache/tvm-ffi.git@37d0485b2058885bf4e7a486f7d7b2174a8ac1ce" "z3-solver==4.15.4.0"; \
\
# Clone + pin TileLang (bundled TVM), then build
git clone --recursive "${TILELANG_REPO}" /opt/tilelang && \
cd /opt/tilelang && \
git fetch --depth=1 origin "${TILELANG_COMMIT}" || true && \
git checkout -f "${TILELANG_COMMIT}" && \
git submodule update --init --recursive && \
export CMAKE_ARGS="-DUSE_CUDA=OFF -DUSE_ROCM=ON -DROCM_PATH=/opt/rocm -DLLVM_CONFIG=${LLVM_CONFIG} -DSKBUILD_SABI_VERSION= ${CMAKE_ARGS:-}" && \
"$VENV_PIP" install -e . -v --no-build-isolation --no-deps; \
if [ -f pyproject.toml ]; then sed -i "/^[[:space:]]*\"torch/d" pyproject.toml || true; fi; \
"$VENV_PIP" cache purge || true; \
"$VENV_PY" -c "import tilelang; print(tilelang.__version__)"'
# -----------------------
# Hadamard-transform (HIP build)
RUN /bin/bash -lc 'set -euo pipefail; \
git clone --branch "${FHT_BRANCH}" "${FHT_REPO}" fast-hadamard-transform; \
cd fast-hadamard-transform; \
git checkout -f "${FHT_COMMIT}"; \
python setup.py install'
# -----------------------
# Python tools
RUN python3 -m pip install --no-cache-dir \
py-spy \
pre-commit \
tabulate
# -----------------------
# MORI (optional)
ENV PYTORCH_ROCM_ARCH=gfx942;gfx950
RUN /bin/bash -lc 'set -euo pipefail; \
if [ "${ENABLE_MORI}" != "1" ]; then \
echo "[MORI] Skipping (ENABLE_MORI=${ENABLE_MORI})"; \
exit 0; \
fi; \
echo "[MORI] Enabling MORI (NIC_BACKEND=${NIC_BACKEND})"; \
\
# Base deps for MORI build
apt-get update && apt-get install -y --no-install-recommends \
build-essential \
g++ \
jq \
libopenmpi-dev \
libpci-dev \
initramfs-tools \
&& rm -rf /var/lib/apt/lists/*; \
\
# NIC backend deps
case "${NIC_BACKEND}" in \
# default: mlx5
none) \
export USE_IONIC="OFF"; \
export USE_BNXT="OFF"; \
;; \
# AMD NIC
ainic) \
export USE_IONIC="ON"; \
export USE_BNXT="OFF"; \
apt-get update && apt-get install -y --no-install-recommends ca-certificates curl gnupg apt-transport-https && \
rm -rf /var/lib/apt/lists/* && mkdir -p /etc/apt/keyrings; \
curl -fsSL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor > /etc/apt/keyrings/amdainic.gpg; \
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/amdainic.gpg] https://repo.radeon.com/amdainic/pensando/ubuntu/${AINIC_VERSION} ${UBUNTU_CODENAME} main" \
> /etc/apt/sources.list.d/amdainic.list; \
apt-get update && apt-get install -y --no-install-recommends \
libionic-dev \
ionic-common \
; \
rm -rf /var/lib/apt/lists/*; \
;; \
# TODO: Add Broadcom bnxt packages/repos here later.
# bnxt) \
# export USE_IONIC="OFF"; \
# export USE_BNXT="ON"; \
# echo "[MORI] NIC_BACKEND=bnxt: USE_BNXT=ON. Add Broadcom bnxt packages/repos here later."; \
# ;; \
*) \
echo "ERROR: unknown NIC_BACKEND=${NIC_BACKEND}. Use one of: none, ainic"; \
exit 2; \
;; \
esac; \
\
# Build/install MORI
export MORI_GPU_ARCHS="${GPU_ARCH_LIST}"; \
echo "[MORI] MORI_GPU_ARCHS=${MORI_GPU_ARCHS} USE_IONIC=${USE_IONIC} USE_BNXT=${USE_BNXT}"; \
rm -rf /sgl-workspace/mori; \
git clone "${MORI_REPO}" /sgl-workspace/mori; \
cd /sgl-workspace/mori; \
git checkout "${MORI_COMMIT}"; \
git submodule update --init --recursive; \
python3 setup.py develop; \
python3 -c "import os, torch; print(os.path.join(os.path.dirname(torch.__file__), \"lib\"))" > /etc/ld.so.conf.d/torch.conf; \
ldconfig; \
echo "export PYTHONPATH=/sgl-workspace/mori:\${PYTHONPATH}" >> /etc/bash.bashrc; \
echo "[MORI] Done."'
# -----------------------
# Hot patch: torch-ROCm
# The artifact hardcoded the supported triton version to be 3.5.1.
# Rewrite the restriction directly.
ARG TORCH_ROCM_FILE="torch-2.9.1+rocm7.2.0.lw.git7e1940d4-cp310-cp310-linux_x86_64.whl"
RUN mkdir /tmp/whl && cd /tmp/whl \
&& export TORCH_ROCM_FILE="${TORCH_ROCM_FILE}" \
&& python - <<'PY'
import zipfile, csv, os, re
from pathlib import Path
fname = os.environ["TORCH_ROCM_FILE"]
in_whl = Path("/") / fname
out_whl = Path("/tmp")/ fname
work = Path("/tmp/whl")
# 1) Extract
with zipfile.ZipFile(in_whl, "r") as z:
z.extractall(work)
# 2) Locate dist-info and patch METADATA (edit this logic to match your exact line)
dist_info = next(work.glob("*.dist-info"))
meta = dist_info / "METADATA"
txt = meta.read_text(encoding="utf-8")
# Example: replace one exact requirement form.
# Adjust the string to match what you actually see.
pat = r'^Requires-Dist:\s*triton==3.5.1[^\s]*;'
txt2, n = re.subn(pat, r'triton>=3.5.1;', txt, flags=re.MULTILINE)
if txt2 == txt:
raise SystemExit("Did not find expected Requires-Dist line to replace in METADATA")
meta.write_text(txt2, encoding="utf-8")
# 3) Hacky step: blank hash/size columns in RECORD
record = dist_info / "RECORD"
rows = []
with record.open(newline="", encoding="utf-8") as f:
for r in csv.reader(f):
if not r:
continue
# keep filename, blank out hash and size
rows.append([r[0], "", ""])
with record.open("w", newline="", encoding="utf-8") as f:
csv.writer(f).writerows(rows)
# 4) Re-zip as a wheel
with zipfile.ZipFile(out_whl, "w", compression=zipfile.ZIP_DEFLATED) as z:
for p in work.rglob("*"):
if p.is_file():
z.write(p, p.relative_to(work).as_posix())
print("Wrote", out_whl)
PY
RUN python3 -m pip install --force --no-deps /tmp/${TORCH_ROCM_FILE} \
&& rm -fr /tmp/whl /tmp/${TORCH_ROCM_FILE}
# -----------------------
# Hot patch: Triton
# For ROCm 7.2, this custom build breaks pip dependency management,
# so future `pip install` will break the ROCm stack.
# A workaround for this is to reinstall the default triton
# wheel with the `rocm/pytorch` image in the root directory.
RUN if [ "$BUILD_TRITON" = "1" ]; then \
pip uninstall -y triton \
&& apt install -y cmake \
&& git clone ${TRITON_REPO} triton-custom \
&& cd triton-custom \
&& git checkout ${TRITON_COMMIT} \
&& pip install -r python/requirements.txt \
&& pip install -e .; \
fi
# -----------------------
# Performance environment variable.
# Skip CuDNN compatibility check - not applicable for ROCm (uses MIOpen instead)
ENV SGLANG_DISABLE_CUDNN_CHECK=1
ENV HIP_FORCE_DEV_KERNARG=1
ENV HSA_NO_SCRATCH_RECLAIM=1
ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
ENV SGLANG_INT4_WEIGHT=0
ENV SGLANG_MOE_PADDING=1
ENV SGLANG_ROCM_DISABLE_LINEARQUANT=0
ENV SGLANG_ROCM_FUSED_DECODE_MLA=1
ENV SGLANG_SET_CPU_AFFINITY=1
ENV SGLANG_USE_AITER=1
ENV SGLANG_USE_ROCM700A=1
ENV NCCL_MIN_NCHANNELS=112
ENV ROCM_QUICK_REDUCE_QUANTIZATION=INT8
ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
CMD ["/bin/bash"]

View File

@@ -39,6 +39,23 @@ Notes:
- `page_first`: Only compatible with `kernel` I/O backend, automatically switches to `layer_first` with `direct` backend
- `page_first_direct`: Specifically designed for `direct` I/O backend with optimized memory organization
### Heterogeneous TP Support (GQA/MHA models)
HiCache storage supports cross-cluster KV reuse when different deployments use different TP sizes (for example, `tp=4` and `tp=8`) and share the same storage backend namespace.
Use `tp_lcm_size` in `--hicache-storage-backend-extra-config`:
```bash
# Example: heterogeneous TP = {4, 8}, so lcm = 8
--hicache-storage-backend-extra-config '{"tp_lcm_size": 8}'
```
Guidelines:
- Set `tp_lcm_size` to the least common multiple (LCM) of all TP sizes that will share the same HiCache storage.
- For MHA models with Mooncake and `page_head` layout, HiCache will split head shards based on `tp_lcm_size` to make keys reusable across heterogeneous TP deployments.
- If all clusters use the same TP size, this option is not needed.
### Prefetch Policies
```bash

View File

@@ -102,7 +102,7 @@
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
]
},
{
@@ -151,18 +151,16 @@
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"server_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --enable-lora \\\n",
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
" lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
" --max-loras-per-batch 2 \\\n",
" --log-level warning \\\n",
"\"\"\"\n",
")\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
]
},
{
@@ -227,8 +225,7 @@
"\n",
"# The `--target-lora-modules` param below is technically not needed, as the server will infer it from lora0 which already has all the target modules specified.\n",
"# We are adding it here just to demonstrate usage.\n",
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"server_process, port = launch_server_cmd(\"\"\"\n",
" python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --enable-lora \\\n",
" --cuda-graph-max-bs 2 \\\n",
@@ -236,11 +233,10 @@
" --max-lora-rank 256\n",
" --lora-target-modules all\n",
" --log-level warning\n",
" \"\"\"\n",
")\n",
" \"\"\")\n",
"\n",
"url = f\"http://127.0.0.1:{port}\"\n",
"wait_for_server(url)"
"wait_for_server(url, process=server_process)"
]
},
{
@@ -435,8 +431,7 @@
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"server_process, port = launch_server_cmd(\"\"\"\n",
" python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --enable-lora \\\n",
" --cuda-graph-max-bs 8 \\\n",
@@ -448,12 +443,11 @@
" {\"lora_name\":\"lora1\",\"lora_path\":\"algoprog/fact-generation-llama-3.1-8b-instruct-lora\"} \\\n",
" lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora\n",
" --log-level warning\n",
" \"\"\"\n",
")\n",
" \"\"\")\n",
"\n",
"\n",
"url = f\"http://127.0.0.1:{port}\"\n",
"wait_for_server(url)"
"wait_for_server(url, process=server_process)"
]
},
{
@@ -548,16 +542,14 @@
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"server_process, port = launch_server_cmd(\"\"\"\n",
" python3 -m sglang.launch_server \\\n",
" --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --enable-lora \\\n",
" --lora-backend csgmv \\\n",
" --max-loras-per-batch 16 \\\n",
" --lora-paths lora1=path/to/lora1 lora2=path/to/lora2\n",
" \"\"\"\n",
")"
" \"\"\")"
]
},
{
@@ -594,8 +586,7 @@
"lora2 = \"philschmid/code-llama-3-1-8b-text-to-sql-lora\"\n",
"\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"server_process, port = launch_server_cmd(\"\"\"\n",
" python3 -m sglang.launch_server \\\n",
" --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --enable-lora \\\n",
@@ -606,11 +597,10 @@
" --max-lora-rank 256 \\\n",
" --max-loras-per-batch 2 \\\n",
" --max-loaded-loras 4\n",
" \"\"\"\n",
")\n",
" \"\"\")\n",
"\n",
"url = f\"http://127.0.0.1:{port}\"\n",
"wait_for_server(url)"
"wait_for_server(url, process=server_process)"
]
},
{

View File

@@ -142,6 +142,7 @@ The `SGLANG_MOONCAKE_CUSTOM_MEM_POOL` environment variable enables the custom me
| **`SGLANG_DISAGGREGATION_THREAD_POOL_SIZE`** | Controls the total number of worker threads for KVCache transfer operations per TP rank | A dynamic value calculated by `int(0.75 * os.cpu_count()) // 8)`, which is limited to be larger than 4 and less than 12 to ensure efficiency and prevent thread race conditions |
| **`SGLANG_DISAGGREGATION_QUEUE_SIZE`** | Sets the number of parallel transfer queues. KVCache transfer requests from multiple decode instances will be sharded into these queues so that they can share the threads and the transfer bandwidth at the same time. If it is set to `1`, then we transfer requests one by one according to fcfs strategy | `4` |
| **`SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT`** | Timeout (seconds) for receiving destination KV indices during request initialization | `300` |
| **`SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL`** | Interval (seconds) between cleanups of bootstrap entries | `120` |
If a greater mean TTFT is acceptable, you can `export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600` (10 minutes) to relax the timeout condition.
Please be aware that this setting will cause prefill instances to take a longer time to clean up the affected memory resources when a running decode node loses connection.

View File

@@ -70,7 +70,7 @@
" \"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
]
},
{

View File

@@ -153,6 +153,8 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| `--device` | The device to use ('cuda', 'xpu', 'hpu', 'npu', 'cpu'). Defaults to auto-detection if not specified. | `None` | Type: str |
| `--tensor-parallel-size`<br>`--tp-size` | The tensor parallelism size. | `1` | Type: int |
| `--pipeline-parallel-size`<br>`--pp-size` | The pipeline parallelism size. | `1` | Type: int |
| `--attention-context-parallel-size`<br>`--attn-cp-size`| The attention context parallelism size. | `1` | Type: int|
| `--moe-data-parallel-size`<br>`--moe-dp-size`| The moe data parallelism size. | `1` | Type: int|
| `--pp-max-micro-batch-size` | The maximum micro batch size in pipeline parallelism. | `None` | Type: int |
| `--pp-async-batch-depth` | The async batch depth of pipeline parallelism. | `0` | Type: int |
| `--stream-interval` | The interval (or buffer size) for streaming in terms of the token length. A smaller value makes streaming smoother, while a larger value makes the throughput higher | `1` | Type: int |
@@ -264,9 +266,9 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| `--sampling-backend` | Choose the kernels for sampling layers. | `None` | `flashinfer`, `pytorch`, `ascend` |
| `--grammar-backend` | Choose the backend for grammar-guided decoding. | `None` | `xgrammar`, `outlines`, `llguidance`, `none` |
| `--mm-attention-backend` | Set multimodal attention backend. | `None` | `sdpa`, `fa3`, `fa4`, `triton_attn`, `ascend_attn`, `aiter_attn` |
| `--nsa-prefill-backend` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_kv`, `flashmla_auto`, `fa3`, `tilelang`, `aiter` |
| `--nsa-decode-backend` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `fa3` | `flashmla_sparse`, `flashmla_kv`, `fa3`, `tilelang`, `aiter` |
| `--fp8-gemm-backend` | Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (optimal for Blackwell and low-latency), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only). **NOTE**: This replaces the deprecated environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8. | `auto` | `auto`, `deep_gemm`, `flashinfer_trtllm`, `cutlass`, `triton`, `aiter` |
| `--nsa-prefill-backend` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_kv`, `flashmla_auto`, `fa3`, `tilelang`, `aiter`, `trtllm` |
| `--nsa-decode-backend` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `fa3` | `flashmla_sparse`, `flashmla_kv`, `fa3`, `tilelang`, `aiter`, `trtllm` |
| `--fp8-gemm-backend` | Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (optimal for Blackwell and low-latency), 'flashinfer_deepgemm' (Hopper SM90 only; uses swapAB optimization for small M dimensions in decoding), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only). **NOTE**: This replaces the deprecated environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8. | `auto` | `auto`, `deep_gemm`, `flashinfer_trtllm`, `flashinfer_deepgemm`, `cutlass`, `triton`, `aiter` |
| `--fp4-gemm-backend` | Choose the runner backend for NVFP4 GEMM operations. Options: 'flashinfer_cutlass' (default), 'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), 'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), 'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback. **NOTE**: This replaces the deprecated environment variable SGLANG_FLASHINFER_FP4_GEMM_BACKEND. | `flashinfer_cutlass` | `auto`, `flashinfer_cudnn`, `flashinfer_cutlass`, `flashinfer_trtllm` |
| `--disable-flashinfer-autotune` | Flashinfer autotune is enabled by default. Set this flag to disable the autotune. | `False` | bool flag (set to enable) |
@@ -309,10 +311,11 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| Argument | Description | Defaults | Options |
| --- | --- | --- | --- |
| `--expert-parallel-size`<br>`--ep-size`<br>`--ep` | The expert parallelism size. | `1` | Type: int |
| `--moe-a2a-backend` | Select the backend for all-to-all communication for expert parallelism. | `none` | `none`, `deepep`, `mooncake`, `ascend_fuseep`|
| `--moe-a2a-backend` | Select the backend for all-to-all communication for expert parallelism. | `none` | `none`, `deepep`, `mooncake`, `mori`, `ascend_fuseep`|
| `--moe-runner-backend` | Choose the runner backend for MoE. | `auto` | `auto`, `deep_gemm`, `triton`, `triton_kernel`, `flashinfer_trtllm`, `flashinfer_cutlass`, `flashinfer_mxfp4`, `flashinfer_cutedsl`, `cutlass` |
| `--flashinfer-mxfp4-moe-precision` | Choose the computation precision of flashinfer mxfp4 moe | `default` | `default`, `bf16` |
| `--enable-flashinfer-allreduce-fusion` | Enable FlashInfer allreduce fusion with Residual RMSNorm. | `False` | bool flag (set to enable) |
| `--enable-aiter-allreduce-fusion` | Enable aiter allreduce fusion with Residual RMSNorm. | `False` | bool flag (set to enable) |
| `--deepep-mode` | Select the mode when enable DeepEP MoE, could be `normal`, `low_latency` or `auto`. Default is `auto`, which means `low_latency` for decode batch and `normal` for prefill batch. | `auto` | `normal`, `low_latency`, `auto` |
| `--ep-num-redundant-experts` | Allocate this number of redundant experts in expert parallel. | `0` | Type: int |
| `--ep-dispatch-algorithm` | The algorithm to choose ranks for redundant experts in expert parallel. | `None` | Type: str |
@@ -334,7 +337,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| Argument | Description | Defaults | Options |
| --- | --- | --- | --- |
| `--max-mamba-cache-size` | The maximum size of the mamba cache. | `None` | Type: int |
| `--mamba-ssm-dtype` | The data type of the SSM states in mamba cache. | `float32` | `float32`, `bfloat16` |
| `--mamba-ssm-dtype` | The data type of the SSM states in mamba cache. | `float32` | `float32`, `bfloat16`, `float16` |
| `--mamba-full-memory-ratio` | The ratio of mamba state memory to full kv cache memory. | `0.9` | Type: float |
| `--mamba-scheduler-strategy` | The strategy to use for mamba scheduler. `auto` currently defaults to `no_buffer`. 1. `no_buffer` does not support overlap scheduler due to not allocating extra mamba state buffers. Branching point caching support is feasible but not implemented. 2. `extra_buffer` supports overlap schedule by allocating extra mamba state buffers to track mamba state for caching (mamba state usage per running req becomes `2x` for non-spec; `1+(1/(2+speculative_num_draft_tokens))x` for spec dec (e.g. 1.16x if speculative_num_draft_tokens==4)). 2a. `extra_buffer` is strictly better for non-KV-cache-bound cases; for KV-cache-bound cases, the tradeoff depends on whether enabling overlap outweighs reduced max running requests. 2b. mamba caching at radix cache branching point is strictly better than non-branch but requires kernel support (currently only FLA backend), currently only extra_buffer supports branching. | `auto` | `auto`, `no_buffer`, `extra_buffer` |
| `--mamba-track-interval` | The interval (in tokens) to track the mamba state during decode. Only used when `--mamba-scheduler-strategy` is `extra_buffer`. Must be divisible by page_size if set, and must be >= speculative_num_draft_tokens when using speculative decoding. | `256` | Type: int |
@@ -373,6 +376,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| `--kt-max-deferred-experts-per-token` | [ktransformers parameter] Maximum number of experts deferred to CPU per token. All MoE layers except the final one use this value; the final layer always uses 0. | `None` | Type: int |
## Diffusion LLM
| Argument | Description | Defaults | Options |
| --- | --- | --- | --- |
| `--dllm-algorithm` | The diffusion LLM algorithm, such as LowConfidence. | `None` | Type: str |
@@ -492,7 +496,6 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| `--disaggregation-prefill-pp` | Prefill pp size. If not set, it is default to 1. This is only set on the decode server. | `1` | Type: int |
| `--disaggregation-ib-device` | The InfiniBand devices for disaggregation transfer, accepts single device (e.g., --disaggregation-ib-device mlx5_0) or multiple comma-separated devices (e.g., --disaggregation-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when mooncake backend is enabled. | `None` | Type: str |
| `--disaggregation-decode-enable-offload-kvcache` | Enable async KV cache offloading on decode server (PD mode). | `False` | bool flag (set to enable) |
| `--disaggregation-decode-enable-fake-auto` | Auto enable FAKE mode for decode node testing, no need to pass bootstrap_host and bootstrap_room in request. | `False` | bool flag (set to enable) |
| `--num-reserved-decode-tokens` | Number of decode tokens that will have memory reserved when adding new request to the running batch. | `512` | Type: int |
| `--disaggregation-decode-polling-interval` | The interval to poll requests in decode server. Can be set to >1 to reduce the overhead of this. | `1` | Type: int |

View File

@@ -106,6 +106,29 @@ This path trades some I/O overhead for simplicity and flexibility. It integrates
**Python Engine API:** `engine.update_weights_from_disk(model_path, load_format=None)`
**Diffusion engine (SGLang-Diffusion):** The diffusion engine exposes the same `POST /update_weights_from_disk` endpoint with the following behavior:
- **All-or-nothing with rollback:** if any module fails to load, all previously updated modules are rolled back to the original weights by reloading from the original model path. No partial updates are left behind. If rollback itself fails, the exception propagates so the caller knows the model is in an inconsistent state.
- **Offload-aware:** when layerwise offload (`--dit-layerwise-offload`) is enabled, the diffusion offload manager replaces GPU parameters with small `torch.empty((1,))` placeholders while real weights live in consolidated pinned CPU buffers. A naive `param.data.copy_()` would fail with a shape mismatch. Instead, the updater dynamically detects active offload managers and writes new weights directly into their CPU buffers, bypassing the placeholders entirely. For any layer that happens to be prefetched on GPU at update time, the live GPU tensor is also updated so the change takes effect immediately. This requires no extra GPU memory and does not disturb the offload state.
- **DTensor-aware:** parameters distributed via `torch.distributed.tensor` (tensor parallelism) are updated through `distribute_tensor` so that each shard is correctly placed on the right device mesh.
**Request body:**
| Field | Description | Defaults | Options |
| --- | --- | --- | --- |
| `model_path` | The model path with the new weights. | Required | Type: str |
| `flush_cache` | Flush TeaCache state after update. | `True` | Type: bool |
| `target_modules` | List of module names to update (e.g. `["transformer"]`). If omitted, all `nn.Module` components are updated. | `None` | Type: list[str] |
**Response body:**
| Field | Description | Defaults | Options |
| --- | --- | --- | --- |
| `success` | Whether the update succeeded. | - | Type: bool |
| `message` | Status / error message. | - | Type: str |
> **Note:** The diffusion engine (SGLang-Diffusion) does not currently support hot refit (updating weights while inference is in progress). The diffusion scheduler processes one request at a time and completes the entire inference before handling the next request, so weight updates and inference never run concurrently.
### Update Weights from Tensor
**When to use:**

View File

@@ -1,608 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Speculative Decoding\n",
"\n",
"SGLang provides several speculative decoding options, including EAGLE-2/EAGLE-3, MTP, classic draft-model decoding, and an NGRAM-based variant. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.\n",
"\n",
"## Summary\n",
"\n",
"### Jump to sections\n",
"\n",
"- [EAGLE Decoding](#eagle-decoding)\n",
" - [EAGLE-2 decoding](#eagle-2-decoding)\n",
" - [EAGLE-2 Decoding with torch.compile](#eagle-2-decoding-with-torchcompile)\n",
" - [EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling](#eagle-2-decoding-via-frequency-ranked-speculative-sampling)\n",
" - [EAGLE-3 Decoding](#eagle-3-decoding)\n",
"- [Multi Token Prediction](#multi-token-prediction)\n",
"- [Standalone Speculative Decoding (Small Draft Model)](#standalone-speculative-decoding-small-draft-model)\n",
"- [Speculative Decoding V2 (Overlap Scheduler)](#speculative-decoding-v2-overlap-scheduler)\n",
"- [Ngram Speculative Decoding](#ngram-speculative-decoding)\n",
"\n",
"### Quick guidance\n",
"\n",
"- **Best speed/quality (recommended)**: Use **EAGLE-3** with `--speculative-algorithm EAGLE3`.\n",
"- **Strong default / broad compatibility**: Use **EAGLE-2** with `--speculative-algorithm EAGLE`.\n",
"- **Lower `lm_head` overhead for EAGLE-2**: Enable **FR-Spec** with `--speculative-token-map`.\n",
"- **Model is MTP-enabled**: Use **MTP via speculative decoding** (often with small `speculative_num_steps/topk/num_draft_tokens`, see the example section).\n",
"- **You have a smaller draft LLM**: Use **STANDALONE** (`--speculative-algorithm STANDALONE`).\n",
"- **No extra model available**: Use **NGRAM** (`--speculative-algorithm NGRAM`, CUDA-only).\n",
"- **Want overlap scheduler (experimental)**: Enable **SpecV2** with `SGLANG_ENABLE_SPEC_V2=True` (requires `--speculative-eagle-topk 1`).\n",
"\n",
"### Method comparison (mini table)\n",
"\n",
"| Method | Draft source | Separate draft model? | How to enable | Notes / constraints |\n",
"|---|---|---:|---|---|\n",
"| EAGLE-2 | EAGLE draft model (feature drafting + tree) | Typically yes | `--speculative-algorithm EAGLE` + `--speculative-draft-model-path ...` | Tune `--speculative-num-steps`, `--speculative-eagle-topk`, `--speculative-num-draft-tokens` |\n",
"| EAGLE-2 + `torch.compile` | Same as EAGLE-2 | Typically yes | Add `--enable-torch-compile` (optionally `--torch-compile-max-bs`) | Further kernel-level optimizations |\n",
"| EAGLE-2 + FR-Spec | Same as EAGLE-2 + token subset | Typically yes | Add `--speculative-token-map ...` | Reduces `lm_head` overhead with high-frequency token vocab |\n",
"| EAGLE-3 | EAGLE3 draft model | Yes | `--speculative-algorithm EAGLE3` + `--speculative-draft-model-path ...` | Best throughput in the benchmark above |\n",
"| MTP | Built-in multi-token heads (model-specific) | Often no | See **Multi Token Prediction** section | Uses speculative workflow; draft path may be auto-handled for some models |\n",
"| STANDALONE | Smaller draft LLM (token-level) | Yes | `--speculative-algorithm STANDALONE` + `--speculative-draft-model-path ...` | Does **not** support `--enable-dp-attention` |\n",
"| SpecV2 (experimental) | V2 workers + overlap scheduler | N/A | `SGLANG_ENABLE_SPEC_V2=True` | Only supports `--speculative-eagle-topk 1`; applies to `EAGLE`, `EAGLE3`, `STANDALONE` |\n",
"| NGRAM | Ngram cache from previous tokens | No | `--speculative-algorithm NGRAM` | CUDA-only; no `--enable-dp-attention`; disables overlap scheduler & mixed chunked prefill |\n",
"\n",
"### Performance Highlights\n",
"\n",
"Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding.\n",
"For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840).\n",
"\n",
"| Method | Throughput (tokens/s) |\n",
"|--------|----------------|\n",
"| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |\n",
"| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |\n",
"| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## EAGLE Decoding\n",
"\n",
"To enable EAGLE speculative decoding the following parameters are relevant:\n",
"* `speculative_draft_model_path`: Draft model path/weights. **Typically required** for EAGLE/EAGLE3 and STANDALONE. For some MTP-enabled models, this can be omitted (SGLang may auto-handle/auto-fill it).\n",
"* `speculative_num_steps`: Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. Default is 5.\n",
"* `speculative_eagle_topk`: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.\n",
"* `speculative_num_draft_tokens`: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8.\n",
"\n",
"These parameters are the same for EAGLE-2 and EAGLE-3.\n",
"\n",
"You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).\n",
"\n",
"In the documentation below, we set `--cuda-graph-max-bs` to be a small value for faster engine startup. For your own workloads, please tune the above parameters together with `--cuda-graph-max-bs`, `--max-running-requests`, `--mem-fraction-static` for the best performance. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### EAGLE-2 decoding\n",
"\n",
"You can enable EAGLE-2 decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"import openai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \\\n",
" --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \\\n",
" --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-2-7b-chat-hf\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
")\n",
"\n",
"print_highlight(f\"Response: {response}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### EAGLE-2 Decoding with `torch.compile`\n",
"\n",
"You can also enable `torch.compile` for further optimizations and optionally set `--torch-compile-max-bs`:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \\\n",
" --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \\\n",
" --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \\\n",
" --enable-torch-compile --torch-compile-max-bs 2 --log-level warning\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-2-7b-chat-hf\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
")\n",
"\n",
"print_highlight(f\"Response: {response}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling\n",
"\n",
"By employing a truncated high-frequency token vocabulary in the draft model, Eagle speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, checkout [the paper](https://arxiv.org/pdf/arXiv:2502.14856).\n",
"\n",
"In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency token in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency token by directly downloading these token from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset).\n",
"\n",
"Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \\\n",
" --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \\\n",
" --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \\\n",
" --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --log-level warning\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
")\n",
"\n",
"print_highlight(f\"Response: {response}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### EAGLE-3 Decoding\n",
"\n",
"You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algorithm EAGLE3 \\\n",
" --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \\\n",
" --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \\\n",
" --cuda-graph-max-bs 2 --dtype float16 --log-level warning\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
")\n",
"\n",
"print_highlight(f\"Response: {response}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multi Token Prediction\n",
"\n",
"We support [MTP(Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use Xiaomi/MiMo-7B-RL model as example here (deepseek mtp usage refer to [deepseek doc](../basic_usage/deepseek.md#multi-token-prediction))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
" python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \\\n",
" --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \\\n",
" --mem-fraction 0.5 --log-level warning\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"url = f\"http://localhost:{port}/v1/chat/completions\"\n",
"\n",
"data = {\n",
" \"model\": \"XiaomiMiMo/MiMo-7B-RL\",\n",
" \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n",
"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.json())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Standalone Speculative Decoding (Small Draft Model)\n",
"\n",
"Besides EAGLE/MTP, SGLang also supports **token-level speculative decoding** using a smaller **draft model**. Enable it with `--speculative-algorithm STANDALONE` and provide a draft model via `--speculative-draft-model-path`.\n",
"\n",
"Relevant parameters:\n",
"- `--speculative-draft-model-path`: Draft model weights (smaller than the target model).\n",
"- `--speculative-num-steps`: Draft depth (how many steps the draft model runs autoregressively).\n",
"- `--speculative-eagle-topk`: Branching factor (token candidates per step).\n",
"- `--speculative-num-draft-tokens`: Verification capacity.\n",
"\n",
"Note:\n",
"- Standalone speculative decoding currently **does not support** `--enable-dp-attention`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct --speculative-algorithm STANDALONE \\\n",
" --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \\\n",
" --speculative-num-steps 4 --speculative-eagle-topk 2 --speculative-num-draft-tokens 7 \\\n",
" --cuda-graph-max-bs 8 --mem-fraction-static 0.7 --log-level warning\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"Qwen/Qwen2.5-7B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
")\n",
"\n",
"print_highlight(f\"Response: {response}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Speculative Decoding V2 (Overlap Scheduler)\n",
"\n",
"SGLang provides an **experimental Speculative Decoding V2** implementation that enables an overlap scheduler and uses V2 speculative workers (e.g. `StandaloneWorkerV2`, `EAGLEWorkerV2`).\n",
"\n",
"To enable it, set the environment variable:\n",
"- `SGLANG_ENABLE_SPEC_V2=True`\n",
"\n",
"Notes:\n",
"- SpecV2 currently only supports `--speculative-eagle-topk 1`. When SpecV2 is enabled, **set `--speculative-eagle-topk 1` explicitly**.\n",
"- If you explicitly set `--speculative-eagle-topk > 1`, the server will error. If you omit `--speculative-eagle-topk`, auto-tuning may pick `topk > 1` for some models (e.g. Llama), which is not supported by SpecV2.\n",
"- This applies to `EAGLE`, `EAGLE3`, and `STANDALONE`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"SGLANG_ENABLE_SPEC_V2=True python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct --speculative-algorithm STANDALONE \\\n",
" --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \\\n",
" --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \\\n",
" --cuda-graph-max-bs 8 --mem-fraction-static 0.7 --log-level warning\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"Qwen/Qwen2.5-7B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
")\n",
"\n",
"print_highlight(f\"Response: {response}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ngram Speculative Decoding\n",
"\n",
"SGLang also supports **ngram-based speculative decoding** (no separate draft model). It retrieves draft tokens from an ngram cache built from previously generated tokens, and then verifies them with the target model.\n",
"\n",
"Enable it with:\n",
"- `--speculative-algorithm NGRAM`\n",
"\n",
"Common parameters:\n",
"- `--speculative-num-draft-tokens`: Number of draft tokens verified per step.\n",
"- `--speculative-ngram-min-match-window-size` / `--speculative-ngram-max-match-window-size`: Matching window range.\n",
"- `--speculative-ngram-min-bfs-breadth` / `--speculative-ngram-max-bfs-breadth`: BFS breadth range.\n",
"- `--speculative-ngram-branch-length`: How many recent tokens to insert into the cache.\n",
"- `--speculative-ngram-capacity`: Cache capacity.\n",
"\n",
"Notes:\n",
"- Ngram speculative decoding **only supports CUDA**.\n",
"- It currently **does not support** `--enable-dp-attention`.\n",
"- It disables the overlap scheduler and mixed chunked prefill.\n",
"- Optional: set `SGLANG_NGRAM_FORCE_GREEDY_VERIFY=True` to force greedy verification.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct --speculative-algorithm NGRAM \\\n",
" --speculative-num-draft-tokens 16 \\\n",
" --speculative-ngram-max-match-window-size 12 --speculative-ngram-max-bfs-breadth 10 \\\n",
" --cuda-graph-max-bs 8 --mem-fraction-static 0.8 --log-level warning\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"Qwen/Qwen2.5-7B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
")\n",
"\n",
"print_highlight(f\"Response: {response}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n",
"EAGLE process is as follows:\n",
"\n",
"- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$. \n",
"- The next token is then sampled from $p_{k+2}=\\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again.\n",
"- EAGLE-2 additionally uses the draft model to evaluate how probable certain branches in the draft tree are, dynamically stopping the expansion of unlikely branches. After the expansion phase, reranking is employed to select only the top `speculative_num_draft_tokens` final nodes as draft tokens.\n",
"- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner.\n",
"\n",
"This enhances drafting accuracy by operating on the features instead of tokens for more regular inputs and passing the tokens from the next timestep additionally to minimize randomness effects from sampling. Furthermore the dynamic adjustment of the draft tree and selection of reranked final nodes increases acceptance rate of draft tokens further. For more details see [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) paper.\n",
"\n",
"\n",
"For guidance how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train)."
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@@ -0,0 +1,592 @@
# Speculative Decoding
SGLang provides several speculative decoding options, including EAGLE-2/EAGLE-3, MTP, classic draft-model decoding, and an NGRAM-based variant. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.
## Summary
### Jump to sections
- [EAGLE Decoding](#eagle-decoding)
- [EAGLE-2 Decoding](#eagle-2-decoding)
- [EAGLE-2 Decoding with torch.compile](#eagle-2-decoding-with-torchcompile)
- [EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling](#eagle-2-decoding-via-frequency-ranked-speculative-sampling)
- [EAGLE-3 Decoding](#eagle-3-decoding)
- [Multi Token Prediction](#multi-token-prediction)
- [Standalone Speculative Decoding (Small Draft Model)](#standalone-speculative-decoding-small-draft-model)
- [Speculative Decoding V2 (Overlap Scheduler)](#speculative-decoding-v2-overlap-scheduler)
- [Ngram Speculative Decoding](#ngram-speculative-decoding)
- [Full Parameter Reference](#full-parameter-reference)
- [OOM Troubleshooting](#oom-troubleshooting)
- [References](#references)
### Quick guidance
- **Best speed/quality (recommended)**: Use **EAGLE-3** with `--speculative-algorithm EAGLE3`.
- **Strong default / broad compatibility**: Use **EAGLE-2** with `--speculative-algorithm EAGLE`.
- **Lower `lm_head` overhead for EAGLE-2**: Enable **FR-Spec** with `--speculative-token-map`.
- **Model is MTP-enabled**: Use **MTP via speculative decoding** (often with small `speculative_num_steps/topk/num_draft_tokens`, see the example section).
- **You have a smaller draft LLM**: Use **STANDALONE** (`--speculative-algorithm STANDALONE`).
- **No extra model available**: Use **NGRAM** (`--speculative-algorithm NGRAM`, CUDA-only).
- **Want overlap scheduler (experimental)**: Enable **SpecV2** with `SGLANG_ENABLE_SPEC_V2=True` (requires `--speculative-eagle-topk 1`).
### Method comparison (mini table)
| Method | Draft source | Separate draft model? | How to enable | Notes / constraints |
|---|---|---:|---|---|
| EAGLE-2 | EAGLE draft model (feature drafting + tree) | Typically yes | `--speculative-algorithm EAGLE` + `--speculative-draft-model-path ...` | Tune `--speculative-num-steps`, `--speculative-eagle-topk`, `--speculative-num-draft-tokens` |
| EAGLE-2 + `torch.compile` | Same as EAGLE-2 | Typically yes | Add `--enable-torch-compile` (optionally `--torch-compile-max-bs`) | Further kernel-level optimizations |
| EAGLE-2 + FR-Spec | Same as EAGLE-2 + token subset | Typically yes | Add `--speculative-token-map ...` | Reduces `lm_head` overhead with high-frequency token vocab |
| EAGLE-3 | EAGLE3 draft model | Yes | `--speculative-algorithm EAGLE3` + `--speculative-draft-model-path ...` | Best throughput in the benchmark above |
| MTP | Built-in multi-token heads (model-specific) | Often no | See **Multi Token Prediction** section | Uses speculative workflow; draft path may be auto-handled for some models |
| STANDALONE | Smaller draft LLM (token-level) | Yes | `--speculative-algorithm STANDALONE` + `--speculative-draft-model-path ...` | Does **not** support `--enable-dp-attention` |
| SpecV2 (experimental) | V2 workers + overlap scheduler | N/A | `SGLANG_ENABLE_SPEC_V2=True` | Only supports `--speculative-eagle-topk 1`; applies to `EAGLE`, `EAGLE3`, `STANDALONE` |
| NGRAM | Ngram cache from previous tokens | No | `--speculative-algorithm NGRAM` | CUDA-only; no `--enable-dp-attention`; disables overlap scheduler & mixed chunked prefill |
### Performance Highlights
Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding.
For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840).
| Method | Throughput (tokens/s) |
|--------|----------------|
| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |
| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |
| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |
---
## EAGLE Decoding
To enable EAGLE speculative decoding the following parameters are relevant:
| Parameter | Description | Default |
|---|---|---|
| `--speculative-draft-model-path` | Draft model path/weights. **Typically required** for EAGLE/EAGLE3 and STANDALONE. For some MTP-enabled models, this can be omitted. | `None` |
| `--speculative-num-steps` | Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. | Auto (`5` for Llama/Grok; `3` for many other models) |
| `--speculative-eagle-topk` | Branching factor per step. Improves candidate diversity and acceptance rate, but increases memory/compute consumption. | Auto (`4` for Llama/Grok; `1` for many other models) |
| `--speculative-num-draft-tokens` | Maximum parallel verification capacity. Allows deeper tree evaluation but increases GPU memory usage. | Auto (`8` for Llama/Grok; `4` for many other models). If `topk=1`, it is adjusted to `num_steps + 1`. |
| `--speculative-accept-threshold-single` | Acceptance threshold for single-token verification. Lower values accept more aggressively. | `1.0` |
| `--speculative-accept-threshold-acc` | Accumulated acceptance threshold across steps. | `1.0` |
| `--speculative-attention-mode` | Attention mode for speculative operations (`prefill` or `decode`), affecting both target verification and draft extension. | `"prefill"` |
| `--speculative-draft-attention-backend` | Override attention backend for the draft model. | `None` (same as target) |
| `--speculative-draft-model-quantization` | Quantization method for the draft model. Use `"unquant"` to force no quantization even when the target model is quantized. | Same as target model |
| `--speculative-draft-model-revision` | Specific revision/commit of the draft model to load. | `None` (auto-set to `"main"` when `--speculative-draft-model-path` is set and revision is omitted) |
| `--speculative-draft-load-format` | Load format for the draft model weights. | `None` |
These parameters are mostly the same for EAGLE-2 and EAGLE-3. `--speculative-token-map` is ignored for EAGLE-3 models.
For `--speculative-num-steps`, `--speculative-eagle-topk`, and `--speculative-num-draft-tokens`: leave all three unset to use auto-tuning, or set all three explicitly when tuning.
You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).
### EAGLE-2 decoding
You can enable EAGLE-2 decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model.
**Launch the server:**
```bash
python3 -m sglang.launch_server \
--model meta-llama/Llama-2-7b-chat-hf \
--speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
### EAGLE-2 Decoding with `torch.compile`
You can also enable `torch.compile` for further optimizations and optionally set `--torch-compile-max-bs`:
```bash
python3 -m sglang.launch_server \
--model meta-llama/Llama-2-7b-chat-hf \
--speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mem-fraction-static 0.7 \
--enable-torch-compile \
--torch-compile-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling
By employing a truncated high-frequency token vocabulary in the draft model, Eagle speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, checkout [the paper](https://arxiv.org/pdf/2502.14856).
In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency token in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency token by directly downloading these token from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset).
Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx).
```bash
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-algorithm EAGLE \
--speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--dtype float16 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
### EAGLE-3 Decoding
You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model.
```bash
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--dtype float16 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
## Multi Token Prediction
We support [MTP(Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use `XiaomiMiMo/MiMo-7B-RL` as an example here (for DeepSeek MTP usage, refer to [deepseek_v32 doc](../basic_usage/deepseek_v32.md#multi-token-prediction)).
```bash
python3 -m sglang.launch_server \
--model XiaomiMiMo/MiMo-7B-RL \
--host 0.0.0.0 \
--trust-remote-code \
--speculative-algorithm EAGLE \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "XiaomiMiMo/MiMo-7B-RL",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
}
response = requests.post(url, json=data)
print(response.json())
```
---
## Standalone Speculative Decoding (Small Draft Model)
Besides EAGLE/MTP, SGLang also supports **token-level speculative decoding** using a smaller **draft model**. Enable it with `--speculative-algorithm STANDALONE` and provide a draft model via `--speculative-draft-model-path`.
Relevant parameters:
| Parameter | Description | Default |
|---|---|---|
| `--speculative-draft-model-path` | Draft model weights (smaller than the target model). | `None` |
| `--speculative-num-steps` | Draft depth (how many steps the draft model runs autoregressively). | `3` (auto default for STANDALONE) |
| `--speculative-eagle-topk` | Branching factor (token candidates per step). | `1` (auto default for STANDALONE) |
| `--speculative-num-draft-tokens` | Verification capacity. | `4` (auto default for STANDALONE) |
| `--speculative-draft-model-quantization` | Quantization for the draft model. Use `"unquant"` to disable quantization on the draft even when the target is quantized. | Same as target |
> **Note:** Standalone speculative decoding currently **does not support** `--enable-dp-attention`.
```bash
python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-7B-Instruct \
--speculative-algorithm STANDALONE \
--speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \
--speculative-num-steps 4 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 7 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
## Speculative Decoding V2 (Overlap Scheduler)
SGLang provides an **experimental Speculative Decoding V2** implementation that enables an overlap scheduler and uses V2 speculative workers (e.g. `StandaloneWorkerV2`, `EAGLEWorkerV2`).
To enable it, set the environment variable:
- `SGLANG_ENABLE_SPEC_V2=True`
Notes:
- SpecV2 currently only supports `--speculative-eagle-topk 1`. When SpecV2 is enabled, **set `--speculative-eagle-topk 1` explicitly**.
- If you explicitly set `--speculative-eagle-topk > 1`, the server will error.
- If you omit `--speculative-eagle-topk`, auto-tuning may pick `topk > 1` for some models (e.g. Llama). This is incompatible with SpecV2 and may not always trigger an immediate config error, so set `--speculative-eagle-topk 1` explicitly.
- This applies to `EAGLE`, `EAGLE3`, and `STANDALONE`.
```bash
SGLANG_ENABLE_SPEC_V2=True python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-7B-Instruct \
--speculative-algorithm STANDALONE \
--speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \
--speculative-num-steps 4 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 5 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
## Ngram Speculative Decoding
SGLang also supports **ngram-based speculative decoding** (no separate draft model). It retrieves draft tokens from an ngram cache built from previously generated tokens, and then verifies them with the target model.
Enable it with:
- `--speculative-algorithm NGRAM`
### Ngram-specific parameters
| Parameter | Description | Default |
|---|---|---|
| `--speculative-num-draft-tokens` | Number of draft tokens verified per step. If omitted, defaults to `--speculative-ngram-max-match-window-size`. | `12` (with default ngram settings) |
| `--speculative-ngram-min-match-window-size` | Minimum matching window size. | `1` |
| `--speculative-ngram-max-match-window-size` | Maximum matching window size. | `12` |
| `--speculative-ngram-min-bfs-breadth` | Minimum BFS breadth. | `1` |
| `--speculative-ngram-max-bfs-breadth` | Maximum BFS breadth. | `10` |
| `--speculative-ngram-match-type` | Match type: `"BFS"` or `"PROB"`. | `"BFS"` |
| `--speculative-ngram-branch-length` | How many recent tokens to insert into the cache. | `18` |
| `--speculative-ngram-capacity` | Cache capacity (number of entries). | `10,000,000` |
Notes:
- Ngram speculative decoding **only supports CUDA**.
- It currently **does not support** `--enable-dp-attention`.
- It disables the overlap scheduler and mixed chunked prefill.
- If `--speculative-ngram-max-bfs-breadth > 1` (thus `speculative_eagle_topk > 1`) and `page_size > 1`, use `--attention-backend flashinfer`; otherwise the server will error.
- Optional: set `SGLANG_NGRAM_FORCE_GREEDY_VERIFY=True` to force greedy verification.
```bash
python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-7B-Instruct \
--speculative-algorithm NGRAM \
--speculative-num-draft-tokens 16 \
--speculative-ngram-max-match-window-size 12 \
--speculative-ngram-max-bfs-breadth 10 \
--mem-fraction-static 0.7 \
--cuda-graph-max-bs 8 \
--log-level warning
```
**Send a request:**
```python
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
```
---
## Full Parameter Reference
Below is a comprehensive list of all speculative decoding parameters available in SGLang:
### Core parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| `--speculative-algorithm` | `str` | `None` | Algorithm to use: `EAGLE`, `EAGLE3`, `STANDALONE`, `NGRAM`, `NEXTN` (alias of `EAGLE`) |
| `--speculative-draft-model-path` | `str` | `None` | Path to the draft model weights |
| `--speculative-draft-model-revision` | `str` | `None` | Specific revision/commit of the draft model (`"main"` is auto-used when draft path is set and revision is omitted) |
| `--speculative-draft-load-format` | `str` | `None` | Load format for draft model weights |
| `--speculative-num-steps` | `int` | `None` (auto-chosen when omitted) | Autoregressive drafting depth |
| `--speculative-eagle-topk` | `int` | `None` (auto-chosen when omitted) | Branching factor per drafting step |
| `--speculative-num-draft-tokens` | `int` | `None` (auto-chosen when omitted) | Maximum number of draft tokens for verification |
| `--speculative-accept-threshold-single` | `float` | `1.0` | Single-token acceptance threshold |
| `--speculative-accept-threshold-acc` | `float` | `1.0` | Accumulated acceptance threshold |
| `--speculative-token-map` | `str` | `None` | Path to FR-Spec high-frequency token map |
| `--speculative-attention-mode` | `str` | `"prefill"` | Attention mode for speculative operations (`"prefill"` or `"decode"`) |
| `--speculative-draft-attention-backend` | `str` | `None` | Override attention backend for the draft model |
| `--speculative-moe-runner-backend` | `str` | `None` | MoE runner backend for the draft model |
| `--speculative-moe-a2a-backend` | `str` | `None` | MoE all-to-all backend for the draft model |
| `--speculative-draft-model-quantization` | `str` | Same as target | Quantization for the draft model (`"unquant"` to disable) |
### Ngram-specific parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| `--speculative-ngram-min-match-window-size` | `int` | `1` | Minimum ngram matching window |
| `--speculative-ngram-max-match-window-size` | `int` | `12` | Maximum ngram matching window |
| `--speculative-ngram-min-bfs-breadth` | `int` | `1` | Minimum BFS breadth |
| `--speculative-ngram-max-bfs-breadth` | `int` | `10` | Maximum BFS breadth |
| `--speculative-ngram-match-type` | `str` | `"BFS"` | Match type: `"BFS"` or `"PROB"` |
| `--speculative-ngram-branch-length` | `int` | `18` | Recent tokens to insert into cache |
| `--speculative-ngram-capacity` | `int` | `10,000,000` | Cache capacity |
### Environment variables
| Variable | Default | Description |
|---|---|---|
| `SGLANG_ENABLE_SPEC_V2` | `False` | Enable Speculative Decoding V2 (overlap scheduler) |
| `SGLANG_NGRAM_FORCE_GREEDY_VERIFY` | `False` | Force greedy verification for ngram decoding |
### Other related flags
| Parameter | Description |
|---|---|
| `--enable-multi-layer-eagle` | Enable multi-layer EAGLE (auto-enabled for MiMoV2 and Step3p5 models) |
| `--enable-torch-compile` | Enable `torch.compile` for kernel-level optimizations |
| `--torch-compile-max-bs` | Maximum batch size for `torch.compile` |
---
## OOM Troubleshooting
> [!WARNING]
> **Out of Memory (OOM)?** Speculative decoding may increase GPU memory usage because the draft tree, CUDA graphs, and verification-related buffers consume additional VRAM. If you encounter OOM errors, try the following adjustments.
### Step 1: Reduce draft tree size (most effective)
These three parameters directly control how much memory the draft tree consumes:
```bash
# Before (aggressive, high memory)
--speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64
# After (conservative, lower memory)
--speculative-num-steps 3 --speculative-eagle-topk 4 --speculative-num-draft-tokens 16
```
- **`--speculative-num-draft-tokens`**: This is the single most impactful parameter. Reducing from 64 → 16 can cut draft-related memory by ~75%. Start here.
- **`--speculative-eagle-topk`**: Reducing from 8 → 4 or even 2 halves the branching factor.
- **`--speculative-num-steps`**: Reducing from 5 → 3 shortens the draft depth.
### Step 2: Lower static memory fraction
```bash
# Give more room for dynamic allocations (CUDA graphs, draft model, etc.)
--mem-fraction-static 0.5 # when omitted, this value is auto-computed
```
### Step 3: Reduce CUDA graph batch size
```bash
# Fewer CUDA graph captures = less memory reserved
--cuda-graph-max-bs 4 # or even 2 for tight memory situations
```
### Step 4: Limit concurrent requests
```bash
# Fewer concurrent requests lowers in-flight load and can reduce OOM risk
--max-running-requests 4
```
### Step 5: Use quantization
```bash
# Quantize the target model (if supported by your checkpoint/hardware)
--quantization fp8
# Or quantize only the draft model (keep target at full precision)
--speculative-draft-model-quantization fp8
```
### Step 6: Use a smaller dtype
```bash
--dtype float16 # instead of bfloat16/float32 (when supported)
```
### Step 7: Use FR-Spec to reduce lm_head memory (EAGLE-2 / STANDALONE)
```bash
--speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt
```
> Note: For EAGLE-3, `--speculative-token-map` is ignored because EAGLE-3 models already provide built-in hot-token handling.
### Quick OOM recovery recipe
If you're hitting OOM and just want something that works, start with this minimal configuration and scale up:
```bash
python3 -m sglang.launch_server \
--model <your-model> \
--speculative-algorithm EAGLE \
--speculative-draft-model-path <your-draft-model> \
--speculative-num-steps 2 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 8 \
--cuda-graph-max-bs 2 \
--mem-fraction-static 0.5 \
--max-running-requests 4 \
--dtype float16 \
--log-level warning
```
Then gradually increase `--speculative-num-draft-tokens`, `--speculative-eagle-topk`, and `--cuda-graph-max-bs` until you find the sweet spot for your GPU.
> [!TIP]
> **Memory budget rule of thumb**: during automatic `--mem-fraction-static` estimation, STANDALONE reserves about 6 GB and EAGLE/EAGLE3 reserves about 2 GB as additional headroom. Plan your `--mem-fraction-static` accordingly.
---
## References
EAGLE process is as follows:
- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$.
- The next token is then sampled from $p_{k+2}=\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again.
- In SGLang's EAGLE-2 implementation, the draft tree is expanded for the configured steps and then reranked to select the top `speculative_num_draft_tokens` final nodes as draft tokens.
- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner.
This enhances drafting accuracy by operating on features instead of tokens for more regular inputs and by additionally passing tokens from the next timestep to reduce sampling randomness. For more details, see the [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) papers.
For guidance how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train).

View File

@@ -54,7 +54,7 @@
" \"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --log-level warning\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")\n",
"wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")"
]
},
@@ -740,7 +740,6 @@
"import json\n",
"from pydantic import BaseModel, Field\n",
"\n",
"\n",
"prompts = [\n",
" \"Give me the information of the capital of China in the JSON format.\",\n",
" \"Give me the information of the capital of France in the JSON format.\",\n",

View File

@@ -50,7 +50,7 @@
" \"python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")\n",
"wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")"
]
},
@@ -642,7 +642,6 @@
"import json\n",
"from pydantic import BaseModel, Field\n",
"\n",
"\n",
"prompts = [\n",
" \"Give me the information of the capital of China in the JSON format.\",\n",
" \"Give me the information of the capital of France in the JSON format.\",\n",

View File

@@ -60,7 +60,7 @@
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning\" # qwen25\n",
")\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
]
},
{
@@ -550,7 +550,9 @@
"server_process_tool_choice, port_tool_choice = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning\"\n",
")\n",
"wait_for_server(f\"http://localhost:{port_tool_choice}\")\n",
"wait_for_server(\n",
" f\"http://localhost:{port_tool_choice}\", process=server_process_tool_choice\n",
")\n",
"\n",
"# Initialize client for tool choice examples\n",
"client_tool_choice = OpenAI(\n",
@@ -695,7 +697,7 @@
"server_process, port = launch_server_cmd(\n",
" \" python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --tool-call-parser pythonic --tp 1 --log-level warning\" # llama-3.2-1b-instruct\n",
")\n",
"wait_for_server(f\"http://localhost:{port}\")\n",
"wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
"\n",
"tools = [\n",
" {\n",

View File

@@ -117,7 +117,6 @@
"source": [
"from sglang import Engine\n",
"\n",
"\n",
"llm = Engine(model_path=model_path, chat_template=chat_template, log_level=\"warning\")"
]
},

View File

@@ -66,9 +66,13 @@ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --n
- `fa3`: `flash_attn_with_kvcache` kernel from `flash_attn` library. Can only run on Hopper GPUs. It requires bf16 q, kv inputs.
- `tilelang`: `tilelang` implementation that can run on GPU, HPU and NPU.
- `aiter`: Aiter kernel on AMD HPUs. Can only be used as decode kernel.
- `trtllm`: `trtllm-mla` sparse kernel from flashinfer library. Only run on blackwell GPUs. It requires QKV bf16 or QKV fp8.
- On the basis of performance benchmarks, the default configuration on H200 and B200 are set as follows :
- H200: `flashmla_sparse` prefill attention (short-seq prefill uses MHA via FlashAttention varlen), `fa3` decode attention, `bf16` kv cache dtype.
- B200: `flashmla_auto` prefill attention (short-seq prefill uses MHA via TRT-LLM ragged), `flashmla_kv` decode attention, `fp8_e4m3` kv cache dtype. `flashmla_auto` enables automatic selection of either `flashmla_sparse` or `flashmla_kv` kernel for prefill based on KV cache dtype, hardware, and heuristics. When FP8 KV cache is enabled and `total_kv_tokens < total_q_tokens * 512`, it uses the `flashmla_sparse` kernel; otherwise, it falls back to the `flashmla_kv` kernel. The heuristics may need to be tuned if the performance of either the `flashmla_sparse` or `flashmla_kv` kernel changes significantly.
- On Blackwell platform, with slightly accuracy drop, the performance can boost up to 3x-5x
- B200: by choosing `trtllm` for both `--nsa-prefill-backend` and `--nsa-decode-backend`, the prefill attention use MHA via TRT-LLM ragged for both short and long sequence (**accuracy impact**). Combine the `trtllm` with `fp8_e4m3` kv cache, the kv cache dim is `576` (kv_lora_rank + qk_rope_head_dim) (**accuracy impact**), compare to the combination of `flashmla_auto` and `fp8_e4m` kv cache dim is `656` (kv_lora_rank + scale storage (kv_lora_rank // quant_block_size * 4 bytes) + rope dimension storage).
## Multi-token Prediction
SGLang implements Multi-Token Prediction (MTP) for DeepSeek V3.2 based on [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved significantly on small batch sizes. Please look at [this PR](https://github.com/sgl-project/sglang/pull/11652) for more information.
@@ -308,9 +312,7 @@ For context parallel in DeepSeek V3.2 model, we provide two different modes of s
### In sequence splitting
The first mode can be enabled by `--nsa-prefill-cp-mode in-seq-split`. This mode implements context parallel for DSA by splitting the sequence uniformly between context parallel ranks. At attention stage, each cp rank computes the indexer results of sharded sequence, and collects the whole kv cache through all gather operator.
The communication group for context parallel reuses the one for attention tp, thus `cp_size` equals `atten_tp_size = tp_size / dp_size`.
The first mode can be enabled by `--nsa-prefill-cp-mode in-seq-split`. This mode implements context parallel for DSA by splitting the sequence uniformly between context parallel ranks. At attention stage, each cp rank computes the indexer results of sharded sequence, and collects the whole kv cache through all gather operator. Add `attn_cp_size` for communication group for context parallel.
Note that in sequence splitting mode has the following restrictions:
- The batch size is restricted to 1 for prefill batches
@@ -323,7 +325,7 @@ For more details, please refer to PR https://github.com/sgl-project/sglang/pull/
Example:
```bash
# In-seq splitting mode launched with EP + DP
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 2 --enable-dp-attention --enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode in-seq-split --max-running-requests 32
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 2 --enable-dp-attention --enable-nsa-prefill-context-parallel --attn-cp-size 4 --nsa-prefill-cp-mode in-seq-split --max-running-requests 32
```
### Round robin splitting (default setting)
@@ -337,7 +339,7 @@ For more details, please refer to PR https://github.com/sgl-project/sglang/pull/
Example usage:
```bash
# Launch with FusedMoe + CP8
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode round-robin-split --max-running-requests 32
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --enable-nsa-prefill-context-parallel --attn-cp-size 8 --nsa-prefill-cp-mode round-robin-split --max-running-requests 32
```
### Pipeline Parallel + Context Parallel (PP + CP)
@@ -361,6 +363,7 @@ python3 -m sglang.launch_server \
--tp 8 --pp-size 2 \
--dp-size 1 --moe-dense-tp-size 1 \
--enable-nsa-prefill-context-parallel \
--attn-cp-size 8 \
--nsa-prefill-cp-mode round-robin-split \
--trust-remote-code \
--disable-radix-cache \
@@ -384,6 +387,7 @@ python3 -m sglang.launch_server \
--tp 8 --pp-size 2 \
--dp-size 1 --moe-dense-tp-size 1 \
--enable-nsa-prefill-context-parallel \
--attn-cp-size 8 \
--nsa-prefill-cp-mode round-robin-split \
--trust-remote-code \
--disable-radix-cache \
@@ -411,6 +415,7 @@ python -m sglang.launch_server \
--tp 8 --pp-size 2 \
--dp-size 1 --moe-dense-tp-size 1 \
--enable-nsa-prefill-context-parallel \
--attn-cp-size 8 \
--nsa-prefill-cp-mode round-robin-split \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--trust-remote-code \
@@ -436,6 +441,7 @@ python -m sglang.launch_server \
--tp 8 --pp-size 2 \
--dp-size 1 --moe-dense-tp-size 1 \
--enable-nsa-prefill-context-parallel \
--attn-cp-size 8 \
--nsa-prefill-cp-mode round-robin-split \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--trust-remote-code \

View File

@@ -1,19 +0,0 @@
# Diffusion
SGLang supports two categories of diffusion models for different use cases. This page covers image and video generation; for diffusion LLMs, see [Diffusion LLMs](diffusion_llms.md).
## Image & Video Generation Models
For generating images and videos from text prompts, SGLang supports [many](../supported_models/image_generation/diffusion_models.md#image-generation-models) models like:
- **FLUX, Qwen-Image** - High-quality image generation
- **Wan 2.2, HunyuanVideo** - Video generation
```bash
# Example: Launch FLUX for image generation
python3 -m sglang.launch_server \
--model-path black-forest-labs/FLUX.2-klein-4B \
--host 0.0.0.0 --port 30000
```
**Full model list:** [Diffusion Models](../supported_models/image_generation/diffusion_models.md)

View File

@@ -1,14 +0,0 @@
# Diffusion Language Models (dLLMs)
These are text-generation models that use diffusion (denoising) instead of autoregressive decoding:
- **LLaDA** - Large Language Diffusion with mAsking
```bash
# Example: Launch LLaDA for text generation
python3 -m sglang.launch_server \
--model-path GSAI-ML/LLaDA-8B-Instruct \
--host 0.0.0.0 --port 30000
```
**Full model list:** [Diffusion Language Models](../supported_models/text_generation/diffusion_language_models.md)

View File

@@ -49,7 +49,7 @@
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
]
},
{
@@ -275,14 +275,12 @@
"metadata": {},
"outputs": [],
"source": [
"embedding_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"embedding_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
" --host 0.0.0.0 --is-embedding --log-level warning\n",
"\"\"\"\n",
")\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=embedding_process)"
]
},
{
@@ -324,14 +322,12 @@
"metadata": {},
"outputs": [],
"source": [
"reranker_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"reranker_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \\\n",
" --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning\n",
"\"\"\"\n",
")\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=reranker_process)"
]
},
{
@@ -392,14 +388,12 @@
"metadata": {},
"outputs": [],
"source": [
"score_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"score_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n",
" --host 0.0.0.0 --log-level warning\n",
"\"\"\"\n",
")\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=score_process)"
]
},
{
@@ -456,13 +450,11 @@
"# Note that SGLang now treats embedding models and reward models as the same type of models.\n",
"# This will be updated in the future.\n",
"\n",
"reward_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"reward_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning\n",
"\"\"\"\n",
")\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=reward_process)"
]
},
{
@@ -526,7 +518,7 @@
" \"python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=expert_record_server_process)"
]
},
{
@@ -575,13 +567,11 @@
"metadata": {},
"outputs": [],
"source": [
"tokenizer_free_server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"tokenizer_free_server_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct\n",
"\"\"\"\n",
")\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=tokenizer_free_server_process)"
]
},
{

View File

@@ -39,7 +39,7 @@
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")\n",
"wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
"print(f\"Server started on http://localhost:{port}\")"
]
},

View File

@@ -30,14 +30,12 @@
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"embedding_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"embedding_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
" --host 0.0.0.0 --is-embedding --log-level warning\n",
"\"\"\"\n",
")\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=embedding_process)"
]
},
{

View File

@@ -33,13 +33,11 @@
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"vision_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"vision_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --log-level warning\n",
"\"\"\"\n",
")\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=vision_process)"
]
},
{

View File

@@ -31,14 +31,12 @@
"# This is equivalent to running the following command in your terminal\n",
"# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
"server_process, port = launch_server_cmd(\"\"\"\n",
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n",
" --host 0.0.0.0 --log-level warning\n",
"\"\"\"\n",
")\n",
"\"\"\")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
]
},
{

View File

@@ -2,28 +2,39 @@
## Benchmark
- Benchmark the latency of running a single static batch without a server. The arguments are the same as for `launch_server.py`.
Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.
- Without a server (do not need to launch a server)
```bash
python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
```
- With a server (please use `sglang.launch_server` to launch a server first and run the following command.)
```bash
python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
```
SGLang provides four benchmark tools that operate at different levels of the stack. The table below summarizes their key differences:
| Tool | HTTP Server | Scheduler | Use Case |
| -------------------------- | --------------------------------------------- | --------------------------------------- | -------------------------------------------------------------------------- |
| `bench_serving` | Yes (async HTTP client to a running server) | Yes (indirectly, via server) | Realistic online serving benchmarks with latency metrics (TTFT, TPOT, ITL) |
| `bench_one_batch_server` | Yes (sends HTTP requests to a running server) | Yes (indirectly, via server) | End-to-end single-batch latency including HTTP and scheduler overhead |
| `bench_offline_throughput` | No | Yes (directly uses `Engine` in-process) | Maximum throughput measurement without HTTP overhead |
| `bench_one_batch` | No | No (directly calls `ModelRunner`) | Kernel-level latency profiling of a single static batch |
- Benchmark offline processing. This script will start an offline engine and run the benchmark.
Use `bench_serving` by default unless there are specific needs.
**`bench_serving`** is an async HTTP load-testing client that sends requests at controlled rates with configurable concurrency to a running server. It measures realistic online serving metrics including time-to-first-token (TTFT), time-per-output-token (TPOT), inter-token latency (ITL), and throughput. Use `num-prompts >= 5 * max-concurrency` to measure steady-state performance. Launch a server with `sglang.launch_server` first.
```bash
python3 -m sglang.bench_serving --backend sglang --max-concurrency 16 --num-prompts 80 --random-input-len 256 --random-output-len 32 --dataset-name random
```
**`bench_one_batch_server`** sends a single batch as one HTTP request to a running server. Due to only having a single batch, the server is never in a steady-state and metrics will be biased. Launch a server with `sglang.launch_server` first.
```bash
python3 -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
```
**`bench_offline_throughput`** directly instantiates the `Engine` object in-process (no HTTP server) and submits all requests at once via `engine.generate()`. The engine's scheduler handles batching and execution. This measures maximum achievable throughput without any network overhead.
```bash
python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
```
- Benchmark online serving. Please use `sglang.launch_server` to launch a server first and run the following command.
**`bench_one_batch`** is the lowest-level tool. It directly instantiates a `ModelRunner` and calls `extend()` / `decode()` on a fixed static batch, bypassing the scheduler entirely. The prefill and decode phases are run separately, making profiling easier but rendering the metrics unrealistic. Because there is no dynamic batching, it may run out of memory for batch sizes that a real server can handle (a real server chunks prefill into smaller batches). This is best suited for profiling individual kernel performance.
```bash
python3 -m sglang.bench_serving --backend sglang --num-prompt 10
python3 -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
```
## Profile with PyTorch Profiler

View File

@@ -26,7 +26,7 @@ python -m sglang.test.run_eval \
```bash
python -m sglang.test.few_shot_gsm8k \
--host http://127.0.0.1 \
--host 127.0.0.1 \
--port 30000 \
--num-questions 200 \
--num-shots 5
@@ -36,7 +36,7 @@ python -m sglang.test.few_shot_gsm8k \
```bash
python benchmark/hellaswag/bench_sglang.py \
--host http://127.0.0.1 \
--host 127.0.0.1 \
--port 30000 \
--num-questions 200 \
--num-shots 20
@@ -54,7 +54,7 @@ python -m sglang.test.run_eval \
```
```{tip}
For reasoning models, add `--thinking-mode <mode>` (e.g., `qwen3`, `deepseek-r1`, `deepseek-v3`). You may skip it if the model has forced thinking enabled.
For reasoning models, add `--thinking-mode <mode>` (e.g., `qwen3`, `deepseek-v3`). You may skip it if the model has forced thinking enabled.
```
**HumanEval**

View File

@@ -5,7 +5,6 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
## Prerequisites
- A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`.
- Python 3.11+ if you plan to use the OpenAI Python SDK.
## Supported Arguments
@@ -13,7 +12,6 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
### Server Arguments
- `--model-path {MODEL_PATH}`: Path to the model or model ID
- `--vae-path {VAE_PATH}`: Path to a custom VAE model or HuggingFace model ID (e.g., `fal/FLUX.2-Tiny-AutoEncoder`). If not specified, the VAE will be loaded from the main model path.
- `--lora-path {LORA_PATH}`: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied.
- `--lora-nickname {NAME}`: Nickname for the LoRA adapter. (default: `default`).
- `--num-gpus {NUM_GPUS}`: Number of GPUs to use
@@ -35,7 +33,7 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
- `--seed {SEED}`: Random seed for reproducible generation
#### Image/Video Configuration
**Image/Video Configuration**
- `--height {HEIGHT}`: Height of the generated output
- `--width {WIDTH}`: Width of the generated output
@@ -43,7 +41,7 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
- `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task
#### Output Options
**Output Options**
- `--output-path {PATH}`: Directory to save the generated video
- `--save-output`: Whether to save the image/video to disk
@@ -168,7 +166,7 @@ When enabled, the server follows a **Generate -> Upload -> Delete** workflow:
3. Upon successful upload, the local file is deleted.
4. The API response returns the public URL of the uploaded object.
#### Configuration
**Configuration**
Cloud storage is enabled via environment variables. Note that `boto3` must be installed separately (`pip install boto3`) to use this feature.
@@ -183,7 +181,7 @@ export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
```
See [Environment Variables Documentation](environment_variables.md) for more details.
See [Environment Variables Documentation](../environment_variables.md) for more details.
## Generate
@@ -219,6 +217,32 @@ Once the generation task has finished, the server will shut down automatically.
> [!NOTE]
> The HTTP server-related arguments are ignored in this subcommand.
## Component Path Overrides
SGLang diffusion allows you to override any pipeline component (e.g., `vae`, `transformer`, `text_encoder`) by specifying a custom checkpoint path. This is useful for:
### Example: FLUX.2-dev with Tiny AutoEncoder
You can override **any** component by using `--<component>-path`, where `<component>` matches the key in the model's `model_index.json`:
For example, replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding:
```bash
sglang serve \
--model-path=black-forest-labs/FLUX.2-dev \
# with a Huggingface Repo ID
--vae-path=fal/FLUX.2-Tiny-AutoEncoder
# or use a local path
--vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae
```
**Important:**
- The component key must match the one in your model's `model_index.json` (e.g., `vae`).
- The path must:
- either be a Huggingface Repo ID (e.g., fal/FLUX.2-Tiny-AutoEncoder)
- or point to a **complete component folder**, containing `config.json` and safetensors files
## Diffusers Backend
SGLang diffusion supports a **diffusers backend** that allows you to run any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes.

View File

@@ -2,6 +2,10 @@
The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management.
## Prerequisites
- Python 3.11+ if you plan to use the OpenAI Python SDK.
## Serve
Launch the server using the `sglang serve` command.
@@ -25,7 +29,7 @@ sglang serve "${SERVER_ARGS[@]}"
- **--model-path**: Path to the model or model ID.
- **--port**: HTTP port to listen on (default: `30000`).
#### Get Model Information
**Get Model Information**
**Endpoint:** `GET /models`
@@ -59,7 +63,7 @@ curl -sS -X GET "http://localhost:30010/models"
The server implements an OpenAI-compatible Images API under the `/v1/images` namespace.
#### Create an image
**Create an image**
**Endpoint:** `POST /v1/images/generations`
@@ -100,7 +104,7 @@ curl -sS -X POST "http://localhost:30010/v1/images/generations" \
> **Note**
> The `response_format=url` option is not supported for `POST /v1/images/generations` and will return a `400` error.
#### Edit an image
**Edit an image**
**Endpoint:** `POST /v1/images/edits`
@@ -130,7 +134,7 @@ curl -sS -X POST "http://localhost:30010/v1/images/edits" \
-F "response_format=url"
```
#### Download image content
**Download image content**
When `response_format=url` is used with `POST /v1/images/edits`, the API returns a relative URL like `/v1/images/<IMAGE_ID>/content`.
@@ -148,7 +152,7 @@ curl -sS -L "http://localhost:30010/v1/images/<IMAGE_ID>/content" \
The server implements a subset of the OpenAI Videos API under the `/v1/videos` namespace.
#### Create a video
**Create a video**
**Endpoint:** `POST /v1/videos`
@@ -178,7 +182,7 @@ curl -sS -X POST "http://localhost:30010/v1/videos" \
}'
```
#### List videos
**List videos**
**Endpoint:** `GET /v1/videos`
@@ -197,7 +201,7 @@ curl -sS -X GET "http://localhost:30010/v1/videos" \
-H "Authorization: Bearer sk-proj-1234567890"
```
#### Download video content
**Download video content**
**Endpoint:** `GET /v1/videos/{video_id}/content`
@@ -239,7 +243,7 @@ The server supports dynamic loading, merging, and unmerging of LoRA adapters.
- Switching: To switch LoRAs, you must first `unmerge` the current one, then `set` the new one
- Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost
#### Set LoRA Adapter
**Set LoRA Adapter**
Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters.
@@ -301,7 +305,7 @@ curl -X POST http://localhost:30010/v1/set_lora \
> - Multiple LoRAs applied to the same target will be merged in order
#### Merge LoRA Weights
**Merge LoRA Weights**
Manually merges the currently set LoRA weights into the base model.
@@ -323,7 +327,7 @@ curl -X POST http://localhost:30010/v1/merge_lora_weights \
```
#### Unmerge LoRA Weights
**Unmerge LoRA Weights**
Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This **must** be called before setting a different LoRA.
@@ -336,7 +340,7 @@ curl -X POST http://localhost:30010/v1/unmerge_lora_weights \
-H "Content-Type: application/json"
```
#### List LoRA Adapters
**List LoRA Adapters**
Returns loaded LoRA adapters and current application status per module.
@@ -389,3 +393,26 @@ Notes:
curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}'
```
5. Generate with LoRA B...
### Adjust Output Quality
The server supports adjusting output quality and compression levels for both image and video generation through the `output-quality` and `output-compression` parameters.
#### Parameters
- **`output-quality`** (string, optional): Preset quality level that automatically sets compression. **Default is `"default"`**. Valid values:
- `"maximum"`: Highest quality (100)
- `"high"`: High quality (90)
- `"medium"`: Medium quality (55)
- `"low"`: Lower quality (35)
- `"default"`: Auto-adjust based on media type (50 for video, 75 for image)
- **`output-compression`** (integer, optional): Direct compression level override (0-100). **Default is `None`**. When provided (not `None`), takes precedence over `output-quality`.
- `0`: Lowest quality, smallest file size
- `100`: Highest quality, largest file size
#### Notes
- **Precedence**: When both `output-quality` and `output-compression` are provided, `output-compression` takes precedence
- **Format Support**: Quality settings apply to JPEG, and video formats. PNG uses lossless compression and ignores these settings
- **File Size vs Quality**: Lower compression values (or "low" quality preset) produce smaller files but may show visible artifacts

View File

@@ -1,5 +1,4 @@
## Perf baseline generation script
## Perf Baseline Generation Script
`python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py` starts a local diffusion server, issues requests for selected test cases, aggregates stage/denoise-step/E2E timings from the perf log, and writes the results back to the `scenarios` section of `perf_baselines.json`.

View File

@@ -16,26 +16,26 @@ default parameters when initializing and generating videos.
### Video Generation Models
| Model Name | Hugging Face Model ID | Resolutions | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear AttentionSLA| Sage Sparse Linear AttentionSageSLA|
|:-----------------------------|:--------------------------------------------------|:--------------------|:--------:|:-----------------:|:---------:|:----------------------------:|:----------------------------:|:-----------------------------------------------:|
| FastWan2.1 T2V 1.3B | `FastVideo/FastWan2.1-T2V-1.3B-Diffusers` | 480p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ |
| FastWan2.2 TI2V 5B Full Attn | `FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers` | 720p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ |
| Wan2.2 TI2V 5B | `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | 720p | ⭕ | ⭕ | ✅ | ⭕ | ❌ | ❌ |
| Wan2.2 T2V A14B | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ |
| Wan2.2 I2V A14B | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ |
| HunyuanVideo | `hunyuanvideo-community/HunyuanVideo` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ |
| FastHunyuan | `FastVideo/FastHunyuan-diffusers` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ |
| Wan2.1 T2V 1.3B | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ |
| Wan2.1 T2V 14B | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | 480p, 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ |
| Wan2.1 I2V 480P | `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ |
| Wan2.1 I2V 720P | `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` | 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ |
| TurboWan2.1 T2V 1.3B | `IPostYellow/TurboWan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
| TurboWan2.1 T2V 14B | `IPostYellow/TurboWan2.1-T2V-14B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
| TurboWan2.1 T2V 14B 720P | `IPostYellow/TurboWan2.1-T2V-14B-720P-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
| TurboWan2.2 I2V A14B | `IPostYellow/TurboWan2.2-I2V-A14B-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
| Model Name | Hugging Face Model ID | Resolutions | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear Attention (SLA) | Sage Sparse Linear Attention (SageSLA) | Sparse Video Gen 2 (SVG2) |
|:-----------------------------|:--------------------------------------------------|:--------------------|:--------:|:-----------------:|:---------:|:----------------------------:|:----------------------------:|:-----------------------------------------------:|:----------------------------------:|
| FastWan2.1 T2V 1.3B | `FastVideo/FastWan2.1-T2V-1.3B-Diffusers` | 480p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ | ❌ |
| FastWan2.2 TI2V 5B Full Attn | `FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers` | 720p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ | ❌ |
| Wan2.2 TI2V 5B | `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | 720p | ⭕ | ⭕ | ✅ | ⭕ | ❌ | ❌ | ❌ |
| Wan2.2 T2V A14B | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ | ❌ |
| Wan2.2 I2V A14B | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ | ❌ |
| HunyuanVideo | `hunyuanvideo-community/HunyuanVideo` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
| FastHunyuan | `FastVideo/FastHunyuan-diffusers` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
| Wan2.1 T2V 1.3B | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
| Wan2.1 T2V 14B | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | 480p, 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
| Wan2.1 I2V 480P | `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
| Wan2.1 I2V 720P | `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` | 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
| TurboWan2.1 T2V 1.3B | `IPostYellow/TurboWan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
| TurboWan2.1 T2V 14B | `IPostYellow/TurboWan2.1-T2V-14B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
| TurboWan2.1 T2V 14B 720P | `IPostYellow/TurboWan2.1-T2V-14B-720P-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
| TurboWan2.2 I2V A14B | `IPostYellow/TurboWan2.2-I2V-A14B-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
**Note**: <br>
1.Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.<br>
**Note**:
1.Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.
2.SageSLA Based on SpargeAttn. Install it first with `pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation`
### Image Generation Models
@@ -55,7 +55,7 @@ default parameters when initializing and generating videos.
This section lists example LoRAs that have been explicitly tested and verified with each base model in the **SGLang Diffusion** pipeline.
> Important: \
> Important:
> LoRAs that are not listed here are not necessarily incompatible.
> In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions.
> The entries below simply reflect configurations that have been manually validated by the SGLang team.

View File

@@ -2,7 +2,7 @@
This guide outlines the requirements for contributing to the SGLang Diffusion module (`sglang.multimodal_gen`).
## 1. Commit Message Convention
## Commit Message Convention
We follow a structured commit message format to maintain a clean history.
@@ -21,7 +21,7 @@ We follow a structured commit message format to maintain a clean history.
- **Scope** (Optional): `cli`, `scheduler`, `model`, `pipeline`, `docs`, etc.
- **Subject**: Imperative mood, short and clear (e.g., "add feature" not "added feature").
## 2. Performance Reporting
## Performance Reporting
For PRs that impact **latency**, **throughput**, or **memory usage**, you **should** provide a performance comparison report.
@@ -45,7 +45,7 @@ For PRs that impact **latency**, **throughput**, or **memory usage**, you **shou
```
4. **Paste**: paste the table into the PR description
## 3. CI-Based Change Protection
## CI-Based Change Protection
Consider adding tests to the `pr-test` or `nightly-test` suites to safeguard your changes, especially for PRs that:

View File

@@ -1,11 +1,11 @@
## Caching Acceleration
These variables configure caching acceleration for Diffusion Transformer (DiT) models.
SGLang supports multiple caching strategies - see [caching documentation](cache/caching.md) for an overview.
SGLang supports multiple caching strategies - see [caching documentation](performance/cache/index.md) for an overview.
### Cache-DiT Configuration
See [cache-dit documentation](cache/cache_dit.md) for detailed configuration.
See [cache-dit documentation](performance/cache/cache_dit.md) for detailed configuration.
| Environment Variable | Default | Description |
|-------------------------------------|---------|------------------------------------------|

98
docs/diffusion/index.md Normal file
View File

@@ -0,0 +1,98 @@
# SGLang Diffusion
SGLang Diffusion is an inference framework for accelerated image and video generation using diffusion models. It provides an end-to-end unified pipeline with optimized kernels and an efficient scheduler loop.
## Key Features
- **Broad Model Support**: Wan series, FastWan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux, Z-Image, GLM-Image, and more
- **Fast Inference**: Optimized kernels, efficient scheduler loop, and Cache-DiT acceleration
- **Ease of Use**: OpenAI-compatible API, CLI, and Python SDK
- **Multi-Platform**: NVIDIA GPUs (H100, H200, A100, B200, 4090), AMD GPUs (MI300X, MI325X) and Ascend NPU (A2, A3)
---
## Quick Start
### Installation
```bash
uv pip install "sglang[diffusion]" --prerelease=allow
```
See [Installation Guide](installation.md) for more installation methods and ROCm-specific instructions.
### Basic Usage
Generate an image with the CLI:
```bash
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A beautiful sunset over the mountains" \
--save-output
```
Or start a server with the OpenAI-compatible API:
```bash
sglang serve --model-path Qwen/Qwen-Image --port 30010
```
---
## Documentation
### Getting Started
- **[Installation](installation.md)** - Install SGLang Diffusion via pip, uv, Docker, or from source
- **[Compatibility Matrix](compatibility_matrix.md)** - Supported models and optimization compatibility
### Usage
- **[CLI Documentation](api/cli.md)** - Command-line interface for `sglang generate` and `sglang serve`
- **[OpenAI API](api/openai_api.md)** - OpenAI-compatible API for image/video generation and LoRA management
### Performance Optimization
- **[Performance Overview](performance/index.md)** - Overview of all performance optimization strategies
- **[Attention Backends](performance/attention_backends.md)** - Available attention backends (FlashAttention, SageAttention, etc.)
- **[Caching Strategies](performance/cache/)** - Cache-DiT and TeaCache acceleration
- **[Profiling](performance/profiling.md)** - Profiling techniques with PyTorch Profiler and Nsight Systems
### Reference
- **[Environment Variables](environment_variables.md)** - Configuration via environment variables
- **[Support New Models](support_new_models.md)** - Guide for adding new diffusion models
- **[Contributing](contributing.md)** - Contribution guidelines and commit message conventions
- **[CI Performance](ci_perf.md)** - Performance baseline generation script
---
## CLI Quick Reference
### Generate (one-off generation)
```bash
sglang generate --model-path <MODEL> --prompt "<PROMPT>" --save-output
```
### Serve (HTTP server)
```bash
sglang serve --model-path <MODEL> --port 30010
```
### Enable Cache-DiT acceleration
```bash
SGLANG_CACHE_DIT_ENABLED=true sglang generate --model-path <MODEL> --prompt "<PROMPT>"
```
---
## References
- [SGLang GitHub](https://github.com/sgl-project/sglang)
- [Cache-DiT](https://github.com/vipshop/cache-dit)
- [FastVideo](https://github.com/hao-ai-lab/FastVideo)
- [xDiT](https://github.com/xdit-project/xDiT)
- [Diffusers](https://github.com/huggingface/diffusers)

View File

@@ -0,0 +1,95 @@
# Install SGLang-Diffusion
You can install SGLang-Diffusion using one of the methods below.
## Standard Installation (NVIDIA GPUs)
### Method 1: With pip or uv
It is recommended to use uv for a faster installation:
```bash
pip install --upgrade pip
pip install uv
uv pip install "sglang[diffusion]" --prerelease=allow
```
### Method 2: From source
```bash
# Use the latest release branch
git clone https://github.com/sgl-project/sglang.git
cd sglang
# Install the Python packages
pip install --upgrade pip
pip install -e "python[diffusion]"
# With uv
uv pip install -e "python[diffusion]" --prerelease=allow
```
### Method 3: Using Docker
The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang), built from the [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile).
Replace `<secret>` below with your HuggingFace Hub [token](https://huggingface.co/docs/hub/en/security-tokens).
```bash
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:dev \
zsh -c '\
echo "Installing diffusion dependencies..." && \
pip install -e "python[diffusion]" && \
echo "Starting SGLang-Diffusion..." && \
sglang generate \
--model-path black-forest-labs/FLUX.1-dev \
--prompt "A logo With Bold Large text: SGL Diffusion" \
--save-output \
'
```
## Platform-Specific: ROCm (AMD GPUs)
For AMD Instinct GPUs (e.g., MI300X), you can use the ROCm-enabled Docker image:
```bash
docker run --device=/dev/kfd --device=/dev/dri --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env HF_TOKEN=<secret> \
lmsysorg/sglang:v0.5.5.post2-rocm700-mi30x \
sglang generate --model-path black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output
```
For detailed ROCm system configuration and installation from source, see [AMD GPUs](../../platforms/amd_gpu.md).
## Platform-Specific: MUSA (Moore Threads GPUs)
For Moore Threads GPUs (MTGPU) with the MUSA software stack:
```bash
# Clone the repository
git clone https://github.com/sgl-project/sglang.git
cd sglang
# Install the Python packages
pip install --upgrade pip
rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
pip install -e "python[all_musa]"
```
## Platform-Specific: Ascend NPU
For Ascend NPU, please follow the [NPU installation guide](../platforms/ascend_npu.md).
Quick test:
```bash
sglang generate --model-path black-forest-labs/FLUX.1-dev \
--prompt "A logo With Bold Large text: SGL Diffusion" \
--save-output
```

View File

@@ -14,6 +14,7 @@ When using the diffusers backend, `--attention-backend` is passed through to dif
- **CUDA**: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.
- **ROCm**: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.
- **MPS**: always uses PyTorch SDPA.
- **NPU**: always uses PyTorch SDPA.
## Backend options
@@ -29,6 +30,7 @@ For SGLang-native pipelines, the CLI accepts the lowercase names of `AttentionBa
| `video_sparse_attn` | `VIDEO_SPARSE_ATTN` | Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. |
| `vmoba_attn` | `VMOBA_ATTN` | Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. |
| `aiter` | `AITER` | Requires `aiter`. |
| `sparse_video_gen_2_attn` | `SPARSE_VIDEO_GEN_2_ATTN` | Requires `svg`. See installation instructions at https://github.com/svg-project/Sparse-VideoGen. |
## Selection priority
@@ -47,7 +49,7 @@ Some backends require additional configuration. You can pass these parameters vi
### Supported Configuration Parameters
#### Sliding Tile Attention (`sliding_tile_attn`)
**Sliding Tile Attention (`sliding_tile_attn`)**
| Parameter | Type | Description | Default |
| :--- | :--- | :--- | :--- |
@@ -55,13 +57,13 @@ Some backends require additional configuration. You can pass these parameters vi
| `sta_mode` | `str` | Mode of STA. | `STA_inference` |
| `skip_time_steps` | `int` | Number of steps to use full attention before switching to sparse attention. | `15` |
#### Video Sparse Attention (`video_sparse_attn`)
**Video Sparse Attention (`video_sparse_attn`)**
| Parameter | Type | Description | Default |
| :--- | :--- | :--- | :--- |
| `sparsity` | `float` | Validation sparsity (0.0 - 1.0). | `0.0` |
#### V-MoBA (`vmoba_attn`)
**V-MoBA (`vmoba_attn`)**
| Parameter | Type | Description | Default |
| :--- | :--- | :--- | :--- |
@@ -82,16 +84,17 @@ Some backends require additional configuration. You can pass these parameters vi
## Platform support matrix
| Backend | CUDA | ROCm | MPS | Notes |
|---|---:|---:|---:|---|
| `fa` | ✅ | ✅ | ❌ | CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to `torch_sdpa`. |
| `torch_sdpa` | ✅ | ✅ | ✅ | Most compatible option across platforms. |
| `sliding_tile_attn` | ✅ | ❌ | ❌ | CUDA-only. Requires `st_attn`. Configure via `--attention-backend-config`. |
| `sage_attn` | ✅ | ❌ | ❌ | CUDA-only (optional dependency). |
| `sage_attn_3` | ✅ | ❌ | ❌ | CUDA-only (optional dependency). |
| `video_sparse_attn` | ✅ | ❌ | ❌ | CUDA-only. Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. |
| `vmoba_attn` | ✅ | ❌ | ❌ | CUDA-only. Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. |
| `aiter` | ✅ | ❌ | ❌ | Requires `aiter`. |
| Backend | CUDA | ROCm | MPS | NPU | Notes |
|---|---:|---:|---:|---:|---|
| `fa` | ✅ | ✅ | ❌ | ❌ | CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to `torch_sdpa`. |
| `torch_sdpa` | ✅ | ✅ | ✅ | ✅ | Most compatible option across platforms. |
| `sliding_tile_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `st_attn`. Configure via `--attention-backend-config`. |
| `sage_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). |
| `sage_attn_3` | ✅ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). |
| `video_sparse_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. |
| `vmoba_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. |
| `aiter` | ✅ | ❌ | ❌ | ❌ | Requires `aiter`. |
| `sparse_video_gen_2_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `svg`. |
## Usage

View File

@@ -1,9 +1,5 @@
# Cache-DiT Acceleration
> **Note**: This is one of two caching strategies available in SGLang.
> For an overview of all caching options, see [caching.md](caching.md).
> For TeaCache documentation, see [teacache.md](teacache.md).
SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to **1.69x inference speedup** with minimal quality loss.
## Overview
@@ -136,7 +132,7 @@ sglang generate --model-path black-forest-labs/FLUX.1-dev \
SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and
which to use cached results.
#### SCM Presets
**SCM Presets**
SCM is configured with presets:
@@ -148,7 +144,7 @@ SCM is configured with presets:
| `fast` | ~35% | ~3x | Acceptable |
| `ultra` | ~25% | ~4x | Lower |
##### Usage
**Usage**
```bash
SGLANG_CACHE_DIT_ENABLED=true \
@@ -157,7 +153,7 @@ sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
```
#### Custom SCM Bins
**Custom SCM Bins**
For fine-grained control over which steps to compute vs cache:
@@ -169,7 +165,7 @@ sglang generate --model-path Qwen/Qwen-Image \
--prompt "A futuristic cityscape at sunset"
```
#### SCM Policy
**SCM Policy**
| Policy | Env Variable | Description |
|-----------|---------------------------------------|---------------------------------------------|
@@ -178,22 +174,8 @@ sglang generate --model-path Qwen/Qwen-Image \
## Environment Variables
All Cache-DiT parameters can be set via the following environment variables:
| Environment Variable | Default | Description |
|-------------------------------------|---------|------------------------------------------|
| `SGLANG_CACHE_DIT_ENABLED` | false | Enable Cache-DiT acceleration |
| `SGLANG_CACHE_DIT_FN` | 1 | First N blocks to always compute |
| `SGLANG_CACHE_DIT_BN` | 0 | Last N blocks to always compute |
| `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching |
| `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold |
| `SGLANG_CACHE_DIT_MC` | 3 | Max continuous cached steps |
| `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator |
| `SGLANG_CACHE_DIT_TS_ORDER` | 1 | TaylorSeer order (1 or 2) |
| `SGLANG_CACHE_DIT_SCM_PRESET` | none | SCM preset (none/slow/medium/fast/ultra) |
| `SGLANG_CACHE_DIT_SCM_POLICY` | dynamic | SCM caching policy |
| `SGLANG_CACHE_DIT_SCM_COMPUTE_BINS` | not set | Custom SCM compute bins |
| `SGLANG_CACHE_DIT_SCM_CACHE_BINS` | not set | Custom SCM cache bins |
All Cache-DiT parameters can be configured via environment variables.
See [Environment Variables](../../environment_variables.md) for the complete list.
## Supported Models
@@ -240,4 +222,4 @@ acceleration still works.
## References
- [Cache-Dit](https://github.com/vipshop/cache-dit)
- [SGLang Diffusion](../README.md)
- [SGLang Diffusion](../index.md)

View File

@@ -1,7 +1,7 @@
# TeaCache Acceleration
> **Note**: This is one of two caching strategies available in SGLang.
> For an overview of all caching options, see [caching.md](caching.md).
> For an overview of all caching options, see [caching](../index.md).
TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.

View File

@@ -0,0 +1,72 @@
# Performance Optimization
SGLang-Diffusion provides multiple performance optimization strategies to accelerate inference. This section covers all available performance tuning options.
## Overview
| Optimization | Type | Description |
|--------------|------|-------------|
| **Cache-DiT** | Caching | Block-level caching with DBCache, TaylorSeer, and SCM |
| **TeaCache** | Caching | Timestep-level caching using L1 similarity |
| **Attention Backends** | Kernel | Optimized attention implementations (FlashAttention, SageAttention, etc.) |
| **Profiling** | Diagnostics | PyTorch Profiler and Nsight Systems guidance |
## Caching Strategies
SGLang supports two complementary caching approaches:
### Cache-DiT
[Cache-DiT](https://github.com/vipshop/cache-dit) provides block-level caching with advanced strategies. It can achieve up to **1.69x speedup**.
**Quick Start:**
```bash
SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
--prompt "A beautiful sunset over the mountains"
```
**Key Features:**
- **DBCache**: Dynamic block-level caching based on residual differences
- **TaylorSeer**: Taylor expansion-based calibration for optimized caching
- **SCM**: Step-level computation masking for additional speedup
See [Cache-DiT Documentation](cache/cache_dit.md) for detailed configuration.
### TeaCache
TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.
**Quick Overview:**
- Tracks L1 distance between modulated inputs across timesteps
- When accumulated distance is below threshold, reuses cached residual
- Supports CFG with separate positive/negative caches
**Supported Models:** Wan (wan2.1, wan2.2), Hunyuan (HunyuanVideo), Z-Image
See [TeaCache Documentation](cache/teacache.md) for detailed configuration.
## Attention Backends
Different attention backends offer varying performance characteristics depending on your hardware and model:
- **FlashAttention**: Fastest on NVIDIA GPUs with fp16/bf16
- **SageAttention**: Alternative optimized implementation
- **xformers**: Memory-efficient attention
- **SDPA**: PyTorch native scaled dot-product attention
See [Attention Backends](attention_backends.md) for platform support and configuration options.
## Profiling
To diagnose performance bottlenecks, SGLang-Diffusion supports profiling tools:
- **PyTorch Profiler**: Built-in Python profiling
- **Nsight Systems**: GPU kernel-level analysis
See [Profiling Guide](profiling.md) for detailed instructions.
## References
- [Cache-DiT Repository](https://github.com/vipshop/cache-dit)
- [TeaCache Paper](https://arxiv.org/abs/2411.14324)

View File

@@ -23,7 +23,7 @@ To add support for a new diffusion model, you will primarily need to define or c
3. **`ComposedPipeline` (not a config)**: This is the central class where you define the structure of your model's generation pipeline. You will create a new class that inherits from `ComposedPipelineBase` and, within it, instantiate and chain together the necessary `PipelineStage`s in the correct order. See `ComposedPipelineBase` and `PipelineStage` base definitions:
- [`ComposedPipelineBase`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines/composed_pipeline_base.py)
- [`PipelineStage`]( https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines/stages/base.py)
- [`PipelineStage`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines/stages/base.py)
- [Central registry (models/config mapping)](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/registry.py)
4. **Modules (components referenced by the pipeline)**: Each pipeline references a set of modules that are loaded from the model repository (e.g., Diffusers `model_index.json`) and assembled via the registry/loader. Common modules include:
@@ -37,7 +37,7 @@ To add support for a new diffusion model, you will primarily need to define or c
## Available Pipeline Stages
You can build your custom `ComposedPipeline` by combining the following available stages as your will. Each stage is responsible for a specific part of the generation process.
You can build your custom `ComposedPipeline` by combining the following available stages as needed. Each stage is responsible for a specific part of the generation process.
| Stage Class | Description |
| -------------------------------- | ------------------------------------------------------------------------------------------------------- |
@@ -45,7 +45,6 @@ You can build your custom `ComposedPipeline` by combining the following availabl
| `TextEncodingStage` | Encodes text prompts into embeddings using one or more text encoders. |
| `ImageEncodingStage` | Encodes input images into embeddings, often used in image-to-image tasks. |
| `ImageVAEEncodingStage` | Specifically encodes an input image into the latent space using a Variational Autoencoder (VAE). |
| `ConditioningStage` | Prepares the conditioning tensors (e.g., from text or image embeddings) for the denoising loop. |
| `TimestepPreparationStage` | Prepares the scheduler's timesteps for the diffusion process. |
| `LatentPreparationStage` | Creates the initial noisy latent tensor that will be denoised. |
| `DenoisingStage` | Executes the main denoising loop, iteratively applying the model (e.g., UNet) to refine the latents. |
@@ -88,15 +87,13 @@ To illustrate the process, let's look at how `Qwen-Image-Edit` is implemented. T
_required_config_modules = ["processor", "scheduler", "text_encoder", "tokenizer", "transformer", "vae"]
def create_pipeline_stages(self, server_args: ServerArgs):
"""Set up pipeline stages sequentially."""
self.add_stage(stage_name="input_validation_stage", stage=InputValidationStage())
self.add_stage(stage_name="prompt_encoding_stage_primary", stage=ImageEncodingStage(...))
self.add_stage(stage_name="image_encoding_stage_primary", stage=ImageVAEEncodingStage(...))
self.add_stage(stage_name="timestep_preparation_stage", stage=TimestepPreparationStage(...))
self.add_stage(stage_name="latent_preparation_stage", stage=LatentPreparationStage(...))
self.add_stage(stage_name="conditioning_stage", stage=ConditioningStage())
self.add_stage(stage_name="denoising_stage", stage=DenoisingStage(...))
self.add_stage(stage_name="decoding_stage", stage=DecodingStage(...))
self.add_stage(InputValidationStage())
self.add_stage(ImageEncodingStage(...))
self.add_stage(ImageVAEEncodingStage(...))
self.add_stage(TimestepPreparationStage(...))
self.add_stage(LatentPreparationStage(...))
self.add_stage(DenoisingStage(...))
self.add_stage(DecodingStage(...))
```
The pipeline is constructed by adding stages in order. `Qwen-Image-Edit` uses `ImageEncodingStage` (for prompt and image processing) and `ImageVAEEncodingStage` (for latent extraction) before standard denoising and decoding.

View File

@@ -35,8 +35,6 @@ Its core features include:
basic_usage/native_api.ipynb
basic_usage/sampling_params.md
basic_usage/popular_model_usage.rst
basic_usage/diffusion_llms.md
basic_usage/diffusion.md
.. toctree::
:maxdepth: 1
@@ -74,11 +72,30 @@ Its core features include:
:caption: Supported Models
supported_models/text_generation/index
supported_models/image_generation/index
supported_models/retrieval_ranking/index
supported_models/specialized/index
supported_models/extending/index
.. toctree::
:maxdepth: 2
:caption: SGLang Diffusion
diffusion/index
diffusion/installation
diffusion/compatibility_matrix
diffusion/api/cli
diffusion/api/openai_api
diffusion/performance/index
diffusion/performance/attention_backends
diffusion/performance/profiling
diffusion/performance/cache/index
diffusion/performance/cache/cache_dit
diffusion/performance/cache/teacache
diffusion/support_new_models
diffusion/contributing
diffusion/ci_perf
diffusion/environment_variables
.. toctree::
:maxdepth: 1
:caption: Hardware Platforms
@@ -113,6 +130,7 @@ Its core features include:
references/custom_chat_template.md
references/frontend/frontend_index.rst
references/post_training_integration.md
references/release_lookup
references/learn_more.md
.. toctree::

View File

@@ -14,21 +14,26 @@ let currentMetricType = 'throughput'; // throughput, latency, ttft, inputThrough
// Metric type definitions
const metricTypes = {
throughput: { label: 'Overall Throughput', unit: 'tokens/sec', field: 'throughput' },
outputThroughput: { label: 'Output Throughput', unit: 'tokens/sec', field: 'outputThroughput' },
inputThroughput: { label: 'Input Throughput', unit: 'tokens/sec', field: 'inputThroughput' },
latency: { label: 'Latency', unit: 'ms', field: 'latency' },
ttft: { label: 'Time to First Token', unit: 'ms', field: 'ttft' },
accLength: { label: 'Accept Length', unit: 'tokens', field: 'accLength', filterInvalid: true }
// Text/VLM metrics
throughput: { label: 'Overall Throughput', unit: 'tokens/sec', field: 'throughput', type: 'text' },
outputThroughput: { label: 'Output Throughput', unit: 'tokens/sec', field: 'outputThroughput', type: 'text' },
inputThroughput: { label: 'Input Throughput', unit: 'tokens/sec', field: 'inputThroughput', type: 'text' },
latency: { label: 'Latency', unit: 'ms', field: 'latency', type: 'text' },
ttft: { label: 'Time to First Token', unit: 'ms', field: 'ttft', type: 'text' },
accLength: { label: 'Accept Length', unit: 'tokens', field: 'accLength', filterInvalid: true, type: 'text' },
// Diffusion metrics
e2eMs: { label: 'End-to-End Time', unit: 'ms', field: 'e2e_ms', type: 'diffusion' },
avgDenoiseMs: { label: 'Avg Denoise Time', unit: 'ms', field: 'avg_denoise_ms', type: 'diffusion' },
medianDenoiseMs: { label: 'Median Denoise Time', unit: 'ms', field: 'median_denoise_ms', type: 'diffusion' }
};
// Chart.js default configuration for dark theme
Chart.defaults.color = '#8b949e';
Chart.defaults.borderColor = '#30363d';
Chart.defaults.color = '#94a3b8';
Chart.defaults.borderColor = '#1e293b';
const chartColors = [
'#58a6ff', '#3fb950', '#d29922', '#f85149', '#a371f7',
'#79c0ff', '#56d364', '#e3b341', '#ff7b72', '#bc8cff'
'#22d3ee', '#34d399', '#fbbf24', '#f87171', '#a78bfa',
'#67e8f9', '#6ee7b7', '#fcd34d', '#fca5a5', '#c4b5fd'
];
// Initialize the dashboard
@@ -53,7 +58,7 @@ async function init() {
async function loadData() {
// Try local server API first (if running server.py)
try {
const response = await fetch('/api/metrics');
const response = await fetch('/api/metrics', { headers: getAuthHeaders() });
if (response.ok) {
const data = await response.json();
if (data.length > 0 && data[0].results && data[0].results.length > 0) {
@@ -142,32 +147,51 @@ async function fetchMetricsForRun(run) {
}
}
// Helper function to detect if result is diffusion type
function isDiffusionResult(result) {
return result.test_type === 'diffusion' || (result.tests && !result.benchmarks);
}
// Populate filter dropdowns
function populateFilters() {
const gpuConfigs = new Set();
const models = new Set();
const testNames = new Set(); // For diffusion tests
const batchSizes = new Set();
const ioLengths = new Set();
allMetricsData.forEach(run => {
run.results.forEach(result => {
gpuConfigs.add(result.gpu_config);
models.add(result.model);
// Try new structure first (benchmarks_by_io_len), fall back to flat benchmarks
if (result.benchmarks_by_io_len) {
Object.entries(result.benchmarks_by_io_len).forEach(([ioKey, ioData]) => {
ioLengths.add(ioKey);
ioData.benchmarks.forEach(bench => {
batchSizes.add(bench.batch_size);
// Handle diffusion results
if (isDiffusionResult(result)) {
models.add(result.test_suite || 'diffusion');
if (result.tests) {
result.tests.forEach(test => {
testNames.add(test.test_name);
});
});
} else if (result.benchmarks) {
result.benchmarks.forEach(bench => {
batchSizes.add(bench.batch_size);
if (bench.input_len && bench.output_len) {
ioLengths.add(`${bench.input_len}_${bench.output_len}`);
}
});
}
}
// Handle text/VLM results
else {
models.add(result.model);
// Try new structure first (benchmarks_by_io_len), fall back to flat benchmarks
if (result.benchmarks_by_io_len) {
Object.entries(result.benchmarks_by_io_len).forEach(([ioKey, ioData]) => {
ioLengths.add(ioKey);
ioData.benchmarks.forEach(bench => {
batchSizes.add(bench.batch_size);
});
});
} else if (result.benchmarks) {
result.benchmarks.forEach(bench => {
batchSizes.add(bench.batch_size);
if (bench.input_len && bench.output_len) {
ioLengths.add(`${bench.input_len}_${bench.output_len}`);
}
});
}
}
});
});
@@ -345,7 +369,16 @@ function createMetricTabs() {
const tabsContainer = document.getElementById('metric-tabs');
tabsContainer.innerHTML = '';
Object.entries(metricTypes).forEach(([key, metric], index) => {
// Detect if current data is diffusion or text
const isDiffusion = detectCurrentDataType() === 'diffusion';
const dataType = isDiffusion ? 'diffusion' : 'text';
// Filter metrics based on data type
const relevantMetrics = Object.entries(metricTypes).filter(([key, metric]) =>
metric.type === dataType
);
relevantMetrics.forEach(([key, metric], index) => {
const tab = document.createElement('div');
tab.className = index === 0 ? 'tab active' : 'tab';
tab.textContent = metric.label;
@@ -353,6 +386,31 @@ function createMetricTabs() {
tab.onclick = () => selectMetricTab(key, tab);
tabsContainer.appendChild(tab);
});
// Set initial metric type
if (relevantMetrics.length > 0) {
currentMetricType = relevantMetrics[0][0];
}
}
function detectCurrentDataType() {
// Check if currently selected model/GPU config has diffusion data
const gpuFilter = document.getElementById('gpu-filter')?.value;
const modelFilter = currentModel;
if (!gpuFilter || !modelFilter) return 'text';
for (const run of allMetricsData) {
for (const result of run.results) {
if (result.gpu_config === gpuFilter) {
const resultModel = result.test_suite || result.model;
if (resultModel === modelFilter && isDiffusionResult(result)) {
return 'diffusion';
}
}
}
}
return 'text';
}
function selectMetricTab(metricKey, tabElement) {
@@ -374,6 +432,8 @@ function handleModelFilterChange(model) {
updateVariantFilter();
// Update IO length filter based on new model selection
updateIoLenFilter();
// Recreate metric tabs in case data type changed (text vs diffusion)
createMetricTabs();
updateCharts();
}
@@ -383,6 +443,8 @@ function handleGpuFilterChange() {
updateVariantFilter();
// Update IO length filter based on new GPU selection
updateIoLenFilter();
// Recreate metric tabs in case data type changed (text vs diffusion)
createMetricTabs();
updateCharts();
}
@@ -518,6 +580,7 @@ function prepareChartData(gpuFilter, modelFilter, variantFilter, ioLenFilter, ba
// Prepare chart data grouped by batch size - each batch size is a separate series
function prepareChartDataByBatch(gpuFilter, modelFilter, variantFilter, ioLenFilter, batchFilter) {
const batchDataMap = new Map(); // batch_size -> Map of variant -> data
const testDataMap = new Map(); // For diffusion: test_name -> data
allMetricsData.forEach(run => {
const runDate = new Date(run.run_date);
@@ -525,6 +588,37 @@ function prepareChartDataByBatch(gpuFilter, modelFilter, variantFilter, ioLenFil
run.results.forEach(result => {
// Apply filters - GPU and Model are required (no "all" option)
if (result.gpu_config !== gpuFilter) return;
// Handle diffusion results
if (isDiffusionResult(result)) {
const resultModel = result.test_suite || 'diffusion';
if (resultModel !== modelFilter) return;
if (result.tests) {
result.tests.forEach(test => {
const testName = test.test_name;
if (!testDataMap.has(testName)) {
testDataMap.set(testName, {
label: testName,
data: [],
model: resultModel,
testName: testName
});
}
testDataMap.get(testName).data.push({
x: runDate,
e2e_ms: test.e2e_ms,
avg_denoise_ms: test.avg_denoise_ms,
median_denoise_ms: test.median_denoise_ms,
runId: run.run_id
});
});
}
return;
}
// Handle text/VLM results
if (result.model !== modelFilter) return;
if (variantFilter !== 'all' && result.variant !== variantFilter) return;
@@ -622,6 +716,17 @@ function prepareChartDataByBatch(gpuFilter, modelFilter, variantFilter, ioLenFil
// Sort data points by date and convert to array format
const result = {};
// For diffusion data, use test names as "batch sizes"
if (testDataMap.size > 0) {
testDataMap.forEach((series, testName) => {
series.data.sort((a, b) => a.x - b.x);
result[testName] = [series]; // Each test is its own series
});
return result;
}
// For text/VLM data, use batch sizes
batchDataMap.forEach((variantMap, batchSize) => {
variantMap.forEach(series => {
series.data.sort((a, b) => a.x - b.x);
@@ -642,7 +747,16 @@ function updateMetricChart(chartDataByBatch, metricType) {
activeCharts = [];
const metric = metricTypes[metricType];
const batchSizes = Object.keys(chartDataByBatch).sort((a, b) => parseInt(a) - parseInt(b));
const isDiffusion = metric.type === 'diffusion';
// For diffusion, keys are test names; for text, keys are batch sizes
const keys = Object.keys(chartDataByBatch);
if (!isDiffusion) {
keys.sort((a, b) => parseInt(a) - parseInt(b));
} else {
keys.sort(); // Alphabetical sort for test names
}
const batchSizes = keys; // Keep variable name for compatibility
if (batchSizes.length === 0) {
container.innerHTML = '<div class="no-data">No data available for the selected filters</div>';
@@ -682,7 +796,8 @@ function updateMetricChart(chartDataByBatch, metricType) {
const title = document.createElement('div');
title.className = 'batch-chart-title';
title.textContent = `Batch Size: ${batchSize}`;
// For diffusion, show test name; for text, show batch size
title.textContent = isDiffusion ? `Test: ${batchSize}` : `Batch Size: ${batchSize}`;
chartWrapper.appendChild(title);
const chartContainer = document.createElement('div');
@@ -726,12 +841,13 @@ function getChartOptions(yAxisLabel) {
}
},
tooltip: {
backgroundColor: '#21262d',
borderColor: '#30363d',
backgroundColor: '#1a2332',
borderColor: 'rgba(148, 163, 184, 0.1)',
borderWidth: 1,
titleFont: { size: 13 },
bodyFont: { size: 12 },
padding: 12
titleFont: { size: 13, family: "'DM Sans', sans-serif" },
bodyFont: { size: 12, family: "'JetBrains Mono', monospace" },
padding: 14,
cornerRadius: 8
}
},
scales: {
@@ -744,7 +860,7 @@ function getChartOptions(yAxisLabel) {
}
},
grid: {
color: '#21262d'
color: 'rgba(148, 163, 184, 0.06)'
}
},
y: {
@@ -753,7 +869,7 @@ function getChartOptions(yAxisLabel) {
text: yAxisLabel
},
grid: {
color: '#21262d'
color: 'rgba(148, 163, 184, 0.06)'
}
}
}
@@ -832,5 +948,109 @@ function formatNumber(num) {
return num.toFixed(1);
}
// Authentication state
let authToken = sessionStorage.getItem('dashboard_auth_token') || null;
// Get auth headers for API requests
function getAuthHeaders() {
const headers = {};
if (authToken) {
headers['Authorization'] = `Bearer ${authToken}`;
}
return headers;
}
// Check if server requires authentication and show/hide login accordingly
async function checkAuthAndInit() {
const loginOverlay = document.getElementById('login-overlay');
const dashboardContainer = document.getElementById('dashboard-container');
try {
const response = await fetch('/api/auth-check');
if (response.ok) {
const data = await response.json();
if (!data.auth_required) {
// No auth required - skip login, show dashboard directly
loginOverlay.style.display = 'none';
dashboardContainer.style.display = 'block';
init();
return;
}
}
} catch (e) {
// Server not available (e.g. static hosting) - skip login
loginOverlay.style.display = 'none';
dashboardContainer.style.display = 'block';
init();
return;
}
// Auth is required - check if we have a valid token from a previous session
if (authToken) {
try {
const testResponse = await fetch('/api/metrics', {
headers: getAuthHeaders()
});
if (testResponse.ok) {
loginOverlay.style.display = 'none';
dashboardContainer.style.display = 'block';
init();
return;
}
} catch (e) {
// Token invalid or expired
}
// Clear invalid token
authToken = null;
sessionStorage.removeItem('dashboard_auth_token');
}
// Show login form
loginOverlay.style.display = 'flex';
dashboardContainer.style.display = 'none';
}
// Handle login form submission
async function handleLogin(event) {
event.preventDefault();
const username = document.getElementById('login-username').value;
const password = document.getElementById('login-password').value;
const errorEl = document.getElementById('login-error');
const loginBtn = document.getElementById('login-btn');
errorEl.textContent = '';
loginBtn.disabled = true;
loginBtn.textContent = 'Signing in...';
try {
const response = await fetch('/api/login', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ username, password })
});
const data = await response.json();
if (response.ok && data.token) {
authToken = data.token;
sessionStorage.setItem('dashboard_auth_token', authToken);
document.getElementById('login-overlay').style.display = 'none';
document.getElementById('dashboard-container').style.display = 'block';
init();
} else {
errorEl.textContent = data.error || 'Invalid username or password';
}
} catch (e) {
errorEl.textContent = 'Unable to connect to server';
} finally {
loginBtn.disabled = false;
loginBtn.textContent = 'Sign In';
}
return false;
}
// Initialize on page load
document.addEventListener('DOMContentLoaded', init);
document.addEventListener('DOMContentLoaded', checkAuthAndInit);

File diff suppressed because it is too large Load Diff

View File

@@ -12,13 +12,19 @@ Usage:
python server.py --port 8080
python server.py --host 0.0.0.0 # Allow external access
python server.py --fetch-on-start
python server.py --username admin --password secret # Enable authentication
DASHBOARD_USERNAME=admin DASHBOARD_PASSWORD=secret python server.py # Via env vars
python server.py --refresh-interval 12 # Auto-refresh data every 12 hours
"""
import argparse
import hashlib
import hmac
import http.server
import io
import json
import os
import secrets
import socketserver
import threading
import time
@@ -44,6 +50,47 @@ metrics_cache = {
CACHE_TTL = 300 # 5 minutes
REQUEST_TIMEOUT = 30 # seconds
# Authentication configuration (set via CLI flags)
auth_config = {
"enabled": False,
"username": None,
"password_hash": None, # SHA-256 hash of the password
"active_tokens": {}, # token -> expiry timestamp
}
auth_lock = threading.Lock()
AUTH_TOKEN_TTL = 3600 # 1 hour
def hash_password(password):
"""Hash a password using SHA-256 for constant-time comparison."""
return hashlib.sha256(password.encode("utf-8")).hexdigest()
def create_auth_token():
"""Create a new session token."""
token = secrets.token_hex(32)
with auth_lock:
# Clean up expired tokens
now = time.time()
auth_config["active_tokens"] = {
t: exp for t, exp in auth_config["active_tokens"].items() if exp > now
}
auth_config["active_tokens"][token] = now + AUTH_TOKEN_TTL
return token
def verify_auth_token(token):
"""Verify a session token is valid and not expired."""
if not token:
return False
with auth_lock:
expiry = auth_config["active_tokens"].get(token)
if expiry and expiry > time.time():
return True
# Remove expired token
auth_config["active_tokens"].pop(token, None)
return False
def get_github_token():
"""Get GitHub token from environment or gh CLI."""
@@ -187,12 +234,47 @@ def update_cache_async():
metrics_cache["updating"] = False
def start_periodic_refresh(interval_hours):
"""Start a background thread that refreshes the cache periodically."""
interval_seconds = interval_hours * 3600
def refresh_loop():
while True:
time.sleep(interval_seconds)
print(f"Periodic refresh triggered (every {interval_hours}h)")
update_cache_async()
thread = threading.Thread(target=refresh_loop, daemon=True)
thread.start()
print(f"Periodic refresh enabled: every {interval_hours} hours")
class DashboardHandler(http.server.SimpleHTTPRequestHandler):
"""HTTP request handler for the dashboard."""
def __init__(self, *args, directory=None, **kwargs):
super().__init__(*args, directory=directory, **kwargs)
def _send_json(self, data, status=200):
"""Send a JSON response."""
self.send_response(status)
self.send_header("Content-Type", "application/json")
self.send_header("Access-Control-Allow-Origin", "*")
self.end_headers()
self.wfile.write(json.dumps(data).encode())
def _check_auth(self):
"""Check if request is authenticated. Returns True if OK, sends 401 and returns False otherwise."""
if not auth_config["enabled"]:
return True
auth_header = self.headers.get("Authorization", "")
if auth_header.startswith("Bearer "):
token = auth_header[7:]
if verify_auth_token(token):
return True
self._send_json({"error": "Unauthorized"}, status=401)
return False
def do_GET(self):
parsed = urlparse(self.path)
@@ -201,13 +283,55 @@ class DashboardHandler(http.server.SimpleHTTPRequestHandler):
self.send_error(400, "Invalid path")
return
if parsed.path == "/api/metrics":
self.handle_metrics_api(parsed)
if parsed.path == "/api/auth-check":
self.handle_auth_check()
elif parsed.path == "/api/metrics":
if self._check_auth():
self.handle_metrics_api(parsed)
elif parsed.path == "/api/refresh":
self.handle_refresh_api()
if self._check_auth():
self.handle_refresh_api()
else:
super().do_GET()
def do_POST(self):
parsed = urlparse(self.path)
if parsed.path == "/api/login":
self.handle_login()
else:
self.send_error(404, "Not Found")
def handle_auth_check(self):
"""Tell the frontend whether authentication is required."""
self._send_json({"auth_required": auth_config["enabled"]})
def handle_login(self):
"""Validate username/password and return a session token."""
content_length = int(self.headers.get("Content-Length", 0))
if content_length == 0 or content_length > 4096:
self._send_json({"error": "Invalid request"}, status=400)
return
try:
body = json.loads(self.rfile.read(content_length))
except (json.JSONDecodeError, ValueError):
self._send_json({"error": "Invalid JSON"}, status=400)
return
username = body.get("username", "")
password = body.get("password", "")
if hmac.compare_digest(
username, auth_config["username"]
) and hmac.compare_digest(
hash_password(password), auth_config["password_hash"]
):
token = create_auth_token()
self._send_json({"token": token})
else:
self._send_json({"error": "Invalid username or password"}, status=401)
def handle_metrics_api(self, parsed):
"""Handle /api/metrics endpoint."""
# Check cache with thread safety
@@ -222,21 +346,12 @@ class DashboardHandler(http.server.SimpleHTTPRequestHandler):
# Trigger background update
threading.Thread(target=update_cache_async, daemon=True).start()
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.send_header("Access-Control-Allow-Origin", "*")
self.end_headers()
self.wfile.write(json.dumps(data).encode())
self._send_json(data)
def handle_refresh_api(self):
"""Handle /api/refresh endpoint."""
threading.Thread(target=update_cache_async, daemon=True).start()
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.send_header("Access-Control-Allow-Origin", "*")
self.end_headers()
self.wfile.write(json.dumps({"status": "refreshing"}).encode())
self._send_json({"status": "refreshing"})
def log_message(self, format, *args):
"""Custom log format."""
@@ -254,8 +369,33 @@ def main():
parser.add_argument(
"--fetch-on-start", action="store_true", help="Fetch metrics on startup"
)
parser.add_argument(
"--refresh-interval",
type=float,
default=12,
help="Auto-refresh interval in hours (default: 12, set to 0 to disable)",
)
parser.add_argument(
"--username",
default=os.environ.get("DASHBOARD_USERNAME"),
help="Username for dashboard authentication (or set DASHBOARD_USERNAME env var)",
)
parser.add_argument(
"--password",
default=os.environ.get("DASHBOARD_PASSWORD"),
help="Password for dashboard authentication (or set DASHBOARD_PASSWORD env var)",
)
args = parser.parse_args()
# Configure authentication if both username and password are provided
if args.username and args.password:
auth_config["enabled"] = True
auth_config["username"] = args.username
auth_config["password_hash"] = hash_password(args.password)
print(f"Authentication enabled for user: {args.username}")
elif args.username or args.password:
parser.error("Both --username and --password must be provided together")
# Change to dashboard directory
dashboard_dir = Path(__file__).parent
os.chdir(dashboard_dir)
@@ -264,6 +404,9 @@ def main():
print("Fetching initial metrics data...")
update_cache_async()
if args.refresh_interval > 0:
start_periodic_refresh(args.refresh_interval)
handler = lambda *a, **kw: DashboardHandler(*a, directory=str(dashboard_dir), **kw)
with socketserver.TCPServer((args.host, args.port), handler) as httpd:

View File

@@ -126,7 +126,6 @@ export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
unset TASK_QUEUE_ENABLE
export SGLANG_NPU_USE_MLAPO=1
export SGLANG_USE_FIA_NZ=1
export ENABLE_MOE_NZ=1
# suggest max-running-requests <= max-cuda-graph-bs * dp_size, Because when this value is exceeded, performance will significantly degrade.
python -m sglang.launch_server \

View File

@@ -0,0 +1,39 @@
# Environment Variables
SGLang supports various environment variables related to Ascend NPU that can be used to configure its runtime behavior.
This document provides a list of commonly used environment variables and aims to stay updated over time.
## Directly Used in SGLang
| Environment Variable | Description | Default Value |
|--------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| `SGLANG_NPU_USE_MLAPO` | Adopts the `MLAPO` fusion operator in attention <br/> preprocessing stage of the MLA model. | `false` |
| `SGLANG_USE_FIA_NZ` | Reshapes KV Cache for FIA NZ format.<br/> `SGLANG_USE_FIA_NZ` must be enabled with `SGLANG_NPU_USE_MLAPO` | `false` |
| `SGLANG_NPU_USE_MULTI_STREAM` | Enable dual-stream computation of shared experts <br/> and routing experts in DeepSeek models.<br/> Enable dual-stream computation in DeepSeek NSA Indexer. | `false` |
| `SGLANG_NPU_DISABLE_ACL_FORMAT_WEIGHT` | Disable cast model weight tensor to a specific NPU <br/> ACL format. | `false` |
| `SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK` | The maximum number of dispatched tokens on each rank. | `128` |
## Used in DeepEP Ascend
| Environment Variable | Description | Default Value |
|-------------------------------------------|------------------------------------------------------------------------------------------------------------------------|---------------|
| `DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS` | Enable ant-moving function in dispatch stage. Indicates <br/> the number of tokens transmitted per round on each rank. | `8192` |
| `DEEPEP_NORMAL_LONG_SEQ_ROUND` | Enable ant-moving function in dispatch stage. Indicates <br/> the number of rounds transmitted on each rank. | `1` |
| `DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ` | Enable ant-moving function in combine stage. <br/> The value `0` means disabled. | `0` |
| `MOE_ENABLE_TOPK_NEG_ONE` | Needs to be enabled when the expert ID to be processed by <br/> DEEPEP contains -1. | `0` |
| `DEEP_NORMAL_MODE_USE_INT8_QUANT` | Quantizes x to int8 and returns (tensor, scales) in dispatch operator. | `0` |
## Others
| Environment Variable | Description | Default Value |
|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| `TASK_QUEUE_ENABLE` | Used to control the optimization level of the dispatch queue<br/> about the task_queue operator. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/730/comref/Envvariables/docs/zh/environment_variable_reference/TASK_QUEUE_ENABLE.md) | `1` |
| `INF_NAN_MODE_ENABLE` | Controls whether the chip uses saturation mode or INF_NAN mode. [Detail](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha001/apiref/envref/envref_07_0056.html) | `1` |
| `STREAMS_PER_DEVICE` | Configures the maximum number of streams for the stream pool. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/720/comref/Envvariables/Envir_041.html) | `32` |
| `PYTORCH_NPU_ALLOC_CONF` | Controls the behavior of the cache allocator. <br/>This variable changes memory usage and may cause performance fluctuations. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html) | |
| `ASCEND_MF_STORE_URL` | The address of config store in MemFabric during PD separation, <br/>which is generally set to the IP address of the P primary node<br/> with an arbitrary port number. | |
| `ASCEND_LAUNCH_BLOCKING` | Controls whether synchronous mode is enabled during operator execution. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/710/comref/Envvariables/Envir_006.html) | `0` |
| `HCCL_OP_EXPANSION_MODE` | Configures the expansion position for communication algorithm scheduling. [Detail](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha001/apiref/envref/envref_07_0094.html) | |
| `HCCL_BUFFSIZE` | Controls the size of the buffer area for shared data between two NPUs. <br/>The unit is MB, and the value must be greater than or equal to 1. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/60RC3/ptmoddevg/trainingmigrguide/performance_tuning_0047.html) | `200` |
| `HCCL_SOCKET_IFNAME` | Configures the name of the network card used by the Host <br/>during HCCL initialization. [Detail](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/apiref/envvar/envref_07_0075.html) | |
| `GLOO_SOCKET_IFNAME` | Configures the network interface name for GLOO communication. | |

View File

@@ -0,0 +1,194 @@
# GLM-5 examples
## Introduction
The GLM (General Language Model) series is an open-source bilingual large language model family jointly developed by the KEG Laboratory of Tsinghua University and Zhipu AI. This series of models has performed outstandingly in the field of Chinese NLP with its unique unified pre-training framework and bilingual capabilities. [GLM-5](https://huggingface.co/zai-org/GLM-5) adopts the DeepSeek-V3/V3.2 architecture, including the sparse attention (DSA) and multi-token prediction (MTP). Ascend supports GLM-5 with 0Day based on the SGLang inference framework, achieving low-code seamless enablement and compatibility with the mainstream distributed parallel capabilities within the current SGLang framework. We welcome developers to download and experience it.
## Environment Preparation
### Model Weight
- `GLM-5.0`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5).
- `GLM-5.0-w4a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Eco-Tech/GLM-5-w4a8).
- You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively.
### Installation
The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the quay.io platform. You can directly pull it.
```{code-block} bash
#Atlas 800 A3
docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann8.5.0-a3-glm5
#Atlas 800 A2
docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann8.5.0-910b-glm5
#start container
docker run -itd --shm-size=16g --privileged=true --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci1:/dev/avinci1 \
--device=/dev/davinci2:/dev/davinci2 \
--device=/dev/davinci3:/dev/davinci3 \
--device=/dev/davinci4:/dev/davinci4 \
--device=/dev/davinci5:/dev/davinci5 \
--device=/dev/davinci6:/dev/davinci6 \
--device=/dev/davinci7:/dev/davinci7 \
--device=/dev/davinci8:/dev/davinci8 \
--device=/dev/davinci9:/dev/davinci9 \
--device=/dev/davinci10:/dev/davinci10 \
--device=/dev/davinci11:/dev/davinci11 \
--device=/dev/davinci12:/dev/davinci12 \
--device=/dev/davinci13:/dev/davinci13 \
--device=/dev/davinci14:/dev/davinci14 \
--device=/dev/davinci15:/dev/davinci15 \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:${TAG}
```
Note: Using this image, you need to update transformers to main branch
``` shell
# reinstall transformers
pip install git+https://github.com/huggingface/transformers.git
```
## Deployment
### Single-node Deployment
- Quantized model `glm5_w4a8` can be deployed on 1 Atlas 800 A3 (64G × 16) .
Run the following script to execute online inference.
```shell
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 16 --nnodes 1 --node-rank 0 \
--chunked-prefill-size 16384 --max-prefill-tokens 280000 \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.7 \
--port 8000 \
--served-model-name glm-5 \
--cuda-graph-bs 16 \
--quantization modelslim \
--moe-a2a-backend deepep --deepep-mode auto
```
### Multi-node Deployment
- `GLM-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16).
**A3 series**
Modify the IP of 2 nodes, then run the same scripts on two nodes.
**node 0/1**
```shell
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_NPU_USE_MULTI_STREAM=1
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
# Run command ifconfig on two nodes, find out which inet addr has same IP with your node IP. That is your public interface, which should be added here
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
P_IP=('your ip1' 'your ip2')
P_MASTER="${P_IP[0]}:your port"
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
for i in "${!P_IP[@]}";
do
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
then
echo "${P_IP[$i]}"
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 32 --nnodes 2 --node-rank $i --dist-init-addr $P_MASTER \
--chunked-prefill-size 16384 --max-prefill-tokens 131072 \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.8\
--port 8000 \
--served-model-name glm-5 \
--cuda-graph-max-bs 16 \
--disable-radix-cache
NODE_RANK=$i
break
fi
done
```
### Prefill-Decode Disaggregation
Not test yet.
### Using Benchmark
Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md) for details.

View File

@@ -0,0 +1,231 @@
# Qwen3.5 examples
## Environment Preparation
### Installation
The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the quay.io platform. You can directly pull it.
```{code-block} bash
#Atlas 800 A3
docker pull quay.io/ascend/sglang:v0.5.9-cann8.5.0-a3
#Atlas 800 A2
docker pull quay.io/ascend/sglang:v0.5.9-cann8.5.0-910b
#start container
docker run -itd --shm-size=16g --privileged=true --name ${NAME} \
--privileged=true --net=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /usr/local/sbin:/usr/local/sbin \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--device=/dev/davinci0:/dev/davinci0 \
--device=/dev/davinci1:/dev/davinci1 \
--device=/dev/davinci2:/dev/davinci2 \
--device=/dev/davinci3:/dev/davinci3 \
--device=/dev/davinci4:/dev/davinci4 \
--device=/dev/davinci5:/dev/davinci5 \
--device=/dev/davinci6:/dev/davinci6 \
--device=/dev/davinci7:/dev/davinci7 \
--device=/dev/davinci8:/dev/davinci8 \
--device=/dev/davinci9:/dev/davinci9 \
--device=/dev/davinci10:/dev/davinci10 \
--device=/dev/davinci11:/dev/davinci11 \
--device=/dev/davinci12:/dev/davinci12 \
--device=/dev/davinci13:/dev/davinci13 \
--device=/dev/davinci14:/dev/davinci14 \
--device=/dev/davinci15:/dev/davinci15 \
--device=/dev/davinci_manager:/dev/davinci_manager \
--device=/dev/hisi_hdc:/dev/hisi_hdc \
--entrypoint=bash \
quay.io/ascend/sglang:${tag}
```
## Deployment
### Single-node Deployment
Run the following script to execute online inference.
#### Qwen3.5 397B
```shell
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 16 --nnodes 1 --node-rank 0 \
--chunked-prefill-size 4096 --max-prefill-tokens 280000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.7 \
--port 8000 \
--cuda-graph-bs 16 \
--quantization modelslim \
--enable-multimodal \
--mm-attention-backend ascend_attn \
--dtype bfloat16
```
#### Qwen3.5 122B
```shell
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 8 --nnodes 1 --node-rank 0 \
--chunked-prefill-size 4096 --max-prefill-tokens 280000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.7 \
--port 8000 \
--cuda-graph-bs 16 \
--quantization modelslim \
--enable-multimodal \
--mm-attention-backend ascend_attn \
--dtype bfloat16
```
#### Qwen3.5 35B
```shell
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 2 --nnodes 1 --node-rank 0 \
--chunked-prefill-size 4096 --max-prefill-tokens 280000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.7 \
--port 8000 \
--cuda-graph-bs 16 \
--quantization modelslim \
--enable-multimodal \
--mm-attention-backend ascend_attn \
--dtype bfloat16
```
#### Qwen3.5 27B
```shell
# high performance cpu
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
# bind cpu
export SGLANG_SET_CPU_AFFINITY=1
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
# cann
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export STREAMS_PER_DEVICE=32
export HCCL_BUFFSIZE=1000
export HCCL_OP_EXPANSION_MODE=AIV
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
python3 -m sglang.launch_server \
--model-path $MODEL_PATH \
--attention-backend ascend \
--device npu \
--tp-size 2 \
--chunked-prefill-size -1 --max-prefill-tokens 120000 \
--disable-radix-cache \
--trust-remote-code \
--host 127.0.0.1 \
--mem-fraction-static 0.8 \
--port 8000 \
--cuda-graph-bs 32 \
--enable-multimodal \
--mm-attention-backend ascend_attn
```
### Prefill-Decode Disaggregation
Not test yet.
### Using Benchmark
Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md) for details.

View File

@@ -12,3 +12,6 @@ Ascend NPUs
mindspore_backend.md
ascend_contribution_guide.md
ascend_npu_best_practice.md
ascend_npu_qwen3_5_examples.md
ascend_npu_glm5_examples.md
ascend_npu_environment_variables.md

View File

@@ -17,8 +17,8 @@ SGLang supports various environment variables that can be used to configure its
| `SGLANG_HEALTH_CHECK_TIMEOUT` | Timeout for health check in seconds | `20` |
| `SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL` | The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled. | `0` |
| `SGLANG_FORWARD_UNKNOWN_TOOLS` | Forward unknown tool calls to clients instead of dropping them | `false` (drop unknown tools) |
| `SGLANG_QUEUED_TIMEOUT_MS` | Timeout (in ms) for requests in the waiting queue | `-1` |
| `SGLANG_FORWARD_TIMEOUT_MS` | Timeout (in ms) for requests in the forward batch | `-1` |
| `SGLANG_REQ_WAITING_TIMEOUT` | Timeout (in seconds) for requests waiting in the queue before being scheduled | `-1` |
| `SGLANG_REQ_RUNNING_TIMEOUT` | Timeout (in seconds) for requests running in the decode batch | `-1` |
## Performance Tuning
@@ -76,6 +76,7 @@ SGLang supports various environment variables that can be used to configure its
| --- | --- | --- |
| `SGLANG_MORI_FP8_DISP` | Use FP8 for dispatch | `"false"` |
| `SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK` | Maximum number of dispatch tokens per rank for MORI-EP buffer allocation | `4096` |
| `SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD` | Threshold for switching between `InterNodeV1` and `InterNodeV1LL` kernel types. `InterNodeV1LL` is used if `SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK` is less than or equal to this threshold; otherwise, `InterNodeV1` is used. | `256` |
| `SGLANG_MORI_QP_PER_TRANSFER` | Number of RDMA Queue Pairs (QPs) used per transfer operation | `1` |
| `SGLANG_MORI_POST_BATCH_SIZE` | Number of RDMA work requests posted in a single batch to each QP | `-1` |
| `SGLANG_MORI_NUM_WORKERS` | Number of worker threads in the RDMA executor thread pool | `1` |

Some files were not shown because too many files have changed in this diff Show More