mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-29 19:28:51 +00:00
Merge upstream/main: bring in PCG (Piecewise CUDA Graph) support for Qwen3.5 GDN
This commit is contained in:
559
.claude/skills/add-jit-kernel/SKILL.md
Normal file
559
.claude/skills/add-jit-kernel/SKILL.md
Normal file
@@ -0,0 +1,559 @@
|
||||
---
|
||||
name: add-jit-kernel
|
||||
description: Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module
|
||||
---
|
||||
|
||||
# Tutorial: Adding a New JIT Kernel to SGLang
|
||||
|
||||
This tutorial walks through adding a simple element-wise scale operation as a JIT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
|
||||
|
||||
## Goal
|
||||
|
||||
Add a new operation that scales each element of a tensor by a scalar factor:
|
||||
|
||||
- Input: tensor `x` (CUDA) and scalar `factor` (float, passed as C++ template argument)
|
||||
- Output: `x * factor` (element-wise), allocated internally
|
||||
- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
|
||||
|
||||
## When to use JIT vs AOT (`sgl-kernel`)
|
||||
|
||||
- **JIT (`jit_kernel`)**: lightweight, few dependencies, rapid iteration, compiled on first use
|
||||
- **AOT (`sgl-kernel`)**: depends on CUTLASS / FlashInfer / DeepGEMM, needs pre-built wheel
|
||||
|
||||
---
|
||||
|
||||
## Common Abstractions in `python/sglang/jit_kernel/include/sgl_kernel/`
|
||||
|
||||
**Always prefer these abstractions over raw CUDA primitives.** They provide safety, readability, and consistency with the rest of the codebase.
|
||||
|
||||
### `utils.h` — Host-side utilities
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/utils.h>
|
||||
```
|
||||
|
||||
- **`host::RuntimeCheck(cond, args...)`** — Assert a condition at runtime; throws `PanicError` with file/line info on failure. Prefer this over bare `assert`.
|
||||
- **`host::Panic(args...)`** — Unconditionally throw a `PanicError` with a descriptive message.
|
||||
- **`host::div_ceil(a, b)`** — Integer ceiling division `(a + b - 1) / b`.
|
||||
- **`host::irange(n)`** / **`host::irange(start, end)`** — Range views for cleaner loops.
|
||||
- **`host::pointer::offset(ptr, offsets...)`** — Byte-safe pointer arithmetic on `void*`. Use this instead of raw casts.
|
||||
|
||||
### `utils.cuh` — Device-side utilities + `LaunchKernel`
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/utils.cuh>
|
||||
```
|
||||
|
||||
- **Type aliases**: `fp16_t`, `bf16_t`, `fp32_t`, `fp8_e4m3_t`, `fp8_e5m2_t` and their packed variants `fp16x2_t`, `bf16x2_t`, `fp32x2_t`, etc.
|
||||
- **`SGL_DEVICE`** — Expands to `__forceinline__ __device__`. Use on all device functions.
|
||||
- **`device::kWarpThreads`** — Constant `32`.
|
||||
- **`device::load_as<T>(ptr, offset)`** / **`device::store_as<T>(ptr, val, offset)`** — Type-safe loads/stores from `void*`.
|
||||
- **`device::pointer::offset(ptr, offsets...)`** — Pointer arithmetic on device.
|
||||
- **`host::LaunchKernel(grid, block, device_or_stream [, smem])`** — RAII kernel launcher that:
|
||||
- Resolves the CUDA stream from a `DLDevice` via TVM-FFI automatically.
|
||||
- Checks the CUDA error with file/line info after launch via `operator()(kernel, args...)`.
|
||||
- Supports `.enable_pdl(bool)` for PDL (Programmatic Dependent Launch, SM90+).
|
||||
- **`host::RuntimeDeviceCheck(cudaError_t)`** — Check a CUDA error; throw on failure.
|
||||
|
||||
### `tensor.h` — Tensor validation (`TensorMatcher`, Symbolic types)
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/tensor.h>
|
||||
```
|
||||
|
||||
This is the **primary validation API** for all kernel launchers. Use it to validate every `tvm::ffi::TensorView` argument.
|
||||
|
||||
- **`host::SymbolicSize{"name"}`** — A named symbolic dimension. Call `.set_value(n)` to pin it, `.unwrap()` to extract after verification.
|
||||
- **`host::SymbolicDType`** — Symbolic dtype. Use `.set_options<Ts...>()` to restrict allowed types.
|
||||
- **`host::SymbolicDevice`** — Symbolic device. Use `.set_options<kDLCUDA>()` to restrict to CUDA.
|
||||
- **`host::TensorMatcher({dims...})`** — Fluent builder for tensor validation:
|
||||
- `.with_dtype<T>()` — require a specific C++ type (e.g. `fp16_t`)
|
||||
- `.with_dtype<T1, T2, ...>()` — allow a set of types
|
||||
- `.with_device<kDLCUDA>(device_sym)` — require CUDA, bind device to symbol
|
||||
- `.with_strides({strides...})` — validate strides (omit to require contiguous)
|
||||
- `.verify(tensor_view)` — execute the check; throws `PanicError` with full context on failure; **chainable** (`verify(a).verify(b)` to check multiple tensors with the same shape)
|
||||
|
||||
**Typical pattern:**
|
||||
```cpp
|
||||
auto N = SymbolicSize{"num_elements"};
|
||||
auto device = SymbolicDevice{};
|
||||
device.set_options<kDLCUDA>();
|
||||
TensorMatcher({N}) //
|
||||
.with_dtype<fp16_t>()
|
||||
.with_device(device)
|
||||
.verify(dst)
|
||||
.verify(src); // same shape, dtype, device as dst
|
||||
const size_t n = N.unwrap();
|
||||
const DLDevice dev = device.unwrap();
|
||||
```
|
||||
|
||||
### `type.cuh` — `dtype_trait<T>` and `packed_t<T>`
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/type.cuh>
|
||||
```
|
||||
|
||||
- **`dtype_trait<T>`** — Static trait struct for each scalar type. Provides:
|
||||
- `dtype_trait<T>::from(value)` — convert from another type (e.g. `fp32_t` → `fp16_t`)
|
||||
- `dtype_trait<T>::abs/sqrt/rsqrt/max/min(x)` — type-dispatched math (for `fp32_t`)
|
||||
- **`packed_t<T>`** — Two-element packed alias: `packed_t<fp16_t>` = `fp16x2_t`, `packed_t<bf16_t>` = `bf16x2_t`, `packed_t<fp32_t>` = `fp32x2_t`. Use for vectorized loads/stores.
|
||||
- **`device::cast<To, From>(value)`** — Type-safe cast using `dtype_trait`, e.g. `cast<fp32x2_t, fp16x2_t>(v)`.
|
||||
|
||||
### `vec.cuh` — Vectorized memory access (`AlignedVector`)
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/vec.cuh>
|
||||
```
|
||||
|
||||
- **`device::AlignedVector<T, N>`** — Aligned storage for N elements of type T. N must be a power of two, `sizeof(T)*N <= 32`. Enables 128-bit vector loads/stores for bandwidth efficiency.
|
||||
- `.load(ptr, offset)` — vectorized load from `ptr[offset]`
|
||||
- `.store(ptr, offset)` — vectorized store to `ptr[offset]`
|
||||
- `.fill(value)` — fill all lanes
|
||||
- `operator[](i)` — element access
|
||||
|
||||
### `tile.cuh` — `tile::Memory` (strided memory access pattern)
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/tile.cuh>
|
||||
```
|
||||
|
||||
- **`device::tile::Memory<T>::cta(blockDim.x)`** — Creates a tile accessor where each thread handles `tid = threadIdx.x` with stride `blockDim.x`. Common for loops over a 1D array.
|
||||
- **`.load(ptr, offset)`** — loads `ptr[tid + offset * blockDim.x]`
|
||||
- **`.store(ptr, val, offset)`** — stores to `ptr[tid + offset * blockDim.x]`
|
||||
- **`.in_bound(n, offset)`** — boundary check
|
||||
|
||||
### `math.cuh` — Device math (`device::math::`)
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/math.cuh>
|
||||
```
|
||||
|
||||
- `device::math::max/min/abs/sqrt/rsqrt<T>(a, b)` — type-dispatched math via `dtype_trait`
|
||||
- `device::math::exp/sin/cos(float)` — fast float math wrappers
|
||||
|
||||
### `warp.cuh` — Warp-level primitives
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/warp.cuh>
|
||||
```
|
||||
|
||||
- `device::warp::reduce_sum<T>(value)` — warp-level sum reduction via `__shfl_xor_sync`
|
||||
- `device::warp::reduce_max<T>(value)` — warp-level max reduction
|
||||
|
||||
### `cta.cuh` — CTA-level primitives
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/cta.cuh>
|
||||
```
|
||||
|
||||
- `device::cta::reduce_max<T>(value, smem, min_value)` — CTA-wide max using shared memory + warp reduction. Caller is responsible for a `__syncthreads()` after if the result in `smem[0]` is needed.
|
||||
|
||||
### `atomic.cuh` — Atomic operations
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/atomic.cuh>
|
||||
```
|
||||
|
||||
- `device::atomic::max(float* addr, float value)` — float atomic max (handles negative values correctly via bit tricks).
|
||||
|
||||
### `runtime.cuh` — Occupancy and device info
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/runtime.cuh>
|
||||
```
|
||||
|
||||
- `host::runtime::get_blocks_per_sm(kernel, block_dim)` — max active blocks per SM (occupancy)
|
||||
- `host::runtime::get_sm_count(device_id)` — number of SMs on the device
|
||||
- `host::runtime::get_cc_major(device_id)` — compute capability major version
|
||||
|
||||
**Persistent kernel pattern** (cap blocks to SM count × occupancy):
|
||||
```cpp
|
||||
static const uint32_t max_occ = runtime::get_blocks_per_sm(kernel, kBlockSize);
|
||||
static const uint32_t num_sm = runtime::get_sm_count(device.unwrap().device_id);
|
||||
const auto num_blocks = std::min(num_sm * max_occ, div_ceil(n, kBlockSize));
|
||||
LaunchKernel(num_blocks, kBlockSize, device.unwrap())(kernel, params);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 0 (optional): Generate a `.clangd` config for better IDE support
|
||||
|
||||
```bash
|
||||
python -m sglang.jit_kernel
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 1: Implement the CUDA kernel in `jit_kernel/csrc/`
|
||||
|
||||
Create `python/sglang/jit_kernel/csrc/elementwise/scale.cuh`.
|
||||
|
||||
The implementation fully uses the project abstractions described above:
|
||||
|
||||
```cpp
|
||||
#include <sgl_kernel/tensor.h> // TensorMatcher, SymbolicSize, SymbolicDevice
|
||||
#include <sgl_kernel/type.cuh> // dtype_trait, fp16_t, bf16_t, fp32_t
|
||||
#include <sgl_kernel/utils.h> // RuntimeCheck, div_ceil
|
||||
#include <sgl_kernel/utils.cuh> // LaunchKernel, SGL_DEVICE
|
||||
#include <sgl_kernel/vec.cuh> // AlignedVector
|
||||
|
||||
#include <dlpack/dlpack.h>
|
||||
#include <tvm/ffi/container/tensor.h>
|
||||
|
||||
namespace {
|
||||
|
||||
// ----------------------------------------------------------------
|
||||
// Kernel: element-wise scale using vectorized 128-bit loads/stores
|
||||
// T = fp16_t | bf16_t | fp32_t
|
||||
// kVecN = number of elements per vector load (e.g. 8 for fp16)
|
||||
// kFactor = scale factor encoded as kFactorNumer / kFactorDenom
|
||||
// ----------------------------------------------------------------
|
||||
template <typename T, int kVecN, int32_t kFactorNumer, int32_t kFactorDenom>
|
||||
__global__ void scale_kernel(T* __restrict__ dst,
|
||||
const T* __restrict__ src,
|
||||
uint32_t n_vecs,
|
||||
uint32_t n_remainder,
|
||||
uint32_t n_total) {
|
||||
constexpr float kFactor = static_cast<float>(kFactorNumer)
|
||||
/ static_cast<float>(kFactorDenom);
|
||||
|
||||
using vec_t = device::AlignedVector<T, kVecN>;
|
||||
|
||||
// --- vectorised body ---
|
||||
const uint32_t vec_stride = blockDim.x * gridDim.x;
|
||||
for (uint32_t vi = blockIdx.x * blockDim.x + threadIdx.x;
|
||||
vi < n_vecs;
|
||||
vi += vec_stride) {
|
||||
vec_t v;
|
||||
v.load(src, vi);
|
||||
#pragma unroll
|
||||
for (int i = 0; i < kVecN; ++i) {
|
||||
v[i] = static_cast<T>(static_cast<float>(v[i]) * kFactor);
|
||||
}
|
||||
v.store(dst, vi);
|
||||
}
|
||||
|
||||
// --- scalar tail ---
|
||||
const uint32_t base = n_vecs * kVecN;
|
||||
const uint32_t scalar_stride = blockDim.x * gridDim.x;
|
||||
for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
|
||||
i < n_remainder;
|
||||
i += scalar_stride) {
|
||||
dst[base + i] = static_cast<T>(static_cast<float>(src[base + i]) * kFactor);
|
||||
}
|
||||
}
|
||||
|
||||
// ----------------------------------------------------------------
|
||||
// Launcher: validates tensors, selects vector width, launches kernel
|
||||
// ----------------------------------------------------------------
|
||||
template <typename T, int32_t kFactorNumer, int32_t kFactorDenom>
|
||||
void scale(tvm::ffi::TensorView dst, tvm::ffi::TensorView src) {
|
||||
using namespace host;
|
||||
|
||||
// 1. Validate input tensors with TensorMatcher
|
||||
SymbolicSize N = {"num_elements"};
|
||||
SymbolicDevice device_;
|
||||
device_.set_options<kDLCUDA>();
|
||||
|
||||
TensorMatcher({N}) //
|
||||
.with_dtype<T>()
|
||||
.with_device(device_)
|
||||
.verify(dst)
|
||||
.verify(src); // same shape / dtype / device as dst
|
||||
|
||||
const uint32_t n = static_cast<uint32_t>(N.unwrap());
|
||||
const DLDevice device = device_.unwrap();
|
||||
|
||||
RuntimeCheck(n > 0, "scale: num_elements must be > 0, got ", n);
|
||||
|
||||
// 2. Choose vector width for 128-bit loads (16 bytes)
|
||||
// fp16/bf16: 8 elements × 2 bytes = 16 bytes
|
||||
// fp32: 4 elements × 4 bytes = 16 bytes
|
||||
constexpr int kVecN = 16 / sizeof(T);
|
||||
const uint32_t n_vecs = n / kVecN;
|
||||
const uint32_t n_remainder = n % kVecN;
|
||||
|
||||
// 3. Launch
|
||||
constexpr uint32_t kBlockSize = 256;
|
||||
const uint32_t grid = div_ceil(std::max(n_vecs, n_remainder), kBlockSize);
|
||||
|
||||
LaunchKernel(grid, kBlockSize, device)(
|
||||
scale_kernel<T, kVecN, kFactorNumer, kFactorDenom>,
|
||||
static_cast<T*>(dst.data_ptr()),
|
||||
static_cast<const T*>(src.data_ptr()),
|
||||
n_vecs,
|
||||
n_remainder,
|
||||
n);
|
||||
}
|
||||
|
||||
} // namespace
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
|
||||
- Include headers from `sgl_kernel/` — **not** raw CUDA headers for anything already covered
|
||||
- Use `TensorMatcher` for all tensor validation; never manually check shape/dtype/device
|
||||
- Use `AlignedVector` for vectorised 128-bit loads/stores — significant bandwidth win
|
||||
- Use `LaunchKernel` — it resolves the stream and checks errors automatically
|
||||
- Use `RuntimeCheck` for runtime assertions with useful error messages
|
||||
- `fp16_t` / `bf16_t` / `fp32_t` are the project's type aliases (from `utils.cuh`)
|
||||
- `device::cast<To, From>` or `dtype_trait<T>::from(val)` for cross-type conversions
|
||||
- `device::math::` functions for device math instead of bare `__` intrinsics
|
||||
|
||||
---
|
||||
|
||||
## Step 2: Add the Python wrapper in `jit_kernel/`
|
||||
|
||||
Create `python/sglang/jit_kernel/scale.py`:
|
||||
|
||||
```python
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
import torch
|
||||
|
||||
from sglang.jit_kernel.utils import cache_once, load_jit, make_cpp_args
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from tvm_ffi.module import Module
|
||||
|
||||
|
||||
@cache_once
|
||||
def _jit_scale_module(dtype: torch.dtype, factor_numer: int, factor_denom: int) -> Module:
|
||||
"""Compile and cache the JIT scale module for a given dtype and factor."""
|
||||
args = make_cpp_args(dtype, factor_numer, factor_denom)
|
||||
return load_jit(
|
||||
"scale",
|
||||
*args,
|
||||
cuda_files=["elementwise/scale.cuh"],
|
||||
cuda_wrappers=[("scale", f"scale<{args}>")],
|
||||
)
|
||||
|
||||
|
||||
def scale(src: torch.Tensor, factor: float, out: torch.Tensor | None = None) -> torch.Tensor:
|
||||
"""
|
||||
Element-wise scale: dst = src * factor.
|
||||
|
||||
Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
src : CUDA tensor (FP16 / BF16 / FP32)
|
||||
factor : scale factor
|
||||
out : optional pre-allocated output tensor (same shape/dtype as src)
|
||||
|
||||
Returns
|
||||
-------
|
||||
Scaled tensor (dst = src * factor).
|
||||
"""
|
||||
assert src.is_cuda, "src must be a CUDA tensor"
|
||||
assert src.dtype in (torch.float16, torch.bfloat16, torch.float32), (
|
||||
f"Unsupported dtype {src.dtype}. Supported: float16, bfloat16, float32"
|
||||
)
|
||||
if out is None:
|
||||
out = torch.empty_like(src)
|
||||
else:
|
||||
assert out.shape == src.shape, "out shape must match src"
|
||||
assert out.dtype == src.dtype, "out dtype must match src"
|
||||
|
||||
# Encode factor as integer ratio; denom=1000 gives 3 decimal places of precision
|
||||
factor_denom = 1000
|
||||
factor_numer = round(factor * factor_denom)
|
||||
|
||||
module = _jit_scale_module(src.dtype, factor_numer, factor_denom)
|
||||
module.scale(out, src)
|
||||
return out
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
|
||||
- Use `cache_once` — **not** `functools.lru_cache` (incompatible with `torch.compile`)
|
||||
- `load_jit` first arg(s) form the unique build marker; same marker = same cached binary
|
||||
- `cuda_wrappers`: `(export_name, kernel_symbol)` — `export_name` is called from Python
|
||||
- `make_cpp_args(dtype, ...)` converts `torch.dtype` to C++ type alias:
|
||||
|
||||
| `torch.dtype` | C++ type |
|
||||
|--------------------|------------|
|
||||
| `torch.float16` | `fp16_t` |
|
||||
| `torch.bfloat16` | `bf16_t` |
|
||||
| `torch.float32` | `fp32_t` |
|
||||
|
||||
---
|
||||
|
||||
## Step 3 (optional): Tune JIT build flags
|
||||
|
||||
```python
|
||||
return load_jit(
|
||||
"scale",
|
||||
*args,
|
||||
cuda_files=["elementwise/scale.cuh"],
|
||||
cuda_wrappers=[("scale", f"scale<{args}>")],
|
||||
extra_cuda_cflags=["-O3", "--use_fast_math"],
|
||||
)
|
||||
```
|
||||
|
||||
If your kernel requires SM90+, raise a clear Python error before calling `load_jit`:
|
||||
|
||||
```python
|
||||
if torch.cuda.get_device_capability()[0] < 9:
|
||||
raise RuntimeError("This kernel requires SM90 (Hopper) or later")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 4: Write tests (required)
|
||||
|
||||
Create `python/sglang/jit_kernel/tests/test_scale.py`:
|
||||
|
||||
```python
|
||||
import pytest
|
||||
import torch
|
||||
from sglang.jit_kernel.scale import scale
|
||||
|
||||
|
||||
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
|
||||
@pytest.mark.parametrize("size", [1, 127, 128, 1024, 4097]) # cover tail remainder
|
||||
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0, 3.0])
|
||||
def test_scale_correctness(dtype, size, factor):
|
||||
src = torch.randn(size, dtype=dtype, device="cuda")
|
||||
out = scale(src, factor)
|
||||
expected = src * factor
|
||||
|
||||
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
|
||||
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
|
||||
def test_scale_out_param(dtype):
|
||||
src = torch.randn(1024, dtype=dtype, device="cuda")
|
||||
out = torch.empty_like(src)
|
||||
result = scale(src, 2.0, out=out)
|
||||
assert result is out
|
||||
torch.testing.assert_close(out, src * 2.0, rtol=1e-2, atol=1e-2)
|
||||
|
||||
|
||||
def test_scale_cpu_error():
|
||||
src = torch.randn(128, dtype=torch.float16) # CPU tensor
|
||||
with pytest.raises(AssertionError, match="CUDA"):
|
||||
scale(src, 2.0)
|
||||
|
||||
|
||||
def test_scale_unsupported_dtype():
|
||||
src = torch.randint(0, 10, (128,), dtype=torch.int32, device="cuda")
|
||||
with pytest.raises(AssertionError, match="Unsupported dtype"):
|
||||
scale(src, 2.0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-q"])
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
pytest python/sglang/jit_kernel/tests/test_scale.py -q
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 5: Add a benchmark (required)
|
||||
|
||||
Create `python/sglang/jit_kernel/benchmark/bench_scale.py`:
|
||||
|
||||
```python
|
||||
import itertools
|
||||
|
||||
import torch
|
||||
import triton
|
||||
import triton.testing
|
||||
|
||||
from sglang.jit_kernel.benchmark.utils import (
|
||||
DEFAULT_DEVICE,
|
||||
DEFAULT_DTYPE,
|
||||
get_benchmark_range,
|
||||
run_benchmark,
|
||||
)
|
||||
from sglang.jit_kernel.scale import scale as jit_scale
|
||||
|
||||
|
||||
SIZE_LIST = get_benchmark_range(
|
||||
full_range=[2**n for n in range(10, 20)], # 1K … 512K elements
|
||||
ci_range=[4096, 65536],
|
||||
)
|
||||
|
||||
configs = list(itertools.product(SIZE_LIST))
|
||||
|
||||
|
||||
@triton.testing.perf_report(
|
||||
triton.testing.Benchmark(
|
||||
x_names=["size"],
|
||||
x_vals=configs,
|
||||
line_arg="provider",
|
||||
line_vals=["jit", "torch"],
|
||||
line_names=["SGL JIT Kernel", "PyTorch"],
|
||||
styles=[("blue", "-"), ("red", "--")],
|
||||
ylabel="us",
|
||||
plot_name="scale-performance",
|
||||
args={},
|
||||
)
|
||||
)
|
||||
def benchmark(size: int, provider: str):
|
||||
src = torch.randn(size, dtype=DEFAULT_DTYPE, device=DEFAULT_DEVICE)
|
||||
factor = 2.0
|
||||
|
||||
if provider == "jit":
|
||||
fn = lambda: jit_scale(src, factor)
|
||||
else:
|
||||
fn = lambda: src * factor
|
||||
|
||||
return run_benchmark(fn)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
benchmark.run(print_data=True)
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
python python/sglang/jit_kernel/benchmark/bench_scale.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **JIT compilation fails**: ensure the `.cuh` file is under `python/sglang/jit_kernel/csrc/`; reduce template argument combinations
|
||||
- **CUDA crash / illegal memory access**: `CUDA_LAUNCH_BLOCKING=1`; `compute-sanitizer --tool memcheck python ...`
|
||||
- **Unstable benchmark results**: `run_benchmark` uses CUDA-graph-based timing by default
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- `docs/developer_guide/development_jit_kernel_guide.md`
|
||||
- `python/sglang/jit_kernel/utils.py` — `cache_once`, `load_jit`, `make_cpp_args`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/tensor.h` — `TensorMatcher`, `SymbolicSize/DType/Device`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/utils.cuh` — type aliases, `LaunchKernel`, `SGL_DEVICE`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/vec.cuh` — `AlignedVector`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/tile.cuh` — `tile::Memory`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/type.cuh` — `dtype_trait`, `packed_t`, `device::cast`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/math.cuh` — `device::math::`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/warp.cuh` — `warp::reduce_sum/max`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/cta.cuh` — `cta::reduce_max`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/atomic.cuh` — `atomic::max`
|
||||
- `python/sglang/jit_kernel/include/sgl_kernel/runtime.cuh` — occupancy / SM count helpers
|
||||
- `python/sglang/jit_kernel/csrc/add_constant.cuh` — minimal runnable reference
|
||||
- `python/sglang/jit_kernel/csrc/elementwise/rmsnorm.cuh` — real example using `TensorMatcher` + `LaunchKernel` + `tile::Memory`
|
||||
- `python/sglang/jit_kernel/csrc/elementwise/qknorm.cuh` — real example using `runtime::get_blocks_per_sm` + persistent kernel pattern
|
||||
- `python/sglang/jit_kernel/benchmark/utils.py` — benchmark helpers
|
||||
|
||||
## Summary of Files Created
|
||||
|
||||
```
|
||||
python/sglang/jit_kernel/csrc/elementwise/scale.cuh # NEW: CUDA kernel
|
||||
python/sglang/jit_kernel/scale.py # NEW: Python wrapper
|
||||
python/sglang/jit_kernel/tests/test_scale.py # NEW: Tests
|
||||
python/sglang/jit_kernel/benchmark/bench_scale.py # NEW: Benchmark
|
||||
```
|
||||
358
.claude/skills/add-sgl-kernel/SKILL.md
Normal file
358
.claude/skills/add-sgl-kernel/SKILL.md
Normal file
@@ -0,0 +1,358 @@
|
||||
---
|
||||
name: add-sgl-kernel
|
||||
description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)
|
||||
---
|
||||
|
||||
# Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)
|
||||
|
||||
This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
|
||||
|
||||
## Goal
|
||||
|
||||
Add a new operation that scales each element of a tensor by a scalar factor:
|
||||
|
||||
- Input: tensor `x` (CUDA) and scalar `factor` (float)
|
||||
- Output: `x * factor` (element-wise, in-place or into pre-allocated `out`)
|
||||
- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
|
||||
- Dispatched via `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro (defined in `sgl-kernel/include/utils.h`)
|
||||
|
||||
## Two rules of thumb (must follow)
|
||||
|
||||
1. **Heavyweight kernels go to `sgl-kernel`.** If it depends on CUTLASS / FlashInfer / DeepGEMM (or similarly heavy stacks), implement it in `sgl-kernel/`.
|
||||
2. **Lightweight kernels go to `python/sglang/jit_kernel`.** If it is small, has few dependencies, and benefits from rapid iteration, implement it as a JIT kernel instead.
|
||||
|
||||
In addition, every new kernel must ship with:
|
||||
|
||||
- **Tests** (pytest)
|
||||
- **A benchmark script** (triton.testing)
|
||||
|
||||
---
|
||||
|
||||
## Repository integration map
|
||||
|
||||
You will typically touch these files/areas:
|
||||
|
||||
- Implementation: `sgl-kernel/csrc/elementwise/scale.cu` (pick the right subdirectory)
|
||||
- Public declarations: `sgl-kernel/include/sgl_kernel_ops.h`
|
||||
- Torch extension registration: `sgl-kernel/csrc/common_extension.cc`
|
||||
- Build: `sgl-kernel/CMakeLists.txt` (`set(SOURCES ...)`)
|
||||
- Python API: `sgl-kernel/python/sgl_kernel/` and `sgl-kernel/python/sgl_kernel/__init__.py`
|
||||
- Tests: `sgl-kernel/tests/test_scale.py`
|
||||
- Benchmarks: `sgl-kernel/benchmark/bench_scale.py`
|
||||
|
||||
---
|
||||
|
||||
## Step 1: Implement the kernel in `csrc/`
|
||||
|
||||
Pick the right subdirectory:
|
||||
|
||||
- `csrc/elementwise/` — for element-wise ops (our example)
|
||||
- `csrc/gemm/`, `csrc/attention/`, `csrc/moe/` — for other categories
|
||||
|
||||
Create `sgl-kernel/csrc/elementwise/scale.cu`:
|
||||
|
||||
```cpp
|
||||
#include <ATen/cuda/CUDAContext.h>
|
||||
#include <c10/cuda/CUDAGuard.h>
|
||||
#include <torch/all.h>
|
||||
|
||||
#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
|
||||
|
||||
// scale_kernel: out[i] = input[i] * factor
|
||||
// Supports float, half (__half), __nv_bfloat16 via template T
|
||||
template <typename T>
|
||||
__global__ void scale_kernel(T* __restrict__ out,
|
||||
const T* __restrict__ input,
|
||||
float factor,
|
||||
int64_t n) {
|
||||
int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
|
||||
if (idx < n) {
|
||||
out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
|
||||
}
|
||||
}
|
||||
|
||||
void scale(at::Tensor& out, const at::Tensor& input, double factor) {
|
||||
TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
|
||||
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
|
||||
TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor");
|
||||
TORCH_CHECK(out.is_contiguous(), "out must be contiguous");
|
||||
TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape");
|
||||
TORCH_CHECK(out.scalar_type() == input.scalar_type(),
|
||||
"out and input must have the same dtype");
|
||||
|
||||
const int64_t n = input.numel();
|
||||
const int threads = 256;
|
||||
const int blocks = (n + threads - 1) / threads;
|
||||
|
||||
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
|
||||
|
||||
// Dispatches over float, float16, bfloat16
|
||||
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
|
||||
scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
|
||||
static_cast<c_type*>(out.data_ptr()),
|
||||
static_cast<const c_type*>(input.data_ptr()),
|
||||
static_cast<float>(factor),
|
||||
n);
|
||||
cudaError_t status = cudaGetLastError();
|
||||
TORCH_CHECK(status == cudaSuccess,
|
||||
"scale_kernel launch failed: ", cudaGetErrorString(status));
|
||||
return true;
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
|
||||
- Use `at::Tensor` (PyTorch tensors), `TORCH_CHECK` for validation, `at::cuda::getCurrentCUDAStream()` for stream
|
||||
- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` covers `float`, `half` (FP16), `__nv_bfloat16` (BF16)
|
||||
- Add device error checking after every kernel launch
|
||||
- If a kernel only works on certain architectures, enforce that with `TORCH_CHECK` and skip logic in tests
|
||||
|
||||
---
|
||||
|
||||
## Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`
|
||||
|
||||
Edit `sgl-kernel/include/sgl_kernel_ops.h`, add to the elementwise section:
|
||||
|
||||
```cpp
|
||||
void scale(at::Tensor& out, const at::Tensor& input, double factor);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 3: Register the op in `csrc/common_extension.cc`
|
||||
|
||||
Edit `sgl-kernel/csrc/common_extension.cc`, inside `TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)`:
|
||||
|
||||
```cpp
|
||||
// From csrc/elementwise
|
||||
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
|
||||
m.impl("scale", torch::kCUDA, &scale);
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
|
||||
- `Tensor!` means in-place / mutable output argument
|
||||
- The schema is important for `torch.compile` and for consistent call signatures
|
||||
- If your underlying C++ API uses `float` but PyTorch bindings expect `double`, the implicit cast is fine for scalars; use shims if needed for other types
|
||||
|
||||
---
|
||||
|
||||
## Step 4: Add the new source file to `CMakeLists.txt`
|
||||
|
||||
Edit `sgl-kernel/CMakeLists.txt`, add to `set(SOURCES ...)`:
|
||||
|
||||
```cmake
|
||||
csrc/elementwise/scale.cu
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
|
||||
- Keep the list **alphabetically sorted** (the file explicitly requires this)
|
||||
- If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic
|
||||
|
||||
---
|
||||
|
||||
## Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`
|
||||
|
||||
In `sgl-kernel/python/sgl_kernel/__init__.py`, add:
|
||||
|
||||
```python
|
||||
from torch.ops import sgl_kernel as _ops
|
||||
|
||||
def scale(out: torch.Tensor, input: torch.Tensor, factor: float) -> None:
|
||||
"""
|
||||
Element-wise scale: out = input * factor (in-place into out).
|
||||
|
||||
Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
out : pre-allocated CUDA output tensor (same shape/dtype as input)
|
||||
input : CUDA input tensor
|
||||
factor : scale factor (float)
|
||||
"""
|
||||
_ops.scale(out, input, factor)
|
||||
```
|
||||
|
||||
Or export it from the existing module organisation — follow the pattern already used by similar ops in `__init__.py`.
|
||||
|
||||
---
|
||||
|
||||
## Step 6: Write tests (required)
|
||||
|
||||
Create `sgl-kernel/tests/test_scale.py`:
|
||||
|
||||
```python
|
||||
import pytest
|
||||
import torch
|
||||
import sgl_kernel
|
||||
|
||||
|
||||
@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
|
||||
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
|
||||
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
|
||||
def test_scale_correctness(dtype, size, factor):
|
||||
input = torch.randn(size, dtype=dtype, device="cuda")
|
||||
out = torch.empty_like(input)
|
||||
|
||||
sgl_kernel.scale(out, input, factor)
|
||||
|
||||
expected = input * factor
|
||||
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
|
||||
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
|
||||
|
||||
|
||||
def test_scale_shape_mismatch():
|
||||
input = torch.randn(128, dtype=torch.float16, device="cuda")
|
||||
out = torch.empty(256, dtype=torch.float16, device="cuda")
|
||||
with pytest.raises(RuntimeError, match="same shape"):
|
||||
sgl_kernel.scale(out, input, 2.0)
|
||||
|
||||
|
||||
def test_scale_cpu_input():
|
||||
input = torch.randn(128, dtype=torch.float16) # CPU
|
||||
out = torch.empty_like(input)
|
||||
with pytest.raises(RuntimeError, match="CUDA"):
|
||||
sgl_kernel.scale(out, input, 2.0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-q"])
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
pytest sgl-kernel/tests/test_scale.py -q
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 7: Add a benchmark (required)
|
||||
|
||||
Create `sgl-kernel/benchmark/bench_scale.py`:
|
||||
|
||||
```python
|
||||
import itertools
|
||||
import os
|
||||
|
||||
import torch
|
||||
import triton
|
||||
import triton.testing
|
||||
|
||||
import sgl_kernel
|
||||
|
||||
IS_CI = (
|
||||
os.getenv("CI", "false").lower() == "true"
|
||||
or os.getenv("GITHUB_ACTIONS", "false").lower() == "true"
|
||||
)
|
||||
|
||||
dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
|
||||
sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K
|
||||
factors = [2.0]
|
||||
|
||||
configs = list(itertools.product(dtypes, sizes))
|
||||
|
||||
|
||||
def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
|
||||
return input * factor
|
||||
|
||||
|
||||
@triton.testing.perf_report(
|
||||
triton.testing.Benchmark(
|
||||
x_names=["dtype", "size"],
|
||||
x_vals=configs,
|
||||
line_arg="provider",
|
||||
line_vals=["sglang", "torch"],
|
||||
line_names=["SGL Kernel", "PyTorch"],
|
||||
styles=[("green", "-"), ("red", "--")],
|
||||
ylabel="µs (median)",
|
||||
plot_name="scale-performance",
|
||||
args={},
|
||||
)
|
||||
)
|
||||
def benchmark(dtype, size, provider):
|
||||
input = torch.randn(size, dtype=dtype, device="cuda")
|
||||
out = torch.empty_like(input)
|
||||
factor = 2.0
|
||||
|
||||
if provider == "sglang":
|
||||
fn = lambda: sgl_kernel.scale(out, input, factor)
|
||||
else:
|
||||
fn = lambda: torch_scale(input, factor)
|
||||
|
||||
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
|
||||
fn, quantiles=[0.5, 0.2, 0.8]
|
||||
)
|
||||
return 1000 * ms, 1000 * max_ms, 1000 * min_ms
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
benchmark.run(print_data=True)
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
python sgl-kernel/benchmark/bench_scale.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 8: Build and validate
|
||||
|
||||
Build:
|
||||
|
||||
```bash
|
||||
cd sgl-kernel
|
||||
make build -j16
|
||||
```
|
||||
|
||||
If you need to limit host resource usage:
|
||||
|
||||
```bash
|
||||
cd sgl-kernel
|
||||
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
|
||||
```
|
||||
|
||||
Validate:
|
||||
|
||||
```bash
|
||||
pytest sgl-kernel/tests/test_scale.py -q
|
||||
python sgl-kernel/benchmark/bench_scale.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **Async CUDA errors**: `CUDA_LAUNCH_BLOCKING=1`
|
||||
- **Memory errors**: `compute-sanitizer --tool memcheck python ...`
|
||||
- **Build is too slow / OOM**: reduce `MAX_JOBS` and `SGL_KERNEL_COMPILE_THREADS`
|
||||
- **Binary bloat**: use `sgl-kernel/analyze_whl_kernel_sizes.py`
|
||||
- **CMake sources list**: if your `.cu` file is missing from `SOURCES`, the symbol will be undefined at link time
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- `sgl-kernel/README.md`
|
||||
- `sgl-kernel/include/sgl_kernel_ops.h`
|
||||
- `sgl-kernel/csrc/common_extension.cc`
|
||||
- `sgl-kernel/CMakeLists.txt`
|
||||
- `sgl-kernel/include/utils.h` — `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro and friends
|
||||
- `sgl-kernel/csrc/elementwise/activation.cu` — reference for the FP16/BF16/FP32 dispatch pattern
|
||||
|
||||
## Summary of Files Created/Modified
|
||||
|
||||
```
|
||||
sgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher
|
||||
sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration
|
||||
sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration
|
||||
sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical)
|
||||
sgl-kernel/python/sgl_kernel/__init__.py # MODIFIED: export Python API
|
||||
sgl-kernel/tests/test_scale.py # NEW: tests
|
||||
sgl-kernel/benchmark/bench_scale.py # NEW: benchmark
|
||||
```
|
||||
37
.github/CI_PERMISSIONS.json
vendored
37
.github/CI_PERMISSIONS.json
vendored
@@ -55,6 +55,13 @@
|
||||
"reason": "top contributor",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
"Chen-0210": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
"cooldown_interval_minutes": 60,
|
||||
"reason": "custom override",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
"ClawSeven": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
@@ -121,7 +128,7 @@
|
||||
"HandH1998": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
"cooldown_interval_minutes": 60,
|
||||
"cooldown_interval_minutes": 0,
|
||||
"reason": "custom override",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
@@ -202,6 +209,13 @@
|
||||
"reason": "custom override",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
"Ratish1": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
"cooldown_interval_minutes": 0,
|
||||
"reason": "custom override",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
"ShangmingCai": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
@@ -720,6 +734,20 @@
|
||||
"reason": "custom override",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
"mmangkad": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
"cooldown_interval_minutes": 0,
|
||||
"reason": "custom override",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
"narutolhy": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
"cooldown_interval_minutes": 0,
|
||||
"reason": "custom override",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
"netanel-haber": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
@@ -846,6 +874,13 @@
|
||||
"reason": "top contributor",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
"sglang-npu-bot": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
"cooldown_interval_minutes": 0,
|
||||
"reason": "custom override",
|
||||
"can_rerun_stage": true
|
||||
},
|
||||
"shaharmor98": {
|
||||
"can_tag_run_ci_label": true,
|
||||
"can_rerun_failed_ci": true,
|
||||
|
||||
20
.github/CODEOWNERS
vendored
20
.github/CODEOWNERS
vendored
@@ -1,18 +1,19 @@
|
||||
.github @merrymercy @Fridge003 @ispobock @Kangyan-Zhou
|
||||
/docker @Fridge003 @ispobock @HaiShaw @ishandhanani
|
||||
.github @merrymercy @Fridge003 @ispobock @Kangyan-Zhou @bingxche
|
||||
/docker @Fridge003 @ispobock @HaiShaw @ishandhanani @yctseng0211
|
||||
/docker/npu.Dockerfile @ping1jing2 @iforgetmyname
|
||||
/python/pyproject.toml @merrymercy @Fridge003 @ispobock
|
||||
/python/sglang/jit_kernel @DarkSharpness @BBuf
|
||||
/python/sglang/jit_kernel @DarkSharpness @BBuf @celve @HydraQYH @yuan-luo
|
||||
/python/sglang/jit_kernel/diffusion @yingluosanqian @BBuf @mickqian
|
||||
/python/sglang/multimodal_gen @mickqian @yhyang201
|
||||
/python/sglang/multimodal_gen/runtime/layers @mickqian @yhyang201 @BBuf @yingluosanqian
|
||||
/python/sglang/multimodal_gen/runtime/models/dits @mickqian @yhyang201 @BBuf @yingluosanqian
|
||||
/python/sglang/multimodal_gen @mickqian @yhyang201 @ping1jing2
|
||||
/python/sglang/multimodal_gen/runtime/layers @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2
|
||||
/python/sglang/multimodal_gen/runtime/models/dits @mickqian @yhyang201 @BBuf @yingluosanqian @ping1jing2
|
||||
/python/sglang/srt/batch_invariant_ops @Fridge003 @hebiao064
|
||||
/python/sglang/srt/constrained @hnyls2002 @DarkSharpness
|
||||
/python/sglang/srt/compilation @hebiao064
|
||||
/python/sglang/srt/disaggregation @ByronHsu @hnyls2002 @ShangmingCai
|
||||
/python/sglang/srt/disaggregation/ascend @ping1jing2 @iforgetmyname
|
||||
/python/sglang/srt/distributed @yizhang2077 @merrymercy @ch-wan
|
||||
/python/sglang/srt/dllm @ClawSeven @btw616
|
||||
/python/sglang/srt/entrypoints @ispobock @CatherineSue @slin1237 @merrymercy @JustinTong0323
|
||||
/python/sglang/srt/entrypoints/grpc_server.py @CatherineSue @slin1237
|
||||
/python/sglang/srt/eplb @fzyzcjy @ch-wan
|
||||
@@ -21,11 +22,13 @@
|
||||
/python/sglang/srt/hardware_backend/npu @ping1jing2 @iforgetmyname
|
||||
/python/sglang/srt/hardware_backend/npu/quantization @OrangeRedeng @TamirBaydasov @iforgetmyname
|
||||
/python/sglang/srt/layers @merrymercy @Ying1123 @Fridge003 @ispobock @HaiShaw @ch-wan @BBuf @Edwardf0t1
|
||||
/python/sglang/srt/layers/attention @merrymercy @Fridge003 @ispobock @Qiaolin-Yu @hebiao064
|
||||
/python/sglang/srt/layers/attention @merrymercy @Fridge003 @ispobock @Qiaolin-Yu @hebiao064 @HaiShaw
|
||||
/python/sglang/srt/layers/attention/fla @yizhang2077 @hebiao064
|
||||
/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py @yizhang2077 @hebiao064 @hanming-lu
|
||||
/python/sglang/srt/layers/attention/mamba @yizhang2077 @hebiao064
|
||||
/python/sglang/srt/layers/quantization @ch-wan @BBuf @Edwardf0t1 @FlamingoPg @AniZpZ
|
||||
/python/sglang/srt/layers/attention/nsa @1am9trash @hubertlu-tw @kkHuang-amd @HaiShaw @Fridge003 @hlu1 @rainj-me
|
||||
/python/sglang/srt/layers/quantization @ch-wan @BBuf @Edwardf0t1 @FlamingoPg @AniZpZ @HaiShaw
|
||||
/python/sglang/srt/layers/quantization/quark @kkHuang-amd @yichiche @hubertlu-tw @1am9trash @BowenBao
|
||||
/python/sglang/srt/lora @Ying1123 @Fridge003 @lifuhuang
|
||||
/python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
|
||||
/python/sglang/srt/managers/scheduler_pp_mixin.py @ShangmingCai @XucSh
|
||||
@@ -34,6 +37,7 @@
|
||||
/python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py @hebiao064
|
||||
/python/sglang/srt/models/deepseek_v2.py @fzyzcjy @zhyncs @ispobock @ch-wan @merrymercy @Fridge003
|
||||
/python/sglang/srt/multimodal @mickqian @JustinTong0323 @yhyang201 @yuan-luo
|
||||
/python/sglang/srt/observability @merrymercy @fzyzcjy @sufeng-buaa
|
||||
/python/sglang/srt/speculative @Ying1123 @merrymercy @hnyls2002
|
||||
/sgl-kernel @zhyncs @ispobock @BBuf @yizhang2077 @merrymercy @FlamingoPg @HaiShaw
|
||||
/sgl-model-gateway @slin1237 @CatherineSue
|
||||
|
||||
27
.github/actions/upload-cuda-coredumps/action.yml
vendored
Normal file
27
.github/actions/upload-cuda-coredumps/action.yml
vendored
Normal file
@@ -0,0 +1,27 @@
|
||||
name: Upload CUDA Coredumps
|
||||
description: Upload CUDA coredump files as artifacts and clean up the directory.
|
||||
|
||||
inputs:
|
||||
artifact-suffix:
|
||||
description: Suffix appended to the artifact name (e.g. matrix partition id)
|
||||
required: false
|
||||
default: ""
|
||||
retention-days:
|
||||
description: Number of days to retain the artifact
|
||||
required: false
|
||||
default: "7"
|
||||
|
||||
runs:
|
||||
using: composite
|
||||
steps:
|
||||
- name: Upload CUDA coredumps
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: cuda-coredumps-${{ github.job }}${{ inputs.artifact-suffix && format('-{0}', inputs.artifact-suffix) }}
|
||||
path: ${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}/
|
||||
retention-days: ${{ inputs.retention-days }}
|
||||
if-no-files-found: ignore
|
||||
|
||||
- name: Cleanup CUDA coredumps
|
||||
shell: bash
|
||||
run: rm -rf "${{ env.SGLANG_CUDA_COREDUMP_DIR || '/tmp/sglang_cuda_coredumps' }}"
|
||||
20
.github/workflows/lint.yml
vendored
20
.github/workflows/lint.yml
vendored
@@ -26,25 +26,9 @@ jobs:
|
||||
run: SKIP=no-commit-to-branch pre-commit run --all-files --show-diff-on-failure
|
||||
|
||||
- name: Run sgl-kernel clang-format checks
|
||||
uses: DoozyX/clang-format-lint-action@v0.18.1
|
||||
uses: DoozyX/clang-format-lint-action@v0.20
|
||||
with:
|
||||
source: sgl-kernel
|
||||
extensions: h,c,cpp,hpp,cu,cuh,cc
|
||||
clangFormatVersion: 18
|
||||
clangFormatVersion: 20
|
||||
style: file
|
||||
|
||||
- name: Check proto files are in sync
|
||||
run: |
|
||||
if ! diff -q python/sglang/srt/grpc/sglang_scheduler.proto sgl-model-gateway/src/proto/sglang_scheduler.proto; then
|
||||
echo "❌ ERROR: Proto files are out of sync!"
|
||||
echo ""
|
||||
echo "The following files must be kept identical:"
|
||||
echo " - python/sglang/srt/grpc/sglang_scheduler.proto"
|
||||
echo " - sgl-model-gateway/src/proto/sglang_scheduler.proto"
|
||||
echo ""
|
||||
echo "Please ensure both files have the same content."
|
||||
echo ""
|
||||
echo "Differences:"
|
||||
diff python/sglang/srt/grpc/sglang_scheduler.proto sgl-model-gateway/src/proto/sglang_scheduler.proto || true
|
||||
exit 1
|
||||
fi
|
||||
|
||||
90
.github/workflows/list-active-pr-runs.yml.yml
vendored
90
.github/workflows/list-active-pr-runs.yml.yml
vendored
@@ -1,4 +1,4 @@
|
||||
name: List Active PR Runs
|
||||
name: List Active Runs
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
@@ -15,13 +15,13 @@ permissions:
|
||||
pull-requests: read
|
||||
|
||||
jobs:
|
||||
list-active-pr-runs:
|
||||
list-active-runs:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Install GitHub CLI
|
||||
run: sudo apt-get install -y gh jq
|
||||
|
||||
- name: List active PR runs grouped by PR
|
||||
- name: List active runs grouped by PR
|
||||
env:
|
||||
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
REPO: ${{ github.repository }}
|
||||
@@ -31,7 +31,7 @@ jobs:
|
||||
set -euo pipefail
|
||||
|
||||
echo "========================================="
|
||||
echo "🔍 Active PR Workflow Runs Report"
|
||||
echo "🔍 Active Workflow Runs Report"
|
||||
echo "========================================="
|
||||
echo ""
|
||||
|
||||
@@ -54,7 +54,7 @@ jobs:
|
||||
--workflow "$workflow_file" \
|
||||
--json databaseId,status,event,headBranch,createdAt,updatedAt,headSha,number,attempt \
|
||||
--limit 500 \
|
||||
| jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress") | select(.event=="pull_request")')
|
||||
| jq -c '.[] | select(.status=="queued" or .status=="waiting" or .status=="in_progress")')
|
||||
|
||||
if [ -z "$active_runs" ]; then
|
||||
continue
|
||||
@@ -64,6 +64,7 @@ jobs:
|
||||
echo "$active_runs" | while read -r run; do
|
||||
run_id=$(echo "$run" | jq -r '.databaseId')
|
||||
run_status=$(echo "$run" | jq -r '.status')
|
||||
run_event=$(echo "$run" | jq -r '.event')
|
||||
created_at=$(echo "$run" | jq -r '.createdAt')
|
||||
head_sha=$(echo "$run" | jq -r '.headSha')
|
||||
run_number=$(echo "$run" | jq -r '.number')
|
||||
@@ -83,12 +84,12 @@ jobs:
|
||||
continue
|
||||
fi
|
||||
|
||||
# Find PR number
|
||||
# Find PR number (may be empty for non-PR runs)
|
||||
pr_number=$(gh api "repos/$REPO/pulls?state=open&head=${head_owner}:${head_branch}" \
|
||||
--jq '.[0].number // empty' 2>/dev/null || true)
|
||||
|
||||
if [ -z "$pr_number" ]; then
|
||||
continue
|
||||
pr_number="NO_PR"
|
||||
fi
|
||||
|
||||
# Get jobs for this run (with pagination to avoid missing jobs)
|
||||
@@ -106,25 +107,25 @@ jobs:
|
||||
queue_time=$((current_time - created_time))
|
||||
queue_minutes=$((queue_time / 60))
|
||||
|
||||
# Store data in temporary file
|
||||
echo "$pr_number|$workflow_file|$run_id|$run_status|$running_jobs|$queued_jobs|$runners|$queue_minutes|$created_at|$head_sha|$run_attempt" >> "$pr_data_file"
|
||||
# Store data in temporary file (unified format with event and branch)
|
||||
echo "$pr_number|$workflow_file|$run_id|$run_status|$running_jobs|$queued_jobs|$runners|$queue_minutes|$created_at|$head_sha|$run_attempt|$run_event|$head_branch" >> "$pr_data_file"
|
||||
done
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "========================================="
|
||||
echo "📊 Active PRs Summary"
|
||||
echo "📊 Active Runs Summary"
|
||||
echo "========================================="
|
||||
echo ""
|
||||
|
||||
if [ ! -s "$pr_data_file" ]; then
|
||||
echo "✅ No active PR runs found"
|
||||
echo "✅ No active runs found"
|
||||
rm -f "$pr_data_file"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Get unique PR numbers
|
||||
pr_numbers=$(cat "$pr_data_file" | cut -d'|' -f1 | sort -u)
|
||||
# Get unique PR numbers (exclude NO_PR entries)
|
||||
pr_numbers=$(cut -d'|' -f1 < "$pr_data_file" | grep -v '^NO_PR$' | sort -u || true)
|
||||
|
||||
# Separate high priority and normal PRs
|
||||
high_priority_prs=()
|
||||
@@ -240,11 +241,74 @@ jobs:
|
||||
echo ""
|
||||
done
|
||||
|
||||
# --- Non-PR Runs Section ---
|
||||
non_pr_runs=$(grep '^NO_PR|' "$pr_data_file" 2>/dev/null || true)
|
||||
non_pr_running=0
|
||||
non_pr_queued=0
|
||||
|
||||
if [ -n "$non_pr_runs" ]; then
|
||||
echo "========================================="
|
||||
echo "📦 Non-PR Runs (manual / scheduled / other)"
|
||||
echo "========================================="
|
||||
echo ""
|
||||
|
||||
echo "$non_pr_runs" | while read -r line; do
|
||||
workflow=$(echo "$line" | cut -d'|' -f2)
|
||||
run_id=$(echo "$line" | cut -d'|' -f3)
|
||||
status=$(echo "$line" | cut -d'|' -f4)
|
||||
running=$(echo "$line" | cut -d'|' -f5)
|
||||
queued=$(echo "$line" | cut -d'|' -f6)
|
||||
runners=$(echo "$line" | cut -d'|' -f7)
|
||||
queue_min=$(echo "$line" | cut -d'|' -f8)
|
||||
created=$(echo "$line" | cut -d'|' -f9)
|
||||
attempt=$(echo "$line" | cut -d'|' -f11)
|
||||
event=$(echo "$line" | cut -d'|' -f12)
|
||||
branch=$(echo "$line" | cut -d'|' -f13)
|
||||
|
||||
run_url="https://github.com/$REPO/actions/runs/$run_id"
|
||||
|
||||
retry_count=$((attempt - 1))
|
||||
retry_indicator=""
|
||||
if [ "$retry_count" -gt 0 ]; then
|
||||
retry_indicator=" 🔄 Retry #$retry_count"
|
||||
fi
|
||||
|
||||
echo " 📦 Workflow: $workflow (Run #$run_id)$retry_indicator"
|
||||
echo " Event: $event"
|
||||
echo " Branch: $branch"
|
||||
echo " Status: $status"
|
||||
echo " 🟢 Running jobs: $running"
|
||||
echo " 🟡 Queued jobs: $queued"
|
||||
|
||||
if [ "$running" -gt 0 ] && [ "$runners" != "" ]; then
|
||||
echo " 🖥️ Runners: $runners"
|
||||
fi
|
||||
|
||||
if [ "$queue_min" -gt 0 ]; then
|
||||
echo " ⏱️ Queue time: ${queue_min} minutes"
|
||||
fi
|
||||
|
||||
echo " 🔗 Run URL: $run_url"
|
||||
echo ""
|
||||
done
|
||||
|
||||
non_pr_running=$(echo "$non_pr_runs" | cut -d'|' -f5 | awk '{sum+=$1} END {print sum+0}')
|
||||
non_pr_queued=$(echo "$non_pr_runs" | cut -d'|' -f6 | awk '{sum+=$1} END {print sum+0}')
|
||||
non_pr_count=$(echo "$non_pr_runs" | wc -l | tr -d ' ')
|
||||
|
||||
total_running=$((total_running + non_pr_running))
|
||||
total_queued=$((total_queued + non_pr_queued))
|
||||
|
||||
echo " 📊 Non-PR Total: $non_pr_running running, $non_pr_queued queued"
|
||||
echo ""
|
||||
fi
|
||||
|
||||
# Overall summary
|
||||
echo "========================================="
|
||||
echo "📈 Overall Summary"
|
||||
echo "========================================="
|
||||
echo "Total PRs with active runs: $pr_count"
|
||||
echo "Total non-PR active runs: ${non_pr_count:-0}"
|
||||
echo "Total running jobs: $total_running"
|
||||
echo "Total queued jobs: $total_queued"
|
||||
echo "========================================="
|
||||
|
||||
1041
.github/workflows/nightly-test-amd-rocm720.yml
vendored
Normal file
1041
.github/workflows/nightly-test-amd-rocm720.yml
vendored
Normal file
File diff suppressed because it is too large
Load Diff
174
.github/workflows/nightly-test-amd.yml
vendored
174
.github/workflows/nightly-test-amd.yml
vendored
@@ -32,9 +32,13 @@ on:
|
||||
- 'nightly-8-gpu-deepseek-v32'
|
||||
- 'nightly-8-gpu-deepseek-v32-mtp'
|
||||
- 'nightly-8-gpu-kimi-k25'
|
||||
- 'nightly-8-gpu-qwen3-235b'
|
||||
- 'nightly-8-gpu-glm5'
|
||||
# MI35x jobs
|
||||
- 'nightly-test-1-gpu-mi35x'
|
||||
- 'nightly-8-gpu-mi35x-qwen3-235b-mxfp4'
|
||||
- 'nightly-8-gpu-mi35x-kimi-k25'
|
||||
- 'nightly-8-gpu-mi35x-glm5'
|
||||
- 'nightly-accuracy-8-gpu-mi35x'
|
||||
- 'nightly-8-gpu-mi35x-grok1-int4'
|
||||
- 'nightly-8-gpu-mi35x-grok2'
|
||||
@@ -83,11 +87,11 @@ jobs:
|
||||
run: bash scripts/ci/amd/amd_ci_install_dependency.sh
|
||||
|
||||
- name: Nightly Unit Test (1-GPU)
|
||||
timeout-minutes: 60
|
||||
timeout-minutes: 90
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
|
||||
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
|
||||
python3 run_suite.py --hw amd --suite nightly-amd-1-gpu --nightly --timeout-per-file 600 --continue-on-error || TEST_EXIT_CODE=$?
|
||||
python3 run_suite.py --hw amd --suite nightly-amd-1-gpu --nightly --timeout-per-file 900 --continue-on-error || TEST_EXIT_CODE=$?
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
@@ -464,36 +468,6 @@ jobs:
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
# 8-GPU Kimi-K2 (Accuracy + Speed)
|
||||
nightly-8-gpu-kimi-k2:
|
||||
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-kimi-k2')
|
||||
runs-on: linux-mi325-gpu-8
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.ref || github.ref }}
|
||||
|
||||
- name: Setup docker
|
||||
run: |
|
||||
touch github_summary.md
|
||||
bash scripts/ci/amd/amd_ci_start_container.sh
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: bash scripts/ci/amd/amd_ci_install_dependency.sh
|
||||
|
||||
- name: Accuracy Test (8-GPU Kimi-K2)
|
||||
timeout-minutes: 120
|
||||
run: |
|
||||
> github_summary.md # Clear summary file
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
|
||||
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
|
||||
python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-kimi-k2 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$?
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
# 8-GPU Kimi-K2.5 (Accuracy)
|
||||
nightly-8-gpu-kimi-k25:
|
||||
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-kimi-k25')
|
||||
@@ -524,6 +498,67 @@ jobs:
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
nightly-8-gpu-qwen3-235b:
|
||||
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-qwen3-235b')
|
||||
runs-on: linux-mi325-gpu-8
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.ref || github.ref }}
|
||||
|
||||
- name: Setup docker
|
||||
run: |
|
||||
touch github_summary.md
|
||||
bash scripts/ci/amd/amd_ci_start_container.sh
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: bash scripts/ci/amd/amd_ci_install_dependency.sh
|
||||
|
||||
- name: Accuracy Test + Performance Test (8-GPU Qwen3)
|
||||
timeout-minutes: 120
|
||||
run: |
|
||||
> github_summary.md # Clear summary file
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
|
||||
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
|
||||
python3 run_suite.py --hw amd --suite nightly-8-gpu-qwen3-235b --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$?
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
nightly-8-gpu-glm5:
|
||||
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-glm5')
|
||||
runs-on: linux-mi325-gpu-8
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.ref || github.ref }}
|
||||
|
||||
- name: Setup docker
|
||||
run: |
|
||||
touch github_summary.md
|
||||
bash scripts/ci/amd/amd_ci_start_container.sh
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh
|
||||
# GLM-5 requires latest transformers for glm_moe_dsa architecture
|
||||
bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git
|
||||
|
||||
- name: Accuracy Test (8-GPU GLM-5 NSA)
|
||||
timeout-minutes: 120
|
||||
run: |
|
||||
> github_summary.md # Clear summary file
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
|
||||
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
|
||||
python3 run_suite.py --hw amd --suite nightly-amd-accuracy-8-gpu-glm5 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$?
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
# ============================================== MI35x Tests ==============================================
|
||||
# MI35x 1-GPU tests - platform-agnostic tests that may work on CDNA4 (gfx950)
|
||||
nightly-test-1-gpu-mi35x:
|
||||
@@ -549,11 +584,11 @@ jobs:
|
||||
bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate
|
||||
|
||||
- name: Nightly Test MI35x (1-GPU)
|
||||
timeout-minutes: 60
|
||||
timeout-minutes: 90
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
|
||||
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
|
||||
python3 run_suite.py --hw amd --suite nightly-amd-1-gpu-mi35x --nightly --timeout-per-file 600 --continue-on-error || TEST_EXIT_CODE=$?
|
||||
python3 run_suite.py --hw amd --suite nightly-amd-1-gpu-mi35x --nightly --timeout-per-file 900 --continue-on-error || TEST_EXIT_CODE=$?
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
@@ -857,6 +892,73 @@ jobs:
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
# MI35x 8-GPU Qwen3-235B-MXFP4 (Accuracy + Performance)
|
||||
nightly-8-gpu-mi35x-qwen3-235b-mxfp4:
|
||||
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-mi35x-qwen3-235b-mxfp4')
|
||||
runs-on: linux-mi35x-gpu-8
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.ref || github.ref }}
|
||||
|
||||
- name: Setup docker
|
||||
run: |
|
||||
touch github_summary.md
|
||||
bash scripts/ci/amd/amd_ci_start_container.sh
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh
|
||||
# Install tabulate for run_suite.py (missing in MI35x container)
|
||||
bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate
|
||||
|
||||
- name: Accuracy Test + Performance Test MI35x (8-GPU Qwen3-235B-MXFP4)
|
||||
timeout-minutes: 120
|
||||
run: |
|
||||
> github_summary.md # Clear summary file
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
|
||||
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
|
||||
python3 run_suite.py --hw amd --suite nightly-8-gpu-mi35x-qwen3-235b-mxfp4 --nightly --timeout-per-file 3600 || TEST_EXIT_CODE=$?
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
nightly-8-gpu-mi35x-glm5:
|
||||
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-8-gpu-mi35x-glm5')
|
||||
runs-on: linux-mi35x-gpu-8
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.ref || github.ref }}
|
||||
|
||||
- name: Setup docker
|
||||
run: |
|
||||
touch github_summary.md
|
||||
bash scripts/ci/amd/amd_ci_start_container.sh
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh
|
||||
# Install tabulate for run_suite.py (missing in MI35x container)
|
||||
bash scripts/ci/amd/amd_ci_exec.sh pip install tabulate
|
||||
# GLM-5 requires latest transformers for glm_moe_dsa architecture
|
||||
bash scripts/ci/amd/amd_ci_exec.sh pip install git+https://github.com/huggingface/transformers.git
|
||||
|
||||
- name: Accuracy Test MI35x (8-GPU GLM-5 NSA)
|
||||
timeout-minutes: 180
|
||||
run: |
|
||||
> github_summary.md # Clear summary file
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w /sglang-checkout/test \
|
||||
-e GITHUB_STEP_SUMMARY="/sglang-checkout/github_summary.md" \
|
||||
python3 run_suite.py --hw amd --suite nightly-amd-8-gpu-mi35x-glm5 --nightly --timeout-per-file 7200 || TEST_EXIT_CODE=$?
|
||||
echo "$(<github_summary.md )" >> $GITHUB_STEP_SUMMARY || true
|
||||
exit ${TEST_EXIT_CODE:-0}
|
||||
|
||||
# MI35x 8-GPU DeepSeek-V3.2 Performance Test (MTP)
|
||||
nightly-perf-8-gpu-mi35x-deepseek-v32-mtp:
|
||||
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-perf-8-gpu-mi35x-deepseek-v32-mtp')
|
||||
@@ -909,6 +1011,8 @@ jobs:
|
||||
- nightly-8-gpu-deepseek-v32
|
||||
- nightly-8-gpu-deepseek-v32-mtp
|
||||
- nightly-8-gpu-kimi-k25
|
||||
- nightly-8-gpu-qwen3-235b
|
||||
- nightly-8-gpu-glm5
|
||||
# MI35x jobs
|
||||
- nightly-test-1-gpu-mi35x
|
||||
- nightly-accuracy-8-gpu-mi35x
|
||||
@@ -918,6 +1022,8 @@ jobs:
|
||||
- nightly-accuracy-8-gpu-mi35x-deepseek-v32
|
||||
- nightly-accuracy-8-gpu-mi35x-deepseek-v32-mtp
|
||||
- nightly-8-gpu-mi35x-kimi-k25
|
||||
- nightly-8-gpu-mi35x-qwen3-235b-mxfp4
|
||||
- nightly-8-gpu-mi35x-glm5
|
||||
# MI35x perf jobs excluded from check - perf failures don't block CI
|
||||
# - nightly-perf-8-gpu-mi35x-deepseek-v32-basic
|
||||
# - nightly-perf-8-gpu-mi35x-deepseek-v32-mtp
|
||||
|
||||
133
.github/workflows/nightly-test-nvidia.yml
vendored
133
.github/workflows/nightly-test-nvidia.yml
vendored
@@ -44,6 +44,7 @@ concurrency:
|
||||
|
||||
env:
|
||||
SGLANG_IS_IN_CI: true
|
||||
SGLANG_CUDA_COREDUMP: "1"
|
||||
HF_HUB_DOWNLOAD_TIMEOUT: 300
|
||||
HF_HUB_ETAG_TIMEOUT: 300
|
||||
|
||||
@@ -68,6 +69,9 @@ jobs:
|
||||
cd test
|
||||
python3 run_suite.py --hw cuda --suite nightly-1-gpu --nightly --continue-on-error
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
# General tests - 4 GPU H100
|
||||
nightly-test-general-4-gpu-h100:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-4-gpu-h100')
|
||||
@@ -88,6 +92,9 @@ jobs:
|
||||
cd test
|
||||
python3 run_suite.py --hw cuda --suite nightly-4-gpu --nightly --continue-on-error
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
# General tests - 8 GPU H200
|
||||
nightly-test-general-8-gpu-h200:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h200')
|
||||
@@ -120,6 +127,25 @@ jobs:
|
||||
cd test
|
||||
python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=18000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4
|
||||
|
||||
- name: Publish traces to storage repo
|
||||
if: always()
|
||||
continue-on-error: true
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
|
||||
GITHUB_RUN_ID: ${{ github.run_id }}
|
||||
GITHUB_RUN_NUMBER: ${{ github.run_number }}
|
||||
run: |
|
||||
TRACE_ARGS=""
|
||||
for dir in test/performance_profiles_*/; do
|
||||
[ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir"
|
||||
done
|
||||
if [ -n "$TRACE_ARGS" ]; then
|
||||
python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS
|
||||
find test/performance_profiles_*/ -name '*.json.gz' -delete
|
||||
else
|
||||
echo "No trace directories found, skipping publish"
|
||||
fi
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 30
|
||||
env:
|
||||
@@ -148,6 +174,11 @@ jobs:
|
||||
retention-days: 5
|
||||
if-no-files-found: ignore
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.partition }}
|
||||
|
||||
# General tests - 8 GPU H20
|
||||
nightly-test-general-8-gpu-h20:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-h20')
|
||||
@@ -172,6 +203,9 @@ jobs:
|
||||
cd test
|
||||
python3 run_suite.py --hw cuda --suite nightly-8-gpu-h20 --nightly --continue-on-error
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
# General tests - 8 GPU B200
|
||||
nightly-test-general-8-gpu-b200:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-general-8-gpu-b200')
|
||||
@@ -201,6 +235,25 @@ jobs:
|
||||
cd test
|
||||
IS_BLACKWELL=1 python3 run_suite.py --hw cuda --suite nightly-8-gpu-common --nightly --timeout-per-file=12000 --continue-on-error --auto-partition-id=${{ matrix.partition }} --auto-partition-size=4
|
||||
|
||||
- name: Publish traces to storage repo
|
||||
if: always()
|
||||
continue-on-error: true
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI_DATA }}
|
||||
GITHUB_RUN_ID: ${{ github.run_id }}
|
||||
GITHUB_RUN_NUMBER: ${{ github.run_number }}
|
||||
run: |
|
||||
TRACE_ARGS=""
|
||||
for dir in test/performance_profiles_*/; do
|
||||
[ -d "$dir" ] && TRACE_ARGS="$TRACE_ARGS --traces-dir $dir"
|
||||
done
|
||||
if [ -n "$TRACE_ARGS" ]; then
|
||||
python3 scripts/ci/utils/publish_traces.py $TRACE_ARGS
|
||||
find test/performance_profiles_*/ -name '*.json.gz' -delete
|
||||
else
|
||||
echo "No trace directories found, skipping publish"
|
||||
fi
|
||||
|
||||
- name: Collect performance metrics
|
||||
if: always()
|
||||
run: |
|
||||
@@ -221,6 +274,11 @@ jobs:
|
||||
retention-days: 5
|
||||
if-no-files-found: ignore
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.partition }}
|
||||
|
||||
# Text model accuracy tests
|
||||
nightly-test-text-accuracy-2-gpu-runner:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-accuracy-2-gpu-runner')
|
||||
@@ -241,6 +299,9 @@ jobs:
|
||||
cd test
|
||||
python3 run_suite.py --hw cuda --suite nightly-eval-text-2-gpu --nightly --continue-on-error --timeout-per-file 4500
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
# Text model performance tests
|
||||
nightly-test-text-perf-2-gpu-runner:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-text-perf-2-gpu-runner')
|
||||
@@ -264,7 +325,7 @@ jobs:
|
||||
run: |
|
||||
cd test
|
||||
rm -rf performance_profiles_text_models/
|
||||
python3 run_suite.py --hw cuda --suite nightly-perf-text-2-gpu --nightly --continue-on-error
|
||||
python3 run_suite.py --hw cuda --suite nightly-perf-text-2-gpu --nightly --continue-on-error --timeout-per-file 3600
|
||||
|
||||
- name: Publish traces to storage repo
|
||||
env:
|
||||
@@ -274,6 +335,9 @@ jobs:
|
||||
run: |
|
||||
python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_text_models
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
# VLM accuracy tests
|
||||
nightly-test-vlm-accuracy-2-gpu-runner:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-accuracy-2-gpu-runner')
|
||||
@@ -294,6 +358,9 @@ jobs:
|
||||
cd test
|
||||
python3 run_suite.py --hw cuda --suite nightly-eval-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 9000
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
# VLM performance tests
|
||||
nightly-test-vlm-perf-2-gpu-runner:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-vlm-perf-2-gpu-runner')
|
||||
@@ -317,7 +384,7 @@ jobs:
|
||||
run: |
|
||||
cd test
|
||||
rm -rf performance_profiles_vlms/
|
||||
python3 run_suite.py --hw cuda --suite nightly-perf-vlm-2-gpu --nightly --continue-on-error
|
||||
python3 run_suite.py --hw cuda --suite nightly-perf-vlm-2-gpu --nightly --continue-on-error --timeout-per-file 3600
|
||||
|
||||
- name: Publish traces to storage repo
|
||||
env:
|
||||
@@ -327,6 +394,9 @@ jobs:
|
||||
run: |
|
||||
python3 scripts/ci/utils/publish_traces.py --traces-dir test/performance_profiles_vlms
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
# diffusion performance tests
|
||||
nightly-test-multimodal-server-1-gpu:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-1-gpu')
|
||||
@@ -351,6 +421,7 @@ jobs:
|
||||
env:
|
||||
SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
|
||||
GITHUB_RUN_ID: ${{ github.run_id }}
|
||||
GPU_CONFIG: "1-gpu-runner"
|
||||
|
||||
timeout-minutes: 60
|
||||
run: |
|
||||
@@ -360,6 +431,28 @@ jobs:
|
||||
--partition-id ${{ matrix.part }} \
|
||||
--total-partitions 2
|
||||
|
||||
- name: Collect diffusion performance metrics
|
||||
if: always()
|
||||
run: |
|
||||
python3 scripts/ci/save_diffusion_metrics.py \
|
||||
--gpu-config 1-gpu-runner \
|
||||
--run-id ${{ github.run_id }} \
|
||||
--output python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json \
|
||||
--results-json python/diffusion-results.json
|
||||
|
||||
- name: Upload diffusion metrics
|
||||
if: always()
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: diffusion-metrics-1gpu-partition-${{ matrix.part }}
|
||||
path: python/diffusion-metrics-1gpu-partition-${{ matrix.part }}.json
|
||||
retention-days: 90
|
||||
if-no-files-found: ignore
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.part }}
|
||||
|
||||
nightly-test-multimodal-server-2-gpu:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-multimodal-server-2-gpu')
|
||||
@@ -384,6 +477,7 @@ jobs:
|
||||
env:
|
||||
SGLANG_DIFFUSION_SLACK_TOKEN: ${{ secrets.SGLANG_DIFFUSION_SLACK_TOKEN }}
|
||||
GITHUB_RUN_ID: ${{ github.run_id }}
|
||||
GPU_CONFIG: "2-gpu-runner"
|
||||
|
||||
timeout-minutes: 60
|
||||
run: |
|
||||
@@ -393,6 +487,29 @@ jobs:
|
||||
--partition-id ${{ matrix.part }} \
|
||||
--total-partitions 2
|
||||
|
||||
- name: Collect diffusion performance metrics
|
||||
if: always()
|
||||
run: |
|
||||
python3 scripts/ci/save_diffusion_metrics.py \
|
||||
--gpu-config 2-gpu-runner \
|
||||
--run-id ${{ github.run_id }} \
|
||||
--output python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json \
|
||||
--results-json python/diffusion-results.json
|
||||
|
||||
- name: Upload diffusion metrics
|
||||
if: always()
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: diffusion-metrics-2gpu-partition-${{ matrix.part }}
|
||||
path: python/diffusion-metrics-2gpu-partition-${{ matrix.part }}.json
|
||||
retention-days: 90
|
||||
if-no-files-found: ignore
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.part }}
|
||||
|
||||
# B200 Performance tests - 4 GPU
|
||||
nightly-test-perf-4-gpu-b200:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-4-gpu-b200')
|
||||
@@ -413,6 +530,9 @@ jobs:
|
||||
cd test
|
||||
python3 run_suite.py --hw cuda --suite nightly-4-gpu-b200 --nightly --continue-on-error --timeout-per-file 12000
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
# Specialized B200 tests - 8 GPU, for specific backends and configs
|
||||
nightly-test-specialized-8-gpu-b200:
|
||||
if: github.repository == 'sgl-project/sglang' && (inputs.job_filter == '' || inputs.job_filter == 'all' || inputs.job_filter == 'nightly-test-perf-8-gpu-b200')
|
||||
@@ -437,12 +557,17 @@ jobs:
|
||||
cd test
|
||||
python3 run_suite.py --hw cuda --suite nightly-8-gpu-b200 --nightly --continue-on-error --timeout-per-file 2400
|
||||
|
||||
# Consolidate performance metrics from all 8-GPU jobs
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
# Consolidate performance metrics from all jobs
|
||||
consolidate-metrics:
|
||||
if: github.repository == 'sgl-project/sglang' && always()
|
||||
needs:
|
||||
- nightly-test-general-8-gpu-h200
|
||||
- nightly-test-general-8-gpu-b200
|
||||
- nightly-test-multimodal-server-1-gpu
|
||||
- nightly-test-multimodal-server-2-gpu
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout code
|
||||
@@ -453,7 +578,7 @@ jobs:
|
||||
- name: Download all partition metrics
|
||||
uses: actions/download-artifact@v4
|
||||
with:
|
||||
pattern: metrics-*
|
||||
pattern: "*metrics-*"
|
||||
path: metrics/
|
||||
merge-multiple: true
|
||||
|
||||
|
||||
115
.github/workflows/patch-docker-dev.yml
vendored
Normal file
115
.github/workflows/patch-docker-dev.yml
vendored
Normal file
@@ -0,0 +1,115 @@
|
||||
name: Patch Docker Image
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
pr_numbers:
|
||||
description: "Comma-separated PR numbers to apply (e.g. 18962,19010)"
|
||||
required: false
|
||||
default: ""
|
||||
image_tag:
|
||||
description: "Base image tag to patch (e.g. dev-x86, dev-x86-cu13)"
|
||||
required: true
|
||||
|
||||
concurrency:
|
||||
group: patch-docker-${{ inputs.image_tag }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
patch:
|
||||
if: github.repository == 'sgl-project/sglang'
|
||||
runs-on: x64-docker-build-node
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Login to Docker Hub
|
||||
uses: docker/login-action@v2
|
||||
with:
|
||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
- name: Pull base image and extract commit
|
||||
run: |
|
||||
IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
|
||||
docker pull "${IMAGE}"
|
||||
if BASE_SHA=$(docker run --rm "${IMAGE}" git -C /sgl-workspace/sglang rev-parse HEAD 2>/dev/null); then
|
||||
echo "Image built from commit: ${BASE_SHA}"
|
||||
else
|
||||
BASE_SHA=""
|
||||
echo "::warning::Image has no .git directory — cannot extract base commit"
|
||||
fi
|
||||
echo "BASE_SHA=${BASE_SHA}" >> "$GITHUB_ENV"
|
||||
|
||||
- name: Generate patches
|
||||
run: |
|
||||
git config --global --add safe.directory "$GITHUB_WORKSPACE"
|
||||
git fetch origin main
|
||||
mkdir -p /tmp/patch-ctx
|
||||
|
||||
if [ -n "${{ inputs.pr_numbers }}" ]; then
|
||||
IFS=',' read -ra PRS <<< "${{ inputs.pr_numbers }}"
|
||||
for pr in "${PRS[@]}"; do
|
||||
pr=$(echo "${pr}" | xargs)
|
||||
echo "Fetching PR #${pr}"
|
||||
git fetch origin "pull/${pr}/head:pr-${pr}"
|
||||
MERGE_BASE=$(git merge-base origin/main "pr-${pr}")
|
||||
echo " PR #${pr}: merge-base=${MERGE_BASE}"
|
||||
git diff "${MERGE_BASE}..pr-${pr}" > "/tmp/patch-ctx/${pr}.patch"
|
||||
echo " PR #${pr}: $(wc -l < /tmp/patch-ctx/${pr}.patch) lines"
|
||||
done
|
||||
elif [ -n "${BASE_SHA}" ]; then
|
||||
echo "Generating diff: image ${BASE_SHA} → latest main"
|
||||
git fetch origin "${BASE_SHA}"
|
||||
git diff "${BASE_SHA}..origin/main" > /tmp/patch-ctx/main.patch
|
||||
echo " main: $(wc -l < /tmp/patch-ctx/main.patch) lines"
|
||||
else
|
||||
echo "::error::No PR numbers specified and image has no .git — cannot generate diff against main"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
TOTAL=$(cat /tmp/patch-ctx/*.patch | wc -l)
|
||||
if [ "${TOTAL}" -eq 0 ]; then
|
||||
echo "::warning::All patches are empty — image is already up to date"
|
||||
echo "SKIP_BUILD=true" >> "$GITHUB_ENV"
|
||||
fi
|
||||
|
||||
- name: Build patched image
|
||||
if: env.SKIP_BUILD != 'true'
|
||||
run: |
|
||||
IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
|
||||
|
||||
cat <<'DOCKERFILE' > /tmp/patch-ctx/Dockerfile
|
||||
ARG BASE_IMAGE
|
||||
FROM ${BASE_IMAGE}
|
||||
COPY *.patch /tmp/patches/
|
||||
RUN cd /sgl-workspace/sglang \
|
||||
&& for p in /tmp/patches/*.patch; do \
|
||||
if [ ! -s "${p}" ]; then \
|
||||
echo "Skipping ${p} (empty)"; \
|
||||
else \
|
||||
echo "Applying ${p}..." \
|
||||
&& patch -p1 --fuzz=2 --no-backup-if-mismatch -f < "${p}" \
|
||||
|| { echo "ERROR: Failed to apply ${p}"; exit 1; }; \
|
||||
fi; \
|
||||
done \
|
||||
&& rm -rf /tmp/patches
|
||||
DOCKERFILE
|
||||
|
||||
docker build \
|
||||
--no-cache \
|
||||
--build-arg BASE_IMAGE="${IMAGE}" \
|
||||
-t "${IMAGE}" \
|
||||
/tmp/patch-ctx/
|
||||
|
||||
- name: Push patched image
|
||||
if: env.SKIP_BUILD != 'true'
|
||||
run: |
|
||||
IMAGE="lmsysorg/sglang:${{ inputs.image_tag }}"
|
||||
docker push "${IMAGE}"
|
||||
|
||||
echo "### Patched \`${IMAGE}\`" >> "$GITHUB_STEP_SUMMARY"
|
||||
echo "- **Base commit:** \`${BASE_SHA:-unknown (no .git)}\`" >> "$GITHUB_STEP_SUMMARY"
|
||||
echo "- **Source:** ${{ inputs.pr_numbers && format('PRs: {0}', inputs.pr_numbers) || 'latest main' }}" >> "$GITHUB_STEP_SUMMARY"
|
||||
837
.github/workflows/pr-test-amd-rocm720.yml
vendored
Normal file
837
.github/workflows/pr-test-amd-rocm720.yml
vendored
Normal file
@@ -0,0 +1,837 @@
|
||||
name: PR Test ROCm 7.2 (AMD)
|
||||
# Dynamic run-name for /rerun-stage commands to enable URL lookup
|
||||
# Format: "[stage-name] sha" for fork PRs, "[stage-name]" for non-fork, default for normal runs
|
||||
run-name: ${{ inputs.target_stage && (inputs.pr_head_sha && format('[{0}] {1}', inputs.target_stage, inputs.pr_head_sha) || format('[{0}]', inputs.target_stage)) || '' }}
|
||||
|
||||
on:
|
||||
# run rocm 720 pr tests once a day at 2am UTC to avoid overwhelming the CI system
|
||||
schedule:
|
||||
- cron: '0 2 * * *'
|
||||
# push:
|
||||
# branches: [ main ]
|
||||
# paths:
|
||||
# - "python/**"
|
||||
# - "scripts/ci/**"
|
||||
# - "test/**"
|
||||
# - "sgl-kernel/**"
|
||||
# - ".github/workflows/pr-test-amd-rocm720.yml"
|
||||
# - "docker/rocm720.Dockerfile"
|
||||
# pull_request:
|
||||
# branches: [ main ]
|
||||
# paths:
|
||||
# - "python/**"
|
||||
# - "scripts/ci/**"
|
||||
# - "test/**"
|
||||
# - "sgl-kernel/**"
|
||||
# - ".github/workflows/pr-test-amd-rocm720.yml"
|
||||
# - "docker/rocm720.Dockerfile"
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
target_stage:
|
||||
description: "Specific stage to run (optional, for quick testing)"
|
||||
required: false
|
||||
type: string
|
||||
default: ""
|
||||
pr_head_sha:
|
||||
description: "PR head SHA to checkout (for /rerun-stage on fork PRs)"
|
||||
required: false
|
||||
type: string
|
||||
default: ""
|
||||
workflow_call:
|
||||
inputs:
|
||||
ref:
|
||||
description: 'Git ref (branch, tag, or SHA) to test. If not provided, uses the default branch.'
|
||||
required: false
|
||||
type: string
|
||||
default: ''
|
||||
run_all_tests:
|
||||
description: "Run all tests (for releasing or testing purpose)"
|
||||
required: false
|
||||
type: boolean
|
||||
default: false
|
||||
|
||||
concurrency:
|
||||
# Include pr_head_sha in group for /rerun-stage dispatches to avoid collisions with main branch runs
|
||||
group: pr-test-amd-rocm720-${{ inputs.pr_head_sha || inputs.ref || github.ref }}
|
||||
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
|
||||
|
||||
jobs:
|
||||
call-gate:
|
||||
uses: ./.github/workflows/pr-gate.yml
|
||||
secrets: inherit
|
||||
check-changes:
|
||||
needs: [call-gate]
|
||||
runs-on: ubuntu-latest
|
||||
outputs:
|
||||
main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }}
|
||||
sgl_kernel: ${{ steps.filter.outputs.sgl_kernel || steps.run-mode.outputs.run_all_tests }}
|
||||
jit_kernel: ${{ steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }}
|
||||
multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Determine run mode
|
||||
id: run-mode
|
||||
run: |
|
||||
# Run all tests for workflow_call (when ref input is provided)
|
||||
# Note: github.event_name is inherited from caller, so we detect workflow_call by checking inputs.ref
|
||||
if [[ "${{ inputs.run_all_tests }}" == "true" ]]; then
|
||||
echo "run_all_tests=true" >> $GITHUB_OUTPUT
|
||||
echo "Run mode: ALL TESTS (run_all_tests=${{ inputs.run_all_tests }})"
|
||||
else
|
||||
echo "run_all_tests=false" >> $GITHUB_OUTPUT
|
||||
echo "Run mode: FILTERED (triggered by ${{ github.event_name }})"
|
||||
fi
|
||||
|
||||
- name: Detect file changes
|
||||
id: filter
|
||||
uses: dorny/paths-filter@v3
|
||||
if: steps.run-mode.outputs.run_all_tests != 'true'
|
||||
with:
|
||||
filters: |
|
||||
main_package:
|
||||
- "python/sglang/!(multimodal_gen)/**"
|
||||
- "python/pyproject_rocm.toml"
|
||||
- "python/pyproject_other.toml"
|
||||
- "scripts/ci/amd/*"
|
||||
- "scripts/ci/utils/*"
|
||||
- "test/**"
|
||||
- ".github/workflows/pr-test-amd-rocm720.yml"
|
||||
sgl_kernel:
|
||||
- "sgl-kernel/**"
|
||||
- ".github/workflows/pr-test-amd-rocm720.yml"
|
||||
jit_kernel:
|
||||
- "python/sglang/jit_kernel/**"
|
||||
- ".github/workflows/pr-test-amd-rocm720.yml"
|
||||
multimodal_gen:
|
||||
- "python/sglang/multimodal_gen/**"
|
||||
- "python/sglang/cli/**"
|
||||
- "python/pyproject_rocm.toml"
|
||||
- "python/pyproject_other.toml"
|
||||
|
||||
# =============================================== sgl-kernel ====================================================
|
||||
sgl-kernel-unit-test-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'sgl-kernel-unit-test-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
needs.check-changes.outputs.sgl_kernel == 'true'
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-1]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 14
|
||||
run: |
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_align.py
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_softmax.py
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests/speculative ci_sglang python3 -m pytest test_eagle_utils.py
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_apply_token_bitmask_inplace.py
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_activation.py
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_topk.py
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_kvcacheio.py
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_moe_topk_sigmoid.py
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_torch_defaults_reset.py
|
||||
|
||||
sgl-kernel-unit-test-2-gpu-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'sgl-kernel-unit-test-2-gpu-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
needs.check-changes.outputs.sgl_kernel == 'true'
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-2]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 20
|
||||
run: |
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_deterministic_custom_allreduce.py
|
||||
docker exec -w /sglang-checkout/sgl-kernel/tests ci_sglang python3 -m pytest test_amd_nccl_allreduce_determinism.py
|
||||
|
||||
# =============================================== primary ====================================================
|
||||
|
||||
stage-a-test-1-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'stage-a-test-1-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
(!failure() && !cancelled()) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-1]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 10
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-a-test-1-amd --continue-on-error
|
||||
|
||||
jit-kernel-unit-test-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'jit-kernel-unit-test-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
needs.check-changes.outputs.jit_kernel == 'true'
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-1]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Run JIT kernel unit tests
|
||||
timeout-minutes: 10
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout" python3 -m pytest -q python/sglang/jit_kernel/tests/test_store_cache.py
|
||||
|
||||
stage-b-test-small-1-gpu-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'stage-b-test-small-1-gpu-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
(!failure() && !cancelled()) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-1]
|
||||
part: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 30
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-small-1-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 13 --timeout-per-file 1800 --continue-on-error
|
||||
|
||||
stage-b-test-small-1-gpu-amd-mi35x:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'stage-b-test-small-1-gpu-amd-mi35x') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
(!failure() && !cancelled()) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi35x-gpu-1]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 30
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-small-1-gpu-amd-mi35x --continue-on-error
|
||||
|
||||
stage-b-test-large-1-gpu-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'stage-b-test-large-1-gpu-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
(!failure() && !cancelled()) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-1]
|
||||
part: [0, 1]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 30
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-large-1-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 1800 --continue-on-error
|
||||
|
||||
stage-b-test-large-2-gpu-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'stage-b-test-large-2-gpu-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
(!failure() && !cancelled()) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-2]
|
||||
part: [0, 1]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 30
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-b-test-large-2-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 --timeout-per-file 1800 --continue-on-error
|
||||
|
||||
multimodal-gen-test-1-gpu-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'multimodal-gen-test-1-gpu-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
(!failure() && !cancelled()) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
max-parallel: 1 # Run one at a time to avoid eviction from resource exhaustion during AITER kernel JIT
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-1]
|
||||
part: [0, 1] # 2 partitions: 11 tests ÷ 2 = ~5-6 tests each
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Download artifacts
|
||||
if: needs.check-changes.outputs.sgl_kernel == 'true'
|
||||
uses: actions/download-artifact@v4
|
||||
with:
|
||||
path: sgl-kernel/dist/
|
||||
merge-multiple: true
|
||||
pattern: wheel-python3.10-cuda12.9
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build diffusion
|
||||
docker exec ci_sglang pip install amdsmi
|
||||
|
||||
- name: Setup kernel caches
|
||||
run: |
|
||||
# Use the persistent /sgl-data directory (mounted from /home/runner/sgl-data)
|
||||
# This directory persists across container restarts on the self-hosted runner
|
||||
docker exec ci_sglang mkdir -p /sgl-data/aiter-kernels /sgl-data/miopen-cache /sgl-data/hf-cache/hub
|
||||
|
||||
# Clear pre-built AITER kernels from Docker image to avoid segfaults
|
||||
# The image may have stale/incompatible kernels at /sgl-workspace/aiter/aiter/jit/
|
||||
echo "Clearing pre-built AITER kernels from Docker image..."
|
||||
docker exec ci_sglang rm -rf /sgl-workspace/aiter/aiter/jit/*.so 2>/dev/null || true
|
||||
docker exec ci_sglang rm -rf /sgl-data/aiter-kernels/*.so 2>/dev/null || true
|
||||
echo "AITER kernels cleared - will be rebuilt on first use"
|
||||
|
||||
# Create persistent cache marker if /sgl-data is a real mount (not ephemeral)
|
||||
# This tells the test cleanup code to NOT delete downloaded models
|
||||
if docker exec ci_sglang test -d /sgl-data && docker exec ci_sglang mountpoint -q /sgl-data 2>/dev/null; then
|
||||
docker exec ci_sglang touch /sgl-data/hf-cache/.persistent_cache
|
||||
echo "Created .persistent_cache marker - HF cache will persist"
|
||||
else
|
||||
echo "WARNING: /sgl-data is not a mount point - models will be cleaned up after each test"
|
||||
fi
|
||||
|
||||
# Check MIOpen cache (VAE convolution kernels)
|
||||
miopen_files=$(docker exec ci_sglang find /sgl-data/miopen-cache -name "*.udb" 2>/dev/null | wc -l || echo "0")
|
||||
echo "Found ${miopen_files} MIOpen cache files"
|
||||
|
||||
- name: Diagnose HF cache and system resources
|
||||
run: |
|
||||
echo "=== System Memory Status ==="
|
||||
free -h
|
||||
echo ""
|
||||
echo "=== Disk Space ==="
|
||||
df -h /home/runner/sgl-data 2>/dev/null || df -h
|
||||
echo ""
|
||||
echo "=== HF Cache Directory Structure ==="
|
||||
docker exec ci_sglang ls -la /sgl-data/hf-cache/ 2>/dev/null || echo "HF cache dir not found"
|
||||
docker exec ci_sglang ls -la /sgl-data/hf-cache/hub/ 2>/dev/null || echo "HF hub cache not found"
|
||||
echo ""
|
||||
echo "=== Checking for cached diffusion models (1-GPU tests) ==="
|
||||
# Models used in 1-GPU tests: Wan2.1-T2V-1.3B, HunyuanVideo, Qwen-Image, FLUX.1, FLUX.2
|
||||
for model in "Wan-AI--Wan2.1-T2V-1.3B-Diffusers" "tencent--HunyuanVideo" "Qwen--Qwen-Image" "black-forest-labs--FLUX.1-dev" "black-forest-labs--FLUX.2-dev"; do
|
||||
cache_path="/sgl-data/hf-cache/hub/models--${model}"
|
||||
if docker exec ci_sglang test -d "$cache_path"; then
|
||||
size=$(docker exec ci_sglang du -sh "$cache_path" 2>/dev/null | cut -f1)
|
||||
echo "✓ CACHED: $model ($size)"
|
||||
else
|
||||
echo "✗ NOT CACHED: $model"
|
||||
fi
|
||||
done
|
||||
echo ""
|
||||
echo "=== GPU Memory Status ==="
|
||||
docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available"
|
||||
|
||||
- name: Run diffusion server tests (1-GPU)
|
||||
timeout-minutes: 60
|
||||
run: |
|
||||
# AMD CI: All 1-GPU tests except FLUX.2 (FLUX.1 covers same code path)
|
||||
# Tests: T2V, T2I, I2V, LoRA
|
||||
#
|
||||
# HF download env vars:
|
||||
# - HF_HUB_ENABLE_HF_TRANSFER=1: Use faster hf_transfer for downloads (if available)
|
||||
# - HF_HUB_DISABLE_SYMLINKS_WARNING=1: Suppress symlink warnings
|
||||
docker exec \
|
||||
-e SGLANG_E2E_TOLERANCE=0.3 \
|
||||
-e SGLANG_STAGE_TIME_TOLERANCE=0.2 \
|
||||
-e SGLANG_NON_DENOISE_STAGE_TIME_TOLERANCE=0.6 \
|
||||
-e SGLANG_DENOISE_STEP_TOLERANCE=0.6 \
|
||||
-e SGLANG_DENOISE_AGG_TOLERANCE=0.3 \
|
||||
-e SGLANG_TEST_NUM_INFERENCE_STEPS=5 \
|
||||
-e AITER_JIT_DIR=/sgl-data/aiter-kernels \
|
||||
-e MIOPEN_USER_DB_PATH=/sgl-data/miopen-cache \
|
||||
-e HF_HUB_ENABLE_HF_TRANSFER=1 \
|
||||
-e HF_HUB_DISABLE_SYMLINKS_WARNING=1 \
|
||||
-w /sglang-checkout/python \
|
||||
ci_sglang python3 sglang/multimodal_gen/test/run_suite.py \
|
||||
--suite 1-gpu \
|
||||
--partition-id ${{ matrix.part }} \
|
||||
--total-partitions 2 \
|
||||
-k "not flux_2"
|
||||
|
||||
# Post-test diagnostics
|
||||
echo "=== Post-test System Memory Status ==="
|
||||
free -h
|
||||
|
||||
multimodal-gen-test-2-gpu-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'multimodal-gen-test-2-gpu-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
(!failure() && !cancelled()) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
max-parallel: 1 # Run one at a time to avoid eviction from resource exhaustion during AITER kernel JIT
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-2]
|
||||
part: [0, 1] # 2 partitions: 9 tests ÷ 2 = ~4-5 tests each
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Download artifacts
|
||||
if: needs.check-changes.outputs.sgl_kernel == 'true'
|
||||
uses: actions/download-artifact@v4
|
||||
with:
|
||||
path: sgl-kernel/dist/
|
||||
merge-multiple: true
|
||||
pattern: wheel-python3.10-cuda12.9
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build diffusion
|
||||
docker exec ci_sglang pip install amdsmi
|
||||
|
||||
- name: Setup kernel caches
|
||||
run: |
|
||||
# Use the persistent /sgl-data directory (mounted from /home/runner/sgl-data)
|
||||
docker exec ci_sglang mkdir -p /sgl-data/aiter-kernels /sgl-data/miopen-cache /sgl-data/hf-cache/hub
|
||||
|
||||
# Clear pre-built AITER kernels from Docker image to avoid segfaults
|
||||
# The image may have stale/incompatible kernels at /sgl-workspace/aiter/aiter/jit/
|
||||
echo "Clearing pre-built AITER kernels from Docker image..."
|
||||
docker exec ci_sglang rm -rf /sgl-workspace/aiter/aiter/jit/*.so 2>/dev/null || true
|
||||
docker exec ci_sglang rm -rf /sgl-data/aiter-kernels/*.so 2>/dev/null || true
|
||||
echo "AITER kernels cleared - will be rebuilt on first use"
|
||||
|
||||
# Create persistent cache marker if /sgl-data is a real mount (not ephemeral)
|
||||
# This tells the test cleanup code to NOT delete downloaded models
|
||||
if docker exec ci_sglang test -d /sgl-data && docker exec ci_sglang mountpoint -q /sgl-data 2>/dev/null; then
|
||||
docker exec ci_sglang touch /sgl-data/hf-cache/.persistent_cache
|
||||
echo "Created .persistent_cache marker - HF cache will persist"
|
||||
else
|
||||
echo "WARNING: /sgl-data is not a mount point - models will be cleaned up after each test"
|
||||
fi
|
||||
|
||||
# Check MIOpen cache (VAE convolution kernels)
|
||||
miopen_files=$(docker exec ci_sglang find /sgl-data/miopen-cache -name "*.udb" 2>/dev/null | wc -l || echo "0")
|
||||
echo "Found ${miopen_files} MIOpen cache files"
|
||||
|
||||
- name: Diagnose HF cache and system resources
|
||||
run: |
|
||||
echo "=== System Memory Status ==="
|
||||
free -h
|
||||
echo ""
|
||||
echo "=== Disk Space ==="
|
||||
df -h /home/runner/sgl-data 2>/dev/null || df -h
|
||||
echo ""
|
||||
echo "=== HF Cache Directory Structure ==="
|
||||
docker exec ci_sglang ls -la /sgl-data/hf-cache/ 2>/dev/null || echo "HF cache dir not found"
|
||||
docker exec ci_sglang ls -la /sgl-data/hf-cache/hub/ 2>/dev/null || echo "HF hub cache not found"
|
||||
echo ""
|
||||
echo "=== Checking for cached diffusion models (2-GPU tests) ==="
|
||||
# Models used in 2-GPU tests: Wan2.2-T2V-A14B, Wan2.1-T2V-14B, Qwen-Image, FLUX.1
|
||||
for model in "Wan-AI--Wan2.2-T2V-A14B-Diffusers" "Wan-AI--Wan2.1-T2V-14B-Diffusers" "Qwen--Qwen-Image" "black-forest-labs--FLUX.1-dev"; do
|
||||
cache_path="/sgl-data/hf-cache/hub/models--${model}"
|
||||
if docker exec ci_sglang test -d "$cache_path"; then
|
||||
size=$(docker exec ci_sglang du -sh "$cache_path" 2>/dev/null | cut -f1)
|
||||
echo "✓ CACHED: $model ($size)"
|
||||
else
|
||||
echo "✗ NOT CACHED: $model"
|
||||
fi
|
||||
done
|
||||
echo ""
|
||||
echo "=== GPU Memory Status ==="
|
||||
docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available"
|
||||
|
||||
- name: Run diffusion server tests (2-GPU)
|
||||
timeout-minutes: 80
|
||||
run: |
|
||||
# AMD CI: All 2-GPU tests including LoRA
|
||||
# Tests: T2V, T2I, I2V, LoRA
|
||||
#
|
||||
# HF download env vars:
|
||||
# - HF_HUB_ENABLE_HF_TRANSFER=1: Use faster hf_transfer for downloads (if available)
|
||||
# - HF_HUB_DISABLE_SYMLINKS_WARNING=1: Suppress symlink warnings
|
||||
docker exec \
|
||||
-e SGLANG_E2E_TOLERANCE=0.3 \
|
||||
-e SGLANG_STAGE_TIME_TOLERANCE=0.2 \
|
||||
-e SGLANG_NON_DENOISE_STAGE_TIME_TOLERANCE=0.6 \
|
||||
-e SGLANG_DENOISE_STEP_TOLERANCE=0.6 \
|
||||
-e SGLANG_DENOISE_AGG_TOLERANCE=0.3 \
|
||||
-e SGLANG_TEST_NUM_INFERENCE_STEPS=5 \
|
||||
-e AITER_JIT_DIR=/sgl-data/aiter-kernels \
|
||||
-e MIOPEN_USER_DB_PATH=/sgl-data/miopen-cache \
|
||||
-e HF_HUB_ENABLE_HF_TRANSFER=1 \
|
||||
-e HF_HUB_DISABLE_SYMLINKS_WARNING=1 \
|
||||
-w /sglang-checkout/python \
|
||||
ci_sglang python3 sglang/multimodal_gen/test/run_suite.py \
|
||||
--suite 2-gpu \
|
||||
--partition-id ${{ matrix.part }} \
|
||||
--total-partitions 2
|
||||
|
||||
# Post-test diagnostics
|
||||
echo "=== Post-test System Memory Status ==="
|
||||
free -h
|
||||
|
||||
|
||||
stage-c-test-large-8-gpu-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'stage-c-test-large-8-gpu-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
(!failure() && !cancelled()) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
env:
|
||||
RUNNER_LABELS: linux-mi325-gpu-8
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-8]
|
||||
part: [0, 1, 2]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Test RCCL multi-GPU communication
|
||||
timeout-minutes: 5
|
||||
run: |
|
||||
echo "Testing RCCL multi-GPU communication with debug info..."
|
||||
docker exec ci_sglang bash -c "cd /sglang-checkout && NCCL_DEBUG=INFO RCCL_DEBUG=INFO torchrun --nproc_per_node=8 scripts/ci/amd/test_rccl_multi_gpu.py"
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 60
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 --timeout-per-file 3600 --continue-on-error
|
||||
|
||||
stage-c-test-large-8-gpu-amd-mi35x:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'stage-c-test-large-8-gpu-amd-mi35x') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
(!failure() && !cancelled()) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi35x-gpu-8]
|
||||
part: [0]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh --rocm-version rocm720
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: bash scripts/ci/amd/amd_ci_install_dependency.sh --skip-aiter-build
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 60
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-c-test-large-8-gpu-amd-mi35x --auto-partition-id ${{ matrix.part }} --auto-partition-size 1 --timeout-per-file 3600 --continue-on-error
|
||||
|
||||
pr-test-amd-finish:
|
||||
needs:
|
||||
[
|
||||
call-gate,
|
||||
check-changes,
|
||||
|
||||
sgl-kernel-unit-test-amd,
|
||||
sgl-kernel-unit-test-2-gpu-amd,
|
||||
multimodal-gen-test-1-gpu-amd,
|
||||
multimodal-gen-test-2-gpu-amd,
|
||||
|
||||
stage-a-test-1-amd,
|
||||
jit-kernel-unit-test-amd,
|
||||
stage-b-test-small-1-gpu-amd,
|
||||
stage-b-test-small-1-gpu-amd-mi35x,
|
||||
stage-b-test-large-1-gpu-amd,
|
||||
stage-b-test-large-2-gpu-amd,
|
||||
stage-c-test-large-8-gpu-amd,
|
||||
stage-c-test-large-8-gpu-amd-mi35x,
|
||||
]
|
||||
if: always()
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Check all dependent job statuses
|
||||
run: |
|
||||
# Convert the 'needs' context to a JSON string
|
||||
json_needs='${{ toJson(needs) }}'
|
||||
|
||||
# Get a list of all job names from the JSON keys
|
||||
job_names=$(echo "$json_needs" | jq -r 'keys_unsorted[]')
|
||||
|
||||
for job in $job_names; do
|
||||
# For each job, extract its result
|
||||
result=$(echo "$json_needs" | jq -r --arg j "$job" '.[$j].result')
|
||||
|
||||
# Print the job name and its result
|
||||
echo "$job: $result"
|
||||
|
||||
# Check for failure or cancellation and exit if found
|
||||
if [[ "$result" == "failure" || "$result" == "cancelled" ]]; then
|
||||
echo "The above jobs failed."
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
# If the loop completes, all jobs were successful
|
||||
echo "All jobs completed successfully"
|
||||
exit 0
|
||||
46
.github/workflows/pr-test-amd.yml
vendored
46
.github/workflows/pr-test-amd.yml
vendored
@@ -62,6 +62,7 @@ jobs:
|
||||
outputs:
|
||||
main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }}
|
||||
sgl_kernel: ${{ steps.filter.outputs.sgl_kernel || steps.run-mode.outputs.run_all_tests }}
|
||||
jit_kernel: ${{ steps.filter.outputs.jit_kernel || steps.run-mode.outputs.run_all_tests }}
|
||||
multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
@@ -99,6 +100,9 @@ jobs:
|
||||
sgl_kernel:
|
||||
- "sgl-kernel/**"
|
||||
- ".github/workflows/pr-test-amd.yml"
|
||||
jit_kernel:
|
||||
- "python/sglang/jit_kernel/**"
|
||||
- ".github/workflows/pr-test-amd.yml"
|
||||
multimodal_gen:
|
||||
- "python/sglang/multimodal_gen/**"
|
||||
- "python/sglang/cli/**"
|
||||
@@ -235,6 +239,45 @@ jobs:
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout/test" python3 run_suite.py --hw amd --suite stage-a-test-1-amd
|
||||
|
||||
jit-kernel-unit-test-amd:
|
||||
needs: [check-changes]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'jit-kernel-unit-test-amd') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
needs.check-changes.outputs.jit_kernel == 'true'
|
||||
)
|
||||
)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
runner: [linux-mi325-gpu-1]
|
||||
runs-on: ${{matrix.runner}}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
|
||||
- name: Ensure VRAM is clear
|
||||
run: bash scripts/ensure_vram_clear.sh rocm
|
||||
|
||||
- name: Start CI container
|
||||
run: bash scripts/ci/amd/amd_ci_start_container.sh
|
||||
env:
|
||||
GITHUB_WORKSPACE: ${{ github.workspace }}
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_install_dependency.sh
|
||||
|
||||
- name: Run JIT kernel unit tests
|
||||
timeout-minutes: 10
|
||||
run: |
|
||||
bash scripts/ci/amd/amd_ci_exec.sh -w "/sglang-checkout" python3 -m pytest -q python/sglang/jit_kernel/tests/test_store_cache.py
|
||||
|
||||
stage-b-test-small-1-gpu-amd:
|
||||
needs: [check-changes, stage-a-test-1-amd]
|
||||
if: |
|
||||
@@ -484,7 +527,7 @@ jobs:
|
||||
docker exec ci_sglang rocm-smi --showmeminfo vram 2>/dev/null || echo "rocm-smi not available"
|
||||
|
||||
- name: Run diffusion server tests (1-GPU)
|
||||
timeout-minutes: 60
|
||||
timeout-minutes: 70
|
||||
run: |
|
||||
# AMD CI: All 1-GPU tests except FLUX.2 (FLUX.1 covers same code path)
|
||||
# Tests: T2V, T2I, I2V, LoRA
|
||||
@@ -845,6 +888,7 @@ jobs:
|
||||
multimodal-gen-test-2-gpu-amd,
|
||||
|
||||
stage-a-test-1-amd,
|
||||
jit-kernel-unit-test-amd,
|
||||
stage-b-test-small-1-gpu-amd,
|
||||
stage-b-test-small-1-gpu-amd-mi35x,
|
||||
stage-b-test-large-1-gpu-amd,
|
||||
|
||||
6
.github/workflows/pr-test-npu.yml
vendored
6
.github/workflows/pr-test-npu.yml
vendored
@@ -28,9 +28,9 @@ jobs:
|
||||
check-changes:
|
||||
runs-on: ubuntu-latest
|
||||
outputs:
|
||||
changes_exist: ${{ steps.filter.outputs.main_package || steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests}}
|
||||
main_package: ${{ steps.filter.outputs.main_package || steps.run-mode.outputs.run_all_tests }}
|
||||
multimodal_gen: ${{ steps.filter.outputs.multimodal_gen || steps.run-mode.outputs.run_all_tests }}
|
||||
changes_exist: ${{ steps.filter.outputs.main_package == 'true' || steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true'}}
|
||||
main_package: ${{ steps.filter.outputs.main_package == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }}
|
||||
multimodal_gen: ${{ steps.filter.outputs.multimodal_gen == 'true' || steps.run-mode.outputs.run_all_tests == 'true' }}
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
|
||||
218
.github/workflows/pr-test.yml
vendored
218
.github/workflows/pr-test.yml
vendored
@@ -52,20 +52,24 @@ on:
|
||||
default: false
|
||||
|
||||
concurrency:
|
||||
# Concurrency group structure: pr-test-{branch}-{pr_sha}-{stage}
|
||||
# Concurrency group structure: pr-test-{event}-{branch}-{pr_sha}-{stage}
|
||||
# - event_name prevents scheduled runs from colliding with fork PRs whose branch is named 'main'
|
||||
# (without it, both resolve the branch segment to 'main' and block each other)
|
||||
# - github.head_ref (pull_request) or github.ref_name (workflow_dispatch) normalizes to branch name
|
||||
# - pr_head_sha isolates /rerun-stage from main branch runs
|
||||
# - target_stage allows parallel stage dispatches to run independently
|
||||
# This ensures pull_request and workflow_dispatch on same branch cancel each other
|
||||
group: pr-test-${{ github.head_ref || github.ref_name || 'default' }}-${{ inputs.pr_head_sha || 'current' }}-${{ inputs.target_stage || inputs.ref || 'all' }}
|
||||
group: pr-test-${{ github.event_name }}-${{ github.head_ref || github.ref_name || 'default' }}-${{ inputs.pr_head_sha || 'current' }}-${{ inputs.target_stage || inputs.ref || 'all' }}
|
||||
cancel-in-progress: ${{ github.event_name != 'workflow_call' }}
|
||||
|
||||
env:
|
||||
SGLANG_IS_IN_CI: true
|
||||
SGLANG_CUDA_COREDUMP: "1"
|
||||
SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
|
||||
|
||||
permissions:
|
||||
actions: write
|
||||
contents: read
|
||||
pull-requests: read
|
||||
|
||||
jobs:
|
||||
# =============================================== check changes ====================================================
|
||||
@@ -128,6 +132,7 @@ jobs:
|
||||
- ".github/workflows/pr-test.yml"
|
||||
multimodal_gen:
|
||||
- "python/sglang/multimodal_gen/**"
|
||||
- "python/sglang/jit_kernel/**"
|
||||
- "python/sglang/cli/**"
|
||||
- "python/pyproject.toml"
|
||||
- ".github/workflows/pr-test.yml"
|
||||
@@ -198,6 +203,8 @@ jobs:
|
||||
|
||||
- name: Set max-parallel based on run type
|
||||
id: set-parallel
|
||||
env:
|
||||
GH_TOKEN: ${{ github.token }}
|
||||
run: |
|
||||
# Scheduled runs and high-priority PRs get full parallelism
|
||||
if [[ "${{ github.event_name }}" == "schedule" ]]; then
|
||||
@@ -206,6 +213,27 @@ jobs:
|
||||
elif [[ "${{ github.event_name }}" == "pull_request" && "${{ contains(github.event.pull_request.labels.*.name, 'high priority') }}" == "true" ]]; then
|
||||
echo "max_parallel=14" >> $GITHUB_OUTPUT
|
||||
echo "High priority PR detected, setting max_parallel to 14"
|
||||
elif [[ -n "${{ inputs.target_stage }}" ]]; then
|
||||
# /rerun-stage (workflow_dispatch): query PR labels via GitHub API
|
||||
# Try SHA lookup first (fork PRs), fallback to branch name (non-fork PRs)
|
||||
LABELS=""
|
||||
PR_HEAD_SHA="${{ inputs.pr_head_sha }}"
|
||||
if [[ -n "$PR_HEAD_SHA" ]]; then
|
||||
LABELS=$(gh api "repos/${{ github.repository }}/commits/${PR_HEAD_SHA}/pulls" \
|
||||
--jq '.[0].labels[].name' 2>/dev/null || true)
|
||||
fi
|
||||
if [[ -z "$LABELS" ]]; then
|
||||
LABELS=$(gh pr list --head "${{ github.ref_name }}" --repo "${{ github.repository }}" \
|
||||
--json labels --jq '.[0].labels[].name' 2>/dev/null || true)
|
||||
fi
|
||||
echo "PR labels: ${LABELS:-"(none)"}"
|
||||
if echo "$LABELS" | grep -Fxq "high priority"; then
|
||||
echo "max_parallel=14" >> $GITHUB_OUTPUT
|
||||
echo "High priority PR detected via API (/rerun-stage), setting max_parallel to 14"
|
||||
else
|
||||
echo "max_parallel=3" >> $GITHUB_OUTPUT
|
||||
echo "Using default max_parallel of 3 (/rerun-stage, no high priority label)"
|
||||
fi
|
||||
else
|
||||
echo "max_parallel=3" >> $GITHUB_OUTPUT
|
||||
echo "Using default max_parallel of 3"
|
||||
@@ -848,6 +876,9 @@ jobs:
|
||||
# temporarily put backend-independent cpu tests here
|
||||
python3 run_suite.py --hw cpu --suite default $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
stage-a-cpu-only:
|
||||
needs: [check-changes, call-gate]
|
||||
if: |
|
||||
@@ -950,6 +981,11 @@ jobs:
|
||||
fi
|
||||
python3 run_suite.py --hw cuda --suite stage-b-test-small-1-gpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 8 $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.partition }}
|
||||
|
||||
# Runs on H100 (80GB, SM90) - tests that don't pass on 5090 (FA3, FP8, high VRAM, etc.)
|
||||
stage-b-test-large-1-gpu:
|
||||
needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels]
|
||||
@@ -1001,6 +1037,11 @@ jobs:
|
||||
fi
|
||||
python3 run_suite.py --hw cuda --suite stage-b-test-large-1-gpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 14 --timeout-per-file 1800 $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.partition }}
|
||||
|
||||
stage-b-test-large-2-gpu:
|
||||
needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels]
|
||||
if: |
|
||||
@@ -1053,6 +1094,11 @@ jobs:
|
||||
fi
|
||||
python3 run_suite.py --hw cuda --suite stage-b-test-large-2-gpu --auto-partition-id ${{ matrix.partition }} --auto-partition-size 4 $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.partition }}
|
||||
|
||||
stage-b-test-4-gpu-b200:
|
||||
needs: [check-changes, call-gate, wait-for-stage-a, sgl-kernel-build-wheels]
|
||||
if: |
|
||||
@@ -1106,6 +1152,9 @@ jobs:
|
||||
run: |
|
||||
IS_BLACKWELL=1 python3 -m pytest -q python/sglang/jit_kernel/tests/test_flash_attention_4.py
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
multimodal-gen-test-1-gpu:
|
||||
needs: [check-changes, call-gate, sgl-kernel-build-wheels]
|
||||
if: |
|
||||
@@ -1158,6 +1207,10 @@ jobs:
|
||||
--total-partitions 2 \
|
||||
$CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.part }}
|
||||
|
||||
multimodal-gen-test-2-gpu:
|
||||
needs: [check-changes, call-gate, sgl-kernel-build-wheels]
|
||||
@@ -1212,6 +1265,11 @@ jobs:
|
||||
--total-partitions 2 \
|
||||
$CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.part }}
|
||||
|
||||
stage-c-test-4-gpu-h100:
|
||||
needs: [check-changes, call-gate, wait-for-stage-b]
|
||||
if: |
|
||||
@@ -1261,6 +1319,11 @@ jobs:
|
||||
fi
|
||||
python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-h100 --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.part }}
|
||||
|
||||
stage-c-test-8-gpu-h200:
|
||||
needs: [check-changes, call-gate, wait-for-stage-b]
|
||||
if: |
|
||||
@@ -1300,14 +1363,22 @@ jobs:
|
||||
run: |
|
||||
CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_dependency.sh
|
||||
|
||||
# - name: Warmup Weights and JIT Compilation
|
||||
# timeout-minutes: 20
|
||||
# run: |
|
||||
# # An example command for testing the warmup. TODO: make this more general and move them to python scripts.
|
||||
# python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3-0324 --tp 8 --trust-remote-code
|
||||
- name: Warmup DeepGEMM JIT Compilation
|
||||
timeout-minutes: 25
|
||||
run: |
|
||||
python3 scripts/ci/cuda/warmup_deep_gemm.py \
|
||||
deepseek-ai/DeepSeek-V3-0324:8 \
|
||||
deepseek-ai/DeepSeek-V3.2-Exp:8
|
||||
|
||||
- name: Warmup Server CUDA Graphs
|
||||
timeout-minutes: 25
|
||||
run: |
|
||||
python3 scripts/ci/cuda/warmup_server.py \
|
||||
deepseek-ai/DeepSeek-V3-0324:8 \
|
||||
inclusionAI/Ring-2.5-1T:8
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 20
|
||||
timeout-minutes: 30
|
||||
run: |
|
||||
cd test
|
||||
CONTINUE_ON_ERROR_FLAG=""
|
||||
@@ -1316,6 +1387,11 @@ jobs:
|
||||
fi
|
||||
python3 run_suite.py --hw cuda --suite stage-c-test-8-gpu-h200 --auto-partition-id ${{ matrix.part }} --auto-partition-size 4 $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.part }}
|
||||
|
||||
stage-c-test-8-gpu-h20:
|
||||
needs: [check-changes, call-gate, wait-for-stage-b]
|
||||
if: |
|
||||
@@ -1366,6 +1442,11 @@ jobs:
|
||||
fi
|
||||
python3 run_suite.py --hw cuda --suite stage-c-test-8-gpu-h20 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2 $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
artifact-suffix: ${{ matrix.part }}
|
||||
|
||||
stage-c-test-deepep-4-gpu:
|
||||
needs: [check-changes, call-gate, wait-for-stage-b]
|
||||
if: |
|
||||
@@ -1411,6 +1492,9 @@ jobs:
|
||||
fi
|
||||
python3 run_suite.py --hw cuda --suite stage-c-test-deepep-4-gpu $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
stage-c-test-deepep-8-gpu-h200:
|
||||
needs: [check-changes, call-gate, wait-for-stage-b]
|
||||
if: |
|
||||
@@ -1446,6 +1530,19 @@ jobs:
|
||||
run: |
|
||||
CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} bash scripts/ci/cuda/ci_install_deepep.sh
|
||||
|
||||
- name: Warmup DeepGEMM JIT Compilation
|
||||
timeout-minutes: 25
|
||||
run: |
|
||||
python3 scripts/ci/cuda/warmup_deep_gemm.py \
|
||||
deepseek-ai/DeepSeek-V3-0324:8 \
|
||||
deepseek-ai/DeepSeek-V3.2-Exp:8
|
||||
|
||||
- name: Warmup Server CUDA Graphs
|
||||
timeout-minutes: 25
|
||||
run: |
|
||||
python3 scripts/ci/cuda/warmup_server.py \
|
||||
deepseek-ai/DeepSeek-V3-0324:8
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 45
|
||||
run: |
|
||||
@@ -1456,6 +1553,9 @@ jobs:
|
||||
fi
|
||||
python3 run_suite.py --hw cuda --suite stage-c-test-deepep-8-gpu-h200 $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
|
||||
stage-c-test-4-gpu-b200:
|
||||
needs: [check-changes, call-gate, wait-for-stage-b]
|
||||
if: |
|
||||
@@ -1506,52 +1606,62 @@ jobs:
|
||||
fi
|
||||
IS_BLACKWELL=1 python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-b200 --auto-partition-id ${{ matrix.part }} --auto-partition-size 3 --timeout-per-file 1800 $CONTINUE_ON_ERROR_FLAG
|
||||
|
||||
stage-c-test-4-gpu-gb200:
|
||||
needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels-arm]
|
||||
if: |
|
||||
always() &&
|
||||
(
|
||||
(inputs.target_stage == 'stage-c-test-4-gpu-gb200') ||
|
||||
(
|
||||
!inputs.target_stage &&
|
||||
((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) &&
|
||||
((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
)
|
||||
)
|
||||
runs-on: 4-gpu-gb200
|
||||
timeout-minutes: 240
|
||||
env:
|
||||
RUNNER_LABELS: 4-gpu-gb200
|
||||
strategy:
|
||||
fail-fast: false
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
- uses: ./.github/actions/upload-cuda-coredumps
|
||||
if: always()
|
||||
with:
|
||||
ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
artifact-suffix: ${{ matrix.part }}
|
||||
|
||||
- name: Download artifacts
|
||||
if: needs.check-changes.outputs.sgl_kernel == 'true'
|
||||
uses: actions/download-artifact@v4
|
||||
with:
|
||||
path: sgl-kernel/dist/
|
||||
merge-multiple: true
|
||||
pattern: wheel-python3.10-cuda12.9-aarch64
|
||||
|
||||
- name: Install dependencies
|
||||
timeout-minutes: 20
|
||||
run: |
|
||||
CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} IS_BLACKWELL=1 GRACE_BLACKWELL=1 bash scripts/ci/cuda/ci_install_deepep.sh
|
||||
|
||||
- name: Run test
|
||||
timeout-minutes: 45
|
||||
run: |
|
||||
cd test
|
||||
CONTINUE_ON_ERROR_FLAG=""
|
||||
if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then
|
||||
CONTINUE_ON_ERROR_FLAG="--continue-on-error"
|
||||
fi
|
||||
python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-gb200 --timeout-per-file 3600 $CONTINUE_ON_ERROR_FLAG
|
||||
# NOTE: GB200 stage temporarily disabled — no company-owned GB200 runner available yet.
|
||||
# Re-enable when a 4-gpu-gb200 runner is provisioned.
|
||||
# stage-c-test-4-gpu-gb200:
|
||||
# needs: [check-changes, call-gate, wait-for-stage-b, sgl-kernel-build-wheels-arm]
|
||||
# if: |
|
||||
# always() &&
|
||||
# (
|
||||
# (inputs.target_stage == 'stage-c-test-4-gpu-gb200') ||
|
||||
# (
|
||||
# !inputs.target_stage &&
|
||||
# ((github.event_name == 'schedule' || inputs.test_parallel_dispatch == true) || (!failure() && !cancelled())) &&
|
||||
# ((needs.check-changes.outputs.main_package == 'true') || (needs.check-changes.outputs.sgl_kernel == 'true'))
|
||||
# )
|
||||
# )
|
||||
# runs-on: 4-gpu-gb200
|
||||
# timeout-minutes: 240
|
||||
# env:
|
||||
# RUNNER_LABELS: 4-gpu-gb200
|
||||
# strategy:
|
||||
# fail-fast: false
|
||||
# steps:
|
||||
# - name: Checkout code
|
||||
# uses: actions/checkout@v4
|
||||
# with:
|
||||
# ref: ${{ inputs.pr_head_sha || inputs.ref || github.sha }}
|
||||
#
|
||||
# - name: Download artifacts
|
||||
# if: needs.check-changes.outputs.sgl_kernel == 'true'
|
||||
# uses: actions/download-artifact@v4
|
||||
# with:
|
||||
# path: sgl-kernel/dist/
|
||||
# merge-multiple: true
|
||||
# pattern: wheel-python3.10-cuda12.9-aarch64
|
||||
#
|
||||
# - name: Install dependencies
|
||||
# timeout-minutes: 20
|
||||
# run: |
|
||||
# CUSTOM_BUILD_SGL_KERNEL=${{needs.check-changes.outputs.sgl_kernel}} IS_BLACKWELL=1 GRACE_BLACKWELL=1 bash scripts/ci/cuda/ci_install_deepep.sh
|
||||
#
|
||||
# - name: Run test
|
||||
# timeout-minutes: 45
|
||||
# run: |
|
||||
# cd test
|
||||
# CONTINUE_ON_ERROR_FLAG=""
|
||||
# if [[ "${{ needs.check-changes.outputs.continue_on_error }}" == "true" ]]; then
|
||||
# CONTINUE_ON_ERROR_FLAG="--continue-on-error"
|
||||
# fi
|
||||
# python3 run_suite.py --hw cuda --suite stage-c-test-4-gpu-gb200 --timeout-per-file 3600 $CONTINUE_ON_ERROR_FLAG
|
||||
#
|
||||
# - uses: ./.github/actions/upload-cuda-coredumps
|
||||
# if: always()
|
||||
|
||||
pr-test-finish:
|
||||
needs:
|
||||
@@ -1585,7 +1695,7 @@ jobs:
|
||||
stage-c-test-deepep-4-gpu,
|
||||
stage-c-test-deepep-8-gpu-h200,
|
||||
stage-c-test-4-gpu-b200,
|
||||
stage-c-test-4-gpu-gb200,
|
||||
# stage-c-test-4-gpu-gb200, # Temporarily disabled — no GB200 runner
|
||||
]
|
||||
if: always()
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
1
.github/workflows/release-branch-cut.yml
vendored
1
.github/workflows/release-branch-cut.yml
vendored
@@ -16,6 +16,7 @@ on:
|
||||
permissions:
|
||||
actions: write
|
||||
contents: write
|
||||
pull-requests: read
|
||||
|
||||
jobs:
|
||||
cut-release-branch:
|
||||
|
||||
@@ -2,7 +2,7 @@ name: Release Docker Images Nightly (AMD)
|
||||
on:
|
||||
workflow_dispatch:
|
||||
schedule:
|
||||
- cron: '0 13 * * *'
|
||||
- cron: '0 12 * * *'
|
||||
|
||||
concurrency:
|
||||
# A PR number if a pull request and otherwise the commit hash. This cancels
|
||||
@@ -78,7 +78,7 @@ jobs:
|
||||
|
||||
tag=v${version}-${rocm_tag}
|
||||
|
||||
docker build . -f docker/rocm.Dockerfile --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
|
||||
docker build . -f docker/rocm.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
|
||||
docker push rocm/sgl-dev:${tag}-${{ env.DATE }}
|
||||
|
||||
# Temporarily disable docker cache seeding until performant storage is in place
|
||||
|
||||
82
.github/workflows/release-docker-amd-rocm720-nightly.yml
vendored
Normal file
82
.github/workflows/release-docker-amd-rocm720-nightly.yml
vendored
Normal file
@@ -0,0 +1,82 @@
|
||||
name: Release Docker Images ROCm 7.2.0 Nightly Preview (AMD)
|
||||
on:
|
||||
workflow_dispatch:
|
||||
schedule:
|
||||
- cron: '0 12 * * *'
|
||||
|
||||
concurrency:
|
||||
# A PR number if a pull request and otherwise the commit hash. This cancels
|
||||
# queued and in-progress runs for the same PR (presubmit) or commit
|
||||
# (postsubmit). The workflow name is prepended to avoid conflicts between
|
||||
# different workflows.
|
||||
group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
|
||||
cancel-in-progress: True
|
||||
|
||||
jobs:
|
||||
publish:
|
||||
if: github.repository == 'sgl-project/sglang'
|
||||
runs-on: amd-docker-scale
|
||||
environment: 'prod'
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
gpu_arch: ['gfx942-rocm720', 'gfx950-rocm720']
|
||||
build_type: ['all']
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
fetch-depth: 0 # Required for git describe to find tags
|
||||
|
||||
- name: "Set Date"
|
||||
run: |
|
||||
echo "DATE=$(date +%Y%m%d)" >> $GITHUB_ENV
|
||||
|
||||
- name: Get version from latest tag
|
||||
id: version
|
||||
run: |
|
||||
# Get the latest version tag sorted by version number (e.g., v0.5.7 -> 0.5.7)
|
||||
VERSION=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1 | sed 's/^v//')
|
||||
|
||||
if [ -z "$VERSION" ]; then
|
||||
echo "::error::Could not determine version from git tags"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Get short commit hash of current HEAD
|
||||
COMMIT_HASH=$(git rev-parse --short HEAD)
|
||||
|
||||
# Compose pretend version for setuptools_scm: e.g., 0.5.8.post1.dev20260211+g1a2b3c4
|
||||
PRETEND_VERSION="${VERSION}.dev${{ env.DATE }}+g${COMMIT_HASH}"
|
||||
|
||||
echo "version=${VERSION}" >> $GITHUB_OUTPUT
|
||||
echo "pretend_version=${PRETEND_VERSION}" >> $GITHUB_OUTPUT
|
||||
echo "Detected version: ${VERSION}"
|
||||
echo "Pretend version for pip: ${PRETEND_VERSION}"
|
||||
|
||||
- name: Login to Docker Hub
|
||||
uses: docker/login-action@v2
|
||||
with:
|
||||
username: ${{ secrets.DOCKERHUB_AMD_USERNAME }}
|
||||
password: ${{ secrets.DOCKERHUB_AMD_TOKEN }}
|
||||
|
||||
- name: Build and Push
|
||||
run: |
|
||||
version=${{ steps.version.outputs.version }}
|
||||
pretend_version=${{ steps.version.outputs.pretend_version }}
|
||||
echo "Version: ${version}"
|
||||
echo "Pretend version: ${pretend_version}"
|
||||
|
||||
if [ "${{ matrix.gpu_arch }}" = "gfx942-rocm720" ]; then
|
||||
rocm_tag="rocm720-mi30x"
|
||||
elif [ "${{ matrix.gpu_arch }}" = "gfx950-rocm720" ]; then
|
||||
rocm_tag="rocm720-mi35x"
|
||||
else
|
||||
echo "Unsupported gfx arch"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
tag=v${version}-${rocm_tag}
|
||||
|
||||
docker build . -f docker/rocm720.Dockerfile --build-arg SGL_BRANCH=${{ github.ref_name }} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic --build-arg SETUPTOOLS_SCM_PRETEND_VERSION=${pretend_version} -t rocm/sgl-dev:${tag}-${{ env.DATE }} --no-cache
|
||||
docker push rocm/sgl-dev:${tag}-${{ env.DATE }}
|
||||
32
.github/workflows/release-docker-amd.yml
vendored
32
.github/workflows/release-docker-amd.yml
vendored
@@ -16,6 +16,7 @@ jobs:
|
||||
environment: 'prod'
|
||||
strategy:
|
||||
matrix:
|
||||
rocm_version: ['rocm700', 'rocm720']
|
||||
gpu_arch: ['gfx942', 'gfx950']
|
||||
build_type: ['all']
|
||||
steps:
|
||||
@@ -55,17 +56,36 @@ jobs:
|
||||
version=${{ steps.version.outputs.version }}
|
||||
echo "Version: ${version}"
|
||||
|
||||
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
|
||||
rocm_tag="rocm700-mi30x"
|
||||
elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
|
||||
rocm_tag="rocm700-mi35x"
|
||||
dockerfile=""
|
||||
gpu_arch_suffix=""
|
||||
if [ "${{ matrix.rocm_version }}" = "rocm700" ]; then
|
||||
dockerfile="docker/rocm.Dockerfile"
|
||||
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
|
||||
rocm_tag="rocm700-mi30x"
|
||||
elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
|
||||
rocm_tag="rocm700-mi35x"
|
||||
else
|
||||
echo "Unsupported gfx arch"
|
||||
exit 1
|
||||
fi
|
||||
elif [ "${{ matrix.rocm_version }}" = "rocm720" ]; then
|
||||
gpu_arch_suffix="-${{ matrix.rocm_version }}"
|
||||
dockerfile="docker/rocm720.Dockerfile"
|
||||
if [ "${{ matrix.gpu_arch }}" = "gfx942" ]; then
|
||||
rocm_tag="rocm720-mi30x"
|
||||
elif [ "${{ matrix.gpu_arch }}" = "gfx950" ]; then
|
||||
rocm_tag="rocm720-mi35x"
|
||||
else
|
||||
echo "Unsupported gfx arch"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
echo "Unsupported gfx arch"
|
||||
echo "Unsupported rocm version"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
tag=v${version}-${rocm_tag}
|
||||
|
||||
# rocm.Dockerfile expects SGL_BRANCH with 'v' prefix for git tag checkout
|
||||
docker build . -f docker/rocm.Dockerfile --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }} --build-arg SGL_BRANCH=v${version} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t lmsysorg/sglang:${tag} --no-cache
|
||||
docker build . -f ${dockerfile} --build-arg BUILD_TYPE=${{ matrix.build_type }} --build-arg GPU_ARCH=${{ matrix.gpu_arch }}${gpu_arch_suffix} --build-arg SGL_BRANCH=v${version} --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t lmsysorg/sglang:${tag} --no-cache
|
||||
docker push lmsysorg/sglang:${tag}
|
||||
|
||||
@@ -10,9 +10,9 @@ on:
|
||||
description: "Version to build (without v prefix, e.g., 0.5.8)"
|
||||
required: true
|
||||
flashinfer_version:
|
||||
description: "FlashInfer version (default: 0.6.1)"
|
||||
description: "FlashInfer version (default: 0.6.3)"
|
||||
required: false
|
||||
default: "0.6.1"
|
||||
default: "0.6.3"
|
||||
|
||||
jobs:
|
||||
publish-x86:
|
||||
|
||||
122
.github/workflows/release-docker-cu13.yml
vendored
122
.github/workflows/release-docker-cu13.yml
vendored
@@ -1,122 +0,0 @@
|
||||
name: Build and Push CUDA 13 Docker Images
|
||||
|
||||
# release this manually via workflow_dispatch for now
|
||||
on:
|
||||
workflow_dispatch:
|
||||
schedule:
|
||||
- cron: "0 0 * * *"
|
||||
jobs:
|
||||
build-dev:
|
||||
if: ${{ github.repository == 'sgl-project/sglang' }}
|
||||
runs-on: ${{ matrix.runner }}
|
||||
strategy:
|
||||
matrix:
|
||||
include:
|
||||
- runner: x64-docker-build-node
|
||||
platform: linux/amd64
|
||||
build_type: all
|
||||
grace_blackwell: 0
|
||||
tag: dev-x86-cu13
|
||||
version: 13.0.1
|
||||
- runner: arm-docker-build-node
|
||||
platform: linux/arm64
|
||||
build_type: all
|
||||
grace_blackwell: 1
|
||||
tag: dev-arm64-cu13
|
||||
version: 13.0.1
|
||||
steps:
|
||||
- name: Delete huge unnecessary tools folder
|
||||
run: rm -rf /opt/hostedtoolcache
|
||||
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Free disk space
|
||||
uses: jlumbroso/free-disk-space@main
|
||||
with:
|
||||
tool-cache: true
|
||||
docker-images: true
|
||||
android: true
|
||||
dotnet: true
|
||||
haskell: true
|
||||
large-packages: true
|
||||
swap-storage: true
|
||||
|
||||
- name: Set up Docker Buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
|
||||
- name: Login to Docker Hub
|
||||
uses: docker/login-action@v2
|
||||
with:
|
||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
- name: Build and Push Dev Image
|
||||
run: |
|
||||
docker buildx build \
|
||||
--platform ${{ matrix.platform }} \
|
||||
--push \
|
||||
--target framework \
|
||||
-f docker/Dockerfile \
|
||||
--build-arg CUDA_VERSION=${{ matrix.version }} \
|
||||
--build-arg BUILD_TYPE=${{ matrix.build_type }} \
|
||||
--build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \
|
||||
--build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \
|
||||
--build-arg USE_LATEST_SGLANG=1 \
|
||||
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
|
||||
-t lmsysorg/sglang:${{ matrix.tag }} \
|
||||
--no-cache \
|
||||
.
|
||||
|
||||
create-manifests:
|
||||
runs-on: ubuntu-22.04
|
||||
needs: [build-dev]
|
||||
if: ${{ github.repository == 'sgl-project/sglang' }}
|
||||
strategy:
|
||||
matrix:
|
||||
variant:
|
||||
- tag: dev-cu13
|
||||
x86_tag: dev-x86-cu13
|
||||
arm64_tag: dev-arm64-cu13
|
||||
steps:
|
||||
- uses: docker/setup-buildx-action@v3
|
||||
|
||||
- uses: docker/login-action@v2
|
||||
with:
|
||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
- run: |
|
||||
docker buildx imagetools create \
|
||||
-t lmsysorg/sglang:${{ matrix.variant.tag }} \
|
||||
-t lmsysorg/sglang:nightly-${{ matrix.variant.tag }}-$(date +%Y%m%d)-${GITHUB_SHA:0:8} \
|
||||
lmsysorg/sglang:${{ matrix.variant.x86_tag }} \
|
||||
lmsysorg/sglang:${{ matrix.variant.arm64_tag }}
|
||||
|
||||
- name: Cleanup Old Nightly Builds
|
||||
run: |
|
||||
# Get JWT token for Docker Hub API
|
||||
TOKEN=$(curl -s -H "Content-Type: application/json" -X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' https://hub.docker.com/v2/users/login/ | jq -r .token)
|
||||
|
||||
# Get all tags for the repository
|
||||
TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100")
|
||||
|
||||
# Extract tags that match our pattern and sort by last_updated timestamp (most recent first)
|
||||
TAGS=$(echo "$TAGS_RESPONSE" | jq -r '.results[] | select(.name | startswith("nightly-${{ matrix.variant.tag }}-")) | "\(.last_updated)|\(.name)"' | sort -r | cut -d'|' -f2)
|
||||
|
||||
# Count total tags and keep only the 14 most recent
|
||||
TAG_COUNT=$(echo "$TAGS" | wc -l)
|
||||
if [ "$TAG_COUNT" -gt 14 ]; then
|
||||
echo "Found $TAG_COUNT nightly builds, keeping only the 14 most recent"
|
||||
TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +15)
|
||||
echo "Tags to delete: $TAGS_TO_DELETE"
|
||||
|
||||
# Delete old tags
|
||||
for tag in $TAGS_TO_DELETE; do
|
||||
echo "Deleting tag: $tag"
|
||||
curl -X DELETE \
|
||||
-H "Authorization: JWT $TOKEN" \
|
||||
"https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/$tag/"
|
||||
done
|
||||
else
|
||||
echo "Only $TAG_COUNT nightly builds found, no cleanup needed"
|
||||
fi
|
||||
116
.github/workflows/release-docker-dev-pr.yml
vendored
116
.github/workflows/release-docker-dev-pr.yml
vendored
@@ -1,116 +0,0 @@
|
||||
name: Build PR Development Docker Images
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
pr_number:
|
||||
description: 'PR number to build from'
|
||||
required: true
|
||||
type: string
|
||||
pr_branch:
|
||||
description: 'PR branch name to build from (e.g., my-feature-branch or refs/pull/123/head)'
|
||||
required: true
|
||||
type: string
|
||||
|
||||
concurrency:
|
||||
group: release-docker-dev-pr-${{ github.event.inputs.pr_number }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
build-dev:
|
||||
if: ${{ github.repository == 'sgl-project/sglang' }}
|
||||
environment: "prod"
|
||||
runs-on: ${{ matrix.runner }}
|
||||
strategy:
|
||||
matrix:
|
||||
include:
|
||||
- runner: x64-docker-build-node
|
||||
platform: linux/amd64
|
||||
build_type: all
|
||||
grace_blackwell: 0
|
||||
arch_tag: x86
|
||||
version: 12.9.1
|
||||
- runner: arm-docker-build-node
|
||||
platform: linux/arm64
|
||||
build_type: all
|
||||
grace_blackwell: 1
|
||||
arch_tag: arm64
|
||||
version: 12.9.1
|
||||
steps:
|
||||
- name: Delete huge unnecessary tools folder
|
||||
run: rm -rf /opt/hostedtoolcache
|
||||
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_branch }}
|
||||
|
||||
- name: Free disk space
|
||||
uses: jlumbroso/free-disk-space@main
|
||||
with:
|
||||
tool-cache: true
|
||||
docker-images: true
|
||||
android: true
|
||||
dotnet: true
|
||||
haskell: true
|
||||
large-packages: true
|
||||
swap-storage: true
|
||||
|
||||
- name: Set up Docker Buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
|
||||
- name: Login to Docker Hub
|
||||
uses: docker/login-action@v2
|
||||
with:
|
||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
- name: Build and Push Dev Image
|
||||
run: |
|
||||
tag=dev-${{ matrix.arch_tag }}-pr-${{ inputs.pr_number }}
|
||||
|
||||
docker buildx build \
|
||||
--platform ${{ matrix.platform }} \
|
||||
--push \
|
||||
-f docker/Dockerfile \
|
||||
--target framework \
|
||||
--build-arg CUDA_VERSION=${{ matrix.version }} \
|
||||
--build-arg BUILD_TYPE=${{ matrix.build_type }} \
|
||||
--build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \
|
||||
--build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \
|
||||
--build-arg BRANCH_TYPE=local \
|
||||
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
|
||||
-t lmsysorg/sglang:${tag} \
|
||||
--no-cache \
|
||||
.
|
||||
|
||||
create-manifests:
|
||||
runs-on: ubuntu-22.04
|
||||
needs: [build-dev]
|
||||
if: ${{ github.repository == 'sgl-project/sglang' }}
|
||||
environment: "prod"
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Docker Buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
|
||||
- name: Login to Docker Hub
|
||||
uses: docker/login-action@v2
|
||||
with:
|
||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
- name: Create multi-arch manifest
|
||||
run: |
|
||||
# Create PR dev manifest
|
||||
docker buildx imagetools create \
|
||||
-t lmsysorg/sglang:dev-pr-${{ inputs.pr_number }} \
|
||||
lmsysorg/sglang:dev-x86-pr-${{ inputs.pr_number }} \
|
||||
lmsysorg/sglang:dev-arm64-pr-${{ inputs.pr_number }}
|
||||
|
||||
echo "✓ Built Docker image: lmsysorg/sglang:dev-pr-${{ inputs.pr_number }}"
|
||||
echo ""
|
||||
echo "Usage:"
|
||||
echo " docker pull lmsysorg/sglang:dev-pr-${{ inputs.pr_number }}"
|
||||
126
.github/workflows/release-docker-dev.yml
vendored
126
.github/workflows/release-docker-dev.yml
vendored
@@ -2,9 +2,22 @@ name: Build and Push Development Docker Images
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
pr_number:
|
||||
description: "PR number to build from (leave empty to use current branch)"
|
||||
required: false
|
||||
default: ""
|
||||
tag:
|
||||
description: "Custom tag suffix (overrides pr_number in tag). E.g. 'my-test' → dev-x86-my-test, dev-cu13-my-test, etc."
|
||||
required: false
|
||||
default: ""
|
||||
schedule:
|
||||
- cron: "0 0 * * *"
|
||||
|
||||
concurrency:
|
||||
group: release-docker-dev-${{ inputs.tag || inputs.pr_number || 'nightly' }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
build-dev:
|
||||
if: ${{ github.repository == 'sgl-project/sglang' }}
|
||||
@@ -16,20 +29,34 @@ jobs:
|
||||
platform: linux/amd64
|
||||
build_type: all
|
||||
grace_blackwell: 0
|
||||
tag: dev-x86
|
||||
arch_tag: x86
|
||||
version: 12.9.1
|
||||
- runner: arm-docker-build-node
|
||||
platform: linux/arm64
|
||||
build_type: all
|
||||
grace_blackwell: 1
|
||||
tag: dev-arm64
|
||||
arch_tag: arm64
|
||||
version: 12.9.1
|
||||
- runner: x64-docker-build-node
|
||||
platform: linux/amd64
|
||||
build_type: all
|
||||
grace_blackwell: 0
|
||||
arch_tag: x86-cu13
|
||||
version: 13.0.1
|
||||
- runner: arm-docker-build-node
|
||||
platform: linux/arm64
|
||||
build_type: all
|
||||
grace_blackwell: 1
|
||||
arch_tag: arm64-cu13
|
||||
version: 13.0.1
|
||||
steps:
|
||||
- name: Delete huge unnecessary tools folder
|
||||
run: rm -rf /opt/hostedtoolcache
|
||||
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || github.ref }}
|
||||
|
||||
- name: Free disk space
|
||||
uses: jlumbroso/free-disk-space@main
|
||||
@@ -42,6 +69,12 @@ jobs:
|
||||
large-packages: true
|
||||
swap-storage: true
|
||||
|
||||
- name: Prune Docker to reclaim disk space
|
||||
run: |
|
||||
docker buildx prune --filter "until=72h" -f
|
||||
docker system prune -af --filter "until=72h"
|
||||
docker volume prune -af
|
||||
|
||||
- name: Set up Docker Buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
|
||||
@@ -53,18 +86,37 @@ jobs:
|
||||
|
||||
- name: Build and Push Dev Image
|
||||
run: |
|
||||
# Tag suffix: custom tag > pr number > none
|
||||
SUFFIX=""
|
||||
if [ -n "${{ inputs.tag }}" ]; then
|
||||
SUFFIX="-${{ inputs.tag }}"
|
||||
elif [ -n "${{ inputs.pr_number }}" ]; then
|
||||
SUFFIX="-pr-${{ inputs.pr_number }}"
|
||||
fi
|
||||
|
||||
TAG="dev-${{ matrix.arch_tag }}${SUFFIX}"
|
||||
|
||||
# Nightly (schedule) installs latest release; manual dispatch builds from checked-out source
|
||||
if [ "${{ github.event_name }}" = "schedule" ]; then
|
||||
SOURCE_ARG="--build-arg USE_LATEST_SGLANG=1"
|
||||
else
|
||||
SOURCE_ARG="--build-arg BRANCH_TYPE=local"
|
||||
fi
|
||||
|
||||
echo "Building lmsysorg/sglang:${TAG}"
|
||||
|
||||
docker buildx build \
|
||||
--platform ${{ matrix.platform }} \
|
||||
--push \
|
||||
-f docker/Dockerfile \
|
||||
--target framework \
|
||||
-f docker/Dockerfile \
|
||||
--build-arg CUDA_VERSION=${{ matrix.version }} \
|
||||
--build-arg BUILD_TYPE=${{ matrix.build_type }} \
|
||||
--build-arg CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \
|
||||
--build-arg GRACE_BLACKWELL=${{ matrix.grace_blackwell }} \
|
||||
--build-arg USE_LATEST_SGLANG=1 \
|
||||
${SOURCE_ARG} \
|
||||
--build-arg INSTALL_FLASHINFER_JIT_CACHE=1 \
|
||||
-t lmsysorg/sglang:${{ matrix.tag }} \
|
||||
-t lmsysorg/sglang:${TAG} \
|
||||
--no-cache \
|
||||
.
|
||||
|
||||
@@ -75,9 +127,12 @@ jobs:
|
||||
strategy:
|
||||
matrix:
|
||||
variant:
|
||||
- tag: dev
|
||||
x86_tag: dev-x86
|
||||
arm64_tag: dev-arm64
|
||||
- base: dev
|
||||
x86: x86
|
||||
arm64: arm64
|
||||
- base: dev-cu13
|
||||
x86: x86-cu13
|
||||
arm64: arm64-cu13
|
||||
steps:
|
||||
- uses: docker/setup-buildx-action@v3
|
||||
|
||||
@@ -85,37 +140,56 @@ jobs:
|
||||
with:
|
||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
- run: |
|
||||
SHORT_SHA="${{ github.sha }}"
|
||||
|
||||
- name: Create multi-arch manifest
|
||||
run: |
|
||||
SUFFIX=""
|
||||
if [ -n "${{ inputs.tag }}" ]; then
|
||||
SUFFIX="-${{ inputs.tag }}"
|
||||
elif [ -n "${{ inputs.pr_number }}" ]; then
|
||||
SUFFIX="-pr-${{ inputs.pr_number }}"
|
||||
fi
|
||||
|
||||
TAG="${{ matrix.variant.base }}${SUFFIX}"
|
||||
X86_TAG="dev-${{ matrix.variant.x86 }}${SUFFIX}"
|
||||
ARM64_TAG="dev-${{ matrix.variant.arm64 }}${SUFFIX}"
|
||||
|
||||
# For nightly (no suffix), also stamp a dated tag
|
||||
EXTRA_TAG=""
|
||||
if [ -z "${SUFFIX}" ]; then
|
||||
SHORT_SHA="${{ github.sha }}"
|
||||
EXTRA_TAG="-t lmsysorg/sglang:nightly-${TAG}-$(date +%Y%m%d)-${SHORT_SHA:0:8}"
|
||||
fi
|
||||
|
||||
docker buildx imagetools create \
|
||||
-t lmsysorg/sglang:${{ matrix.variant.tag }} \
|
||||
-t lmsysorg/sglang:nightly-${{ matrix.variant.tag }}-$(date +%Y%m%d)-${SHORT_SHA:0:8} \
|
||||
lmsysorg/sglang:${{ matrix.variant.x86_tag }} \
|
||||
lmsysorg/sglang:${{ matrix.variant.arm64_tag }}
|
||||
-t lmsysorg/sglang:${TAG} \
|
||||
${EXTRA_TAG} \
|
||||
lmsysorg/sglang:${X86_TAG} \
|
||||
lmsysorg/sglang:${ARM64_TAG}
|
||||
|
||||
echo "✓ Published lmsysorg/sglang:${TAG}"
|
||||
|
||||
- name: Cleanup Old Nightly Builds
|
||||
if: ${{ !inputs.tag && !inputs.pr_number }}
|
||||
run: |
|
||||
# Get JWT token for Docker Hub API
|
||||
TOKEN=$(curl -s -H "Content-Type: application/json" -X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' https://hub.docker.com/v2/users/login/ | jq -r .token)
|
||||
TOKEN=$(curl -s -H "Content-Type: application/json" \
|
||||
-X POST -d '{"username": "${{ secrets.DOCKERHUB_USERNAME }}", "password": "${{ secrets.DOCKERHUB_TOKEN }}"}' \
|
||||
https://hub.docker.com/v2/users/login/ | jq -r .token)
|
||||
|
||||
# Get all tags for the repository
|
||||
TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100")
|
||||
TAGS_RESPONSE=$(curl -s -H "Authorization: JWT $TOKEN" \
|
||||
"https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100")
|
||||
|
||||
# Extract tags that match our pattern and sort by last_updated timestamp (most recent first)
|
||||
TAGS=$(echo "$TAGS_RESPONSE" | jq -r '.results[] | select(.name | startswith("nightly-${{ matrix.variant.tag }}-")) | "\(.last_updated)|\(.name)"' | sort -r | cut -d'|' -f2)
|
||||
TAGS=$(echo "$TAGS_RESPONSE" | jq -r \
|
||||
'.results[] | select(.name | test("^nightly-${{ matrix.variant.base }}-[0-9]")) | "\(.last_updated)|\(.name)"' \
|
||||
| sort -r | cut -d'|' -f2)
|
||||
|
||||
# Count total tags and keep only the 14 most recent
|
||||
TAG_COUNT=$(echo "$TAGS" | wc -l)
|
||||
if [ "$TAG_COUNT" -gt 14 ]; then
|
||||
echo "Found $TAG_COUNT nightly builds, keeping only the 14 most recent"
|
||||
TAGS_TO_DELETE=$(echo "$TAGS" | tail -n +15)
|
||||
echo "Tags to delete: $TAGS_TO_DELETE"
|
||||
|
||||
# Delete old tags
|
||||
for tag in $TAGS_TO_DELETE; do
|
||||
echo "Deleting tag: $tag"
|
||||
curl -X DELETE \
|
||||
-H "Authorization: JWT $TOKEN" \
|
||||
curl -X DELETE -H "Authorization: JWT $TOKEN" \
|
||||
"https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/$tag/"
|
||||
done
|
||||
else
|
||||
|
||||
@@ -1,5 +1,11 @@
|
||||
name: Release Docker Images Nightly (NPU)
|
||||
on:
|
||||
pull_request:
|
||||
branches:
|
||||
- 'main'
|
||||
paths:
|
||||
- '.github/workflows/release-docker-npu-nightly.yml'
|
||||
- 'docker/npu.Dockerfile'
|
||||
workflow_dispatch:
|
||||
schedule:
|
||||
- cron: "0 0 * * *"
|
||||
@@ -74,6 +80,6 @@ jobs:
|
||||
push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
|
||||
provenance: false
|
||||
build-args: |
|
||||
SGLANG_KERNEL_NPU_TAG=2026.01.28
|
||||
SGLANG_KERNEL_NPU_TAG=2026.02.01.post2
|
||||
CANN_VERSION=${{ matrix.cann_version }}
|
||||
DEVICE_TYPE=${{ matrix.device_type }}
|
||||
|
||||
2
.github/workflows/release-docker-npu.yml
vendored
2
.github/workflows/release-docker-npu.yml
vendored
@@ -87,7 +87,7 @@ jobs:
|
||||
push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
|
||||
provenance: false
|
||||
build-args: |
|
||||
SGLANG_KERNEL_NPU_TAG=2026.01.28
|
||||
SGLANG_KERNEL_NPU_TAG=2026.02.01.post2
|
||||
CANN_VERSION=${{ matrix.cann_version }}
|
||||
DEVICE_TYPE=${{ matrix.device_type }}
|
||||
SGLANG_TAG=${{ steps.version.outputs.version }}
|
||||
|
||||
22
.github/workflows/release-docs.yml
vendored
22
.github/workflows/release-docs.yml
vendored
@@ -1,6 +1,8 @@
|
||||
name: Release Documentation
|
||||
|
||||
on:
|
||||
release:
|
||||
types: [published]
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
@@ -25,6 +27,11 @@ jobs:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Fetch full git history for release index
|
||||
if: github.event_name == 'release'
|
||||
run: |
|
||||
git fetch --prune --unshallow || git fetch --prune --depth=0
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
bash scripts/ci/cuda/ci_install_dependency.sh
|
||||
@@ -53,10 +60,23 @@ jobs:
|
||||
make markdown
|
||||
python3 wrap_run_llm.py
|
||||
|
||||
if [[ "${{ github.event_name }}" == "release" ]]; then
|
||||
python3 release_lookup/generate_index.py --output release_lookup/release_index.json
|
||||
|
||||
# Copy release lookup tool for official docs on published releases.
|
||||
mkdir -p _build/html/release_lookup
|
||||
cp release_lookup/index.html _build/html/release_lookup/
|
||||
cp release_lookup/release_index.json _build/html/release_lookup/
|
||||
fi
|
||||
|
||||
cd _build/html
|
||||
|
||||
git clone https://$GITHUB_TOKEN@github.com/sgl-project/sgl-project.github.io.git ../sgl-project.github.io --depth 1
|
||||
find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
|
||||
if [[ "${{ github.event_name }}" == "release" ]]; then
|
||||
find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
|
||||
else
|
||||
find ../sgl-project.github.io/ -mindepth 1 -not -path "../sgl-project.github.io/.git*" -not -path "../sgl-project.github.io/release_lookup*" -not -name CNAME -not -name ".jekyll" -not -name ".nojekyll" -delete
|
||||
fi
|
||||
cp -r * ../sgl-project.github.io
|
||||
cp ../../README.md ../sgl-project.github.io/README.md
|
||||
cd ../sgl-project.github.io
|
||||
|
||||
34
.github/workflows/release-pypi-nightly.yml
vendored
34
.github/workflows/release-pypi-nightly.yml
vendored
@@ -54,28 +54,26 @@ jobs:
|
||||
cd python
|
||||
cp ../README.md ../LICENSE .
|
||||
|
||||
# Parse git describe output to detect exact tag builds (distance=0)
|
||||
# Parse git describe output to get latest tag
|
||||
# Use same command as pyproject.toml to ensure version consistency
|
||||
DESC=$(git tag --list --sort=-version:refname 'v*.*.*' | head -1 | xargs git describe --tags --long 2>/dev/null || echo 'v0.0.0-0-g0000000')
|
||||
DIST=$(echo "$DESC" | cut -d- -f2)
|
||||
TAG=$(echo "$DESC" | cut -d- -f1)
|
||||
HASH="g$(git rev-parse --short HEAD)"
|
||||
BUILD_DATE=$(date -u +%Y%m%d)
|
||||
|
||||
# If building at exact tag (distance=0), force dev0 version for unique wheel names
|
||||
if [ "$DIST" = "0" ]; then
|
||||
TAG=$(echo "$DESC" | cut -d- -f1)
|
||||
HASH="g$(git rev-parse --short HEAD)"
|
||||
# Increment patch version for nightlies (e.g., v0.5.8 -> 0.5.9)
|
||||
VERSION=${TAG#v} # Remove 'v' prefix
|
||||
MAJOR=$(echo "$VERSION" | cut -d. -f1)
|
||||
MINOR=$(echo "$VERSION" | cut -d. -f2)
|
||||
PATCH=$(echo "$VERSION" | cut -d. -f3)
|
||||
NEXT_PATCH=$((PATCH + 1))
|
||||
NEXT_VERSION="${MAJOR}.${MINOR}.${NEXT_PATCH}"
|
||||
|
||||
# Increment patch version for nightlies (e.g., v0.5.8 -> 0.5.9.dev0)
|
||||
VERSION=${TAG#v} # Remove 'v' prefix
|
||||
MAJOR=$(echo "$VERSION" | cut -d. -f1)
|
||||
MINOR=$(echo "$VERSION" | cut -d. -f2)
|
||||
PATCH=$(echo "$VERSION" | cut -d. -f3)
|
||||
NEXT_PATCH=$((PATCH + 1))
|
||||
NEXT_VERSION="${MAJOR}.${MINOR}.${NEXT_PATCH}"
|
||||
|
||||
FORCE_VERSION="${NEXT_VERSION}.dev0+${HASH}"
|
||||
echo "Building at exact tag $TAG, forcing nightly version to: $FORCE_VERSION"
|
||||
export SETUPTOOLS_SCM_PRETEND_VERSION="$FORCE_VERSION"
|
||||
fi
|
||||
# Use date-based dev number for correct chronological sorting
|
||||
# e.g., 0.5.9.dev20260215+g4cf4f0859 > 0.5.9.dev20260214+g45a4697d4
|
||||
FORCE_VERSION="${NEXT_VERSION}.dev${BUILD_DATE}+${HASH}"
|
||||
echo "Forcing nightly version to: $FORCE_VERSION"
|
||||
export SETUPTOOLS_SCM_PRETEND_VERSION="$FORCE_VERSION"
|
||||
|
||||
# Build wheel
|
||||
python3 -m build --wheel
|
||||
|
||||
23
.github/workflows/release-pypi-pr.yml
vendored
23
.github/workflows/release-pypi-pr.yml
vendored
@@ -4,11 +4,7 @@ on:
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
pr_number:
|
||||
description: 'PR number to build wheel for'
|
||||
required: true
|
||||
type: string
|
||||
pr_branch:
|
||||
description: 'PR branch name to build from (e.g., my-feature-branch or refs/pull/123/head)'
|
||||
description: 'PR number to build wheel for (works with both internal and fork PRs)'
|
||||
required: true
|
||||
type: string
|
||||
|
||||
@@ -27,7 +23,7 @@ jobs:
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_branch }}
|
||||
ref: refs/pull/${{ inputs.pr_number }}/head
|
||||
fetch-depth: 0 # Need full history for version generation
|
||||
|
||||
- name: Set up Python
|
||||
@@ -38,13 +34,14 @@ jobs:
|
||||
- name: Generate PR wheel version
|
||||
id: gen_version
|
||||
run: |
|
||||
# Get base version from setuptools_scm
|
||||
cd python
|
||||
pip install setuptools-scm
|
||||
FULL_VERSION=$(python -c "from setuptools_scm import get_version; print(get_version(root='..'))")
|
||||
# Strip any existing .dev or + suffix to get clean base version
|
||||
BASE_VERSION=$(echo "$FULL_VERSION" | sed 's/\.dev.*//;s/+.*//')
|
||||
cd ..
|
||||
# Get base version from the latest v*.*.* git tag directly
|
||||
# Note: We cannot use setuptools_scm here because the [tool.setuptools_scm]
|
||||
# config (with custom git_describe_command) lives in python/pyproject.toml,
|
||||
# not at the repo root. Without that config, setuptools_scm falls back to
|
||||
# default git describe which finds gateway-* tags instead of v*.*.* release tags.
|
||||
LATEST_TAG=$(git tag --list --sort=-version:refname 'v*.*.*' | head -1)
|
||||
BASE_VERSION=${LATEST_TAG#v}
|
||||
echo "Latest release tag: ${LATEST_TAG}"
|
||||
|
||||
# Get commit info
|
||||
COMMIT_HASH=$(git rev-parse --short HEAD)
|
||||
|
||||
19
.github/workflows/release-whl-kernel.yml
vendored
19
.github/workflows/release-whl-kernel.yml
vendored
@@ -11,6 +11,10 @@ on:
|
||||
tag_name:
|
||||
type: string
|
||||
required: false
|
||||
pr_number:
|
||||
description: "PR number to build from (e.g. 12345)"
|
||||
type: string
|
||||
required: false
|
||||
|
||||
concurrency:
|
||||
group: release-sglang-kernels-${{ github.ref }}
|
||||
@@ -34,6 +38,7 @@ jobs:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
submodules: "recursive"
|
||||
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
|
||||
|
||||
- name: Set up Python ${{ matrix.python-version }}
|
||||
uses: actions/setup-python@v5
|
||||
@@ -46,7 +51,8 @@ jobs:
|
||||
chmod +x ./build.sh
|
||||
./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }}
|
||||
env:
|
||||
USE_CCACHE: 0
|
||||
BUILD_JOBS: 64
|
||||
NVCC_THREADS: 8
|
||||
|
||||
- name: Upload to PyPI
|
||||
working-directory: sgl-kernel
|
||||
@@ -65,6 +71,8 @@ jobs:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
|
||||
|
||||
- name: Download artifacts
|
||||
uses: actions/download-artifact@v4
|
||||
@@ -127,6 +135,7 @@ jobs:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
submodules: "recursive"
|
||||
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
|
||||
|
||||
- name: Set up Python ${{ matrix.python-version }}
|
||||
uses: actions/setup-python@v5
|
||||
@@ -139,7 +148,8 @@ jobs:
|
||||
chmod +x ./build.sh
|
||||
./build.sh "${{ matrix.python-version }}" "${{ matrix.cuda-version }}" ${{ matrix.arch == 'aarch64' && 'aarch64' || '' }}
|
||||
env:
|
||||
USE_CCACHE: 0
|
||||
BUILD_JOBS: 64
|
||||
NVCC_THREADS: 8
|
||||
|
||||
- name: Upload artifacts
|
||||
uses: actions/upload-artifact@v4
|
||||
@@ -152,6 +162,8 @@ jobs:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
|
||||
|
||||
- name: Download artifacts
|
||||
uses: actions/download-artifact@v4
|
||||
@@ -207,6 +219,7 @@ jobs:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
submodules: "recursive"
|
||||
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
|
||||
|
||||
- name: Set up Python ${{ matrix.python-version }}
|
||||
uses: actions/setup-python@v5
|
||||
@@ -231,6 +244,8 @@ jobs:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.pr_number && format('refs/pull/{0}/head', inputs.pr_number) || '' }}
|
||||
|
||||
- name: Download artifacts
|
||||
uses: actions/download-artifact@v4
|
||||
|
||||
30
.github/workflows/retag-docker.yml
vendored
Normal file
30
.github/workflows/retag-docker.yml
vendored
Normal file
@@ -0,0 +1,30 @@
|
||||
name: Retag Docker Image
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
source_tag:
|
||||
description: "Existing image tag (e.g., v0.4.7-cu129-amd64)"
|
||||
required: true
|
||||
target_tag:
|
||||
description: "New tag to apply (e.g., latest)"
|
||||
required: true
|
||||
|
||||
jobs:
|
||||
retag:
|
||||
if: github.repository == 'sgl-project/sglang'
|
||||
runs-on: ubuntu-22.04
|
||||
environment: "prod"
|
||||
steps:
|
||||
- name: Login to Docker Hub
|
||||
uses: docker/login-action@v2
|
||||
with:
|
||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
- name: Retag image
|
||||
run: |
|
||||
echo "Retagging lmsysorg/sglang:${{ inputs.source_tag }} -> lmsysorg/sglang:${{ inputs.target_tag }}"
|
||||
docker buildx imagetools create \
|
||||
-t lmsysorg/sglang:${{ inputs.target_tag }} \
|
||||
lmsysorg/sglang:${{ inputs.source_tag }}
|
||||
5
.gitignore
vendored
5
.gitignore
vendored
@@ -245,7 +245,6 @@ sgl-model-gateway/tests/fixtures/golden/
|
||||
|
||||
lmms-eval
|
||||
|
||||
**/.claude/
|
||||
**/.serena/
|
||||
ctags/
|
||||
outputs/
|
||||
@@ -262,10 +261,6 @@ outputs/
|
||||
# setuptools-scm generated version file
|
||||
python/sglang/_version.py
|
||||
|
||||
# Generated protobuf files (regenerate during wheel build or with compile_proto.py)
|
||||
python/sglang/srt/grpc/*_pb2.py
|
||||
python/sglang/srt/grpc/*_pb2_grpc.py
|
||||
python/sglang/srt/grpc/*_pb2.pyi
|
||||
|
||||
# MUSA section
|
||||
# Generated source files by torchada
|
||||
|
||||
@@ -3,7 +3,7 @@ exclude: ^(python/sglang/multimodal_gen/csrc|python/sglang/jit_kernel/flash_atte
|
||||
|
||||
repos:
|
||||
- repo: https://github.com/pre-commit/pre-commit-hooks
|
||||
rev: v5.0.0
|
||||
rev: v6.0.0
|
||||
hooks:
|
||||
- id: check-symlinks
|
||||
- id: destroyed-symlinks
|
||||
@@ -21,12 +21,12 @@ repos:
|
||||
- id: debug-statements
|
||||
- id: no-commit-to-branch
|
||||
- repo: https://github.com/PyCQA/isort
|
||||
rev: 5.13.2
|
||||
rev: 7.0.0
|
||||
hooks:
|
||||
- id: isort
|
||||
exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
|
||||
- repo: https://github.com/astral-sh/ruff-pre-commit
|
||||
rev: v0.11.7
|
||||
rev: v0.15.1
|
||||
hooks:
|
||||
- id: ruff
|
||||
args:
|
||||
@@ -43,7 +43,7 @@ repos:
|
||||
python/sglang/srt/grpc/.*_pb2_grpc\.pyi$|
|
||||
)$
|
||||
- repo: https://github.com/psf/black
|
||||
rev: 24.10.0
|
||||
rev: 26.1.0
|
||||
hooks:
|
||||
- id: black-jupyter
|
||||
exclude: '^python/sglang/srt/grpc/.*_pb2\.py$|^python/sglang/srt/grpc/.*_pb2_grpc\.py$|^python/sglang/srt/grpc/.*_pb2\.pyi$|^python/sglang/srt/grpc/.*_pb2_grpc\.pyi$'
|
||||
@@ -53,13 +53,13 @@ repos:
|
||||
- id: codespell
|
||||
args: ['--config', '.codespellrc']
|
||||
- repo: https://github.com/pre-commit/mirrors-clang-format
|
||||
rev: v18.1.8
|
||||
rev: v20.1.7
|
||||
hooks:
|
||||
- id: clang-format
|
||||
types_or: [c++, cuda]
|
||||
args: [--style=file, --verbose]
|
||||
- repo: https://github.com/kynan/nbstripout
|
||||
rev: 0.8.1
|
||||
rev: 0.9.0
|
||||
hooks:
|
||||
- id: nbstripout
|
||||
args:
|
||||
|
||||
6
3rdparty/amd/tuning/benchmark_moe_rocm.py
vendored
6
3rdparty/amd/tuning/benchmark_moe_rocm.py
vendored
@@ -187,10 +187,8 @@ def run_grid(bs, model, method, tp_size, dtype: str):
|
||||
|
||||
configs = union_of_list_of_dicts(prune_configs_1, prune_configs_2)
|
||||
|
||||
print(
|
||||
f"{bs=} || {len(full_configs)=} | {len(prune_configs_1)=} | \
|
||||
{len(prune_configs_2)=} | {len(configs)=}"
|
||||
)
|
||||
print(f"{bs=} || {len(full_configs)=} | {len(prune_configs_1)=} | \
|
||||
{len(prune_configs_2)=} | {len(configs)=}")
|
||||
|
||||
best_config = None
|
||||
best_time_us = 1e20
|
||||
|
||||
166
benchmark/asr/README.md
Normal file
166
benchmark/asr/README.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# ASR Benchmark
|
||||
|
||||
This benchmark evaluates the performance and accuracy (Word Error Rate - WER) of Automatic Speech Recognition (ASR) models served via SGLang.
|
||||
|
||||
## Supported Models
|
||||
|
||||
- `openai/whisper-large-v3`
|
||||
- `openai/whisper-large-v3-turbo`
|
||||
|
||||
## Setup
|
||||
|
||||
Install the required dependencies:
|
||||
|
||||
```bash
|
||||
apt install ffmpeg
|
||||
pip install librosa soundfile datasets evaluate jiwer transformers openai torchcodec torch
|
||||
```
|
||||
|
||||
## Running the Benchmark
|
||||
|
||||
### 1. Start SGLang Server
|
||||
|
||||
Launch the SGLang server with a Whisper model:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server --model-path openai/whisper-large-v3 --port 30000
|
||||
```
|
||||
|
||||
### 2. Run the Benchmark Script
|
||||
|
||||
Basic usage (using chat completions API):
|
||||
|
||||
```bash
|
||||
python bench_sglang.py --base-url http://localhost:30000 --model openai/whisper-large-v3 --n-examples 10
|
||||
```
|
||||
|
||||
Using the OpenAI-compatible transcription API:
|
||||
|
||||
```bash
|
||||
python bench_sglang.py \
|
||||
--base-url http://localhost:30000 \
|
||||
--model openai/whisper-large-v3 \
|
||||
--api-type transcription \
|
||||
--language English \
|
||||
--n-examples 10
|
||||
```
|
||||
|
||||
Run with streaming and show real-time output:
|
||||
|
||||
```bash
|
||||
python bench_sglang.py \
|
||||
--base-url http://localhost:30000 \
|
||||
--model openai/whisper-large-v3 \
|
||||
--api-type transcription \
|
||||
--stream \
|
||||
--show-predictions \
|
||||
--concurrency 1
|
||||
```
|
||||
|
||||
Run with higher concurrency and save results:
|
||||
|
||||
```bash
|
||||
python bench_sglang.py \
|
||||
--base-url http://localhost:30000 \
|
||||
--model openai/whisper-large-v3 \
|
||||
--concurrency 8 \
|
||||
--n-examples 100 \
|
||||
--output results.json \
|
||||
--show-predictions
|
||||
```
|
||||
|
||||
## Arguments
|
||||
|
||||
| Argument | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `--base-url` | SGLang server URL | `http://localhost:30000` |
|
||||
| `--model` | Model name on the server | `openai/whisper-large-v3` |
|
||||
| `--dataset` | HuggingFace dataset for evaluation | `D4nt3/esb-datasets-earnings22-validation-tiny-filtered` |
|
||||
| `--split` | Dataset split to use | `validation` |
|
||||
| `--concurrency` | Number of concurrent requests | `4` |
|
||||
| `--n-examples` | Number of examples to process (`-1` for all) | `-1` |
|
||||
| `--output` | Path to save results as JSON | `None` |
|
||||
| `--show-predictions` | Display sample predictions | `False` |
|
||||
| `--print-n` | Number of samples to display | `5` |
|
||||
| `--api-type` | API to use: `chat` (chat completions) or `transcription` (audio transcriptions) | `chat` |
|
||||
| `--language` | Language for transcription API (e.g., `English`, `en`) | `None` |
|
||||
| `--stream` | Enable streaming mode for transcription API | `False` |
|
||||
|
||||
## Metrics
|
||||
|
||||
The benchmark outputs:
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| **Total Requests** | Number of successful ASR requests processed |
|
||||
| **WER** | Word Error Rate (lower is better), computed using the `evaluate` library |
|
||||
| **Average Latency** | Mean time per request (seconds) |
|
||||
| **Median Latency** | 50th percentile latency (seconds) |
|
||||
| **95th Latency** | 95th percentile latency (seconds) |
|
||||
| **Throughput** | Requests processed per second |
|
||||
| **Token Throughput** | Output tokens per second |
|
||||
|
||||
## Example Output
|
||||
|
||||
```bash
|
||||
python bench_sglang.py --api-type transcription --concurrency 128 --model openai/whisper-large-v3 --show-predictions
|
||||
|
||||
Loading dataset: D4nt3/esb-datasets-earnings22-validation-tiny-filtered...
|
||||
Using API type: transcription
|
||||
Repo card metadata block was not found. Setting CardData to empty.
|
||||
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
|
||||
Performing warmup...
|
||||
Processing 511 samples...
|
||||
------------------------------
|
||||
Results for openai/whisper-large-v3:
|
||||
Total Requests: 511
|
||||
WER: 12.7690
|
||||
Average Latency: 1.3602s
|
||||
Median Latency: 1.2090s
|
||||
95th Latency: 2.9986s
|
||||
Throughput: 19.02 req/s
|
||||
Token Throughput: 354.19 tok/s
|
||||
Total Test Time: 26.8726s
|
||||
------------------------------
|
||||
|
||||
==================== Sample Predictions ====================
|
||||
Sample 1:
|
||||
REF: on the use of taxonomy i you know i think it is it is early days for us to to make any clear indications to the market about the proportion that would fall under that requirement
|
||||
PRED: on the eu taxonomy i think it is early days for us to make any clear indications to the market about the proportion that would fall under that requirement
|
||||
----------------------------------------
|
||||
Sample 2:
|
||||
REF: so within fiscal year 2021 say 120 a 100 depending on what the micro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
|
||||
PRED: so within fiscal year 2021 say $120000 $100000 depending on what the macro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
|
||||
----------------------------------------
|
||||
Sample 3:
|
||||
REF: we talked about 4.7 gigawatts
|
||||
PRED: we talked about 4.7 gigawatts
|
||||
----------------------------------------
|
||||
Sample 4:
|
||||
REF: and you know depending on that working capital build we will we will see what that yields
|
||||
PRED: and depending on that working capital build we will see what that yields what
|
||||
----------------------------------------
|
||||
Sample 5:
|
||||
REF: so on on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexs are distributed out 30 70%
|
||||
PRED: so on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexes are distributed out 30% 70%
|
||||
----------------------------------------
|
||||
============================================================
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Audio samples longer than 30 seconds are automatically filtered out (Whisper limitation)
|
||||
- The benchmark performs a warmup request before measuring performance
|
||||
- Results are normalized using the model's tokenizer when available
|
||||
- When using `--stream` with `--show-predictions`, use `--concurrency 1` for clean sequential output
|
||||
- The `--language` option accepts both full names (e.g., `English`) and ISO 639-1 codes (e.g., `en`)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Server connection refused**
|
||||
- Ensure the SGLang server is running and accessible at the specified `--base-url`
|
||||
- Check that the port is not blocked by a firewall
|
||||
|
||||
**Out of memory errors**
|
||||
- Reduce `--concurrency` to lower GPU memory usage
|
||||
- Use a smaller Whisper model variant
|
||||
404
benchmark/asr/bench_sglang.py
Normal file
404
benchmark/asr/bench_sglang.py
Normal file
@@ -0,0 +1,404 @@
|
||||
import argparse
|
||||
import asyncio
|
||||
import base64
|
||||
import io
|
||||
import json
|
||||
import time
|
||||
from statistics import mean, median
|
||||
|
||||
import httpx
|
||||
import librosa
|
||||
import numpy as np
|
||||
import soundfile
|
||||
from datasets import load_dataset
|
||||
from evaluate import load
|
||||
from openai import AsyncOpenAI, OpenAI
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
|
||||
def to_bytes(y, sr):
|
||||
buffer = io.BytesIO()
|
||||
soundfile.write(buffer, y, sr, format="WAV")
|
||||
buffer.seek(0)
|
||||
return buffer
|
||||
|
||||
|
||||
async def run_asr_chat(client, model_name, y, sr):
|
||||
"""Use chat completions API with audio_url for ASR."""
|
||||
with to_bytes(y, sr) as f:
|
||||
audio_bytes = f.read()
|
||||
audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")
|
||||
|
||||
start_time = time.perf_counter()
|
||||
response = await client.chat.completions.create(
|
||||
model=model_name,
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "audio_url",
|
||||
"audio_url": {"url": f"data:audio/wav;base64,{audio_base64}"},
|
||||
}
|
||||
],
|
||||
}
|
||||
],
|
||||
temperature=0.0,
|
||||
)
|
||||
end_time = time.perf_counter()
|
||||
|
||||
asr_text = response.choices[0].message.content
|
||||
latency = end_time - start_time
|
||||
return latency, asr_text
|
||||
|
||||
|
||||
def run_asr_transcription_sync(client, model_name, y, sr, language=None):
|
||||
"""Use audio transcriptions API for ASR (sync version)."""
|
||||
audio_buffer = to_bytes(y, sr)
|
||||
audio_buffer.name = "audio.wav" # OpenAI client needs a name attribute
|
||||
|
||||
start_time = time.perf_counter()
|
||||
kwargs = {
|
||||
"model": model_name,
|
||||
"file": audio_buffer,
|
||||
}
|
||||
if language:
|
||||
kwargs["language"] = language
|
||||
|
||||
transcription = client.audio.transcriptions.create(**kwargs)
|
||||
end_time = time.perf_counter()
|
||||
|
||||
latency = end_time - start_time
|
||||
return latency, transcription.text
|
||||
|
||||
|
||||
def run_asr_transcription_stream_sync(
|
||||
base_url, model_name, y, sr, language=None, show_stream=False
|
||||
):
|
||||
"""Use audio transcriptions API with streaming for ASR."""
|
||||
audio_buffer = to_bytes(y, sr)
|
||||
audio_bytes = audio_buffer.read()
|
||||
|
||||
data = {
|
||||
"model": model_name,
|
||||
"response_format": "json",
|
||||
"stream": "true",
|
||||
}
|
||||
if language:
|
||||
data["language"] = language
|
||||
|
||||
start_time = time.perf_counter()
|
||||
text_chunks = []
|
||||
|
||||
if show_stream:
|
||||
print("[STREAM] ", end="", flush=True)
|
||||
|
||||
with httpx.stream(
|
||||
"POST",
|
||||
f"{base_url}/v1/audio/transcriptions",
|
||||
data=data,
|
||||
files={"file": ("audio.wav", audio_bytes, "audio/wav")},
|
||||
timeout=60.0,
|
||||
) as response:
|
||||
for line in response.iter_lines():
|
||||
if line.startswith("data: ") and not line.startswith("data: [DONE]"):
|
||||
try:
|
||||
chunk = json.loads(line[6:])
|
||||
if "choices" in chunk and chunk["choices"]:
|
||||
delta = chunk["choices"][0].get("delta", {})
|
||||
content = delta.get("content", "")
|
||||
if content:
|
||||
text_chunks.append(content)
|
||||
if show_stream:
|
||||
print(content, end="", flush=True)
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
if show_stream:
|
||||
print() # newline after stream
|
||||
|
||||
end_time = time.perf_counter()
|
||||
latency = end_time - start_time
|
||||
return latency, "".join(text_chunks)
|
||||
|
||||
|
||||
async def run_asr_transcription(
|
||||
client,
|
||||
model_name,
|
||||
y,
|
||||
sr,
|
||||
language=None,
|
||||
stream=False,
|
||||
base_url=None,
|
||||
show_stream=False,
|
||||
):
|
||||
"""Async wrapper for transcription API (runs sync call in executor)."""
|
||||
loop = asyncio.get_event_loop()
|
||||
if stream:
|
||||
return await loop.run_in_executor(
|
||||
None,
|
||||
run_asr_transcription_stream_sync,
|
||||
base_url,
|
||||
model_name,
|
||||
y,
|
||||
sr,
|
||||
language,
|
||||
show_stream,
|
||||
)
|
||||
return await loop.run_in_executor(
|
||||
None, run_asr_transcription_sync, client, model_name, y, sr, language
|
||||
)
|
||||
|
||||
|
||||
async def bound_asr(
|
||||
sem,
|
||||
client,
|
||||
model_name,
|
||||
tokenizer,
|
||||
audio,
|
||||
reference,
|
||||
api_type="chat",
|
||||
language=None,
|
||||
stream=False,
|
||||
base_url=None,
|
||||
show_stream=False,
|
||||
):
|
||||
async with sem:
|
||||
try:
|
||||
if api_type == "transcription":
|
||||
latency, text = await run_asr_transcription(
|
||||
client,
|
||||
model_name,
|
||||
*audio,
|
||||
language=language,
|
||||
stream=stream,
|
||||
base_url=base_url,
|
||||
show_stream=show_stream,
|
||||
)
|
||||
else:
|
||||
latency, text = await run_asr_chat(client, model_name, *audio)
|
||||
|
||||
# Calculate tokens for throughput metrics
|
||||
num_output_tokens = len(tokenizer(text, add_special_tokens=False).input_ids)
|
||||
|
||||
# Normalize for WER evaluation
|
||||
# Whisper tokenizer has a normalize method
|
||||
if hasattr(tokenizer, "normalize"):
|
||||
out = tokenizer.normalize(text)
|
||||
ref = tokenizer.normalize(reference)
|
||||
else:
|
||||
out = text.lower().strip()
|
||||
ref = reference.lower().strip()
|
||||
|
||||
return latency, num_output_tokens, out, ref
|
||||
except Exception as e:
|
||||
print(f"Error during ASR: {e}")
|
||||
return None
|
||||
|
||||
|
||||
async def process_dataset(
|
||||
model_name,
|
||||
client,
|
||||
data,
|
||||
concurrent_request,
|
||||
api_type="chat",
|
||||
language=None,
|
||||
stream=False,
|
||||
base_url=None,
|
||||
show_predictions=False,
|
||||
):
|
||||
sem = asyncio.Semaphore(concurrent_request)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
|
||||
# Warmup
|
||||
print("Performing warmup...")
|
||||
audio_warmup, sr_warmup = (
|
||||
data[0]["audio"]["array"],
|
||||
data[0]["audio"]["sampling_rate"],
|
||||
)
|
||||
await bound_asr(
|
||||
sem,
|
||||
client,
|
||||
model_name,
|
||||
tokenizer,
|
||||
(audio_warmup, sr_warmup),
|
||||
"",
|
||||
api_type=api_type,
|
||||
language=language,
|
||||
stream=stream,
|
||||
base_url=base_url,
|
||||
show_stream=False, # Don't show stream during warmup
|
||||
)
|
||||
|
||||
tasks = []
|
||||
print(f"Processing {len(data)} samples...")
|
||||
for sample in data:
|
||||
audio, sr = sample["audio"]["array"], sample["audio"]["sampling_rate"]
|
||||
tasks.append(
|
||||
asyncio.create_task(
|
||||
bound_asr(
|
||||
sem,
|
||||
client,
|
||||
model_name,
|
||||
tokenizer,
|
||||
(audio, sr),
|
||||
sample["text"],
|
||||
api_type=api_type,
|
||||
language=language,
|
||||
stream=stream,
|
||||
base_url=base_url,
|
||||
show_stream=show_predictions and stream,
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
results = await asyncio.gather(*tasks)
|
||||
return [r for r in results if r is not None]
|
||||
|
||||
|
||||
def run_evaluation(args):
|
||||
# Use sync client for transcription API, async for chat API
|
||||
if args.api_type == "transcription":
|
||||
client = OpenAI(base_url=f"{args.base_url}/v1", api_key="None")
|
||||
else:
|
||||
client = AsyncOpenAI(base_url=f"{args.base_url}/v1", api_key="None")
|
||||
|
||||
print(f"Loading dataset: {args.dataset}...")
|
||||
print(f"Using API type: {args.api_type}" + (f" (streaming)" if args.stream else ""))
|
||||
dataset = load_dataset(args.dataset, split=args.split)
|
||||
|
||||
# Filter by duration if needed (Whisper max is 30s)
|
||||
def add_duration(sample):
|
||||
y, sr = sample["audio"]["array"], sample["audio"]["sampling_rate"]
|
||||
sample["duration_ms"] = librosa.get_duration(y=y, sr=sr) * 1000
|
||||
return sample
|
||||
|
||||
if "duration_ms" not in dataset.column_names:
|
||||
dataset = dataset.map(add_duration)
|
||||
|
||||
dataset = dataset.filter(lambda x: x["duration_ms"] < 30000)
|
||||
|
||||
if args.n_examples > 0:
|
||||
dataset = dataset.select(range(min(args.n_examples, len(dataset))))
|
||||
|
||||
start = time.perf_counter()
|
||||
results = asyncio.run(
|
||||
process_dataset(
|
||||
args.model,
|
||||
client,
|
||||
dataset,
|
||||
args.concurrency,
|
||||
api_type=args.api_type,
|
||||
language=args.language,
|
||||
stream=args.stream,
|
||||
base_url=args.base_url,
|
||||
show_predictions=args.show_predictions,
|
||||
)
|
||||
)
|
||||
total_test_time = time.perf_counter() - start
|
||||
|
||||
if not results:
|
||||
print("No successful results to evaluate.")
|
||||
return
|
||||
|
||||
# Metrics
|
||||
latencies = [res[0] for res in results]
|
||||
total_tokens = sum([res[1] for res in results])
|
||||
predictions = [res[2] for res in results]
|
||||
references = [res[3] for res in results]
|
||||
|
||||
wer_metric = load("wer")
|
||||
wer_score = 100 * wer_metric.compute(references=references, predictions=predictions)
|
||||
|
||||
print("-" * 30)
|
||||
print(f"Results for {args.model}:")
|
||||
print(f"Total Requests: {len(results)}")
|
||||
print(f"WER: {wer_score:.4f}")
|
||||
print(f"Average Latency: {mean(latencies):.4f}s")
|
||||
print(f"Median Latency: {median(latencies):.4f}s")
|
||||
print(f"95th Latency: {np.percentile(latencies, 95):.4f}s")
|
||||
print(f"Throughput: {len(results) / total_test_time:.2f} req/s")
|
||||
print(f"Token Throughput: {total_tokens / total_test_time:.2f} tok/s")
|
||||
print(f"Total Test Time: {total_test_time:.4f}s")
|
||||
print("-" * 30)
|
||||
|
||||
if args.output:
|
||||
with open(args.output, "w") as f:
|
||||
import json
|
||||
|
||||
json.dump(
|
||||
{
|
||||
"model": args.model,
|
||||
"dataset": args.dataset,
|
||||
"wer": wer_score,
|
||||
"avg_latency": mean(latencies),
|
||||
"throughput": len(results) / total_test_time,
|
||||
"token_throughput": total_tokens / total_test_time,
|
||||
},
|
||||
f,
|
||||
indent=2,
|
||||
)
|
||||
|
||||
if args.show_predictions:
|
||||
print("\n" + "=" * 20 + " Sample Predictions " + "=" * 20)
|
||||
num_to_show = min(args.print_n, len(results))
|
||||
for i in range(num_to_show):
|
||||
print(f"Sample {i+1}:")
|
||||
print(f" REF: {references[i]}")
|
||||
print(f" PRED: {predictions[i]}")
|
||||
print("-" * 40)
|
||||
print("=" * 60)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Benchmark sGLang ASR performance.")
|
||||
parser.add_argument(
|
||||
"--base-url", default="http://localhost:30000", help="sGLang server base URL"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--model", default="openai/whisper-large-v3", help="Model name on the server"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dataset",
|
||||
default="D4nt3/esb-datasets-earnings22-validation-tiny-filtered",
|
||||
help="HF dataset repo",
|
||||
)
|
||||
parser.add_argument("--split", default="validation", help="Dataset split")
|
||||
parser.add_argument(
|
||||
"--concurrency", type=int, default=4, help="Number of concurrent requests"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--n-examples",
|
||||
"-n",
|
||||
type=int,
|
||||
default=-1,
|
||||
help="Number of examples to test (-1 for all)",
|
||||
)
|
||||
parser.add_argument("--output", help="Path to save results in JSON")
|
||||
parser.add_argument(
|
||||
"--show-predictions",
|
||||
action="store_true",
|
||||
help="Print sample predictions and references",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--print-n", type=int, default=5, help="Number of sample predictions to print"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--api-type",
|
||||
choices=["chat", "transcription"],
|
||||
default="chat",
|
||||
help="API type to use: 'chat' for chat completions with audio_url, 'transcription' for audio.transcriptions API",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--language",
|
||||
default=None,
|
||||
help="Language code for transcription API (e.g., 'en')",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--stream",
|
||||
action="store_true",
|
||||
help="Use streaming mode for transcription API",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
run_evaluation(args)
|
||||
@@ -4,7 +4,7 @@ The SGLang and DeepSeek teams collaborated to get DeepSeek V3 FP8 running on NVI
|
||||
|
||||
Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
|
||||
|
||||
For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.io/basic_usage/deepseek.html).
|
||||
For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek V3/V3.1/R1 Model Optimizations in SGLang](https://docs.sglang.io/basic_usage/deepseek_v3.html#optimizations).
|
||||
|
||||
## Installation & Launch
|
||||
|
||||
@@ -271,7 +271,7 @@ Then we can benchmark the accuracy and latency by accessing the first node's exp
|
||||
|
||||
```bash
|
||||
# bench accuracy
|
||||
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host http://10.0.0.1 --port 30000
|
||||
python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --host 10.0.0.1 --port 30000
|
||||
|
||||
# bench latency
|
||||
python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1:30000 --batch-size 1 --input-len 128 --output-len 128
|
||||
|
||||
@@ -7,7 +7,9 @@ import torch
|
||||
from sglang.srt.layers.attention.fla.layernorm_gated import (
|
||||
_layer_norm_fwd as layer_norm_fwd,
|
||||
)
|
||||
from sglang.srt.layers.attention.fla.layernorm_gated import rms_norm_ref
|
||||
from sglang.srt.layers.attention.fla.layernorm_gated import (
|
||||
rms_norm_ref,
|
||||
)
|
||||
|
||||
|
||||
def benchmark_layer_norm_fwd(
|
||||
|
||||
@@ -48,6 +48,18 @@ def main(args):
|
||||
# Select backend
|
||||
set_default_backend(select_sglang_backend(args))
|
||||
|
||||
# Load tokenizer if enable_thinking is set
|
||||
tokenizer = None
|
||||
if args.enable_thinking:
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
assert (
|
||||
args.tokenizer_path is not None
|
||||
), "--tokenizer-path is required when --enable-thinking is set"
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
args.tokenizer_path, trust_remote_code=True
|
||||
)
|
||||
|
||||
# Read data
|
||||
if args.platinum:
|
||||
print("Loading GSM8K Platinum dataset from HuggingFace...")
|
||||
@@ -70,7 +82,16 @@ def main(args):
|
||||
questions = []
|
||||
labels = []
|
||||
for i in range(len(lines[:num_questions])):
|
||||
questions.append(get_one_example(lines, i, False))
|
||||
raw_question = few_shot_examples + get_one_example(lines, i, False)
|
||||
if tokenizer is not None:
|
||||
messages = [{"role": "user", "content": raw_question}]
|
||||
raw_question = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True,
|
||||
enable_thinking=True,
|
||||
)
|
||||
questions.append(raw_question)
|
||||
labels.append(get_answer_value(lines[i]["answer"]))
|
||||
assert all(l != INVALID for l in labels)
|
||||
arguments = [{"question": q} for q in questions]
|
||||
@@ -83,9 +104,11 @@ def main(args):
|
||||
|
||||
@sgl.function
|
||||
def few_shot_gsm8k(s, question):
|
||||
s += few_shot_examples + question
|
||||
s += question
|
||||
s += sgl.gen(
|
||||
"answer", max_tokens=512, stop=["Question", "Assistant:", "<|separator|>"]
|
||||
"answer",
|
||||
max_tokens=args.max_new_tokens,
|
||||
stop=["Question", "Assistant:", "<|separator|>"],
|
||||
)
|
||||
|
||||
#####################################
|
||||
@@ -96,7 +119,8 @@ def main(args):
|
||||
tic = time.perf_counter()
|
||||
states = few_shot_gsm8k.run_batch(
|
||||
arguments,
|
||||
temperature=0,
|
||||
temperature=args.temperature,
|
||||
top_p=args.top_p,
|
||||
num_threads=args.parallel,
|
||||
progress_bar=True,
|
||||
)
|
||||
@@ -152,6 +176,20 @@ if __name__ == "__main__":
|
||||
parser.add_argument("--num-shots", type=int, default=5)
|
||||
parser.add_argument("--data-path", type=str, default="test.jsonl")
|
||||
parser.add_argument("--num-questions", type=int, default=200)
|
||||
parser.add_argument("--max-new-tokens", type=int, default=512)
|
||||
parser.add_argument("--temperature", type=float, default=0.0)
|
||||
parser.add_argument("--top-p", type=float, default=1.0)
|
||||
parser.add_argument(
|
||||
"--enable-thinking",
|
||||
action="store_true",
|
||||
help="Enable thinking mode by wrapping prompts with chat template",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--tokenizer-path",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Path to tokenizer (required when --enable-thinking is set)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--platinum",
|
||||
action="store_true",
|
||||
|
||||
@@ -12,7 +12,7 @@ from bench_multiturn import (
|
||||
)
|
||||
from tqdm.asyncio import tqdm
|
||||
|
||||
from sglang.bench_serving import get_tokenizer
|
||||
from sglang.benchmark.utils import get_tokenizer
|
||||
|
||||
|
||||
class ContextWorkloadGenerator(WorkloadGenerator):
|
||||
|
||||
@@ -12,12 +12,9 @@ from functools import wraps
|
||||
|
||||
import aiohttp
|
||||
|
||||
from sglang.bench_serving import (
|
||||
RequestFuncOutput,
|
||||
get_tokenizer,
|
||||
remove_prefix,
|
||||
sample_random_requests,
|
||||
)
|
||||
from sglang.bench_serving import RequestFuncOutput
|
||||
from sglang.benchmark.datasets.random import sample_random_requests
|
||||
from sglang.benchmark.utils import get_tokenizer, remove_prefix
|
||||
|
||||
# Set up logger
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -11,7 +11,8 @@ import numpy as np
|
||||
import requests
|
||||
from tqdm.asyncio import tqdm
|
||||
|
||||
from sglang.bench_serving import get_tokenizer, sample_random_requests
|
||||
from sglang.benchmark.datasets.random import sample_random_requests
|
||||
from sglang.benchmark.utils import get_tokenizer
|
||||
from sglang.test.kits.cache_hit_kit import async_request_sglang_generate, gen_payload
|
||||
|
||||
|
||||
|
||||
@@ -32,7 +32,7 @@ from data_processing import MsgContent, SampleOutput, get_dataset
|
||||
from tqdm.asyncio import tqdm
|
||||
from transformers import PreTrainedTokenizerBase
|
||||
|
||||
from sglang.bench_serving import get_tokenizer, remove_prefix, set_ulimit
|
||||
from sglang.benchmark.utils import get_tokenizer, remove_prefix, set_ulimit
|
||||
|
||||
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=20 * 60 * 60)
|
||||
|
||||
|
||||
@@ -11,13 +11,13 @@ from nextqa import NExTQALoader
|
||||
from tqdm.asyncio import tqdm
|
||||
from transformers import PreTrainedTokenizerBase
|
||||
|
||||
from sglang.bench_serving import (
|
||||
from sglang.benchmark.datasets.common import (
|
||||
SHAREGPT_FILENAME,
|
||||
SHAREGPT_REPO_ID,
|
||||
download_and_cache_hf_file,
|
||||
gen_prompt,
|
||||
get_gen_prefix_cache_path,
|
||||
)
|
||||
from sglang.benchmark.datasets.generated_shared_prefix import get_gen_prefix_cache_path
|
||||
from sglang.benchmark.utils import download_and_cache_hf_file
|
||||
from sglang.lang.chat_template import get_chat_template, get_chat_template_by_model_path
|
||||
from sglang.srt.entrypoints.openai.protocol import ChatCompletionMessageContentPart
|
||||
from sglang.utils import encode_video_base64
|
||||
@@ -442,7 +442,15 @@ def sample_generated_shared_prefix_requests(
|
||||
disable_shuffle: bool = False,
|
||||
) -> SampleOutput:
|
||||
"""Generate benchmark requests with shared system prompts using random tokens and caching."""
|
||||
cache_path = get_gen_prefix_cache_path(args, tokenizer)
|
||||
cache_path = get_gen_prefix_cache_path(
|
||||
args.seed,
|
||||
num_groups,
|
||||
prompts_per_group,
|
||||
system_prompt_len,
|
||||
question_len,
|
||||
output_len,
|
||||
tokenizer,
|
||||
)
|
||||
|
||||
# Try to load from cache first
|
||||
if cache_path.exists():
|
||||
|
||||
536
benchmark/kernels/all_reduce/benchmark_fused_ar_rms_amd.py
Normal file
536
benchmark/kernels/all_reduce/benchmark_fused_ar_rms_amd.py
Normal file
@@ -0,0 +1,536 @@
|
||||
"""
|
||||
Benchmark fused allreduce+rmsnorm on AMD with correctness checks.
|
||||
|
||||
This script targets the same fused op used by SGLang:
|
||||
`tensor_model_parallel_fused_allreduce_rmsnorm`.
|
||||
|
||||
It reports:
|
||||
- eager mode latency (prefill-like)
|
||||
- graph mode latency (decode-like)
|
||||
- fused availability (whether fused path returns non-None)
|
||||
- correctness (fused output matches split allreduce + rmsnorm reference)
|
||||
|
||||
Usage example:
|
||||
torchrun --nproc_per_node=8 \
|
||||
benchmark/kernels/all_reduce/benchmark_fused_ar_rms_amd.py \
|
||||
--dtype bfloat16 \
|
||||
--prefill-shapes 2048x8192,8192x8192 \
|
||||
--decode-shapes 1x8192,4x8192,16x8192 \
|
||||
--warmup 10 --iters 30 --repeats 5
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import os
|
||||
import statistics
|
||||
from typing import Dict, List, Optional, Sequence, Tuple
|
||||
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
import torch.nn.functional as F
|
||||
|
||||
from sglang.srt.distributed.communication_op import (
|
||||
tensor_model_parallel_all_reduce,
|
||||
tensor_model_parallel_fused_allreduce_rmsnorm,
|
||||
)
|
||||
from sglang.srt.distributed.parallel_state import (
|
||||
destroy_distributed_environment,
|
||||
destroy_model_parallel,
|
||||
graph_capture,
|
||||
init_distributed_environment,
|
||||
initialize_model_parallel,
|
||||
set_custom_all_reduce,
|
||||
)
|
||||
|
||||
Shape = Tuple[int, int]
|
||||
|
||||
|
||||
def parse_shapes(raw: str) -> List[Shape]:
|
||||
shapes: List[Shape] = []
|
||||
for item in [x.strip() for x in raw.split(",") if x.strip()]:
|
||||
if "x" not in item:
|
||||
raise ValueError(f"Invalid shape '{item}', expected MxN format.")
|
||||
m_str, n_str = item.split("x", 1)
|
||||
m = int(m_str)
|
||||
n = int(n_str)
|
||||
if m <= 0 or n <= 0:
|
||||
raise ValueError(f"Invalid shape '{item}', both dims must be positive.")
|
||||
shapes.append((m, n))
|
||||
if not shapes:
|
||||
raise ValueError("Empty shape list is not allowed.")
|
||||
return shapes
|
||||
|
||||
|
||||
def dtype_from_name(name: str) -> torch.dtype:
|
||||
mapping = {
|
||||
"float16": torch.float16,
|
||||
"fp16": torch.float16,
|
||||
"bfloat16": torch.bfloat16,
|
||||
"bf16": torch.bfloat16,
|
||||
}
|
||||
if name not in mapping:
|
||||
raise ValueError(f"Unsupported dtype: {name}")
|
||||
return mapping[name]
|
||||
|
||||
|
||||
def check_close(
|
||||
a: torch.Tensor, b: torch.Tensor, dtype: torch.dtype
|
||||
) -> Tuple[bool, str]:
|
||||
if dtype == torch.bfloat16:
|
||||
rtol, atol = 2e-2, 1.25e-1
|
||||
else:
|
||||
rtol, atol = 1e-2, 2e-2
|
||||
try:
|
||||
torch.testing.assert_close(a, b, rtol=rtol, atol=atol)
|
||||
return True, "PASS"
|
||||
except AssertionError:
|
||||
max_diff = torch.max(torch.abs(a - b)).item()
|
||||
mean_diff = torch.mean(torch.abs(a - b)).item()
|
||||
return False, f"FAIL(max={max_diff:.6f},mean={mean_diff:.6f})"
|
||||
|
||||
|
||||
def _measure_us(
|
||||
fn,
|
||||
warmup: int,
|
||||
iters: int,
|
||||
repeats: int,
|
||||
device: torch.device,
|
||||
) -> Tuple[float, Dict[str, float]]:
|
||||
for _ in range(warmup):
|
||||
fn()
|
||||
torch.cuda.synchronize()
|
||||
|
||||
start_event = torch.cuda.Event(enable_timing=True)
|
||||
end_event = torch.cuda.Event(enable_timing=True)
|
||||
samples_us: List[float] = []
|
||||
|
||||
for _ in range(max(1, repeats)):
|
||||
_barrier(device)
|
||||
torch.cuda.synchronize()
|
||||
start_event.record()
|
||||
for _ in range(iters):
|
||||
fn()
|
||||
end_event.record()
|
||||
end_event.synchronize()
|
||||
samples_us.append(start_event.elapsed_time(end_event) * 1000.0 / iters)
|
||||
|
||||
sorted_samples = sorted(samples_us)
|
||||
p50 = float(statistics.median(sorted_samples))
|
||||
p95 = float(sorted_samples[int((len(sorted_samples) - 1) * 0.95)])
|
||||
return p50, {
|
||||
"p50_us": p50,
|
||||
"p95_us": p95,
|
||||
"min_us": float(sorted_samples[0]),
|
||||
"max_us": float(sorted_samples[-1]),
|
||||
}
|
||||
|
||||
|
||||
def _barrier(device: torch.device):
|
||||
try:
|
||||
dist.barrier(device_ids=[device.index])
|
||||
except TypeError:
|
||||
dist.barrier()
|
||||
|
||||
|
||||
def _mean_across_ranks(value: float, device: torch.device) -> float:
|
||||
t = torch.tensor([value], dtype=torch.float64, device=device)
|
||||
dist.all_reduce(t, op=dist.ReduceOp.SUM)
|
||||
t /= dist.get_world_size()
|
||||
return float(t.item())
|
||||
|
||||
|
||||
def _all_true_across_ranks(value: bool, device: torch.device) -> bool:
|
||||
t = torch.tensor([1 if value else 0], dtype=torch.int32, device=device)
|
||||
dist.all_reduce(t, op=dist.ReduceOp.MIN)
|
||||
return bool(int(t.item()))
|
||||
|
||||
|
||||
def _make_inputs(
|
||||
shape: Shape,
|
||||
dtype: torch.dtype,
|
||||
seed: int,
|
||||
residual_mode: str,
|
||||
rank: int,
|
||||
device: torch.device,
|
||||
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
|
||||
m, n = shape
|
||||
torch.manual_seed(seed + rank * 17)
|
||||
x = torch.randn((m, n), dtype=torch.float32, device=device).to(dtype)
|
||||
if residual_mode == "self":
|
||||
residual = x.clone()
|
||||
elif residual_mode == "random":
|
||||
residual = torch.randn((m, n), dtype=torch.float32, device=device).to(dtype)
|
||||
elif residual_mode == "zero":
|
||||
residual = torch.zeros((m, n), dtype=dtype, device=device)
|
||||
else:
|
||||
raise ValueError(f"Unknown residual_mode: {residual_mode}")
|
||||
weight = torch.randn((n,), dtype=torch.float32, device=device).to(dtype)
|
||||
return x, residual, weight
|
||||
|
||||
|
||||
def _split_reference(
|
||||
x: torch.Tensor, residual: torch.Tensor, weight: torch.Tensor, eps: float
|
||||
) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||
ar_out = tensor_model_parallel_all_reduce(x.clone())
|
||||
residual_out = ar_out + residual
|
||||
out = F.rms_norm(
|
||||
input=residual_out,
|
||||
normalized_shape=(residual_out.shape[-1],),
|
||||
weight=weight,
|
||||
eps=eps,
|
||||
)
|
||||
return out, residual_out
|
||||
|
||||
|
||||
def bench_eager(
|
||||
x: torch.Tensor,
|
||||
residual: torch.Tensor,
|
||||
weight: torch.Tensor,
|
||||
eps: float,
|
||||
warmup: int,
|
||||
iters: int,
|
||||
repeats: int,
|
||||
) -> Dict[str, object]:
|
||||
split_fn = lambda: _split_reference(x, residual, weight, eps)
|
||||
split_us, split_stats = _measure_us(split_fn, warmup, iters, repeats, x.device)
|
||||
|
||||
fused_probe = tensor_model_parallel_fused_allreduce_rmsnorm(
|
||||
x.clone(), residual.clone(), weight, eps
|
||||
)
|
||||
fused_available = fused_probe is not None
|
||||
|
||||
fused_us: Optional[float] = None
|
||||
fused_stats: Optional[Dict[str, float]] = None
|
||||
if fused_available:
|
||||
fused_fn = lambda: tensor_model_parallel_fused_allreduce_rmsnorm(
|
||||
x, residual, weight, eps
|
||||
)
|
||||
fused_us, fused_stats = _measure_us(fused_fn, warmup, iters, repeats, x.device)
|
||||
|
||||
ref_out, ref_residual = _split_reference(x, residual, weight, eps)
|
||||
if fused_available:
|
||||
fused_out, fused_residual = tensor_model_parallel_fused_allreduce_rmsnorm(
|
||||
x.clone(), residual.clone(), weight, eps
|
||||
)
|
||||
out_ok, out_detail = check_close(fused_out, ref_out, x.dtype)
|
||||
res_ok, res_detail = check_close(fused_residual, ref_residual, x.dtype)
|
||||
correctness_ok = out_ok and res_ok
|
||||
correctness_detail = f"out={out_detail}, residual={res_detail}"
|
||||
else:
|
||||
correctness_ok = True
|
||||
correctness_detail = "SKIP(fused_unavailable)"
|
||||
|
||||
return {
|
||||
"split_us": split_us,
|
||||
"split_stats": split_stats,
|
||||
"fused_available": fused_available,
|
||||
"fused_us": fused_us,
|
||||
"fused_stats": fused_stats,
|
||||
"correctness_ok": correctness_ok,
|
||||
"correctness_detail": correctness_detail,
|
||||
}
|
||||
|
||||
|
||||
def bench_graph(
|
||||
x: torch.Tensor,
|
||||
residual: torch.Tensor,
|
||||
weight: torch.Tensor,
|
||||
eps: float,
|
||||
warmup: int,
|
||||
iters: int,
|
||||
repeats: int,
|
||||
) -> Dict[str, object]:
|
||||
split_x = x.clone()
|
||||
split_res = residual.clone()
|
||||
split_graph_out: Optional[torch.Tensor] = None
|
||||
|
||||
with graph_capture() as gc:
|
||||
split_graph = torch.cuda.CUDAGraph()
|
||||
with torch.cuda.graph(split_graph, stream=gc.stream):
|
||||
split_graph_out, _ = _split_reference(split_x, split_res, weight, eps)
|
||||
|
||||
def split_replay():
|
||||
split_graph.replay()
|
||||
|
||||
split_us, split_stats = _measure_us(split_replay, warmup, iters, repeats, x.device)
|
||||
|
||||
fused_probe = tensor_model_parallel_fused_allreduce_rmsnorm(
|
||||
x.clone(), residual.clone(), weight, eps
|
||||
)
|
||||
fused_available = fused_probe is not None
|
||||
|
||||
fused_us: Optional[float] = None
|
||||
fused_stats: Optional[Dict[str, float]] = None
|
||||
fused_graph_out: Optional[torch.Tensor] = None
|
||||
fused_graph_residual: Optional[torch.Tensor] = None
|
||||
|
||||
if fused_available:
|
||||
fused_x = x.clone()
|
||||
fused_res = residual.clone()
|
||||
with graph_capture() as gc:
|
||||
fused_graph = torch.cuda.CUDAGraph()
|
||||
with torch.cuda.graph(fused_graph, stream=gc.stream):
|
||||
fused_graph_out, fused_graph_residual = (
|
||||
tensor_model_parallel_fused_allreduce_rmsnorm(
|
||||
fused_x, fused_res, weight, eps
|
||||
)
|
||||
)
|
||||
|
||||
def fused_replay():
|
||||
fused_graph.replay()
|
||||
|
||||
fused_us, fused_stats = _measure_us(
|
||||
fused_replay, warmup, iters, repeats, x.device
|
||||
)
|
||||
|
||||
ref_out, ref_residual = _split_reference(x, residual, weight, eps)
|
||||
if (
|
||||
fused_available
|
||||
and fused_graph_out is not None
|
||||
and fused_graph_residual is not None
|
||||
):
|
||||
fused_graph.replay()
|
||||
torch.cuda.synchronize()
|
||||
out_ok, out_detail = check_close(fused_graph_out, ref_out, x.dtype)
|
||||
res_ok, res_detail = check_close(fused_graph_residual, ref_residual, x.dtype)
|
||||
correctness_ok = out_ok and res_ok
|
||||
correctness_detail = f"out={out_detail}, residual={res_detail}"
|
||||
else:
|
||||
correctness_ok = True
|
||||
correctness_detail = "SKIP(fused_unavailable)"
|
||||
|
||||
return {
|
||||
"split_us": split_us,
|
||||
"split_stats": split_stats,
|
||||
"fused_available": fused_available,
|
||||
"fused_us": fused_us,
|
||||
"fused_stats": fused_stats,
|
||||
"correctness_ok": correctness_ok,
|
||||
"correctness_detail": correctness_detail,
|
||||
}
|
||||
|
||||
|
||||
def _shape_bytes(shape: Shape, dtype: torch.dtype) -> int:
|
||||
m, n = shape
|
||||
return m * n * torch.tensor([], dtype=dtype).element_size()
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Benchmark fused allreduce+rmsnorm (prefill eager + decode graph)."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
type=str,
|
||||
default="bf16",
|
||||
choices=["fp16", "bf16", "float16", "bfloat16"],
|
||||
)
|
||||
parser.add_argument("--eps", type=float, default=1e-6)
|
||||
parser.add_argument("--seed", type=int, default=1234)
|
||||
parser.add_argument(
|
||||
"--residual-mode",
|
||||
type=str,
|
||||
default="self",
|
||||
choices=["self", "random", "zero"],
|
||||
help="Use residual=x (self) to match aiter test behavior by default.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--prefill-shapes",
|
||||
type=str,
|
||||
default="2048x8192,8192x8192,16384x8192",
|
||||
help="Comma-separated MxN shapes for eager mode.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--decode-shapes",
|
||||
type=str,
|
||||
default="1x8192,2x8192,4x8192,8x8192,16x8192",
|
||||
help="Comma-separated MxN shapes for graph mode.",
|
||||
)
|
||||
parser.add_argument("--warmup", type=int, default=10)
|
||||
parser.add_argument("--iters", type=int, default=30)
|
||||
parser.add_argument("--repeats", type=int, default=5)
|
||||
parser.add_argument(
|
||||
"--mode",
|
||||
type=str,
|
||||
default="both",
|
||||
choices=["eager", "graph", "both"],
|
||||
)
|
||||
parser.add_argument(
|
||||
"--csv-out",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Optional output CSV path (written on rank 0 only).",
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
dtype = dtype_from_name(args.dtype)
|
||||
rank = int(os.environ.get("RANK", "0"))
|
||||
world_size = int(os.environ.get("WORLD_SIZE", "1"))
|
||||
local_rank = int(os.environ.get("LOCAL_RANK", str(rank)))
|
||||
torch.cuda.set_device(local_rank % torch.cuda.device_count())
|
||||
device = torch.device(f"cuda:{local_rank % torch.cuda.device_count()}")
|
||||
|
||||
set_custom_all_reduce(True)
|
||||
init_distributed_environment(
|
||||
world_size=world_size,
|
||||
rank=rank,
|
||||
local_rank=local_rank,
|
||||
distributed_init_method="env://",
|
||||
backend="nccl",
|
||||
)
|
||||
initialize_model_parallel(tensor_model_parallel_size=world_size)
|
||||
|
||||
prefill_shapes = parse_shapes(args.prefill_shapes)
|
||||
decode_shapes = parse_shapes(args.decode_shapes)
|
||||
|
||||
if rank == 0:
|
||||
print(
|
||||
"Config: "
|
||||
f"world_size={world_size}, dtype={dtype}, residual_mode={args.residual_mode}, "
|
||||
f"warmup={args.warmup}, iters={args.iters}, repeats={args.repeats}"
|
||||
)
|
||||
|
||||
run_modes: Sequence[str]
|
||||
if args.mode == "both":
|
||||
run_modes = ("eager", "graph")
|
||||
else:
|
||||
run_modes = (args.mode,)
|
||||
csv_rows: List[Dict[str, object]] = []
|
||||
|
||||
for mode in run_modes:
|
||||
shapes = prefill_shapes if mode == "eager" else decode_shapes
|
||||
if rank == 0:
|
||||
phase_name = "prefill(eager)" if mode == "eager" else "decode(graph)"
|
||||
print("\n" + "=" * 120)
|
||||
print(f"Mode: {phase_name}")
|
||||
print(
|
||||
"| Shape | Input bytes/rank | Split p50 (us) | Fused p50 (us) | Speedup | Fused available | Correctness |"
|
||||
)
|
||||
print(
|
||||
"|:------|-----------------:|---------------:|---------------:|--------:|:----------------|:------------|"
|
||||
)
|
||||
|
||||
for shape in shapes:
|
||||
x, residual, weight = _make_inputs(
|
||||
shape=shape,
|
||||
dtype=dtype,
|
||||
seed=args.seed,
|
||||
residual_mode=args.residual_mode,
|
||||
rank=rank,
|
||||
device=device,
|
||||
)
|
||||
|
||||
if mode == "eager":
|
||||
metrics = bench_eager(
|
||||
x=x,
|
||||
residual=residual,
|
||||
weight=weight,
|
||||
eps=args.eps,
|
||||
warmup=args.warmup,
|
||||
iters=args.iters,
|
||||
repeats=args.repeats,
|
||||
)
|
||||
else:
|
||||
metrics = bench_graph(
|
||||
x=x,
|
||||
residual=residual,
|
||||
weight=weight,
|
||||
eps=args.eps,
|
||||
warmup=args.warmup,
|
||||
iters=args.iters,
|
||||
repeats=args.repeats,
|
||||
)
|
||||
|
||||
split_us = _mean_across_ranks(float(metrics["split_us"]), device)
|
||||
fused_available = _all_true_across_ranks(
|
||||
bool(metrics["fused_available"]), device
|
||||
)
|
||||
correctness_ok = _all_true_across_ranks(
|
||||
bool(metrics["correctness_ok"]), device
|
||||
)
|
||||
|
||||
fused_us: Optional[float] = None
|
||||
if fused_available and metrics["fused_us"] is not None:
|
||||
fused_us = _mean_across_ranks(float(metrics["fused_us"]), device)
|
||||
|
||||
if rank == 0:
|
||||
m, n = shape
|
||||
shape_str = f"{m}x{n}"
|
||||
bytes_per_rank = _shape_bytes(shape, dtype)
|
||||
if fused_us is not None and fused_us > 0:
|
||||
speedup = split_us / fused_us
|
||||
speedup_str = f"{speedup:.3f}x"
|
||||
fused_str = f"{fused_us:.1f}"
|
||||
else:
|
||||
speedup_str = "N/A"
|
||||
fused_str = "N/A"
|
||||
correctness_text = (
|
||||
"PASS" if correctness_ok else str(metrics["correctness_detail"])
|
||||
)
|
||||
print(
|
||||
f"| {shape_str} | {bytes_per_rank} | {split_us:.1f} | {fused_str} | "
|
||||
f"{speedup_str} | {str(fused_available)} | {correctness_text} |"
|
||||
)
|
||||
csv_rows.append(
|
||||
{
|
||||
"mode": mode,
|
||||
"shape": shape_str,
|
||||
"m": m,
|
||||
"n": n,
|
||||
"bytes_per_rank": bytes_per_rank,
|
||||
"split_p50_us": split_us,
|
||||
"fused_p50_us": fused_us if fused_us is not None else "",
|
||||
"speedup_split_over_fused": (
|
||||
split_us / fused_us
|
||||
if fused_us is not None and fused_us > 0
|
||||
else ""
|
||||
),
|
||||
"fused_available": fused_available,
|
||||
"correctness_ok": correctness_ok,
|
||||
"correctness_detail": correctness_text,
|
||||
"dtype": str(dtype),
|
||||
"world_size": world_size,
|
||||
"residual_mode": args.residual_mode,
|
||||
"warmup": args.warmup,
|
||||
"iters": args.iters,
|
||||
"repeats": args.repeats,
|
||||
}
|
||||
)
|
||||
|
||||
if rank == 0 and args.csv_out:
|
||||
os.makedirs(os.path.dirname(args.csv_out) or ".", exist_ok=True)
|
||||
fieldnames = [
|
||||
"mode",
|
||||
"shape",
|
||||
"m",
|
||||
"n",
|
||||
"bytes_per_rank",
|
||||
"split_p50_us",
|
||||
"fused_p50_us",
|
||||
"speedup_split_over_fused",
|
||||
"fused_available",
|
||||
"correctness_ok",
|
||||
"correctness_detail",
|
||||
"dtype",
|
||||
"world_size",
|
||||
"residual_mode",
|
||||
"warmup",
|
||||
"iters",
|
||||
"repeats",
|
||||
]
|
||||
with open(args.csv_out, "w", newline="", encoding="utf-8") as f:
|
||||
writer = csv.DictWriter(f, fieldnames=fieldnames)
|
||||
writer.writeheader()
|
||||
writer.writerows(csv_rows)
|
||||
print(f"\nSaved CSV to: {args.csv_out}")
|
||||
|
||||
_barrier(device)
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -18,7 +18,13 @@ from sglang.srt.layers.moe.fused_moe_triton.triton_kernels_moe import (
|
||||
triton_kernel_moe_forward,
|
||||
)
|
||||
from sglang.srt.layers.moe.moe_runner import MoeRunnerConfig
|
||||
from sglang.srt.layers.moe.topk import TopK, TopKConfig, select_experts
|
||||
from sglang.srt.layers.moe.topk import (
|
||||
TopK,
|
||||
TopKConfig,
|
||||
TopKOutputFormat,
|
||||
select_experts,
|
||||
)
|
||||
from sglang.srt.server_args import ServerArgs, set_global_server_args_for_scheduler
|
||||
|
||||
|
||||
def fused_moe_triton_api(
|
||||
@@ -32,8 +38,8 @@ def fused_moe_triton_api(
|
||||
top_k=topk,
|
||||
renormalize=False,
|
||||
use_grouped_topk=False,
|
||||
output_format=TopKOutputFormat.TRITON_KERNEL,
|
||||
)
|
||||
topk_op.use_triton_kernels = True
|
||||
triton_topk_output = topk_op.forward_cuda(
|
||||
hidden_states=x,
|
||||
router_logits=input_gating,
|
||||
@@ -199,6 +205,10 @@ def main():
|
||||
parser.add_argument("--trust-remote-code", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Initialize global server args (required by SGLang MoE kernels)
|
||||
server_args = ServerArgs(model_path=args.model)
|
||||
set_global_server_args_for_scheduler(server_args)
|
||||
|
||||
try:
|
||||
if not torch.distributed.is_initialized():
|
||||
torch.distributed.init_process_group(
|
||||
@@ -217,8 +227,8 @@ def main():
|
||||
)
|
||||
|
||||
initialize_model_parallel(
|
||||
tensor_model_parallel_size=args.ep_size,
|
||||
pipeline_model_parallel_size=args.tp_size,
|
||||
tensor_model_parallel_size=1,
|
||||
expert_model_parallel_size=1,
|
||||
)
|
||||
|
||||
model_config = get_model_config(args.model, args.tp_size, args.ep_size)
|
||||
|
||||
@@ -35,10 +35,9 @@ from sglang.bench_serving import (
|
||||
_create_bench_client_session,
|
||||
calculate_metrics,
|
||||
get_request,
|
||||
get_tokenizer,
|
||||
remove_prefix,
|
||||
sample_random_requests,
|
||||
)
|
||||
from sglang.benchmark.datasets.random import sample_random_requests
|
||||
from sglang.benchmark.utils import get_tokenizer, remove_prefix
|
||||
|
||||
global args
|
||||
|
||||
|
||||
@@ -13,8 +13,7 @@ number = 5
|
||||
|
||||
|
||||
def expand_tip(topic, tip, generate):
|
||||
s = (
|
||||
"""Please expand a tip for a topic into a detailed paragraph.
|
||||
s = """Please expand a tip for a topic into a detailed paragraph.
|
||||
|
||||
Topic: staying healthy
|
||||
Tip: Regular Exercise
|
||||
@@ -28,12 +27,7 @@ Topic: writing a blog post
|
||||
Tip: structure your content effectively
|
||||
Paragraph: A well-structured post is easier to read and more enjoyable. Start with an engaging introduction that hooks the reader and clearly states the purpose of your post. Use headings and subheadings to break up the text and guide readers through your content. Bullet points and numbered lists can make information more digestible. Ensure each paragraph flows logically into the next, and conclude with a summary or call-to-action that encourages reader engagement.
|
||||
|
||||
Topic: """
|
||||
+ topic
|
||||
+ "\nTip: "
|
||||
+ tip
|
||||
+ "\nParagraph:"
|
||||
)
|
||||
Topic: """ + topic + "\nTip: " + tip + "\nParagraph:"
|
||||
return generate(s, max_tokens=128, stop=["\n\n"])
|
||||
|
||||
|
||||
|
||||
@@ -14,8 +14,7 @@ number = 5
|
||||
|
||||
@sgl.function
|
||||
def expand_tip(s, topic, tip):
|
||||
s += (
|
||||
"""Please expand a tip for a topic into a detailed paragraph.
|
||||
s += """Please expand a tip for a topic into a detailed paragraph.
|
||||
|
||||
Topic: staying healthy
|
||||
Tip: Regular Exercise
|
||||
@@ -29,12 +28,7 @@ Topic: writing a blog post
|
||||
Tip: structure your content effectively
|
||||
Paragraph: A well-structured post is easier to read and more enjoyable. Start with an engaging introduction that hooks the reader and clearly states the purpose of your post. Use headings and subheadings to break up the text and guide readers through your content. Bullet points and numbered lists can make information more digestible. Ensure each paragraph flows logically into the next, and conclude with a summary or call-to-action that encourages reader engagement.
|
||||
|
||||
Topic: """
|
||||
+ topic
|
||||
+ "\nTip: "
|
||||
+ tip
|
||||
+ "\nParagraph:"
|
||||
)
|
||||
Topic: """ + topic + "\nTip: " + tip + "\nParagraph:"
|
||||
s += sgl.gen("paragraph", max_tokens=128, stop=["\n\n"], temperature=0)
|
||||
|
||||
|
||||
|
||||
@@ -2,8 +2,7 @@ number = 5
|
||||
|
||||
|
||||
async def expand_tip_async(topic, tip, generate):
|
||||
s = (
|
||||
"""Please expand a tip for a topic into a detailed paragraph.
|
||||
s = """Please expand a tip for a topic into a detailed paragraph.
|
||||
|
||||
Topic: staying healthy
|
||||
Tip: Regular Exercise
|
||||
@@ -17,12 +16,7 @@ Topic: writing a blog post
|
||||
Tip: structure your content effectively
|
||||
Paragraph: A well-structured post is easier to read and more enjoyable. Start with an engaging introduction that hooks the reader and clearly states the purpose of your post. Use headings and subheadings to break up the text and guide readers through your content. Bullet points and numbered lists can make information more digestible. Ensure each paragraph flows logically into the next, and conclude with a summary or call-to-action that encourages reader engagement.
|
||||
|
||||
Topic: """
|
||||
+ topic
|
||||
+ "\nTip: "
|
||||
+ tip
|
||||
+ "\nParagraph:"
|
||||
)
|
||||
Topic: """ + topic + "\nTip: " + tip + "\nParagraph:"
|
||||
return await generate(s, max_tokens=128, stop="\n\n")
|
||||
|
||||
|
||||
|
||||
@@ -19,7 +19,7 @@ ARG PIP_DEFAULT_INDEX
|
||||
ARG UBUNTU_MIRROR
|
||||
ARG GITHUB_ARTIFACTORY=github.com
|
||||
ARG INSTALL_FLASHINFER_JIT_CACHE=0
|
||||
ARG FLASHINFER_VERSION=0.6.2
|
||||
ARG FLASHINFER_VERSION=0.6.3
|
||||
ARG MOONCAKE_VERSION=0.3.9
|
||||
#if need other arg please add in MOONCAKE_COMPILE_ARG
|
||||
ARG MOONCAKE_COMPILE_ARG="-DUSE_HTTP=ON -DUSE_MNNVL=ON -DUSE_CUDA=ON -DWITH_EP=ON"
|
||||
|
||||
@@ -93,9 +93,9 @@ RUN git clone https://github.com/sgl-project/sglang --branch $SGLANG_TAG && \
|
||||
RUN ${PIP_INSTALL} wheel==0.45.1 pybind11 pyyaml decorator scipy attrs psutil \
|
||||
&& mkdir sgl-kernel-npu \
|
||||
&& cd sgl-kernel-npu \
|
||||
&& wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/${SGLANG_KERNEL_NPU_TAG}/sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \
|
||||
&& unzip sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \
|
||||
&& ${PIP_INSTALL} output/deep_ep*.whl output/sgl_kernel_npu*.whl \
|
||||
&& wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/${SGLANG_KERNEL_NPU_TAG}/sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-torch2.8.0-py311-cann${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \
|
||||
&& unzip sgl-kernel-npu-${SGLANG_KERNEL_NPU_TAG}-torch2.8.0-py311-cann${CANN_VERSION}-${DEVICE_TYPE}-$(arch).zip \
|
||||
&& ${PIP_INSTALL} deep_ep*.whl sgl_kernel_npu*.whl \
|
||||
&& cd .. && rm -rf sgl-kernel-npu \
|
||||
&& cd "$(python3 -m pip show deep-ep | awk '/^Location:/ {print $2}')" && ln -sf deep_ep/deep_ep_cpp*.so
|
||||
|
||||
|
||||
@@ -21,7 +21,7 @@ ENV BUILD_TRITON="0"
|
||||
ENV BUILD_LLVM="0"
|
||||
ENV BUILD_AITER_ALL="1"
|
||||
ENV BUILD_MOONCAKE="1"
|
||||
ENV AITER_COMMIT="v0.1.10.post2"
|
||||
ENV AITER_COMMIT="v0.1.10.post3"
|
||||
|
||||
# ===============================
|
||||
# Base image 950 and args
|
||||
@@ -31,7 +31,7 @@ ENV BUILD_TRITON="0"
|
||||
ENV BUILD_LLVM="0"
|
||||
ENV BUILD_AITER_ALL="1"
|
||||
ENV BUILD_MOONCAKE="1"
|
||||
ENV AITER_COMMIT="v0.1.10.post2"
|
||||
ENV AITER_COMMIT="v0.1.10.post3"
|
||||
# ===============================
|
||||
# Chosen arch and args
|
||||
FROM ${GPU_ARCH}
|
||||
@@ -70,7 +70,7 @@ ARG ENABLE_MORI=0
|
||||
ARG NIC_BACKEND=none
|
||||
|
||||
ARG MORI_REPO="https://github.com/ROCm/mori.git"
|
||||
ARG MORI_COMMIT="b0dce4beebeb1f26c784eee17d5fd9785ee9447f"
|
||||
ARG MORI_COMMIT="20920706a9004018dbd87c7387f207d08d0e05af"
|
||||
|
||||
# AMD AINIC apt repo settings
|
||||
ARG AINIC_VERSION=1.117.5
|
||||
@@ -214,10 +214,10 @@ RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y \
|
||||
ENV CARGO_BUILD_JOBS=4
|
||||
|
||||
# Build and install sgl-model-gateway
|
||||
RUN python3 -m pip install --no-cache-dir setuptools-rust \
|
||||
RUN python3 -m pip install --no-cache-dir maturin \
|
||||
&& cd /sgl-workspace/sglang/sgl-model-gateway/bindings/python \
|
||||
&& /bin/bash -lc 'ulimit -n 8192 && cargo build --release' \
|
||||
&& python3 -m pip install --no-cache-dir . \
|
||||
&& ulimit -n 65536 && maturin build --release --features vendored-openssl --out dist \
|
||||
&& python3 -m pip install --force-reinstall dist/*.whl \
|
||||
&& rm -rf /root/.cache
|
||||
|
||||
# -----------------------
|
||||
@@ -280,7 +280,7 @@ RUN /bin/bash -lc 'set -euo pipefail; \
|
||||
\
|
||||
# TVM Python bits need Cython + z3 before configure.
|
||||
# Pin z3-solver==4.15.4.0: 4.15.4.0 has a manylinux wheel; 4.15.5.0 has no wheel and builds from source (fails: C++20 <format> needs GCC 14+, image has GCC 11).
|
||||
"$VENV_PIP" install --no-cache-dir "cython>=0.29.36,<3.0" "apache-tvm-ffi>=0.1.6" "z3-solver==4.15.4.0"; \
|
||||
"$VENV_PIP" install --no-cache-dir "cython>=0.29.36,<3.0" "apache-tvm-ffi @ git+https://github.com/apache/tvm-ffi.git@37d0485b2058885bf4e7a486f7d7b2174a8ac1ce" "z3-solver==4.15.4.0"; \
|
||||
\
|
||||
# Clone + pin TileLang (bundled TVM), then build
|
||||
git clone --recursive "${TILELANG_REPO}" /opt/tilelang && \
|
||||
@@ -390,10 +390,7 @@ ENV SGLANG_USE_AITER=1
|
||||
ENV SGLANG_USE_ROCM700A=1
|
||||
|
||||
ENV NCCL_MIN_NCHANNELS=112
|
||||
ENV VLLM_FP8_PADDING=1
|
||||
ENV VLLM_FP8_ACT_PADDING=1
|
||||
ENV VLLM_FP8_WEIGHT_PADDING=1
|
||||
ENV VLLM_FP8_REDUCE_CONV=1
|
||||
ENV ROCM_QUICK_REDUCE_QUANTIZATION=INT8
|
||||
ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
|
||||
ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
|
||||
|
||||
|
||||
503
docker/rocm720.Dockerfile
Normal file
503
docker/rocm720.Dockerfile
Normal file
@@ -0,0 +1,503 @@
|
||||
# Usage (to build SGLang ROCm docker image):
|
||||
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx942 -t v0.5.8.post1-rocm700-mi30x -f rocm.Dockerfile .
|
||||
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx942-rocm720 -t v0.5.8.post1-rocm720-mi30x-preview -f rocm720.Dockerfile .
|
||||
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx950 -t v0.5.8.post1-rocm700-mi35x -f rocm.Dockerfile .
|
||||
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx950-rocm720 -t v0.5.8.post1-rocm720-mi35x-preview -f rocm720.Dockerfile .
|
||||
|
||||
# Usage (to build SGLang ROCm + Mori docker image):
|
||||
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx942 --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t v0.5.8.post1-rocm700-mi30x -f rocm.Dockerfile .
|
||||
# docker build --build-arg SGL_BRANCH=v0.5.8.post1 --build-arg GPU_ARCH=gfx950 --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic -t v0.5.8.post1-rocm700-mi35x -f rocm.Dockerfile .
|
||||
|
||||
# Default base images
|
||||
ARG BASE_IMAGE_942="rocm/sgl-dev:rocm7-vllm-20250904"
|
||||
ARG BASE_IMAGE_942_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1"
|
||||
ARG BASE_IMAGE_950="rocm/sgl-dev:rocm7-vllm-20250904"
|
||||
ARG BASE_IMAGE_950_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1"
|
||||
|
||||
# This is necessary for scope purpose
|
||||
ARG GPU_ARCH=gfx950
|
||||
|
||||
# ===============================
|
||||
# Base image 942 with rocm700 and args
|
||||
FROM $BASE_IMAGE_942 AS gfx942
|
||||
ENV BUILD_VLLM="0"
|
||||
ENV BUILD_TRITON="0"
|
||||
ENV BUILD_LLVM="0"
|
||||
ENV BUILD_AITER_ALL="1"
|
||||
ENV BUILD_MOONCAKE="1"
|
||||
ENV AITER_COMMIT="v0.1.10.post3"
|
||||
|
||||
# ===============================
|
||||
# Base image 942 with rocm720 and args
|
||||
FROM $BASE_IMAGE_942_ROCM720 AS gfx942-rocm720
|
||||
ENV BUILD_VLLM="0"
|
||||
ENV BUILD_TRITON="1"
|
||||
ENV BUILD_LLVM="0"
|
||||
ENV BUILD_AITER_ALL="1"
|
||||
ENV BUILD_MOONCAKE="1"
|
||||
ENV AITER_COMMIT="v0.1.10.post3"
|
||||
|
||||
# ===============================
|
||||
# Base image 950 and args
|
||||
FROM $BASE_IMAGE_950 AS gfx950
|
||||
ENV BUILD_VLLM="0"
|
||||
ENV BUILD_TRITON="0"
|
||||
ENV BUILD_LLVM="0"
|
||||
ENV BUILD_AITER_ALL="1"
|
||||
ENV BUILD_MOONCAKE="1"
|
||||
ENV AITER_COMMIT="v0.1.10.post3"
|
||||
|
||||
# ===============================
|
||||
# Base image 950 with rocm720 and args
|
||||
FROM $BASE_IMAGE_950_ROCM720 AS gfx950-rocm720
|
||||
ENV BUILD_VLLM="0"
|
||||
ENV BUILD_TRITON="1"
|
||||
ENV BUILD_LLVM="0"
|
||||
ENV BUILD_AITER_ALL="1"
|
||||
ENV BUILD_MOONCAKE="1"
|
||||
ENV AITER_COMMIT="v0.1.10.post3"
|
||||
|
||||
# ===============================
|
||||
# Chosen arch and args
|
||||
FROM ${GPU_ARCH}
|
||||
|
||||
# This is necessary for scope purpose, again
|
||||
ARG GPU_ARCH=gfx950
|
||||
ENV GPU_ARCH_LIST=${GPU_ARCH%-*}
|
||||
|
||||
ARG SGL_REPO="https://github.com/sgl-project/sglang.git"
|
||||
ARG SGL_BRANCH="main"
|
||||
|
||||
# Version override for setuptools_scm (used in nightly builds)
|
||||
ARG SETUPTOOLS_SCM_PRETEND_VERSION=""
|
||||
|
||||
ARG TRITON_REPO="https://github.com/triton-lang/triton.git"
|
||||
ARG TRITON_COMMIT="42270451990532c67e69d753fbd026f28fcc4840"
|
||||
|
||||
ARG AITER_REPO="https://github.com/ROCm/aiter.git"
|
||||
|
||||
ARG LLVM_REPO="https://github.com/jrbyrnes/llvm-project.git"
|
||||
ARG LLVM_BRANCH="MainOpSelV2"
|
||||
ARG LLVM_COMMIT="6520ace8227ffe2728148d5f3b9872a870b0a560"
|
||||
|
||||
ARG MOONCAKE_REPO="https://github.com/kvcache-ai/Mooncake.git"
|
||||
ARG MOONCAKE_COMMIT="b6a841dc78c707ec655a563453277d969fb8f38d"
|
||||
|
||||
ARG TILELANG_REPO="https://github.com/tile-ai/tilelang.git"
|
||||
ARG TILELANG_COMMIT="ebf4a7cb8881432165ae8760e99d209d905c704a"
|
||||
|
||||
ARG FHT_REPO="https://github.com/jeffdaily/fast-hadamard-transform.git"
|
||||
ARG FHT_BRANCH="rocm"
|
||||
ARG FHT_COMMIT="46efb7d776d38638fc39f3c803eaee3dd7016bd1"
|
||||
|
||||
ARG ENABLE_MORI=0
|
||||
ARG NIC_BACKEND=none
|
||||
|
||||
ARG MORI_REPO="https://github.com/ROCm/mori.git"
|
||||
ARG MORI_COMMIT="20920706a9004018dbd87c7387f207d08d0e05af"
|
||||
|
||||
# AMD AINIC apt repo settings
|
||||
ARG AINIC_VERSION=1.117.5
|
||||
ARG UBUNTU_CODENAME=jammy
|
||||
USER root
|
||||
|
||||
# Install some basic utilities
|
||||
RUN python -m pip install --upgrade pip && pip install setuptools_scm
|
||||
RUN apt-get purge -y sccache; python -m pip uninstall -y sccache; rm -f "$(which sccache)"
|
||||
|
||||
# Install AMD SMI Python package from ROCm distribution
|
||||
RUN cd /opt/rocm/share/amd_smi && python3 -m pip install --no-cache-dir .
|
||||
|
||||
WORKDIR /sgl-workspace
|
||||
|
||||
# -----------------------
|
||||
# llvm
|
||||
RUN if [ "$BUILD_LLVM" = "1" ]; then \
|
||||
ENV HIP_CLANG_PATH="/sgl-workspace/llvm-project/build/bin/" \
|
||||
git clone --single-branch ${LLVM_REPO} -b ${LLVM_BRANCH} \
|
||||
&& cd llvm-project \
|
||||
&& git checkout ${LLVM_COMMIT} \
|
||||
&& mkdir build \
|
||||
&& cd build \
|
||||
&& cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_TARGETS_TO_BUILD="AMDGPU;X86" -DLLVM_ENABLE_PROJECTS="clang;lld;" -DLLVM_ENABLE_RUNTIMES="compiler-rt" ../llvm \
|
||||
&& make -j$(nproc); \
|
||||
fi
|
||||
|
||||
# -----------------------
|
||||
# AITER
|
||||
# Unset setuptools_scm override so AITER gets its own version (AITER_COMMIT), not SGLang's
|
||||
# (SETUPTOOLS_SCM_PRETEND_VERSION is set later for SGLang nightly builds and would otherwise
|
||||
# leak into AITER's version when AITER uses setuptools_scm)
|
||||
ENV SETUPTOOLS_SCM_PRETEND_VERSION=
|
||||
RUN pip uninstall -y aiter \
|
||||
&& pip install psutil pybind11 # Required by AITER setup.py
|
||||
RUN git clone ${AITER_REPO} \
|
||||
&& cd aiter \
|
||||
&& git checkout ${AITER_COMMIT} \
|
||||
&& git submodule update --init --recursive
|
||||
|
||||
# Hot patches for AITER in v0.1.10.post1
|
||||
# This is for ROCm 7.2 only, because of the image rebase from vllm
|
||||
# to rocm/pytorch.
|
||||
RUN set -eux; \
|
||||
case "${GPU_ARCH}" in \
|
||||
*rocm720*) \
|
||||
echo "ROCm 7.2 flavor detected from GPU_ARCH=${GPU_ARCH}"; \
|
||||
cd aiter \
|
||||
&& sed -i '459 s/if.*:/if False:/' aiter/ops/triton/attention/pa_mqa_logits.py; \
|
||||
;; \
|
||||
*) \
|
||||
echo "Not rocm720 (GPU_ARCH=${GPU_ARCH}), skip patch"; \
|
||||
;; \
|
||||
esac
|
||||
|
||||
RUN cd aiter \
|
||||
&& echo "[AITER] GPU_ARCH=${GPU_ARCH}" \
|
||||
&& if [ "$BUILD_AITER_ALL" = "1" ] && [ "$BUILD_LLVM" = "1" ]; then \
|
||||
sh -c "HIP_CLANG_PATH=/sgl-workspace/llvm-project/build/bin/ PREBUILD_KERNELS=1 GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \
|
||||
elif [ "$BUILD_AITER_ALL" = "1" ]; then \
|
||||
sh -c "PREBUILD_KERNELS=1 GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \
|
||||
else \
|
||||
sh -c "GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \
|
||||
fi \
|
||||
&& echo "export PYTHONPATH=/sgl-workspace/aiter:\${PYTHONPATH}" >> /etc/bash.bashrc
|
||||
|
||||
# -----------------------
|
||||
# Build vLLM
|
||||
ARG VLLM_REPO="https://github.com/ROCm/vllm.git"
|
||||
ARG VLLM_BRANCH="9f6b92db47c3444b7a7d67451ba0c3a2d6af4c2c"
|
||||
RUN if [ "$BUILD_VLLM" = "1" ]; then \
|
||||
git clone ${VLLM_REPO} \
|
||||
&& cd vllm \
|
||||
&& git checkout ${VLLM_BRANCH} \
|
||||
&& python -m pip install -r requirements/rocm.txt \
|
||||
&& python setup.py clean --all \
|
||||
&& python setup.py develop; \
|
||||
fi
|
||||
|
||||
# -----------------------
|
||||
# Build Mooncake
|
||||
ENV PATH=$PATH:/usr/local/go/bin
|
||||
|
||||
RUN if [ "$BUILD_MOONCAKE" = "1" ]; then \
|
||||
apt update && apt install -y zip unzip wget && \
|
||||
apt install -y gcc make libtool autoconf librdmacm-dev rdmacm-utils infiniband-diags ibverbs-utils perftest ethtool libibverbs-dev rdma-core && \
|
||||
apt install -y openssh-server openmpi-bin openmpi-common libopenmpi-dev && \
|
||||
git clone ${MOONCAKE_REPO} && \
|
||||
cd Mooncake && \
|
||||
git checkout ${MOONCAKE_COMMIT} && \
|
||||
git submodule update --init --recursive && \
|
||||
bash dependencies.sh -y && \
|
||||
rm -rf /usr/local/go && \
|
||||
wget https://go.dev/dl/go1.22.2.linux-amd64.tar.gz && \
|
||||
tar -C /usr/local -xzf go1.22.2.linux-amd64.tar.gz && \
|
||||
rm go1.22.2.linux-amd64.tar.gz && \
|
||||
mkdir -p build && \
|
||||
cd build && \
|
||||
cmake .. -DUSE_HIP=ON -DUSE_ETCD=ON && \
|
||||
make -j "$(nproc)" && make install; \
|
||||
fi
|
||||
|
||||
# -----------------------
|
||||
# Build SGLang
|
||||
ARG BUILD_TYPE=all
|
||||
|
||||
# Set version for setuptools_scm if provided (for nightly builds). Only pass in the SGLang
|
||||
# pip install RUN so it does not affect AITER, sgl-model-gateway, TileLang, FHT, MORI, etc.
|
||||
ARG SETUPTOOLS_SCM_PRETEND_VERSION
|
||||
|
||||
RUN pip install IPython \
|
||||
&& pip install orjson \
|
||||
&& pip install python-multipart \
|
||||
&& pip install torchao==0.9.0 \
|
||||
&& pip install pybind11
|
||||
|
||||
RUN pip uninstall -y sgl_kernel sglang
|
||||
RUN git clone ${SGL_REPO} \
|
||||
&& cd sglang \
|
||||
&& echo "Using ${SGL_BRANCH} branch." \
|
||||
&& git checkout ${SGL_BRANCH} \
|
||||
&& cd sgl-kernel \
|
||||
&& rm -f pyproject.toml \
|
||||
&& mv pyproject_rocm.toml pyproject.toml \
|
||||
&& AMDGPU_TARGET=$GPU_ARCH_LIST python setup_rocm.py install \
|
||||
&& cd .. \
|
||||
&& rm -rf python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml \
|
||||
&& if [ "$BUILD_TYPE" = "srt" ]; then \
|
||||
export SETUPTOOLS_SCM_PRETEND_VERSION="${SETUPTOOLS_SCM_PRETEND_VERSION}" && python -m pip --no-cache-dir install -e "python[srt_hip,diffusion_hip]"; \
|
||||
else \
|
||||
export SETUPTOOLS_SCM_PRETEND_VERSION="${SETUPTOOLS_SCM_PRETEND_VERSION}" && python -m pip --no-cache-dir install -e "python[all_hip]"; \
|
||||
fi
|
||||
|
||||
RUN python -m pip cache purge
|
||||
|
||||
# Copy config files to support MI300X in virtualized environments (MI300X_VF). Symlinks will not be created in image build.
|
||||
RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
|
||||
/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
|
||||
-type f -name '*MI300X*' | xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}
|
||||
|
||||
# Install Rust toolchain for sgl-model-gateway
|
||||
ENV PATH="/root/.cargo/bin:${PATH}"
|
||||
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y \
|
||||
&& rustc --version && cargo --version
|
||||
ENV CARGO_BUILD_JOBS=4
|
||||
|
||||
# Build and install sgl-model-gateway
|
||||
RUN python3 -m pip install --no-cache-dir maturin \
|
||||
&& cd /sgl-workspace/sglang/sgl-model-gateway/bindings/python \
|
||||
&& ulimit -n 65536 && maturin build --release --features vendored-openssl --out dist \
|
||||
&& python3 -m pip install --force-reinstall dist/*.whl \
|
||||
&& rm -rf /root/.cache
|
||||
|
||||
# -----------------------
|
||||
# TileLang
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
ENV LIBGL_ALWAYS_INDIRECT=1
|
||||
RUN echo "LC_ALL=en_US.UTF-8" >> /etc/environment
|
||||
|
||||
RUN /bin/bash -lc 'set -euo pipefail; \
|
||||
echo "[TileLang] Building TileLang for ${GPU_ARCH}"; \
|
||||
# System dependencies (NO llvm-dev to avoid llvm-config-16 shadowing)
|
||||
apt-get update && apt-get install -y --no-install-recommends \
|
||||
build-essential git wget curl ca-certificates gnupg \
|
||||
libgtest-dev libgmock-dev \
|
||||
libprotobuf-dev protobuf-compiler libgflags-dev libsqlite3-dev \
|
||||
python3 python3-dev python3-setuptools python3-pip python3-apt \
|
||||
gcc libtinfo-dev zlib1g-dev libedit-dev libxml2-dev vim \
|
||||
cmake ninja-build pkg-config libstdc++6 software-properties-common \
|
||||
&& rm -rf /var/lib/apt/lists/*; \
|
||||
\
|
||||
# Prefer the container venv
|
||||
VENV_PY="/opt/venv/bin/python"; \
|
||||
VENV_PIP="/opt/venv/bin/pip"; \
|
||||
if [ ! -x "$VENV_PY" ]; then VENV_PY="python3"; fi; \
|
||||
if [ ! -x "$VENV_PIP" ]; then VENV_PIP="pip3"; fi; \
|
||||
\
|
||||
# Build GoogleTest static libs (Ubuntu package ships sources only)
|
||||
cmake -S /usr/src/googletest -B /tmp/build-gtest -DBUILD_GTEST=ON -DBUILD_GMOCK=ON -DCMAKE_BUILD_TYPE=Release && \
|
||||
cmake --build /tmp/build-gtest -j"$(nproc)" && \
|
||||
cp -v /tmp/build-gtest/lib/*.a /usr/lib/x86_64-linux-gnu/ && \
|
||||
rm -rf /tmp/build-gtest; \
|
||||
\
|
||||
# Keep setuptools < 80 (compat with base image)
|
||||
"$VENV_PIP" install --upgrade "setuptools>=77.0.3,<80" wheel cmake ninja scikit-build-core && \
|
||||
"$VENV_PIP" cache purge || true; \
|
||||
\
|
||||
# Locate ROCm llvm-config; fallback to installing LLVM 18 if missing
|
||||
LLVM_CONFIG_PATH=""; \
|
||||
for p in /opt/rocm/llvm/bin/llvm-config /opt/rocm/llvm-*/bin/llvm-config /opt/rocm-*/llvm*/bin/llvm-config; do \
|
||||
if [ -x "$p" ]; then LLVM_CONFIG_PATH="$p"; break; fi; \
|
||||
done; \
|
||||
if [ -z "$LLVM_CONFIG_PATH" ]; then \
|
||||
echo "[TileLang] ROCm llvm-config not found; installing LLVM 18..."; \
|
||||
curl -fsSL https://apt.llvm.org/llvm-snapshot.gpg.key | gpg --dearmor -o /etc/apt/keyrings/llvm.gpg; \
|
||||
echo "deb [signed-by=/etc/apt/keyrings/llvm.gpg] http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main" > /etc/apt/sources.list.d/llvm.list; \
|
||||
apt-get update; \
|
||||
apt-get install -y --no-install-recommends llvm-18; \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
LLVM_CONFIG_PATH="$(command -v llvm-config-18)"; \
|
||||
if [ -z "$LLVM_CONFIG_PATH" ]; then echo "ERROR: llvm-config-18 not found after install"; exit 1; fi; \
|
||||
fi; \
|
||||
echo "[TileLang] Using LLVM_CONFIG at: $LLVM_CONFIG_PATH"; \
|
||||
export PATH="$(dirname "$LLVM_CONFIG_PATH"):/usr/local/bin:${PATH}"; \
|
||||
export LLVM_CONFIG="$LLVM_CONFIG_PATH"; \
|
||||
\
|
||||
# Optional shim for tools that expect llvm-config-16
|
||||
mkdir -p /usr/local/bin && \
|
||||
printf "#!/usr/bin/env bash\nexec \"%s\" \"\$@\"\n" "$LLVM_CONFIG_PATH" > /usr/local/bin/llvm-config-16 && \
|
||||
chmod +x /usr/local/bin/llvm-config-16; \
|
||||
\
|
||||
# TVM Python bits need Cython + z3 before configure.
|
||||
# Pin z3-solver==4.15.4.0: 4.15.4.0 has a manylinux wheel; 4.15.5.0 has no wheel and builds from source (fails: C++20 <format> needs GCC 14+, image has GCC 11).
|
||||
"$VENV_PIP" install --no-cache-dir "cython>=0.29.36,<3.0" "apache-tvm-ffi @ git+https://github.com/apache/tvm-ffi.git@37d0485b2058885bf4e7a486f7d7b2174a8ac1ce" "z3-solver==4.15.4.0"; \
|
||||
\
|
||||
# Clone + pin TileLang (bundled TVM), then build
|
||||
git clone --recursive "${TILELANG_REPO}" /opt/tilelang && \
|
||||
cd /opt/tilelang && \
|
||||
git fetch --depth=1 origin "${TILELANG_COMMIT}" || true && \
|
||||
git checkout -f "${TILELANG_COMMIT}" && \
|
||||
git submodule update --init --recursive && \
|
||||
export CMAKE_ARGS="-DUSE_CUDA=OFF -DUSE_ROCM=ON -DROCM_PATH=/opt/rocm -DLLVM_CONFIG=${LLVM_CONFIG} -DSKBUILD_SABI_VERSION= ${CMAKE_ARGS:-}" && \
|
||||
"$VENV_PIP" install -e . -v --no-build-isolation --no-deps; \
|
||||
if [ -f pyproject.toml ]; then sed -i "/^[[:space:]]*\"torch/d" pyproject.toml || true; fi; \
|
||||
"$VENV_PIP" cache purge || true; \
|
||||
"$VENV_PY" -c "import tilelang; print(tilelang.__version__)"'
|
||||
|
||||
# -----------------------
|
||||
# Hadamard-transform (HIP build)
|
||||
RUN /bin/bash -lc 'set -euo pipefail; \
|
||||
git clone --branch "${FHT_BRANCH}" "${FHT_REPO}" fast-hadamard-transform; \
|
||||
cd fast-hadamard-transform; \
|
||||
git checkout -f "${FHT_COMMIT}"; \
|
||||
python setup.py install'
|
||||
|
||||
# -----------------------
|
||||
# Python tools
|
||||
RUN python3 -m pip install --no-cache-dir \
|
||||
py-spy \
|
||||
pre-commit \
|
||||
tabulate
|
||||
|
||||
# -----------------------
|
||||
# MORI (optional)
|
||||
ENV PYTORCH_ROCM_ARCH=gfx942;gfx950
|
||||
RUN /bin/bash -lc 'set -euo pipefail; \
|
||||
if [ "${ENABLE_MORI}" != "1" ]; then \
|
||||
echo "[MORI] Skipping (ENABLE_MORI=${ENABLE_MORI})"; \
|
||||
exit 0; \
|
||||
fi; \
|
||||
echo "[MORI] Enabling MORI (NIC_BACKEND=${NIC_BACKEND})"; \
|
||||
\
|
||||
# Base deps for MORI build
|
||||
apt-get update && apt-get install -y --no-install-recommends \
|
||||
build-essential \
|
||||
g++ \
|
||||
jq \
|
||||
libopenmpi-dev \
|
||||
libpci-dev \
|
||||
initramfs-tools \
|
||||
&& rm -rf /var/lib/apt/lists/*; \
|
||||
\
|
||||
# NIC backend deps
|
||||
case "${NIC_BACKEND}" in \
|
||||
# default: mlx5
|
||||
none) \
|
||||
export USE_IONIC="OFF"; \
|
||||
export USE_BNXT="OFF"; \
|
||||
;; \
|
||||
# AMD NIC
|
||||
ainic) \
|
||||
export USE_IONIC="ON"; \
|
||||
export USE_BNXT="OFF"; \
|
||||
apt-get update && apt-get install -y --no-install-recommends ca-certificates curl gnupg apt-transport-https && \
|
||||
rm -rf /var/lib/apt/lists/* && mkdir -p /etc/apt/keyrings; \
|
||||
curl -fsSL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor > /etc/apt/keyrings/amdainic.gpg; \
|
||||
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/amdainic.gpg] https://repo.radeon.com/amdainic/pensando/ubuntu/${AINIC_VERSION} ${UBUNTU_CODENAME} main" \
|
||||
> /etc/apt/sources.list.d/amdainic.list; \
|
||||
apt-get update && apt-get install -y --no-install-recommends \
|
||||
libionic-dev \
|
||||
ionic-common \
|
||||
; \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
;; \
|
||||
# TODO: Add Broadcom bnxt packages/repos here later.
|
||||
# bnxt) \
|
||||
# export USE_IONIC="OFF"; \
|
||||
# export USE_BNXT="ON"; \
|
||||
# echo "[MORI] NIC_BACKEND=bnxt: USE_BNXT=ON. Add Broadcom bnxt packages/repos here later."; \
|
||||
# ;; \
|
||||
*) \
|
||||
echo "ERROR: unknown NIC_BACKEND=${NIC_BACKEND}. Use one of: none, ainic"; \
|
||||
exit 2; \
|
||||
;; \
|
||||
esac; \
|
||||
\
|
||||
# Build/install MORI
|
||||
export MORI_GPU_ARCHS="${GPU_ARCH_LIST}"; \
|
||||
echo "[MORI] MORI_GPU_ARCHS=${MORI_GPU_ARCHS} USE_IONIC=${USE_IONIC} USE_BNXT=${USE_BNXT}"; \
|
||||
rm -rf /sgl-workspace/mori; \
|
||||
git clone "${MORI_REPO}" /sgl-workspace/mori; \
|
||||
cd /sgl-workspace/mori; \
|
||||
git checkout "${MORI_COMMIT}"; \
|
||||
git submodule update --init --recursive; \
|
||||
python3 setup.py develop; \
|
||||
python3 -c "import os, torch; print(os.path.join(os.path.dirname(torch.__file__), \"lib\"))" > /etc/ld.so.conf.d/torch.conf; \
|
||||
ldconfig; \
|
||||
echo "export PYTHONPATH=/sgl-workspace/mori:\${PYTHONPATH}" >> /etc/bash.bashrc; \
|
||||
echo "[MORI] Done."'
|
||||
|
||||
# -----------------------
|
||||
# Hot patch: torch-ROCm
|
||||
# The artifact hardcoded the supported triton version to be 3.5.1.
|
||||
# Rewrite the restriction directly.
|
||||
ARG TORCH_ROCM_FILE="torch-2.9.1+rocm7.2.0.lw.git7e1940d4-cp310-cp310-linux_x86_64.whl"
|
||||
RUN mkdir /tmp/whl && cd /tmp/whl \
|
||||
&& export TORCH_ROCM_FILE="${TORCH_ROCM_FILE}" \
|
||||
&& python - <<'PY'
|
||||
import zipfile, csv, os, re
|
||||
from pathlib import Path
|
||||
|
||||
fname = os.environ["TORCH_ROCM_FILE"]
|
||||
in_whl = Path("/") / fname
|
||||
out_whl = Path("/tmp")/ fname
|
||||
work = Path("/tmp/whl")
|
||||
|
||||
# 1) Extract
|
||||
with zipfile.ZipFile(in_whl, "r") as z:
|
||||
z.extractall(work)
|
||||
|
||||
# 2) Locate dist-info and patch METADATA (edit this logic to match your exact line)
|
||||
dist_info = next(work.glob("*.dist-info"))
|
||||
meta = dist_info / "METADATA"
|
||||
txt = meta.read_text(encoding="utf-8")
|
||||
|
||||
# Example: replace one exact requirement form.
|
||||
# Adjust the string to match what you actually see.
|
||||
pat = r'^Requires-Dist:\s*triton==3.5.1[^\s]*;'
|
||||
txt2, n = re.subn(pat, r'triton>=3.5.1;', txt, flags=re.MULTILINE)
|
||||
if txt2 == txt:
|
||||
raise SystemExit("Did not find expected Requires-Dist line to replace in METADATA")
|
||||
meta.write_text(txt2, encoding="utf-8")
|
||||
|
||||
# 3) Hacky step: blank hash/size columns in RECORD
|
||||
record = dist_info / "RECORD"
|
||||
rows = []
|
||||
with record.open(newline="", encoding="utf-8") as f:
|
||||
for r in csv.reader(f):
|
||||
if not r:
|
||||
continue
|
||||
# keep filename, blank out hash and size
|
||||
rows.append([r[0], "", ""])
|
||||
with record.open("w", newline="", encoding="utf-8") as f:
|
||||
csv.writer(f).writerows(rows)
|
||||
|
||||
# 4) Re-zip as a wheel
|
||||
with zipfile.ZipFile(out_whl, "w", compression=zipfile.ZIP_DEFLATED) as z:
|
||||
for p in work.rglob("*"):
|
||||
if p.is_file():
|
||||
z.write(p, p.relative_to(work).as_posix())
|
||||
|
||||
print("Wrote", out_whl)
|
||||
PY
|
||||
|
||||
RUN python3 -m pip install --force --no-deps /tmp/${TORCH_ROCM_FILE} \
|
||||
&& rm -fr /tmp/whl /tmp/${TORCH_ROCM_FILE}
|
||||
|
||||
# -----------------------
|
||||
# Hot patch: Triton
|
||||
# For ROCm 7.2, this custom build breaks pip dependency management,
|
||||
# so future `pip install` will break the ROCm stack.
|
||||
# A workaround for this is to reinstall the default triton
|
||||
# wheel with the `rocm/pytorch` image in the root directory.
|
||||
RUN if [ "$BUILD_TRITON" = "1" ]; then \
|
||||
pip uninstall -y triton \
|
||||
&& apt install -y cmake \
|
||||
&& git clone ${TRITON_REPO} triton-custom \
|
||||
&& cd triton-custom \
|
||||
&& git checkout ${TRITON_COMMIT} \
|
||||
&& pip install -r python/requirements.txt \
|
||||
&& pip install -e .; \
|
||||
fi
|
||||
|
||||
# -----------------------
|
||||
# Performance environment variable.
|
||||
|
||||
# Skip CuDNN compatibility check - not applicable for ROCm (uses MIOpen instead)
|
||||
ENV SGLANG_DISABLE_CUDNN_CHECK=1
|
||||
ENV HIP_FORCE_DEV_KERNARG=1
|
||||
ENV HSA_NO_SCRATCH_RECLAIM=1
|
||||
ENV SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
|
||||
ENV SGLANG_INT4_WEIGHT=0
|
||||
ENV SGLANG_MOE_PADDING=1
|
||||
ENV SGLANG_ROCM_DISABLE_LINEARQUANT=0
|
||||
ENV SGLANG_ROCM_FUSED_DECODE_MLA=1
|
||||
ENV SGLANG_SET_CPU_AFFINITY=1
|
||||
ENV SGLANG_USE_AITER=1
|
||||
ENV SGLANG_USE_ROCM700A=1
|
||||
|
||||
ENV NCCL_MIN_NCHANNELS=112
|
||||
ENV ROCM_QUICK_REDUCE_QUANTIZATION=INT8
|
||||
ENV TORCHINDUCTOR_MAX_AUTOTUNE=1
|
||||
ENV TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
|
||||
|
||||
CMD ["/bin/bash"]
|
||||
@@ -39,6 +39,23 @@ Notes:
|
||||
- `page_first`: Only compatible with `kernel` I/O backend, automatically switches to `layer_first` with `direct` backend
|
||||
- `page_first_direct`: Specifically designed for `direct` I/O backend with optimized memory organization
|
||||
|
||||
### Heterogeneous TP Support (GQA/MHA models)
|
||||
|
||||
HiCache storage supports cross-cluster KV reuse when different deployments use different TP sizes (for example, `tp=4` and `tp=8`) and share the same storage backend namespace.
|
||||
|
||||
Use `tp_lcm_size` in `--hicache-storage-backend-extra-config`:
|
||||
|
||||
```bash
|
||||
# Example: heterogeneous TP = {4, 8}, so lcm = 8
|
||||
--hicache-storage-backend-extra-config '{"tp_lcm_size": 8}'
|
||||
```
|
||||
|
||||
Guidelines:
|
||||
|
||||
- Set `tp_lcm_size` to the least common multiple (LCM) of all TP sizes that will share the same HiCache storage.
|
||||
- For MHA models with Mooncake and `page_head` layout, HiCache will split head shards based on `tp_lcm_size` to make keys reusable across heterogeneous TP deployments.
|
||||
- If all clusters use the same TP size, this option is not needed.
|
||||
|
||||
### Prefetch Policies
|
||||
|
||||
```bash
|
||||
|
||||
@@ -102,7 +102,7 @@
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -151,18 +151,16 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"server_process, port = launch_server_cmd(\"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --enable-lora \\\n",
|
||||
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
|
||||
" lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
|
||||
" --max-loras-per-batch 2 \\\n",
|
||||
" --log-level warning \\\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\"\"\")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -227,8 +225,7 @@
|
||||
"\n",
|
||||
"# The `--target-lora-modules` param below is technically not needed, as the server will infer it from lora0 which already has all the target modules specified.\n",
|
||||
"# We are adding it here just to demonstrate usage.\n",
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"server_process, port = launch_server_cmd(\"\"\"\n",
|
||||
" python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --enable-lora \\\n",
|
||||
" --cuda-graph-max-bs 2 \\\n",
|
||||
@@ -236,11 +233,10 @@
|
||||
" --max-lora-rank 256\n",
|
||||
" --lora-target-modules all\n",
|
||||
" --log-level warning\n",
|
||||
" \"\"\"\n",
|
||||
")\n",
|
||||
" \"\"\")\n",
|
||||
"\n",
|
||||
"url = f\"http://127.0.0.1:{port}\"\n",
|
||||
"wait_for_server(url)"
|
||||
"wait_for_server(url, process=server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -435,8 +431,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"server_process, port = launch_server_cmd(\"\"\"\n",
|
||||
" python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --enable-lora \\\n",
|
||||
" --cuda-graph-max-bs 8 \\\n",
|
||||
@@ -448,12 +443,11 @@
|
||||
" {\"lora_name\":\"lora1\",\"lora_path\":\"algoprog/fact-generation-llama-3.1-8b-instruct-lora\"} \\\n",
|
||||
" lora2=philschmid/code-llama-3-1-8b-text-to-sql-lora\n",
|
||||
" --log-level warning\n",
|
||||
" \"\"\"\n",
|
||||
")\n",
|
||||
" \"\"\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"url = f\"http://127.0.0.1:{port}\"\n",
|
||||
"wait_for_server(url)"
|
||||
"wait_for_server(url, process=server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -548,16 +542,14 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"server_process, port = launch_server_cmd(\"\"\"\n",
|
||||
" python3 -m sglang.launch_server \\\n",
|
||||
" --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --enable-lora \\\n",
|
||||
" --lora-backend csgmv \\\n",
|
||||
" --max-loras-per-batch 16 \\\n",
|
||||
" --lora-paths lora1=path/to/lora1 lora2=path/to/lora2\n",
|
||||
" \"\"\"\n",
|
||||
")"
|
||||
" \"\"\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -594,8 +586,7 @@
|
||||
"lora2 = \"philschmid/code-llama-3-1-8b-text-to-sql-lora\"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"server_process, port = launch_server_cmd(\"\"\"\n",
|
||||
" python3 -m sglang.launch_server \\\n",
|
||||
" --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --enable-lora \\\n",
|
||||
@@ -606,11 +597,10 @@
|
||||
" --max-lora-rank 256 \\\n",
|
||||
" --max-loras-per-batch 2 \\\n",
|
||||
" --max-loaded-loras 4\n",
|
||||
" \"\"\"\n",
|
||||
")\n",
|
||||
" \"\"\")\n",
|
||||
"\n",
|
||||
"url = f\"http://127.0.0.1:{port}\"\n",
|
||||
"wait_for_server(url)"
|
||||
"wait_for_server(url, process=server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -142,6 +142,7 @@ The `SGLANG_MOONCAKE_CUSTOM_MEM_POOL` environment variable enables the custom me
|
||||
| **`SGLANG_DISAGGREGATION_THREAD_POOL_SIZE`** | Controls the total number of worker threads for KVCache transfer operations per TP rank | A dynamic value calculated by `int(0.75 * os.cpu_count()) // 8)`, which is limited to be larger than 4 and less than 12 to ensure efficiency and prevent thread race conditions |
|
||||
| **`SGLANG_DISAGGREGATION_QUEUE_SIZE`** | Sets the number of parallel transfer queues. KVCache transfer requests from multiple decode instances will be sharded into these queues so that they can share the threads and the transfer bandwidth at the same time. If it is set to `1`, then we transfer requests one by one according to fcfs strategy | `4` |
|
||||
| **`SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT`** | Timeout (seconds) for receiving destination KV indices during request initialization | `300` |
|
||||
| **`SGLANG_DISAGGREGATION_BOOTSTRAP_ENTRY_CLEANUP_INTERVAL`** | Interval (seconds) between cleanups of bootstrap entries | `120` |
|
||||
|
||||
If a greater mean TTFT is acceptable, you can `export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600` (10 minutes) to relax the timeout condition.
|
||||
Please be aware that this setting will cause prefill instances to take a longer time to clean up the affected memory resources when a running decode node loses connection.
|
||||
|
||||
@@ -70,7 +70,7 @@
|
||||
" \"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -153,6 +153,8 @@ Please consult the documentation below and [server_args.py](https://github.com/s
|
||||
| `--device` | The device to use ('cuda', 'xpu', 'hpu', 'npu', 'cpu'). Defaults to auto-detection if not specified. | `None` | Type: str |
|
||||
| `--tensor-parallel-size`<br>`--tp-size` | The tensor parallelism size. | `1` | Type: int |
|
||||
| `--pipeline-parallel-size`<br>`--pp-size` | The pipeline parallelism size. | `1` | Type: int |
|
||||
| `--attention-context-parallel-size`<br>`--attn-cp-size`| The attention context parallelism size. | `1` | Type: int|
|
||||
| `--moe-data-parallel-size`<br>`--moe-dp-size`| The moe data parallelism size. | `1` | Type: int|
|
||||
| `--pp-max-micro-batch-size` | The maximum micro batch size in pipeline parallelism. | `None` | Type: int |
|
||||
| `--pp-async-batch-depth` | The async batch depth of pipeline parallelism. | `0` | Type: int |
|
||||
| `--stream-interval` | The interval (or buffer size) for streaming in terms of the token length. A smaller value makes streaming smoother, while a larger value makes the throughput higher | `1` | Type: int |
|
||||
@@ -264,9 +266,9 @@ Please consult the documentation below and [server_args.py](https://github.com/s
|
||||
| `--sampling-backend` | Choose the kernels for sampling layers. | `None` | `flashinfer`, `pytorch`, `ascend` |
|
||||
| `--grammar-backend` | Choose the backend for grammar-guided decoding. | `None` | `xgrammar`, `outlines`, `llguidance`, `none` |
|
||||
| `--mm-attention-backend` | Set multimodal attention backend. | `None` | `sdpa`, `fa3`, `fa4`, `triton_attn`, `ascend_attn`, `aiter_attn` |
|
||||
| `--nsa-prefill-backend` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_kv`, `flashmla_auto`, `fa3`, `tilelang`, `aiter` |
|
||||
| `--nsa-decode-backend` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `fa3` | `flashmla_sparse`, `flashmla_kv`, `fa3`, `tilelang`, `aiter` |
|
||||
| `--fp8-gemm-backend` | Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (optimal for Blackwell and low-latency), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only). **NOTE**: This replaces the deprecated environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8. | `auto` | `auto`, `deep_gemm`, `flashinfer_trtllm`, `cutlass`, `triton`, `aiter` |
|
||||
| `--nsa-prefill-backend` | Choose the NSA backend for the prefill stage (overrides `--attention-backend` when running DeepSeek NSA-style attention). | `flashmla_sparse` | `flashmla_sparse`, `flashmla_kv`, `flashmla_auto`, `fa3`, `tilelang`, `aiter`, `trtllm` |
|
||||
| `--nsa-decode-backend` | Choose the NSA backend for the decode stage when running DeepSeek NSA-style attention. Overrides `--attention-backend` for decoding. | `fa3` | `flashmla_sparse`, `flashmla_kv`, `fa3`, `tilelang`, `aiter`, `trtllm` |
|
||||
| `--fp8-gemm-backend` | Choose the runner backend for Blockwise FP8 GEMM operations. Options: 'auto' (default, auto-selects based on hardware), 'deep_gemm' (JIT-compiled; enabled by default on NVIDIA Hopper (SM90) and Blackwell (SM100) when DeepGEMM is installed), 'flashinfer_trtllm' (optimal for Blackwell and low-latency), 'flashinfer_deepgemm' (Hopper SM90 only; uses swapAB optimization for small M dimensions in decoding), 'cutlass' (optimal for Hopper/Blackwell GPUs and high-throughput), 'triton' (fallback, widely compatible), 'aiter' (ROCm only). **NOTE**: This replaces the deprecated environment variables SGLANG_ENABLE_FLASHINFER_FP8_GEMM and SGLANG_SUPPORT_CUTLASS_BLOCK_FP8. | `auto` | `auto`, `deep_gemm`, `flashinfer_trtllm`, `flashinfer_deepgemm`, `cutlass`, `triton`, `aiter` |
|
||||
| `--fp4-gemm-backend` | Choose the runner backend for NVFP4 GEMM operations. Options: 'flashinfer_cutlass' (default), 'auto' (auto-selects between flashinfer_cudnn/flashinfer_cutlass based on CUDA/cuDNN version), 'flashinfer_cudnn' (FlashInfer cuDNN backend, optimal on CUDA 13+ with cuDNN 9.15+), 'flashinfer_trtllm' (FlashInfer TensorRT-LLM backend, requires different weight preparation with shuffling). All backends are from FlashInfer; when FlashInfer is unavailable, sgl-kernel CUTLASS is used as an automatic fallback. **NOTE**: This replaces the deprecated environment variable SGLANG_FLASHINFER_FP4_GEMM_BACKEND. | `flashinfer_cutlass` | `auto`, `flashinfer_cudnn`, `flashinfer_cutlass`, `flashinfer_trtllm` |
|
||||
| `--disable-flashinfer-autotune` | Flashinfer autotune is enabled by default. Set this flag to disable the autotune. | `False` | bool flag (set to enable) |
|
||||
|
||||
@@ -309,10 +311,11 @@ Please consult the documentation below and [server_args.py](https://github.com/s
|
||||
| Argument | Description | Defaults | Options |
|
||||
| --- | --- | --- | --- |
|
||||
| `--expert-parallel-size`<br>`--ep-size`<br>`--ep` | The expert parallelism size. | `1` | Type: int |
|
||||
| `--moe-a2a-backend` | Select the backend for all-to-all communication for expert parallelism. | `none` | `none`, `deepep`, `mooncake`, `ascend_fuseep`|
|
||||
| `--moe-a2a-backend` | Select the backend for all-to-all communication for expert parallelism. | `none` | `none`, `deepep`, `mooncake`, `mori`, `ascend_fuseep`|
|
||||
| `--moe-runner-backend` | Choose the runner backend for MoE. | `auto` | `auto`, `deep_gemm`, `triton`, `triton_kernel`, `flashinfer_trtllm`, `flashinfer_cutlass`, `flashinfer_mxfp4`, `flashinfer_cutedsl`, `cutlass` |
|
||||
| `--flashinfer-mxfp4-moe-precision` | Choose the computation precision of flashinfer mxfp4 moe | `default` | `default`, `bf16` |
|
||||
| `--enable-flashinfer-allreduce-fusion` | Enable FlashInfer allreduce fusion with Residual RMSNorm. | `False` | bool flag (set to enable) |
|
||||
| `--enable-aiter-allreduce-fusion` | Enable aiter allreduce fusion with Residual RMSNorm. | `False` | bool flag (set to enable) |
|
||||
| `--deepep-mode` | Select the mode when enable DeepEP MoE, could be `normal`, `low_latency` or `auto`. Default is `auto`, which means `low_latency` for decode batch and `normal` for prefill batch. | `auto` | `normal`, `low_latency`, `auto` |
|
||||
| `--ep-num-redundant-experts` | Allocate this number of redundant experts in expert parallel. | `0` | Type: int |
|
||||
| `--ep-dispatch-algorithm` | The algorithm to choose ranks for redundant experts in expert parallel. | `None` | Type: str |
|
||||
@@ -334,7 +337,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
|
||||
| Argument | Description | Defaults | Options |
|
||||
| --- | --- | --- | --- |
|
||||
| `--max-mamba-cache-size` | The maximum size of the mamba cache. | `None` | Type: int |
|
||||
| `--mamba-ssm-dtype` | The data type of the SSM states in mamba cache. | `float32` | `float32`, `bfloat16` |
|
||||
| `--mamba-ssm-dtype` | The data type of the SSM states in mamba cache. | `float32` | `float32`, `bfloat16`, `float16` |
|
||||
| `--mamba-full-memory-ratio` | The ratio of mamba state memory to full kv cache memory. | `0.9` | Type: float |
|
||||
| `--mamba-scheduler-strategy` | The strategy to use for mamba scheduler. `auto` currently defaults to `no_buffer`. 1. `no_buffer` does not support overlap scheduler due to not allocating extra mamba state buffers. Branching point caching support is feasible but not implemented. 2. `extra_buffer` supports overlap schedule by allocating extra mamba state buffers to track mamba state for caching (mamba state usage per running req becomes `2x` for non-spec; `1+(1/(2+speculative_num_draft_tokens))x` for spec dec (e.g. 1.16x if speculative_num_draft_tokens==4)). 2a. `extra_buffer` is strictly better for non-KV-cache-bound cases; for KV-cache-bound cases, the tradeoff depends on whether enabling overlap outweighs reduced max running requests. 2b. mamba caching at radix cache branching point is strictly better than non-branch but requires kernel support (currently only FLA backend), currently only extra_buffer supports branching. | `auto` | `auto`, `no_buffer`, `extra_buffer` |
|
||||
| `--mamba-track-interval` | The interval (in tokens) to track the mamba state during decode. Only used when `--mamba-scheduler-strategy` is `extra_buffer`. Must be divisible by page_size if set, and must be >= speculative_num_draft_tokens when using speculative decoding. | `256` | Type: int |
|
||||
@@ -373,6 +376,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
|
||||
| `--kt-max-deferred-experts-per-token` | [ktransformers parameter] Maximum number of experts deferred to CPU per token. All MoE layers except the final one use this value; the final layer always uses 0. | `None` | Type: int |
|
||||
|
||||
## Diffusion LLM
|
||||
|
||||
| Argument | Description | Defaults | Options |
|
||||
| --- | --- | --- | --- |
|
||||
| `--dllm-algorithm` | The diffusion LLM algorithm, such as LowConfidence. | `None` | Type: str |
|
||||
@@ -492,7 +496,6 @@ Please consult the documentation below and [server_args.py](https://github.com/s
|
||||
| `--disaggregation-prefill-pp` | Prefill pp size. If not set, it is default to 1. This is only set on the decode server. | `1` | Type: int |
|
||||
| `--disaggregation-ib-device` | The InfiniBand devices for disaggregation transfer, accepts single device (e.g., --disaggregation-ib-device mlx5_0) or multiple comma-separated devices (e.g., --disaggregation-ib-device mlx5_0,mlx5_1). Default is None, which triggers automatic device detection when mooncake backend is enabled. | `None` | Type: str |
|
||||
| `--disaggregation-decode-enable-offload-kvcache` | Enable async KV cache offloading on decode server (PD mode). | `False` | bool flag (set to enable) |
|
||||
| `--disaggregation-decode-enable-fake-auto` | Auto enable FAKE mode for decode node testing, no need to pass bootstrap_host and bootstrap_room in request. | `False` | bool flag (set to enable) |
|
||||
| `--num-reserved-decode-tokens` | Number of decode tokens that will have memory reserved when adding new request to the running batch. | `512` | Type: int |
|
||||
| `--disaggregation-decode-polling-interval` | The interval to poll requests in decode server. Can be set to >1 to reduce the overhead of this. | `1` | Type: int |
|
||||
|
||||
|
||||
@@ -106,6 +106,29 @@ This path trades some I/O overhead for simplicity and flexibility. It integrates
|
||||
|
||||
**Python Engine API:** `engine.update_weights_from_disk(model_path, load_format=None)`
|
||||
|
||||
**Diffusion engine (SGLang-Diffusion):** The diffusion engine exposes the same `POST /update_weights_from_disk` endpoint with the following behavior:
|
||||
|
||||
- **All-or-nothing with rollback:** if any module fails to load, all previously updated modules are rolled back to the original weights by reloading from the original model path. No partial updates are left behind. If rollback itself fails, the exception propagates so the caller knows the model is in an inconsistent state.
|
||||
- **Offload-aware:** when layerwise offload (`--dit-layerwise-offload`) is enabled, the diffusion offload manager replaces GPU parameters with small `torch.empty((1,))` placeholders while real weights live in consolidated pinned CPU buffers. A naive `param.data.copy_()` would fail with a shape mismatch. Instead, the updater dynamically detects active offload managers and writes new weights directly into their CPU buffers, bypassing the placeholders entirely. For any layer that happens to be prefetched on GPU at update time, the live GPU tensor is also updated so the change takes effect immediately. This requires no extra GPU memory and does not disturb the offload state.
|
||||
- **DTensor-aware:** parameters distributed via `torch.distributed.tensor` (tensor parallelism) are updated through `distribute_tensor` so that each shard is correctly placed on the right device mesh.
|
||||
|
||||
**Request body:**
|
||||
|
||||
| Field | Description | Defaults | Options |
|
||||
| --- | --- | --- | --- |
|
||||
| `model_path` | The model path with the new weights. | Required | Type: str |
|
||||
| `flush_cache` | Flush TeaCache state after update. | `True` | Type: bool |
|
||||
| `target_modules` | List of module names to update (e.g. `["transformer"]`). If omitted, all `nn.Module` components are updated. | `None` | Type: list[str] |
|
||||
|
||||
**Response body:**
|
||||
|
||||
| Field | Description | Defaults | Options |
|
||||
| --- | --- | --- | --- |
|
||||
| `success` | Whether the update succeeded. | - | Type: bool |
|
||||
| `message` | Status / error message. | - | Type: str |
|
||||
|
||||
> **Note:** The diffusion engine (SGLang-Diffusion) does not currently support hot refit (updating weights while inference is in progress). The diffusion scheduler processes one request at a time and completes the entire inference before handling the next request, so weight updates and inference never run concurrently.
|
||||
|
||||
### Update Weights from Tensor
|
||||
|
||||
**When to use:**
|
||||
|
||||
@@ -1,608 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Speculative Decoding\n",
|
||||
"\n",
|
||||
"SGLang provides several speculative decoding options, including EAGLE-2/EAGLE-3, MTP, classic draft-model decoding, and an NGRAM-based variant. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.\n",
|
||||
"\n",
|
||||
"## Summary\n",
|
||||
"\n",
|
||||
"### Jump to sections\n",
|
||||
"\n",
|
||||
"- [EAGLE Decoding](#eagle-decoding)\n",
|
||||
" - [EAGLE-2 decoding](#eagle-2-decoding)\n",
|
||||
" - [EAGLE-2 Decoding with torch.compile](#eagle-2-decoding-with-torchcompile)\n",
|
||||
" - [EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling](#eagle-2-decoding-via-frequency-ranked-speculative-sampling)\n",
|
||||
" - [EAGLE-3 Decoding](#eagle-3-decoding)\n",
|
||||
"- [Multi Token Prediction](#multi-token-prediction)\n",
|
||||
"- [Standalone Speculative Decoding (Small Draft Model)](#standalone-speculative-decoding-small-draft-model)\n",
|
||||
"- [Speculative Decoding V2 (Overlap Scheduler)](#speculative-decoding-v2-overlap-scheduler)\n",
|
||||
"- [Ngram Speculative Decoding](#ngram-speculative-decoding)\n",
|
||||
"\n",
|
||||
"### Quick guidance\n",
|
||||
"\n",
|
||||
"- **Best speed/quality (recommended)**: Use **EAGLE-3** with `--speculative-algorithm EAGLE3`.\n",
|
||||
"- **Strong default / broad compatibility**: Use **EAGLE-2** with `--speculative-algorithm EAGLE`.\n",
|
||||
"- **Lower `lm_head` overhead for EAGLE-2**: Enable **FR-Spec** with `--speculative-token-map`.\n",
|
||||
"- **Model is MTP-enabled**: Use **MTP via speculative decoding** (often with small `speculative_num_steps/topk/num_draft_tokens`, see the example section).\n",
|
||||
"- **You have a smaller draft LLM**: Use **STANDALONE** (`--speculative-algorithm STANDALONE`).\n",
|
||||
"- **No extra model available**: Use **NGRAM** (`--speculative-algorithm NGRAM`, CUDA-only).\n",
|
||||
"- **Want overlap scheduler (experimental)**: Enable **SpecV2** with `SGLANG_ENABLE_SPEC_V2=True` (requires `--speculative-eagle-topk 1`).\n",
|
||||
"\n",
|
||||
"### Method comparison (mini table)\n",
|
||||
"\n",
|
||||
"| Method | Draft source | Separate draft model? | How to enable | Notes / constraints |\n",
|
||||
"|---|---|---:|---|---|\n",
|
||||
"| EAGLE-2 | EAGLE draft model (feature drafting + tree) | Typically yes | `--speculative-algorithm EAGLE` + `--speculative-draft-model-path ...` | Tune `--speculative-num-steps`, `--speculative-eagle-topk`, `--speculative-num-draft-tokens` |\n",
|
||||
"| EAGLE-2 + `torch.compile` | Same as EAGLE-2 | Typically yes | Add `--enable-torch-compile` (optionally `--torch-compile-max-bs`) | Further kernel-level optimizations |\n",
|
||||
"| EAGLE-2 + FR-Spec | Same as EAGLE-2 + token subset | Typically yes | Add `--speculative-token-map ...` | Reduces `lm_head` overhead with high-frequency token vocab |\n",
|
||||
"| EAGLE-3 | EAGLE3 draft model | Yes | `--speculative-algorithm EAGLE3` + `--speculative-draft-model-path ...` | Best throughput in the benchmark above |\n",
|
||||
"| MTP | Built-in multi-token heads (model-specific) | Often no | See **Multi Token Prediction** section | Uses speculative workflow; draft path may be auto-handled for some models |\n",
|
||||
"| STANDALONE | Smaller draft LLM (token-level) | Yes | `--speculative-algorithm STANDALONE` + `--speculative-draft-model-path ...` | Does **not** support `--enable-dp-attention` |\n",
|
||||
"| SpecV2 (experimental) | V2 workers + overlap scheduler | N/A | `SGLANG_ENABLE_SPEC_V2=True` | Only supports `--speculative-eagle-topk 1`; applies to `EAGLE`, `EAGLE3`, `STANDALONE` |\n",
|
||||
"| NGRAM | Ngram cache from previous tokens | No | `--speculative-algorithm NGRAM` | CUDA-only; no `--enable-dp-attention`; disables overlap scheduler & mixed chunked prefill |\n",
|
||||
"\n",
|
||||
"### Performance Highlights\n",
|
||||
"\n",
|
||||
"Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding.\n",
|
||||
"For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840).\n",
|
||||
"\n",
|
||||
"| Method | Throughput (tokens/s) |\n",
|
||||
"|--------|----------------|\n",
|
||||
"| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |\n",
|
||||
"| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |\n",
|
||||
"| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## EAGLE Decoding\n",
|
||||
"\n",
|
||||
"To enable EAGLE speculative decoding the following parameters are relevant:\n",
|
||||
"* `speculative_draft_model_path`: Draft model path/weights. **Typically required** for EAGLE/EAGLE3 and STANDALONE. For some MTP-enabled models, this can be omitted (SGLang may auto-handle/auto-fill it).\n",
|
||||
"* `speculative_num_steps`: Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. Default is 5.\n",
|
||||
"* `speculative_eagle_topk`: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.\n",
|
||||
"* `speculative_num_draft_tokens`: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8.\n",
|
||||
"\n",
|
||||
"These parameters are the same for EAGLE-2 and EAGLE-3.\n",
|
||||
"\n",
|
||||
"You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).\n",
|
||||
"\n",
|
||||
"In the documentation below, we set `--cuda-graph-max-bs` to be a small value for faster engine startup. For your own workloads, please tune the above parameters together with `--cuda-graph-max-bs`, `--max-running-requests`, `--mem-fraction-static` for the best performance. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### EAGLE-2 decoding\n",
|
||||
"\n",
|
||||
"You can enable EAGLE-2 decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sglang.test.doc_patch import launch_server_cmd\n",
|
||||
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
||||
"\n",
|
||||
"import openai"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \\\n",
|
||||
" --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \\\n",
|
||||
" --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"meta-llama/Llama-2-7b-chat-hf\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### EAGLE-2 Decoding with `torch.compile`\n",
|
||||
"\n",
|
||||
"You can also enable `torch.compile` for further optimizations and optionally set `--torch-compile-max-bs`:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf --speculative-algorithm EAGLE \\\n",
|
||||
" --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 \\\n",
|
||||
" --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.6 \\\n",
|
||||
" --enable-torch-compile --torch-compile-max-bs 2 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"meta-llama/Llama-2-7b-chat-hf\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling\n",
|
||||
"\n",
|
||||
"By employing a truncated high-frequency token vocabulary in the draft model, Eagle speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, checkout [the paper](https://arxiv.org/pdf/arXiv:2502.14856).\n",
|
||||
"\n",
|
||||
"In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency token in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency token by directly downloading these token from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset).\n",
|
||||
"\n",
|
||||
"Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx). "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE \\\n",
|
||||
" --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \\\n",
|
||||
" --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \\\n",
|
||||
" --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### EAGLE-3 Decoding\n",
|
||||
"\n",
|
||||
"You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algorithm EAGLE3 \\\n",
|
||||
" --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \\\n",
|
||||
" --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \\\n",
|
||||
" --cuda-graph-max-bs 2 --dtype float16 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Multi Token Prediction\n",
|
||||
"\n",
|
||||
"We support [MTP(Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use Xiaomi/MiMo-7B-RL model as example here (deepseek mtp usage refer to [deepseek doc](../basic_usage/deepseek.md#multi-token-prediction))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
" python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code \\\n",
|
||||
" --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 \\\n",
|
||||
" --mem-fraction 0.5 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"\n",
|
||||
"url = f\"http://localhost:{port}/v1/chat/completions\"\n",
|
||||
"\n",
|
||||
"data = {\n",
|
||||
" \"model\": \"XiaomiMiMo/MiMo-7B-RL\",\n",
|
||||
" \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"print_highlight(response.json())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Standalone Speculative Decoding (Small Draft Model)\n",
|
||||
"\n",
|
||||
"Besides EAGLE/MTP, SGLang also supports **token-level speculative decoding** using a smaller **draft model**. Enable it with `--speculative-algorithm STANDALONE` and provide a draft model via `--speculative-draft-model-path`.\n",
|
||||
"\n",
|
||||
"Relevant parameters:\n",
|
||||
"- `--speculative-draft-model-path`: Draft model weights (smaller than the target model).\n",
|
||||
"- `--speculative-num-steps`: Draft depth (how many steps the draft model runs autoregressively).\n",
|
||||
"- `--speculative-eagle-topk`: Branching factor (token candidates per step).\n",
|
||||
"- `--speculative-num-draft-tokens`: Verification capacity.\n",
|
||||
"\n",
|
||||
"Note:\n",
|
||||
"- Standalone speculative decoding currently **does not support** `--enable-dp-attention`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct --speculative-algorithm STANDALONE \\\n",
|
||||
" --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \\\n",
|
||||
" --speculative-num-steps 4 --speculative-eagle-topk 2 --speculative-num-draft-tokens 7 \\\n",
|
||||
" --cuda-graph-max-bs 8 --mem-fraction-static 0.7 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"Qwen/Qwen2.5-7B-Instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Speculative Decoding V2 (Overlap Scheduler)\n",
|
||||
"\n",
|
||||
"SGLang provides an **experimental Speculative Decoding V2** implementation that enables an overlap scheduler and uses V2 speculative workers (e.g. `StandaloneWorkerV2`, `EAGLEWorkerV2`).\n",
|
||||
"\n",
|
||||
"To enable it, set the environment variable:\n",
|
||||
"- `SGLANG_ENABLE_SPEC_V2=True`\n",
|
||||
"\n",
|
||||
"Notes:\n",
|
||||
"- SpecV2 currently only supports `--speculative-eagle-topk 1`. When SpecV2 is enabled, **set `--speculative-eagle-topk 1` explicitly**.\n",
|
||||
"- If you explicitly set `--speculative-eagle-topk > 1`, the server will error. If you omit `--speculative-eagle-topk`, auto-tuning may pick `topk > 1` for some models (e.g. Llama), which is not supported by SpecV2.\n",
|
||||
"- This applies to `EAGLE`, `EAGLE3`, and `STANDALONE`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"SGLANG_ENABLE_SPEC_V2=True python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct --speculative-algorithm STANDALONE \\\n",
|
||||
" --speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \\\n",
|
||||
" --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 5 \\\n",
|
||||
" --cuda-graph-max-bs 8 --mem-fraction-static 0.7 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"Qwen/Qwen2.5-7B-Instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Ngram Speculative Decoding\n",
|
||||
"\n",
|
||||
"SGLang also supports **ngram-based speculative decoding** (no separate draft model). It retrieves draft tokens from an ngram cache built from previously generated tokens, and then verifies them with the target model.\n",
|
||||
"\n",
|
||||
"Enable it with:\n",
|
||||
"- `--speculative-algorithm NGRAM`\n",
|
||||
"\n",
|
||||
"Common parameters:\n",
|
||||
"- `--speculative-num-draft-tokens`: Number of draft tokens verified per step.\n",
|
||||
"- `--speculative-ngram-min-match-window-size` / `--speculative-ngram-max-match-window-size`: Matching window range.\n",
|
||||
"- `--speculative-ngram-min-bfs-breadth` / `--speculative-ngram-max-bfs-breadth`: BFS breadth range.\n",
|
||||
"- `--speculative-ngram-branch-length`: How many recent tokens to insert into the cache.\n",
|
||||
"- `--speculative-ngram-capacity`: Cache capacity.\n",
|
||||
"\n",
|
||||
"Notes:\n",
|
||||
"- Ngram speculative decoding **only supports CUDA**.\n",
|
||||
"- It currently **does not support** `--enable-dp-attention`.\n",
|
||||
"- It disables the overlap scheduler and mixed chunked prefill.\n",
|
||||
"- Optional: set `SGLANG_NGRAM_FORCE_GREEDY_VERIFY=True` to force greedy verification.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct --speculative-algorithm NGRAM \\\n",
|
||||
" --speculative-num-draft-tokens 16 \\\n",
|
||||
" --speculative-ngram-max-match-window-size 12 --speculative-ngram-max-bfs-breadth 10 \\\n",
|
||||
" --cuda-graph-max-bs 8 --mem-fraction-static 0.8 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
||||
"\n",
|
||||
"response = client.chat.completions.create(\n",
|
||||
" model=\"Qwen/Qwen2.5-7B-Instruct\",\n",
|
||||
" messages=[\n",
|
||||
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
||||
" ],\n",
|
||||
" temperature=0,\n",
|
||||
" max_tokens=64,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print_highlight(f\"Response: {response}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## References\n",
|
||||
"\n",
|
||||
"EAGLE process is as follows:\n",
|
||||
"\n",
|
||||
"- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$. \n",
|
||||
"- The next token is then sampled from $p_{k+2}=\\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again.\n",
|
||||
"- EAGLE-2 additionally uses the draft model to evaluate how probable certain branches in the draft tree are, dynamically stopping the expansion of unlikely branches. After the expansion phase, reranking is employed to select only the top `speculative_num_draft_tokens` final nodes as draft tokens.\n",
|
||||
"- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner.\n",
|
||||
"\n",
|
||||
"This enhances drafting accuracy by operating on the features instead of tokens for more regular inputs and passing the tokens from the next timestep additionally to minimize randomness effects from sampling. Furthermore the dynamic adjustment of the draft tree and selection of reranked final nodes increases acceptance rate of draft tokens further. For more details see [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) paper.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"For guidance how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train)."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
592
docs/advanced_features/speculative_decoding.md
Normal file
592
docs/advanced_features/speculative_decoding.md
Normal file
@@ -0,0 +1,592 @@
|
||||
# Speculative Decoding
|
||||
|
||||
SGLang provides several speculative decoding options, including EAGLE-2/EAGLE-3, MTP, classic draft-model decoding, and an NGRAM-based variant. Our implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.
|
||||
|
||||
## Summary
|
||||
|
||||
### Jump to sections
|
||||
|
||||
- [EAGLE Decoding](#eagle-decoding)
|
||||
- [EAGLE-2 Decoding](#eagle-2-decoding)
|
||||
- [EAGLE-2 Decoding with torch.compile](#eagle-2-decoding-with-torchcompile)
|
||||
- [EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling](#eagle-2-decoding-via-frequency-ranked-speculative-sampling)
|
||||
- [EAGLE-3 Decoding](#eagle-3-decoding)
|
||||
- [Multi Token Prediction](#multi-token-prediction)
|
||||
- [Standalone Speculative Decoding (Small Draft Model)](#standalone-speculative-decoding-small-draft-model)
|
||||
- [Speculative Decoding V2 (Overlap Scheduler)](#speculative-decoding-v2-overlap-scheduler)
|
||||
- [Ngram Speculative Decoding](#ngram-speculative-decoding)
|
||||
- [Full Parameter Reference](#full-parameter-reference)
|
||||
- [OOM Troubleshooting](#oom-troubleshooting)
|
||||
- [References](#references)
|
||||
|
||||
### Quick guidance
|
||||
|
||||
- **Best speed/quality (recommended)**: Use **EAGLE-3** with `--speculative-algorithm EAGLE3`.
|
||||
- **Strong default / broad compatibility**: Use **EAGLE-2** with `--speculative-algorithm EAGLE`.
|
||||
- **Lower `lm_head` overhead for EAGLE-2**: Enable **FR-Spec** with `--speculative-token-map`.
|
||||
- **Model is MTP-enabled**: Use **MTP via speculative decoding** (often with small `speculative_num_steps/topk/num_draft_tokens`, see the example section).
|
||||
- **You have a smaller draft LLM**: Use **STANDALONE** (`--speculative-algorithm STANDALONE`).
|
||||
- **No extra model available**: Use **NGRAM** (`--speculative-algorithm NGRAM`, CUDA-only).
|
||||
- **Want overlap scheduler (experimental)**: Enable **SpecV2** with `SGLANG_ENABLE_SPEC_V2=True` (requires `--speculative-eagle-topk 1`).
|
||||
|
||||
### Method comparison (mini table)
|
||||
|
||||
| Method | Draft source | Separate draft model? | How to enable | Notes / constraints |
|
||||
|---|---|---:|---|---|
|
||||
| EAGLE-2 | EAGLE draft model (feature drafting + tree) | Typically yes | `--speculative-algorithm EAGLE` + `--speculative-draft-model-path ...` | Tune `--speculative-num-steps`, `--speculative-eagle-topk`, `--speculative-num-draft-tokens` |
|
||||
| EAGLE-2 + `torch.compile` | Same as EAGLE-2 | Typically yes | Add `--enable-torch-compile` (optionally `--torch-compile-max-bs`) | Further kernel-level optimizations |
|
||||
| EAGLE-2 + FR-Spec | Same as EAGLE-2 + token subset | Typically yes | Add `--speculative-token-map ...` | Reduces `lm_head` overhead with high-frequency token vocab |
|
||||
| EAGLE-3 | EAGLE3 draft model | Yes | `--speculative-algorithm EAGLE3` + `--speculative-draft-model-path ...` | Best throughput in the benchmark above |
|
||||
| MTP | Built-in multi-token heads (model-specific) | Often no | See **Multi Token Prediction** section | Uses speculative workflow; draft path may be auto-handled for some models |
|
||||
| STANDALONE | Smaller draft LLM (token-level) | Yes | `--speculative-algorithm STANDALONE` + `--speculative-draft-model-path ...` | Does **not** support `--enable-dp-attention` |
|
||||
| SpecV2 (experimental) | V2 workers + overlap scheduler | N/A | `SGLANG_ENABLE_SPEC_V2=True` | Only supports `--speculative-eagle-topk 1`; applies to `EAGLE`, `EAGLE3`, `STANDALONE` |
|
||||
| NGRAM | Ngram cache from previous tokens | No | `--speculative-algorithm NGRAM` | CUDA-only; no `--enable-dp-attention`; disables overlap scheduler & mixed chunked prefill |
|
||||
|
||||
### Performance Highlights
|
||||
|
||||
Please see below for the huge improvements on throughput for LLaMA-Instruct 3.1 8B tested on MT bench that can be achieved via EAGLE3 decoding.
|
||||
For further details please see the [EAGLE3 paper](https://arxiv.org/pdf/2503.01840).
|
||||
|
||||
| Method | Throughput (tokens/s) |
|
||||
|--------|----------------|
|
||||
| SGLang (w/o speculative, 1x H100) | 158.34 tokens/s |
|
||||
| SGLang + EAGLE-2 (1x H100) | 244.10 tokens/s |
|
||||
| SGLang + EAGLE-3 (1x H100) | 373.25 tokens/s |
|
||||
|
||||
---
|
||||
|
||||
## EAGLE Decoding
|
||||
|
||||
To enable EAGLE speculative decoding the following parameters are relevant:
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|---|---|---|
|
||||
| `--speculative-draft-model-path` | Draft model path/weights. **Typically required** for EAGLE/EAGLE3 and STANDALONE. For some MTP-enabled models, this can be omitted. | `None` |
|
||||
| `--speculative-num-steps` | Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. | Auto (`5` for Llama/Grok; `3` for many other models) |
|
||||
| `--speculative-eagle-topk` | Branching factor per step. Improves candidate diversity and acceptance rate, but increases memory/compute consumption. | Auto (`4` for Llama/Grok; `1` for many other models) |
|
||||
| `--speculative-num-draft-tokens` | Maximum parallel verification capacity. Allows deeper tree evaluation but increases GPU memory usage. | Auto (`8` for Llama/Grok; `4` for many other models). If `topk=1`, it is adjusted to `num_steps + 1`. |
|
||||
| `--speculative-accept-threshold-single` | Acceptance threshold for single-token verification. Lower values accept more aggressively. | `1.0` |
|
||||
| `--speculative-accept-threshold-acc` | Accumulated acceptance threshold across steps. | `1.0` |
|
||||
| `--speculative-attention-mode` | Attention mode for speculative operations (`prefill` or `decode`), affecting both target verification and draft extension. | `"prefill"` |
|
||||
| `--speculative-draft-attention-backend` | Override attention backend for the draft model. | `None` (same as target) |
|
||||
| `--speculative-draft-model-quantization` | Quantization method for the draft model. Use `"unquant"` to force no quantization even when the target model is quantized. | Same as target model |
|
||||
| `--speculative-draft-model-revision` | Specific revision/commit of the draft model to load. | `None` (auto-set to `"main"` when `--speculative-draft-model-path` is set and revision is omitted) |
|
||||
| `--speculative-draft-load-format` | Load format for the draft model weights. | `None` |
|
||||
|
||||
These parameters are mostly the same for EAGLE-2 and EAGLE-3. `--speculative-token-map` is ignored for EAGLE-3 models.
|
||||
For `--speculative-num-steps`, `--speculative-eagle-topk`, and `--speculative-num-draft-tokens`: leave all three unset to use auto-tuning, or set all three explicitly when tuning.
|
||||
|
||||
You can find the best combinations of these parameters with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py).
|
||||
|
||||
|
||||
### EAGLE-2 decoding
|
||||
|
||||
You can enable EAGLE-2 decoding by setting `--speculative-algorithm EAGLE` and choosing an appropriate model.
|
||||
|
||||
**Launch the server:**
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
--model meta-llama/Llama-2-7b-chat-hf \
|
||||
--speculative-algorithm EAGLE \
|
||||
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
|
||||
--speculative-num-steps 3 \
|
||||
--speculative-eagle-topk 4 \
|
||||
--speculative-num-draft-tokens 16 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--cuda-graph-max-bs 8 \
|
||||
--log-level warning
|
||||
```
|
||||
|
||||
**Send a request:**
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="meta-llama/Llama-2-7b-chat-hf",
|
||||
messages=[
|
||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=64,
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### EAGLE-2 Decoding with `torch.compile`
|
||||
|
||||
You can also enable `torch.compile` for further optimizations and optionally set `--torch-compile-max-bs`:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
--model meta-llama/Llama-2-7b-chat-hf \
|
||||
--speculative-algorithm EAGLE \
|
||||
--speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
|
||||
--speculative-num-steps 3 \
|
||||
--speculative-eagle-topk 4 \
|
||||
--speculative-num-draft-tokens 16 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--enable-torch-compile \
|
||||
--torch-compile-max-bs 8 \
|
||||
--log-level warning
|
||||
```
|
||||
|
||||
**Send a request:**
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="meta-llama/Llama-2-7b-chat-hf",
|
||||
messages=[
|
||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=64,
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### EAGLE-2 Decoding via Frequency-Ranked Speculative Sampling
|
||||
|
||||
By employing a truncated high-frequency token vocabulary in the draft model, Eagle speculative decoding reduces `lm_head` computational overhead while accelerating the pipeline without quality degradation. For more details, checkout [the paper](https://arxiv.org/pdf/2502.14856).
|
||||
|
||||
In our implementation, set `--speculative-token-map` to enable the optimization. You can get the high-frequency token in FR-Spec from [this model](https://huggingface.co/thunlp/LLaMA3-Instruct-8B-FR-Spec). Or you can obtain high-frequency token by directly downloading these token from [this repo](https://github.com/thunlp/FR-Spec/tree/main?tab=readme-ov-file#prepare-fr-spec-vocabulary-subset).
|
||||
|
||||
Thanks for the contribution from [Weilin Zhao](https://github.com/Achazwl) and [Zhousx](https://github.com/Zhou-sx).
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
--model meta-llama/Meta-Llama-3-8B-Instruct \
|
||||
--speculative-algorithm EAGLE \
|
||||
--speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B \
|
||||
--speculative-num-steps 3 \
|
||||
--speculative-eagle-topk 4 \
|
||||
--speculative-num-draft-tokens 16 \
|
||||
--speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt \
|
||||
--mem-fraction-static 0.7 \
|
||||
--cuda-graph-max-bs 8 \
|
||||
--dtype float16 \
|
||||
--log-level warning
|
||||
```
|
||||
|
||||
**Send a request:**
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=64,
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### EAGLE-3 Decoding
|
||||
|
||||
You can enable EAGLE-3 decoding by setting `--speculative-algorithm EAGLE3` and choosing an appropriate model.
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
|
||||
--speculative-algorithm EAGLE3 \
|
||||
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
|
||||
--speculative-num-steps 3 \
|
||||
--speculative-eagle-topk 4 \
|
||||
--speculative-num-draft-tokens 16 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--cuda-graph-max-bs 8 \
|
||||
--dtype float16 \
|
||||
--log-level warning
|
||||
```
|
||||
|
||||
**Send a request:**
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=64,
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Multi Token Prediction
|
||||
|
||||
We support [MTP(Multi-Token Prediction)](https://arxiv.org/pdf/2404.19737) in SGLang by using speculative decoding. We use `XiaomiMiMo/MiMo-7B-RL` as an example here (for DeepSeek MTP usage, refer to [deepseek_v32 doc](../basic_usage/deepseek_v32.md#multi-token-prediction)).
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
--model XiaomiMiMo/MiMo-7B-RL \
|
||||
--host 0.0.0.0 \
|
||||
--trust-remote-code \
|
||||
--speculative-algorithm EAGLE \
|
||||
--speculative-num-steps 1 \
|
||||
--speculative-eagle-topk 1 \
|
||||
--speculative-num-draft-tokens 2 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--cuda-graph-max-bs 8 \
|
||||
--log-level warning
|
||||
```
|
||||
|
||||
**Send a request:**
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
url = "http://localhost:30000/v1/chat/completions"
|
||||
|
||||
data = {
|
||||
"model": "XiaomiMiMo/MiMo-7B-RL",
|
||||
"messages": [{"role": "user", "content": "What is the capital of France?"}],
|
||||
}
|
||||
|
||||
response = requests.post(url, json=data)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Standalone Speculative Decoding (Small Draft Model)
|
||||
|
||||
Besides EAGLE/MTP, SGLang also supports **token-level speculative decoding** using a smaller **draft model**. Enable it with `--speculative-algorithm STANDALONE` and provide a draft model via `--speculative-draft-model-path`.
|
||||
|
||||
Relevant parameters:
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|---|---|---|
|
||||
| `--speculative-draft-model-path` | Draft model weights (smaller than the target model). | `None` |
|
||||
| `--speculative-num-steps` | Draft depth (how many steps the draft model runs autoregressively). | `3` (auto default for STANDALONE) |
|
||||
| `--speculative-eagle-topk` | Branching factor (token candidates per step). | `1` (auto default for STANDALONE) |
|
||||
| `--speculative-num-draft-tokens` | Verification capacity. | `4` (auto default for STANDALONE) |
|
||||
| `--speculative-draft-model-quantization` | Quantization for the draft model. Use `"unquant"` to disable quantization on the draft even when the target is quantized. | Same as target |
|
||||
|
||||
> **Note:** Standalone speculative decoding currently **does not support** `--enable-dp-attention`.
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
--model Qwen/Qwen2.5-7B-Instruct \
|
||||
--speculative-algorithm STANDALONE \
|
||||
--speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \
|
||||
--speculative-num-steps 4 \
|
||||
--speculative-eagle-topk 2 \
|
||||
--speculative-num-draft-tokens 7 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--cuda-graph-max-bs 8 \
|
||||
--log-level warning
|
||||
```
|
||||
|
||||
**Send a request:**
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-7B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=64,
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Speculative Decoding V2 (Overlap Scheduler)
|
||||
|
||||
SGLang provides an **experimental Speculative Decoding V2** implementation that enables an overlap scheduler and uses V2 speculative workers (e.g. `StandaloneWorkerV2`, `EAGLEWorkerV2`).
|
||||
|
||||
To enable it, set the environment variable:
|
||||
- `SGLANG_ENABLE_SPEC_V2=True`
|
||||
|
||||
Notes:
|
||||
- SpecV2 currently only supports `--speculative-eagle-topk 1`. When SpecV2 is enabled, **set `--speculative-eagle-topk 1` explicitly**.
|
||||
- If you explicitly set `--speculative-eagle-topk > 1`, the server will error.
|
||||
- If you omit `--speculative-eagle-topk`, auto-tuning may pick `topk > 1` for some models (e.g. Llama). This is incompatible with SpecV2 and may not always trigger an immediate config error, so set `--speculative-eagle-topk 1` explicitly.
|
||||
- This applies to `EAGLE`, `EAGLE3`, and `STANDALONE`.
|
||||
|
||||
```bash
|
||||
SGLANG_ENABLE_SPEC_V2=True python3 -m sglang.launch_server \
|
||||
--model Qwen/Qwen2.5-7B-Instruct \
|
||||
--speculative-algorithm STANDALONE \
|
||||
--speculative-draft-model-path Qwen/Qwen2.5-1.5B-Instruct \
|
||||
--speculative-num-steps 4 \
|
||||
--speculative-eagle-topk 1 \
|
||||
--speculative-num-draft-tokens 5 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--cuda-graph-max-bs 8 \
|
||||
--log-level warning
|
||||
```
|
||||
|
||||
**Send a request:**
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-7B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=64,
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Ngram Speculative Decoding
|
||||
|
||||
SGLang also supports **ngram-based speculative decoding** (no separate draft model). It retrieves draft tokens from an ngram cache built from previously generated tokens, and then verifies them with the target model.
|
||||
|
||||
Enable it with:
|
||||
- `--speculative-algorithm NGRAM`
|
||||
|
||||
### Ngram-specific parameters
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|---|---|---|
|
||||
| `--speculative-num-draft-tokens` | Number of draft tokens verified per step. If omitted, defaults to `--speculative-ngram-max-match-window-size`. | `12` (with default ngram settings) |
|
||||
| `--speculative-ngram-min-match-window-size` | Minimum matching window size. | `1` |
|
||||
| `--speculative-ngram-max-match-window-size` | Maximum matching window size. | `12` |
|
||||
| `--speculative-ngram-min-bfs-breadth` | Minimum BFS breadth. | `1` |
|
||||
| `--speculative-ngram-max-bfs-breadth` | Maximum BFS breadth. | `10` |
|
||||
| `--speculative-ngram-match-type` | Match type: `"BFS"` or `"PROB"`. | `"BFS"` |
|
||||
| `--speculative-ngram-branch-length` | How many recent tokens to insert into the cache. | `18` |
|
||||
| `--speculative-ngram-capacity` | Cache capacity (number of entries). | `10,000,000` |
|
||||
|
||||
Notes:
|
||||
- Ngram speculative decoding **only supports CUDA**.
|
||||
- It currently **does not support** `--enable-dp-attention`.
|
||||
- It disables the overlap scheduler and mixed chunked prefill.
|
||||
- If `--speculative-ngram-max-bfs-breadth > 1` (thus `speculative_eagle_topk > 1`) and `page_size > 1`, use `--attention-backend flashinfer`; otherwise the server will error.
|
||||
- Optional: set `SGLANG_NGRAM_FORCE_GREEDY_VERIFY=True` to force greedy verification.
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
--model Qwen/Qwen2.5-7B-Instruct \
|
||||
--speculative-algorithm NGRAM \
|
||||
--speculative-num-draft-tokens 16 \
|
||||
--speculative-ngram-max-match-window-size 12 \
|
||||
--speculative-ngram-max-bfs-breadth 10 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--cuda-graph-max-bs 8 \
|
||||
--log-level warning
|
||||
```
|
||||
|
||||
**Send a request:**
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="Qwen/Qwen2.5-7B-Instruct",
|
||||
messages=[
|
||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=64,
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Full Parameter Reference
|
||||
|
||||
Below is a comprehensive list of all speculative decoding parameters available in SGLang:
|
||||
|
||||
### Core parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `--speculative-algorithm` | `str` | `None` | Algorithm to use: `EAGLE`, `EAGLE3`, `STANDALONE`, `NGRAM`, `NEXTN` (alias of `EAGLE`) |
|
||||
| `--speculative-draft-model-path` | `str` | `None` | Path to the draft model weights |
|
||||
| `--speculative-draft-model-revision` | `str` | `None` | Specific revision/commit of the draft model (`"main"` is auto-used when draft path is set and revision is omitted) |
|
||||
| `--speculative-draft-load-format` | `str` | `None` | Load format for draft model weights |
|
||||
| `--speculative-num-steps` | `int` | `None` (auto-chosen when omitted) | Autoregressive drafting depth |
|
||||
| `--speculative-eagle-topk` | `int` | `None` (auto-chosen when omitted) | Branching factor per drafting step |
|
||||
| `--speculative-num-draft-tokens` | `int` | `None` (auto-chosen when omitted) | Maximum number of draft tokens for verification |
|
||||
| `--speculative-accept-threshold-single` | `float` | `1.0` | Single-token acceptance threshold |
|
||||
| `--speculative-accept-threshold-acc` | `float` | `1.0` | Accumulated acceptance threshold |
|
||||
| `--speculative-token-map` | `str` | `None` | Path to FR-Spec high-frequency token map |
|
||||
| `--speculative-attention-mode` | `str` | `"prefill"` | Attention mode for speculative operations (`"prefill"` or `"decode"`) |
|
||||
| `--speculative-draft-attention-backend` | `str` | `None` | Override attention backend for the draft model |
|
||||
| `--speculative-moe-runner-backend` | `str` | `None` | MoE runner backend for the draft model |
|
||||
| `--speculative-moe-a2a-backend` | `str` | `None` | MoE all-to-all backend for the draft model |
|
||||
| `--speculative-draft-model-quantization` | `str` | Same as target | Quantization for the draft model (`"unquant"` to disable) |
|
||||
|
||||
### Ngram-specific parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `--speculative-ngram-min-match-window-size` | `int` | `1` | Minimum ngram matching window |
|
||||
| `--speculative-ngram-max-match-window-size` | `int` | `12` | Maximum ngram matching window |
|
||||
| `--speculative-ngram-min-bfs-breadth` | `int` | `1` | Minimum BFS breadth |
|
||||
| `--speculative-ngram-max-bfs-breadth` | `int` | `10` | Maximum BFS breadth |
|
||||
| `--speculative-ngram-match-type` | `str` | `"BFS"` | Match type: `"BFS"` or `"PROB"` |
|
||||
| `--speculative-ngram-branch-length` | `int` | `18` | Recent tokens to insert into cache |
|
||||
| `--speculative-ngram-capacity` | `int` | `10,000,000` | Cache capacity |
|
||||
|
||||
### Environment variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|---|---|---|
|
||||
| `SGLANG_ENABLE_SPEC_V2` | `False` | Enable Speculative Decoding V2 (overlap scheduler) |
|
||||
| `SGLANG_NGRAM_FORCE_GREEDY_VERIFY` | `False` | Force greedy verification for ngram decoding |
|
||||
|
||||
### Other related flags
|
||||
|
||||
| Parameter | Description |
|
||||
|---|---|
|
||||
| `--enable-multi-layer-eagle` | Enable multi-layer EAGLE (auto-enabled for MiMoV2 and Step3p5 models) |
|
||||
| `--enable-torch-compile` | Enable `torch.compile` for kernel-level optimizations |
|
||||
| `--torch-compile-max-bs` | Maximum batch size for `torch.compile` |
|
||||
|
||||
---
|
||||
|
||||
## OOM Troubleshooting
|
||||
|
||||
> [!WARNING]
|
||||
> **Out of Memory (OOM)?** Speculative decoding may increase GPU memory usage because the draft tree, CUDA graphs, and verification-related buffers consume additional VRAM. If you encounter OOM errors, try the following adjustments.
|
||||
|
||||
### Step 1: Reduce draft tree size (most effective)
|
||||
|
||||
These three parameters directly control how much memory the draft tree consumes:
|
||||
|
||||
```bash
|
||||
# Before (aggressive, high memory)
|
||||
--speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64
|
||||
|
||||
# After (conservative, lower memory)
|
||||
--speculative-num-steps 3 --speculative-eagle-topk 4 --speculative-num-draft-tokens 16
|
||||
```
|
||||
|
||||
- **`--speculative-num-draft-tokens`**: This is the single most impactful parameter. Reducing from 64 → 16 can cut draft-related memory by ~75%. Start here.
|
||||
- **`--speculative-eagle-topk`**: Reducing from 8 → 4 or even 2 halves the branching factor.
|
||||
- **`--speculative-num-steps`**: Reducing from 5 → 3 shortens the draft depth.
|
||||
|
||||
### Step 2: Lower static memory fraction
|
||||
|
||||
```bash
|
||||
# Give more room for dynamic allocations (CUDA graphs, draft model, etc.)
|
||||
--mem-fraction-static 0.5 # when omitted, this value is auto-computed
|
||||
```
|
||||
|
||||
### Step 3: Reduce CUDA graph batch size
|
||||
|
||||
```bash
|
||||
# Fewer CUDA graph captures = less memory reserved
|
||||
--cuda-graph-max-bs 4 # or even 2 for tight memory situations
|
||||
```
|
||||
|
||||
### Step 4: Limit concurrent requests
|
||||
|
||||
```bash
|
||||
# Fewer concurrent requests lowers in-flight load and can reduce OOM risk
|
||||
--max-running-requests 4
|
||||
```
|
||||
|
||||
### Step 5: Use quantization
|
||||
|
||||
```bash
|
||||
# Quantize the target model (if supported by your checkpoint/hardware)
|
||||
--quantization fp8
|
||||
|
||||
# Or quantize only the draft model (keep target at full precision)
|
||||
--speculative-draft-model-quantization fp8
|
||||
```
|
||||
|
||||
### Step 6: Use a smaller dtype
|
||||
|
||||
```bash
|
||||
--dtype float16 # instead of bfloat16/float32 (when supported)
|
||||
```
|
||||
|
||||
### Step 7: Use FR-Spec to reduce lm_head memory (EAGLE-2 / STANDALONE)
|
||||
|
||||
```bash
|
||||
--speculative-token-map thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt
|
||||
```
|
||||
> Note: For EAGLE-3, `--speculative-token-map` is ignored because EAGLE-3 models already provide built-in hot-token handling.
|
||||
|
||||
### Quick OOM recovery recipe
|
||||
|
||||
If you're hitting OOM and just want something that works, start with this minimal configuration and scale up:
|
||||
|
||||
```bash
|
||||
python3 -m sglang.launch_server \
|
||||
--model <your-model> \
|
||||
--speculative-algorithm EAGLE \
|
||||
--speculative-draft-model-path <your-draft-model> \
|
||||
--speculative-num-steps 2 \
|
||||
--speculative-eagle-topk 2 \
|
||||
--speculative-num-draft-tokens 8 \
|
||||
--cuda-graph-max-bs 2 \
|
||||
--mem-fraction-static 0.5 \
|
||||
--max-running-requests 4 \
|
||||
--dtype float16 \
|
||||
--log-level warning
|
||||
```
|
||||
|
||||
Then gradually increase `--speculative-num-draft-tokens`, `--speculative-eagle-topk`, and `--cuda-graph-max-bs` until you find the sweet spot for your GPU.
|
||||
|
||||
> [!TIP]
|
||||
> **Memory budget rule of thumb**: during automatic `--mem-fraction-static` estimation, STANDALONE reserves about 6 GB and EAGLE/EAGLE3 reserves about 2 GB as additional headroom. Plan your `--mem-fraction-static` accordingly.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
EAGLE process is as follows:
|
||||
|
||||
- Within EAGLE the draft model predicts the next feature vector, i.e. the last hidden state of the original LLM, using the feature sequence $(f_1, ..., f_k)$ and the token sequence $(t_2, ..., t_{k+1})$.
|
||||
- The next token is then sampled from $p_{k+2}=\text{LMHead}(f_{k+1})$. Afterwards, the two sequences are extended in a tree style—branching out multiple potential continuations, with the branching factor per step controlled by the `speculative_eagle_topk` parameter—to ensure a more coherent connection of context, and are given as input again.
|
||||
- In SGLang's EAGLE-2 implementation, the draft tree is expanded for the configured steps and then reranked to select the top `speculative_num_draft_tokens` final nodes as draft tokens.
|
||||
- EAGLE-3 removes the feature prediction objective, incorporates low and mid-layer features, and is trained in an on-policy manner.
|
||||
|
||||
This enhances drafting accuracy by operating on features instead of tokens for more regular inputs and by additionally passing tokens from the next timestep to reduce sampling randomness. For more details, see the [EAGLE-2](https://arxiv.org/abs/2406.16858) and [EAGLE-3](https://arxiv.org/abs/2503.01840) papers.
|
||||
|
||||
For guidance how to train your own EAGLE model please see the [EAGLE repo](https://github.com/SafeAILab/EAGLE/tree/main?tab=readme-ov-file#train).
|
||||
@@ -54,7 +54,7 @@
|
||||
" \"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --log-level warning\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")"
|
||||
]
|
||||
},
|
||||
@@ -740,7 +740,6 @@
|
||||
"import json\n",
|
||||
"from pydantic import BaseModel, Field\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"prompts = [\n",
|
||||
" \"Give me the information of the capital of China in the JSON format.\",\n",
|
||||
" \"Give me the information of the capital of France in the JSON format.\",\n",
|
||||
|
||||
@@ -50,7 +50,7 @@
|
||||
" \"python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
|
||||
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")"
|
||||
]
|
||||
},
|
||||
@@ -642,7 +642,6 @@
|
||||
"import json\n",
|
||||
"from pydantic import BaseModel, Field\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"prompts = [\n",
|
||||
" \"Give me the information of the capital of China in the JSON format.\",\n",
|
||||
" \"Give me the information of the capital of France in the JSON format.\",\n",
|
||||
|
||||
@@ -60,7 +60,7 @@
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning\" # qwen25\n",
|
||||
")\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -550,7 +550,9 @@
|
||||
"server_process_tool_choice, port_tool_choice = launch_server_cmd(\n",
|
||||
" \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning\"\n",
|
||||
")\n",
|
||||
"wait_for_server(f\"http://localhost:{port_tool_choice}\")\n",
|
||||
"wait_for_server(\n",
|
||||
" f\"http://localhost:{port_tool_choice}\", process=server_process_tool_choice\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Initialize client for tool choice examples\n",
|
||||
"client_tool_choice = OpenAI(\n",
|
||||
@@ -695,7 +697,7 @@
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \" python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --tool-call-parser pythonic --tp 1 --log-level warning\" # llama-3.2-1b-instruct\n",
|
||||
")\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
|
||||
"\n",
|
||||
"tools = [\n",
|
||||
" {\n",
|
||||
|
||||
@@ -117,7 +117,6 @@
|
||||
"source": [
|
||||
"from sglang import Engine\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"llm = Engine(model_path=model_path, chat_template=chat_template, log_level=\"warning\")"
|
||||
]
|
||||
},
|
||||
|
||||
@@ -66,9 +66,13 @@ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --n
|
||||
- `fa3`: `flash_attn_with_kvcache` kernel from `flash_attn` library. Can only run on Hopper GPUs. It requires bf16 q, kv inputs.
|
||||
- `tilelang`: `tilelang` implementation that can run on GPU, HPU and NPU.
|
||||
- `aiter`: Aiter kernel on AMD HPUs. Can only be used as decode kernel.
|
||||
- `trtllm`: `trtllm-mla` sparse kernel from flashinfer library. Only run on blackwell GPUs. It requires QKV bf16 or QKV fp8.
|
||||
- On the basis of performance benchmarks, the default configuration on H200 and B200 are set as follows :
|
||||
- H200: `flashmla_sparse` prefill attention (short-seq prefill uses MHA via FlashAttention varlen), `fa3` decode attention, `bf16` kv cache dtype.
|
||||
- B200: `flashmla_auto` prefill attention (short-seq prefill uses MHA via TRT-LLM ragged), `flashmla_kv` decode attention, `fp8_e4m3` kv cache dtype. `flashmla_auto` enables automatic selection of either `flashmla_sparse` or `flashmla_kv` kernel for prefill based on KV cache dtype, hardware, and heuristics. When FP8 KV cache is enabled and `total_kv_tokens < total_q_tokens * 512`, it uses the `flashmla_sparse` kernel; otherwise, it falls back to the `flashmla_kv` kernel. The heuristics may need to be tuned if the performance of either the `flashmla_sparse` or `flashmla_kv` kernel changes significantly.
|
||||
- On Blackwell platform, with slightly accuracy drop, the performance can boost up to 3x-5x
|
||||
- B200: by choosing `trtllm` for both `--nsa-prefill-backend` and `--nsa-decode-backend`, the prefill attention use MHA via TRT-LLM ragged for both short and long sequence (**accuracy impact**). Combine the `trtllm` with `fp8_e4m3` kv cache, the kv cache dim is `576` (kv_lora_rank + qk_rope_head_dim) (**accuracy impact**), compare to the combination of `flashmla_auto` and `fp8_e4m` kv cache dim is `656` (kv_lora_rank + scale storage (kv_lora_rank // quant_block_size * 4 bytes) + rope dimension storage).
|
||||
|
||||
|
||||
## Multi-token Prediction
|
||||
SGLang implements Multi-Token Prediction (MTP) for DeepSeek V3.2 based on [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved significantly on small batch sizes. Please look at [this PR](https://github.com/sgl-project/sglang/pull/11652) for more information.
|
||||
@@ -308,9 +312,7 @@ For context parallel in DeepSeek V3.2 model, we provide two different modes of s
|
||||
|
||||
### In sequence splitting
|
||||
|
||||
The first mode can be enabled by `--nsa-prefill-cp-mode in-seq-split`. This mode implements context parallel for DSA by splitting the sequence uniformly between context parallel ranks. At attention stage, each cp rank computes the indexer results of sharded sequence, and collects the whole kv cache through all gather operator.
|
||||
|
||||
The communication group for context parallel reuses the one for attention tp, thus `cp_size` equals `atten_tp_size = tp_size / dp_size`.
|
||||
The first mode can be enabled by `--nsa-prefill-cp-mode in-seq-split`. This mode implements context parallel for DSA by splitting the sequence uniformly between context parallel ranks. At attention stage, each cp rank computes the indexer results of sharded sequence, and collects the whole kv cache through all gather operator. Add `attn_cp_size` for communication group for context parallel.
|
||||
|
||||
Note that in sequence splitting mode has the following restrictions:
|
||||
- The batch size is restricted to 1 for prefill batches
|
||||
@@ -323,7 +325,7 @@ For more details, please refer to PR https://github.com/sgl-project/sglang/pull/
|
||||
Example:
|
||||
```bash
|
||||
# In-seq splitting mode launched with EP + DP
|
||||
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 2 --enable-dp-attention --enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode in-seq-split --max-running-requests 32
|
||||
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 2 --enable-dp-attention --enable-nsa-prefill-context-parallel --attn-cp-size 4 --nsa-prefill-cp-mode in-seq-split --max-running-requests 32
|
||||
```
|
||||
|
||||
### Round robin splitting (default setting)
|
||||
@@ -337,7 +339,7 @@ For more details, please refer to PR https://github.com/sgl-project/sglang/pull/
|
||||
Example usage:
|
||||
```bash
|
||||
# Launch with FusedMoe + CP8
|
||||
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode round-robin-split --max-running-requests 32
|
||||
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --enable-nsa-prefill-context-parallel --attn-cp-size 8 --nsa-prefill-cp-mode round-robin-split --max-running-requests 32
|
||||
```
|
||||
### Pipeline Parallel + Context Parallel (PP + CP)
|
||||
|
||||
@@ -361,6 +363,7 @@ python3 -m sglang.launch_server \
|
||||
--tp 8 --pp-size 2 \
|
||||
--dp-size 1 --moe-dense-tp-size 1 \
|
||||
--enable-nsa-prefill-context-parallel \
|
||||
--attn-cp-size 8 \
|
||||
--nsa-prefill-cp-mode round-robin-split \
|
||||
--trust-remote-code \
|
||||
--disable-radix-cache \
|
||||
@@ -384,6 +387,7 @@ python3 -m sglang.launch_server \
|
||||
--tp 8 --pp-size 2 \
|
||||
--dp-size 1 --moe-dense-tp-size 1 \
|
||||
--enable-nsa-prefill-context-parallel \
|
||||
--attn-cp-size 8 \
|
||||
--nsa-prefill-cp-mode round-robin-split \
|
||||
--trust-remote-code \
|
||||
--disable-radix-cache \
|
||||
@@ -411,6 +415,7 @@ python -m sglang.launch_server \
|
||||
--tp 8 --pp-size 2 \
|
||||
--dp-size 1 --moe-dense-tp-size 1 \
|
||||
--enable-nsa-prefill-context-parallel \
|
||||
--attn-cp-size 8 \
|
||||
--nsa-prefill-cp-mode round-robin-split \
|
||||
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
|
||||
--trust-remote-code \
|
||||
@@ -436,6 +441,7 @@ python -m sglang.launch_server \
|
||||
--tp 8 --pp-size 2 \
|
||||
--dp-size 1 --moe-dense-tp-size 1 \
|
||||
--enable-nsa-prefill-context-parallel \
|
||||
--attn-cp-size 8 \
|
||||
--nsa-prefill-cp-mode round-robin-split \
|
||||
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
|
||||
--trust-remote-code \
|
||||
|
||||
@@ -1,19 +0,0 @@
|
||||
# Diffusion
|
||||
|
||||
SGLang supports two categories of diffusion models for different use cases. This page covers image and video generation; for diffusion LLMs, see [Diffusion LLMs](diffusion_llms.md).
|
||||
|
||||
## Image & Video Generation Models
|
||||
|
||||
For generating images and videos from text prompts, SGLang supports [many](../supported_models/image_generation/diffusion_models.md#image-generation-models) models like:
|
||||
|
||||
- **FLUX, Qwen-Image** - High-quality image generation
|
||||
- **Wan 2.2, HunyuanVideo** - Video generation
|
||||
|
||||
```bash
|
||||
# Example: Launch FLUX for image generation
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path black-forest-labs/FLUX.2-klein-4B \
|
||||
--host 0.0.0.0 --port 30000
|
||||
```
|
||||
|
||||
**Full model list:** [Diffusion Models](../supported_models/image_generation/diffusion_models.md)
|
||||
@@ -1,14 +0,0 @@
|
||||
# Diffusion Language Models (dLLMs)
|
||||
|
||||
These are text-generation models that use diffusion (denoising) instead of autoregressive decoding:
|
||||
|
||||
- **LLaDA** - Large Language Diffusion with mAsking
|
||||
|
||||
```bash
|
||||
# Example: Launch LLaDA for text generation
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path GSAI-ML/LLaDA-8B-Instruct \
|
||||
--host 0.0.0.0 --port 30000
|
||||
```
|
||||
|
||||
**Full model list:** [Diffusion Language Models](../supported_models/text_generation/diffusion_language_models.md)
|
||||
@@ -49,7 +49,7 @@
|
||||
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -275,14 +275,12 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"embedding_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"embedding_process, port = launch_server_cmd(\"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
|
||||
" --host 0.0.0.0 --is-embedding --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\"\"\")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=embedding_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -324,14 +322,12 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"reranker_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"reranker_process, port = launch_server_cmd(\"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \\\n",
|
||||
" --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\"\"\")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=reranker_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -392,14 +388,12 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"score_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"score_process, port = launch_server_cmd(\"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n",
|
||||
" --host 0.0.0.0 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\"\"\")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=score_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -456,13 +450,11 @@
|
||||
"# Note that SGLang now treats embedding models and reward models as the same type of models.\n",
|
||||
"# This will be updated in the future.\n",
|
||||
"\n",
|
||||
"reward_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"reward_process, port = launch_server_cmd(\"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\"\"\")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=reward_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -526,7 +518,7 @@
|
||||
" \"python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=expert_record_server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -575,13 +567,11 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tokenizer_free_server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"tokenizer_free_server_process, port = launch_server_cmd(\"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\"\"\")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=tokenizer_free_server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -39,7 +39,7 @@
|
||||
" \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
|
||||
"print(f\"Server started on http://localhost:{port}\")"
|
||||
]
|
||||
},
|
||||
|
||||
@@ -30,14 +30,12 @@
|
||||
"from sglang.test.doc_patch import launch_server_cmd\n",
|
||||
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
||||
"\n",
|
||||
"embedding_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"embedding_process, port = launch_server_cmd(\"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
|
||||
" --host 0.0.0.0 --is-embedding --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\"\"\")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=embedding_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -33,13 +33,11 @@
|
||||
"from sglang.test.doc_patch import launch_server_cmd\n",
|
||||
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
||||
"\n",
|
||||
"vision_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"vision_process, port = launch_server_cmd(\"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\"\"\")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=vision_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -31,14 +31,12 @@
|
||||
"# This is equivalent to running the following command in your terminal\n",
|
||||
"# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\n",
|
||||
"\n",
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
"server_process, port = launch_server_cmd(\"\"\"\n",
|
||||
"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n",
|
||||
" --host 0.0.0.0 --log-level warning\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\"\"\")\n",
|
||||
"\n",
|
||||
"wait_for_server(f\"http://localhost:{port}\")"
|
||||
"wait_for_server(f\"http://localhost:{port}\", process=server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -2,28 +2,39 @@
|
||||
|
||||
## Benchmark
|
||||
|
||||
- Benchmark the latency of running a single static batch without a server. The arguments are the same as for `launch_server.py`.
|
||||
Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.
|
||||
- Without a server (do not need to launch a server)
|
||||
```bash
|
||||
python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
|
||||
```
|
||||
- With a server (please use `sglang.launch_server` to launch a server first and run the following command.)
|
||||
```bash
|
||||
python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
|
||||
```
|
||||
SGLang provides four benchmark tools that operate at different levels of the stack. The table below summarizes their key differences:
|
||||
|
||||
| Tool | HTTP Server | Scheduler | Use Case |
|
||||
| -------------------------- | --------------------------------------------- | --------------------------------------- | -------------------------------------------------------------------------- |
|
||||
| `bench_serving` | Yes (async HTTP client to a running server) | Yes (indirectly, via server) | Realistic online serving benchmarks with latency metrics (TTFT, TPOT, ITL) |
|
||||
| `bench_one_batch_server` | Yes (sends HTTP requests to a running server) | Yes (indirectly, via server) | End-to-end single-batch latency including HTTP and scheduler overhead |
|
||||
| `bench_offline_throughput` | No | Yes (directly uses `Engine` in-process) | Maximum throughput measurement without HTTP overhead |
|
||||
| `bench_one_batch` | No | No (directly calls `ModelRunner`) | Kernel-level latency profiling of a single static batch |
|
||||
|
||||
- Benchmark offline processing. This script will start an offline engine and run the benchmark.
|
||||
Use `bench_serving` by default unless there are specific needs.
|
||||
|
||||
**`bench_serving`** is an async HTTP load-testing client that sends requests at controlled rates with configurable concurrency to a running server. It measures realistic online serving metrics including time-to-first-token (TTFT), time-per-output-token (TPOT), inter-token latency (ITL), and throughput. Use `num-prompts >= 5 * max-concurrency` to measure steady-state performance. Launch a server with `sglang.launch_server` first.
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving --backend sglang --max-concurrency 16 --num-prompts 80 --random-input-len 256 --random-output-len 32 --dataset-name random
|
||||
```
|
||||
|
||||
**`bench_one_batch_server`** sends a single batch as one HTTP request to a running server. Due to only having a single batch, the server is never in a steady-state and metrics will be biased. Launch a server with `sglang.launch_server` first.
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
|
||||
```
|
||||
|
||||
**`bench_offline_throughput`** directly instantiates the `Engine` object in-process (no HTTP server) and submits all requests at once via `engine.generate()`. The engine's scheduler handles batching and execution. This measures maximum achievable throughput without any network overhead.
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
|
||||
```
|
||||
|
||||
- Benchmark online serving. Please use `sglang.launch_server` to launch a server first and run the following command.
|
||||
**`bench_one_batch`** is the lowest-level tool. It directly instantiates a `ModelRunner` and calls `extend()` / `decode()` on a fixed static batch, bypassing the scheduler entirely. The prefill and decode phases are run separately, making profiling easier but rendering the metrics unrealistic. Because there is no dynamic batching, it may run out of memory for batch sizes that a real server can handle (a real server chunks prefill into smaller batches). This is best suited for profiling individual kernel performance.
|
||||
|
||||
```bash
|
||||
python3 -m sglang.bench_serving --backend sglang --num-prompt 10
|
||||
python3 -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
|
||||
```
|
||||
|
||||
## Profile with PyTorch Profiler
|
||||
|
||||
@@ -26,7 +26,7 @@ python -m sglang.test.run_eval \
|
||||
|
||||
```bash
|
||||
python -m sglang.test.few_shot_gsm8k \
|
||||
--host http://127.0.0.1 \
|
||||
--host 127.0.0.1 \
|
||||
--port 30000 \
|
||||
--num-questions 200 \
|
||||
--num-shots 5
|
||||
@@ -36,7 +36,7 @@ python -m sglang.test.few_shot_gsm8k \
|
||||
|
||||
```bash
|
||||
python benchmark/hellaswag/bench_sglang.py \
|
||||
--host http://127.0.0.1 \
|
||||
--host 127.0.0.1 \
|
||||
--port 30000 \
|
||||
--num-questions 200 \
|
||||
--num-shots 20
|
||||
@@ -54,7 +54,7 @@ python -m sglang.test.run_eval \
|
||||
```
|
||||
|
||||
```{tip}
|
||||
For reasoning models, add `--thinking-mode <mode>` (e.g., `qwen3`, `deepseek-r1`, `deepseek-v3`). You may skip it if the model has forced thinking enabled.
|
||||
For reasoning models, add `--thinking-mode <mode>` (e.g., `qwen3`, `deepseek-v3`). You may skip it if the model has forced thinking enabled.
|
||||
```
|
||||
|
||||
**HumanEval**
|
||||
|
||||
@@ -5,7 +5,6 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
|
||||
## Prerequisites
|
||||
|
||||
- A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`.
|
||||
- Python 3.11+ if you plan to use the OpenAI Python SDK.
|
||||
|
||||
|
||||
## Supported Arguments
|
||||
@@ -13,7 +12,6 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
|
||||
### Server Arguments
|
||||
|
||||
- `--model-path {MODEL_PATH}`: Path to the model or model ID
|
||||
- `--vae-path {VAE_PATH}`: Path to a custom VAE model or HuggingFace model ID (e.g., `fal/FLUX.2-Tiny-AutoEncoder`). If not specified, the VAE will be loaded from the main model path.
|
||||
- `--lora-path {LORA_PATH}`: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied.
|
||||
- `--lora-nickname {NAME}`: Nickname for the LoRA adapter. (default: `default`).
|
||||
- `--num-gpus {NUM_GPUS}`: Number of GPUs to use
|
||||
@@ -35,7 +33,7 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
|
||||
- `--seed {SEED}`: Random seed for reproducible generation
|
||||
|
||||
|
||||
#### Image/Video Configuration
|
||||
**Image/Video Configuration**
|
||||
|
||||
- `--height {HEIGHT}`: Height of the generated output
|
||||
- `--width {WIDTH}`: Width of the generated output
|
||||
@@ -43,7 +41,7 @@ The SGLang-diffusion CLI provides a quick way to access the inference pipeline f
|
||||
- `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task
|
||||
|
||||
|
||||
#### Output Options
|
||||
**Output Options**
|
||||
|
||||
- `--output-path {PATH}`: Directory to save the generated video
|
||||
- `--save-output`: Whether to save the image/video to disk
|
||||
@@ -168,7 +166,7 @@ When enabled, the server follows a **Generate -> Upload -> Delete** workflow:
|
||||
3. Upon successful upload, the local file is deleted.
|
||||
4. The API response returns the public URL of the uploaded object.
|
||||
|
||||
#### Configuration
|
||||
**Configuration**
|
||||
|
||||
Cloud storage is enabled via environment variables. Note that `boto3` must be installed separately (`pip install boto3`) to use this feature.
|
||||
|
||||
@@ -183,7 +181,7 @@ export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
|
||||
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
|
||||
```
|
||||
|
||||
See [Environment Variables Documentation](environment_variables.md) for more details.
|
||||
See [Environment Variables Documentation](../environment_variables.md) for more details.
|
||||
|
||||
## Generate
|
||||
|
||||
@@ -219,6 +217,32 @@ Once the generation task has finished, the server will shut down automatically.
|
||||
> [!NOTE]
|
||||
> The HTTP server-related arguments are ignored in this subcommand.
|
||||
|
||||
## Component Path Overrides
|
||||
|
||||
SGLang diffusion allows you to override any pipeline component (e.g., `vae`, `transformer`, `text_encoder`) by specifying a custom checkpoint path. This is useful for:
|
||||
|
||||
### Example: FLUX.2-dev with Tiny AutoEncoder
|
||||
|
||||
You can override **any** component by using `--<component>-path`, where `<component>` matches the key in the model's `model_index.json`:
|
||||
|
||||
For example, replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding:
|
||||
|
||||
```bash
|
||||
sglang serve \
|
||||
--model-path=black-forest-labs/FLUX.2-dev \
|
||||
# with a Huggingface Repo ID
|
||||
--vae-path=fal/FLUX.2-Tiny-AutoEncoder
|
||||
# or use a local path
|
||||
--vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae
|
||||
```
|
||||
|
||||
**Important:**
|
||||
- The component key must match the one in your model's `model_index.json` (e.g., `vae`).
|
||||
- The path must:
|
||||
- either be a Huggingface Repo ID (e.g., fal/FLUX.2-Tiny-AutoEncoder)
|
||||
- or point to a **complete component folder**, containing `config.json` and safetensors files
|
||||
|
||||
|
||||
## Diffusers Backend
|
||||
|
||||
SGLang diffusion supports a **diffusers backend** that allows you to run any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes.
|
||||
@@ -2,6 +2,10 @@
|
||||
|
||||
The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.11+ if you plan to use the OpenAI Python SDK.
|
||||
|
||||
## Serve
|
||||
|
||||
Launch the server using the `sglang serve` command.
|
||||
@@ -25,7 +29,7 @@ sglang serve "${SERVER_ARGS[@]}"
|
||||
- **--model-path**: Path to the model or model ID.
|
||||
- **--port**: HTTP port to listen on (default: `30000`).
|
||||
|
||||
#### Get Model Information
|
||||
**Get Model Information**
|
||||
|
||||
**Endpoint:** `GET /models`
|
||||
|
||||
@@ -59,7 +63,7 @@ curl -sS -X GET "http://localhost:30010/models"
|
||||
|
||||
The server implements an OpenAI-compatible Images API under the `/v1/images` namespace.
|
||||
|
||||
#### Create an image
|
||||
**Create an image**
|
||||
|
||||
**Endpoint:** `POST /v1/images/generations`
|
||||
|
||||
@@ -100,7 +104,7 @@ curl -sS -X POST "http://localhost:30010/v1/images/generations" \
|
||||
> **Note**
|
||||
> The `response_format=url` option is not supported for `POST /v1/images/generations` and will return a `400` error.
|
||||
|
||||
#### Edit an image
|
||||
**Edit an image**
|
||||
|
||||
**Endpoint:** `POST /v1/images/edits`
|
||||
|
||||
@@ -130,7 +134,7 @@ curl -sS -X POST "http://localhost:30010/v1/images/edits" \
|
||||
-F "response_format=url"
|
||||
```
|
||||
|
||||
#### Download image content
|
||||
**Download image content**
|
||||
|
||||
When `response_format=url` is used with `POST /v1/images/edits`, the API returns a relative URL like `/v1/images/<IMAGE_ID>/content`.
|
||||
|
||||
@@ -148,7 +152,7 @@ curl -sS -L "http://localhost:30010/v1/images/<IMAGE_ID>/content" \
|
||||
|
||||
The server implements a subset of the OpenAI Videos API under the `/v1/videos` namespace.
|
||||
|
||||
#### Create a video
|
||||
**Create a video**
|
||||
|
||||
**Endpoint:** `POST /v1/videos`
|
||||
|
||||
@@ -178,7 +182,7 @@ curl -sS -X POST "http://localhost:30010/v1/videos" \
|
||||
}'
|
||||
```
|
||||
|
||||
#### List videos
|
||||
**List videos**
|
||||
|
||||
**Endpoint:** `GET /v1/videos`
|
||||
|
||||
@@ -197,7 +201,7 @@ curl -sS -X GET "http://localhost:30010/v1/videos" \
|
||||
-H "Authorization: Bearer sk-proj-1234567890"
|
||||
```
|
||||
|
||||
#### Download video content
|
||||
**Download video content**
|
||||
|
||||
**Endpoint:** `GET /v1/videos/{video_id}/content`
|
||||
|
||||
@@ -239,7 +243,7 @@ The server supports dynamic loading, merging, and unmerging of LoRA adapters.
|
||||
- Switching: To switch LoRAs, you must first `unmerge` the current one, then `set` the new one
|
||||
- Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost
|
||||
|
||||
#### Set LoRA Adapter
|
||||
**Set LoRA Adapter**
|
||||
|
||||
Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters.
|
||||
|
||||
@@ -301,7 +305,7 @@ curl -X POST http://localhost:30010/v1/set_lora \
|
||||
> - Multiple LoRAs applied to the same target will be merged in order
|
||||
|
||||
|
||||
#### Merge LoRA Weights
|
||||
**Merge LoRA Weights**
|
||||
|
||||
Manually merges the currently set LoRA weights into the base model.
|
||||
|
||||
@@ -323,7 +327,7 @@ curl -X POST http://localhost:30010/v1/merge_lora_weights \
|
||||
```
|
||||
|
||||
|
||||
#### Unmerge LoRA Weights
|
||||
**Unmerge LoRA Weights**
|
||||
|
||||
Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This **must** be called before setting a different LoRA.
|
||||
|
||||
@@ -336,7 +340,7 @@ curl -X POST http://localhost:30010/v1/unmerge_lora_weights \
|
||||
-H "Content-Type: application/json"
|
||||
```
|
||||
|
||||
#### List LoRA Adapters
|
||||
**List LoRA Adapters**
|
||||
|
||||
Returns loaded LoRA adapters and current application status per module.
|
||||
|
||||
@@ -389,3 +393,26 @@ Notes:
|
||||
curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}'
|
||||
```
|
||||
5. Generate with LoRA B...
|
||||
|
||||
### Adjust Output Quality
|
||||
|
||||
The server supports adjusting output quality and compression levels for both image and video generation through the `output-quality` and `output-compression` parameters.
|
||||
|
||||
#### Parameters
|
||||
|
||||
- **`output-quality`** (string, optional): Preset quality level that automatically sets compression. **Default is `"default"`**. Valid values:
|
||||
- `"maximum"`: Highest quality (100)
|
||||
- `"high"`: High quality (90)
|
||||
- `"medium"`: Medium quality (55)
|
||||
- `"low"`: Lower quality (35)
|
||||
- `"default"`: Auto-adjust based on media type (50 for video, 75 for image)
|
||||
|
||||
- **`output-compression`** (integer, optional): Direct compression level override (0-100). **Default is `None`**. When provided (not `None`), takes precedence over `output-quality`.
|
||||
- `0`: Lowest quality, smallest file size
|
||||
- `100`: Highest quality, largest file size
|
||||
|
||||
#### Notes
|
||||
|
||||
- **Precedence**: When both `output-quality` and `output-compression` are provided, `output-compression` takes precedence
|
||||
- **Format Support**: Quality settings apply to JPEG, and video formats. PNG uses lossless compression and ignores these settings
|
||||
- **File Size vs Quality**: Lower compression values (or "low" quality preset) produce smaller files but may show visible artifacts
|
||||
@@ -1,5 +1,4 @@
|
||||
|
||||
## Perf baseline generation script
|
||||
## Perf Baseline Generation Script
|
||||
|
||||
`python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py` starts a local diffusion server, issues requests for selected test cases, aggregates stage/denoise-step/E2E timings from the perf log, and writes the results back to the `scenarios` section of `perf_baselines.json`.
|
||||
|
||||
@@ -16,26 +16,26 @@ default parameters when initializing and generating videos.
|
||||
|
||||
### Video Generation Models
|
||||
|
||||
| Model Name | Hugging Face Model ID | Resolutions | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear Attention(SLA)| Sage Sparse Linear Attention(SageSLA)|
|
||||
|:-----------------------------|:--------------------------------------------------|:--------------------|:--------:|:-----------------:|:---------:|:----------------------------:|:----------------------------:|:-----------------------------------------------:|
|
||||
| FastWan2.1 T2V 1.3B | `FastVideo/FastWan2.1-T2V-1.3B-Diffusers` | 480p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ |
|
||||
| FastWan2.2 TI2V 5B Full Attn | `FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers` | 720p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ |
|
||||
| Wan2.2 TI2V 5B | `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | 720p | ⭕ | ⭕ | ✅ | ⭕ | ❌ | ❌ |
|
||||
| Wan2.2 T2V A14B | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ |
|
||||
| Wan2.2 I2V A14B | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ |
|
||||
| HunyuanVideo | `hunyuanvideo-community/HunyuanVideo` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ |
|
||||
| FastHunyuan | `FastVideo/FastHunyuan-diffusers` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ |
|
||||
| Wan2.1 T2V 1.3B | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ |
|
||||
| Wan2.1 T2V 14B | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | 480p, 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ |
|
||||
| Wan2.1 I2V 480P | `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ |
|
||||
| Wan2.1 I2V 720P | `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` | 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ |
|
||||
| TurboWan2.1 T2V 1.3B | `IPostYellow/TurboWan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
|
||||
| TurboWan2.1 T2V 14B | `IPostYellow/TurboWan2.1-T2V-14B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
|
||||
| TurboWan2.1 T2V 14B 720P | `IPostYellow/TurboWan2.1-T2V-14B-720P-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
|
||||
| TurboWan2.2 I2V A14B | `IPostYellow/TurboWan2.2-I2V-A14B-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ |
|
||||
| Model Name | Hugging Face Model ID | Resolutions | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear Attention (SLA) | Sage Sparse Linear Attention (SageSLA) | Sparse Video Gen 2 (SVG2) |
|
||||
|:-----------------------------|:--------------------------------------------------|:--------------------|:--------:|:-----------------:|:---------:|:----------------------------:|:----------------------------:|:-----------------------------------------------:|:----------------------------------:|
|
||||
| FastWan2.1 T2V 1.3B | `FastVideo/FastWan2.1-T2V-1.3B-Diffusers` | 480p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ | ❌ |
|
||||
| FastWan2.2 TI2V 5B Full Attn | `FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers` | 720p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ | ❌ |
|
||||
| Wan2.2 TI2V 5B | `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | 720p | ⭕ | ⭕ | ✅ | ⭕ | ❌ | ❌ | ❌ |
|
||||
| Wan2.2 T2V A14B | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ | ❌ |
|
||||
| Wan2.2 I2V A14B | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ | ❌ |
|
||||
| HunyuanVideo | `hunyuanvideo-community/HunyuanVideo` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
|
||||
| FastHunyuan | `FastVideo/FastHunyuan-diffusers` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
|
||||
| Wan2.1 T2V 1.3B | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
|
||||
| Wan2.1 T2V 14B | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | 480p, 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
|
||||
| Wan2.1 I2V 480P | `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
|
||||
| Wan2.1 I2V 720P | `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` | 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
|
||||
| TurboWan2.1 T2V 1.3B | `IPostYellow/TurboWan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
|
||||
| TurboWan2.1 T2V 14B | `IPostYellow/TurboWan2.1-T2V-14B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
|
||||
| TurboWan2.1 T2V 14B 720P | `IPostYellow/TurboWan2.1-T2V-14B-720P-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
|
||||
| TurboWan2.2 I2V A14B | `IPostYellow/TurboWan2.2-I2V-A14B-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
|
||||
|
||||
**Note**: <br>
|
||||
1.Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.<br>
|
||||
**Note**:
|
||||
1.Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.
|
||||
2.SageSLA Based on SpargeAttn. Install it first with `pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation`
|
||||
|
||||
### Image Generation Models
|
||||
@@ -55,7 +55,7 @@ default parameters when initializing and generating videos.
|
||||
|
||||
This section lists example LoRAs that have been explicitly tested and verified with each base model in the **SGLang Diffusion** pipeline.
|
||||
|
||||
> Important: \
|
||||
> Important:
|
||||
> LoRAs that are not listed here are not necessarily incompatible.
|
||||
> In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions.
|
||||
> The entries below simply reflect configurations that have been manually validated by the SGLang team.
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
This guide outlines the requirements for contributing to the SGLang Diffusion module (`sglang.multimodal_gen`).
|
||||
|
||||
## 1. Commit Message Convention
|
||||
## Commit Message Convention
|
||||
|
||||
We follow a structured commit message format to maintain a clean history.
|
||||
|
||||
@@ -21,7 +21,7 @@ We follow a structured commit message format to maintain a clean history.
|
||||
- **Scope** (Optional): `cli`, `scheduler`, `model`, `pipeline`, `docs`, etc.
|
||||
- **Subject**: Imperative mood, short and clear (e.g., "add feature" not "added feature").
|
||||
|
||||
## 2. Performance Reporting
|
||||
## Performance Reporting
|
||||
|
||||
For PRs that impact **latency**, **throughput**, or **memory usage**, you **should** provide a performance comparison report.
|
||||
|
||||
@@ -45,7 +45,7 @@ For PRs that impact **latency**, **throughput**, or **memory usage**, you **shou
|
||||
```
|
||||
4. **Paste**: paste the table into the PR description
|
||||
|
||||
## 3. CI-Based Change Protection
|
||||
## CI-Based Change Protection
|
||||
|
||||
Consider adding tests to the `pr-test` or `nightly-test` suites to safeguard your changes, especially for PRs that:
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
## Caching Acceleration
|
||||
|
||||
These variables configure caching acceleration for Diffusion Transformer (DiT) models.
|
||||
SGLang supports multiple caching strategies - see [caching documentation](cache/caching.md) for an overview.
|
||||
SGLang supports multiple caching strategies - see [caching documentation](performance/cache/index.md) for an overview.
|
||||
|
||||
### Cache-DiT Configuration
|
||||
|
||||
See [cache-dit documentation](cache/cache_dit.md) for detailed configuration.
|
||||
See [cache-dit documentation](performance/cache/cache_dit.md) for detailed configuration.
|
||||
|
||||
| Environment Variable | Default | Description |
|
||||
|-------------------------------------|---------|------------------------------------------|
|
||||
98
docs/diffusion/index.md
Normal file
98
docs/diffusion/index.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# SGLang Diffusion
|
||||
|
||||
SGLang Diffusion is an inference framework for accelerated image and video generation using diffusion models. It provides an end-to-end unified pipeline with optimized kernels and an efficient scheduler loop.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Broad Model Support**: Wan series, FastWan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux, Z-Image, GLM-Image, and more
|
||||
- **Fast Inference**: Optimized kernels, efficient scheduler loop, and Cache-DiT acceleration
|
||||
- **Ease of Use**: OpenAI-compatible API, CLI, and Python SDK
|
||||
- **Multi-Platform**: NVIDIA GPUs (H100, H200, A100, B200, 4090), AMD GPUs (MI300X, MI325X) and Ascend NPU (A2, A3)
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
uv pip install "sglang[diffusion]" --prerelease=allow
|
||||
```
|
||||
|
||||
See [Installation Guide](installation.md) for more installation methods and ROCm-specific instructions.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
Generate an image with the CLI:
|
||||
|
||||
```bash
|
||||
sglang generate --model-path Qwen/Qwen-Image \
|
||||
--prompt "A beautiful sunset over the mountains" \
|
||||
--save-output
|
||||
```
|
||||
|
||||
Or start a server with the OpenAI-compatible API:
|
||||
|
||||
```bash
|
||||
sglang serve --model-path Qwen/Qwen-Image --port 30010
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Documentation
|
||||
|
||||
### Getting Started
|
||||
|
||||
- **[Installation](installation.md)** - Install SGLang Diffusion via pip, uv, Docker, or from source
|
||||
- **[Compatibility Matrix](compatibility_matrix.md)** - Supported models and optimization compatibility
|
||||
|
||||
### Usage
|
||||
|
||||
- **[CLI Documentation](api/cli.md)** - Command-line interface for `sglang generate` and `sglang serve`
|
||||
- **[OpenAI API](api/openai_api.md)** - OpenAI-compatible API for image/video generation and LoRA management
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
- **[Performance Overview](performance/index.md)** - Overview of all performance optimization strategies
|
||||
- **[Attention Backends](performance/attention_backends.md)** - Available attention backends (FlashAttention, SageAttention, etc.)
|
||||
- **[Caching Strategies](performance/cache/)** - Cache-DiT and TeaCache acceleration
|
||||
- **[Profiling](performance/profiling.md)** - Profiling techniques with PyTorch Profiler and Nsight Systems
|
||||
|
||||
### Reference
|
||||
|
||||
- **[Environment Variables](environment_variables.md)** - Configuration via environment variables
|
||||
- **[Support New Models](support_new_models.md)** - Guide for adding new diffusion models
|
||||
- **[Contributing](contributing.md)** - Contribution guidelines and commit message conventions
|
||||
- **[CI Performance](ci_perf.md)** - Performance baseline generation script
|
||||
|
||||
---
|
||||
|
||||
## CLI Quick Reference
|
||||
|
||||
### Generate (one-off generation)
|
||||
|
||||
```bash
|
||||
sglang generate --model-path <MODEL> --prompt "<PROMPT>" --save-output
|
||||
```
|
||||
|
||||
### Serve (HTTP server)
|
||||
|
||||
```bash
|
||||
sglang serve --model-path <MODEL> --port 30010
|
||||
```
|
||||
|
||||
### Enable Cache-DiT acceleration
|
||||
|
||||
```bash
|
||||
SGLANG_CACHE_DIT_ENABLED=true sglang generate --model-path <MODEL> --prompt "<PROMPT>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [SGLang GitHub](https://github.com/sgl-project/sglang)
|
||||
- [Cache-DiT](https://github.com/vipshop/cache-dit)
|
||||
- [FastVideo](https://github.com/hao-ai-lab/FastVideo)
|
||||
- [xDiT](https://github.com/xdit-project/xDiT)
|
||||
- [Diffusers](https://github.com/huggingface/diffusers)
|
||||
95
docs/diffusion/installation.md
Normal file
95
docs/diffusion/installation.md
Normal file
@@ -0,0 +1,95 @@
|
||||
# Install SGLang-Diffusion
|
||||
|
||||
You can install SGLang-Diffusion using one of the methods below.
|
||||
|
||||
## Standard Installation (NVIDIA GPUs)
|
||||
|
||||
### Method 1: With pip or uv
|
||||
|
||||
It is recommended to use uv for a faster installation:
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
pip install uv
|
||||
uv pip install "sglang[diffusion]" --prerelease=allow
|
||||
```
|
||||
|
||||
### Method 2: From source
|
||||
|
||||
```bash
|
||||
# Use the latest release branch
|
||||
git clone https://github.com/sgl-project/sglang.git
|
||||
cd sglang
|
||||
|
||||
# Install the Python packages
|
||||
pip install --upgrade pip
|
||||
pip install -e "python[diffusion]"
|
||||
|
||||
# With uv
|
||||
uv pip install -e "python[diffusion]" --prerelease=allow
|
||||
```
|
||||
|
||||
### Method 3: Using Docker
|
||||
|
||||
The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang), built from the [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile).
|
||||
Replace `<secret>` below with your HuggingFace Hub [token](https://huggingface.co/docs/hub/en/security-tokens).
|
||||
|
||||
```bash
|
||||
docker run --gpus all \
|
||||
--shm-size 32g \
|
||||
-p 30000:30000 \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=<secret>" \
|
||||
--ipc=host \
|
||||
lmsysorg/sglang:dev \
|
||||
zsh -c '\
|
||||
echo "Installing diffusion dependencies..." && \
|
||||
pip install -e "python[diffusion]" && \
|
||||
echo "Starting SGLang-Diffusion..." && \
|
||||
sglang generate \
|
||||
--model-path black-forest-labs/FLUX.1-dev \
|
||||
--prompt "A logo With Bold Large text: SGL Diffusion" \
|
||||
--save-output \
|
||||
'
|
||||
```
|
||||
|
||||
## Platform-Specific: ROCm (AMD GPUs)
|
||||
|
||||
For AMD Instinct GPUs (e.g., MI300X), you can use the ROCm-enabled Docker image:
|
||||
|
||||
```bash
|
||||
docker run --device=/dev/kfd --device=/dev/dri --ipc=host \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env HF_TOKEN=<secret> \
|
||||
lmsysorg/sglang:v0.5.5.post2-rocm700-mi30x \
|
||||
sglang generate --model-path black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output
|
||||
```
|
||||
|
||||
For detailed ROCm system configuration and installation from source, see [AMD GPUs](../../platforms/amd_gpu.md).
|
||||
|
||||
## Platform-Specific: MUSA (Moore Threads GPUs)
|
||||
|
||||
For Moore Threads GPUs (MTGPU) with the MUSA software stack:
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://github.com/sgl-project/sglang.git
|
||||
cd sglang
|
||||
|
||||
# Install the Python packages
|
||||
pip install --upgrade pip
|
||||
rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
|
||||
pip install -e "python[all_musa]"
|
||||
```
|
||||
|
||||
## Platform-Specific: Ascend NPU
|
||||
|
||||
For Ascend NPU, please follow the [NPU installation guide](../platforms/ascend_npu.md).
|
||||
|
||||
Quick test:
|
||||
|
||||
```bash
|
||||
sglang generate --model-path black-forest-labs/FLUX.1-dev \
|
||||
--prompt "A logo With Bold Large text: SGL Diffusion" \
|
||||
--save-output
|
||||
```
|
||||
@@ -14,6 +14,7 @@ When using the diffusers backend, `--attention-backend` is passed through to dif
|
||||
- **CUDA**: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.
|
||||
- **ROCm**: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.
|
||||
- **MPS**: always uses PyTorch SDPA.
|
||||
- **NPU**: always uses PyTorch SDPA.
|
||||
|
||||
## Backend options
|
||||
|
||||
@@ -29,6 +30,7 @@ For SGLang-native pipelines, the CLI accepts the lowercase names of `AttentionBa
|
||||
| `video_sparse_attn` | `VIDEO_SPARSE_ATTN` | Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. |
|
||||
| `vmoba_attn` | `VMOBA_ATTN` | Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. |
|
||||
| `aiter` | `AITER` | Requires `aiter`. |
|
||||
| `sparse_video_gen_2_attn` | `SPARSE_VIDEO_GEN_2_ATTN` | Requires `svg`. See installation instructions at https://github.com/svg-project/Sparse-VideoGen. |
|
||||
|
||||
## Selection priority
|
||||
|
||||
@@ -47,7 +49,7 @@ Some backends require additional configuration. You can pass these parameters vi
|
||||
|
||||
### Supported Configuration Parameters
|
||||
|
||||
#### Sliding Tile Attention (`sliding_tile_attn`)
|
||||
**Sliding Tile Attention (`sliding_tile_attn`)**
|
||||
|
||||
| Parameter | Type | Description | Default |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
@@ -55,13 +57,13 @@ Some backends require additional configuration. You can pass these parameters vi
|
||||
| `sta_mode` | `str` | Mode of STA. | `STA_inference` |
|
||||
| `skip_time_steps` | `int` | Number of steps to use full attention before switching to sparse attention. | `15` |
|
||||
|
||||
#### Video Sparse Attention (`video_sparse_attn`)
|
||||
**Video Sparse Attention (`video_sparse_attn`)**
|
||||
|
||||
| Parameter | Type | Description | Default |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `sparsity` | `float` | Validation sparsity (0.0 - 1.0). | `0.0` |
|
||||
|
||||
#### V-MoBA (`vmoba_attn`)
|
||||
**V-MoBA (`vmoba_attn`)**
|
||||
|
||||
| Parameter | Type | Description | Default |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
@@ -82,16 +84,17 @@ Some backends require additional configuration. You can pass these parameters vi
|
||||
|
||||
## Platform support matrix
|
||||
|
||||
| Backend | CUDA | ROCm | MPS | Notes |
|
||||
|---|---:|---:|---:|---|
|
||||
| `fa` | ✅ | ✅ | ❌ | CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to `torch_sdpa`. |
|
||||
| `torch_sdpa` | ✅ | ✅ | ✅ | Most compatible option across platforms. |
|
||||
| `sliding_tile_attn` | ✅ | ❌ | ❌ | CUDA-only. Requires `st_attn`. Configure via `--attention-backend-config`. |
|
||||
| `sage_attn` | ✅ | ❌ | ❌ | CUDA-only (optional dependency). |
|
||||
| `sage_attn_3` | ✅ | ❌ | ❌ | CUDA-only (optional dependency). |
|
||||
| `video_sparse_attn` | ✅ | ❌ | ❌ | CUDA-only. Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. |
|
||||
| `vmoba_attn` | ✅ | ❌ | ❌ | CUDA-only. Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. |
|
||||
| `aiter` | ✅ | ❌ | ❌ | Requires `aiter`. |
|
||||
| Backend | CUDA | ROCm | MPS | NPU | Notes |
|
||||
|---|---:|---:|---:|---:|---|
|
||||
| `fa` | ✅ | ✅ | ❌ | ❌ | CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to `torch_sdpa`. |
|
||||
| `torch_sdpa` | ✅ | ✅ | ✅ | ✅ | Most compatible option across platforms. |
|
||||
| `sliding_tile_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `st_attn`. Configure via `--attention-backend-config`. |
|
||||
| `sage_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). |
|
||||
| `sage_attn_3` | ✅ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). |
|
||||
| `video_sparse_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. |
|
||||
| `vmoba_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. |
|
||||
| `aiter` | ✅ | ❌ | ❌ | ❌ | Requires `aiter`. |
|
||||
| `sparse_video_gen_2_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `svg`. |
|
||||
|
||||
## Usage
|
||||
|
||||
@@ -1,9 +1,5 @@
|
||||
# Cache-DiT Acceleration
|
||||
|
||||
> **Note**: This is one of two caching strategies available in SGLang.
|
||||
> For an overview of all caching options, see [caching.md](caching.md).
|
||||
> For TeaCache documentation, see [teacache.md](teacache.md).
|
||||
|
||||
SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to **1.69x inference speedup** with minimal quality loss.
|
||||
|
||||
## Overview
|
||||
@@ -136,7 +132,7 @@ sglang generate --model-path black-forest-labs/FLUX.1-dev \
|
||||
SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and
|
||||
which to use cached results.
|
||||
|
||||
#### SCM Presets
|
||||
**SCM Presets**
|
||||
|
||||
SCM is configured with presets:
|
||||
|
||||
@@ -148,7 +144,7 @@ SCM is configured with presets:
|
||||
| `fast` | ~35% | ~3x | Acceptable |
|
||||
| `ultra` | ~25% | ~4x | Lower |
|
||||
|
||||
##### Usage
|
||||
**Usage**
|
||||
|
||||
```bash
|
||||
SGLANG_CACHE_DIT_ENABLED=true \
|
||||
@@ -157,7 +153,7 @@ sglang generate --model-path Qwen/Qwen-Image \
|
||||
--prompt "A futuristic cityscape at sunset"
|
||||
```
|
||||
|
||||
#### Custom SCM Bins
|
||||
**Custom SCM Bins**
|
||||
|
||||
For fine-grained control over which steps to compute vs cache:
|
||||
|
||||
@@ -169,7 +165,7 @@ sglang generate --model-path Qwen/Qwen-Image \
|
||||
--prompt "A futuristic cityscape at sunset"
|
||||
```
|
||||
|
||||
#### SCM Policy
|
||||
**SCM Policy**
|
||||
|
||||
| Policy | Env Variable | Description |
|
||||
|-----------|---------------------------------------|---------------------------------------------|
|
||||
@@ -178,22 +174,8 @@ sglang generate --model-path Qwen/Qwen-Image \
|
||||
|
||||
## Environment Variables
|
||||
|
||||
All Cache-DiT parameters can be set via the following environment variables:
|
||||
|
||||
| Environment Variable | Default | Description |
|
||||
|-------------------------------------|---------|------------------------------------------|
|
||||
| `SGLANG_CACHE_DIT_ENABLED` | false | Enable Cache-DiT acceleration |
|
||||
| `SGLANG_CACHE_DIT_FN` | 1 | First N blocks to always compute |
|
||||
| `SGLANG_CACHE_DIT_BN` | 0 | Last N blocks to always compute |
|
||||
| `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching |
|
||||
| `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold |
|
||||
| `SGLANG_CACHE_DIT_MC` | 3 | Max continuous cached steps |
|
||||
| `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator |
|
||||
| `SGLANG_CACHE_DIT_TS_ORDER` | 1 | TaylorSeer order (1 or 2) |
|
||||
| `SGLANG_CACHE_DIT_SCM_PRESET` | none | SCM preset (none/slow/medium/fast/ultra) |
|
||||
| `SGLANG_CACHE_DIT_SCM_POLICY` | dynamic | SCM caching policy |
|
||||
| `SGLANG_CACHE_DIT_SCM_COMPUTE_BINS` | not set | Custom SCM compute bins |
|
||||
| `SGLANG_CACHE_DIT_SCM_CACHE_BINS` | not set | Custom SCM cache bins |
|
||||
All Cache-DiT parameters can be configured via environment variables.
|
||||
See [Environment Variables](../../environment_variables.md) for the complete list.
|
||||
|
||||
## Supported Models
|
||||
|
||||
@@ -240,4 +222,4 @@ acceleration still works.
|
||||
## References
|
||||
|
||||
- [Cache-Dit](https://github.com/vipshop/cache-dit)
|
||||
- [SGLang Diffusion](../README.md)
|
||||
- [SGLang Diffusion](../index.md)
|
||||
@@ -1,7 +1,7 @@
|
||||
# TeaCache Acceleration
|
||||
|
||||
> **Note**: This is one of two caching strategies available in SGLang.
|
||||
> For an overview of all caching options, see [caching.md](caching.md).
|
||||
> For an overview of all caching options, see [caching](../index.md).
|
||||
|
||||
TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.
|
||||
|
||||
72
docs/diffusion/performance/index.md
Normal file
72
docs/diffusion/performance/index.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Performance Optimization
|
||||
|
||||
SGLang-Diffusion provides multiple performance optimization strategies to accelerate inference. This section covers all available performance tuning options.
|
||||
|
||||
## Overview
|
||||
|
||||
| Optimization | Type | Description |
|
||||
|--------------|------|-------------|
|
||||
| **Cache-DiT** | Caching | Block-level caching with DBCache, TaylorSeer, and SCM |
|
||||
| **TeaCache** | Caching | Timestep-level caching using L1 similarity |
|
||||
| **Attention Backends** | Kernel | Optimized attention implementations (FlashAttention, SageAttention, etc.) |
|
||||
| **Profiling** | Diagnostics | PyTorch Profiler and Nsight Systems guidance |
|
||||
|
||||
## Caching Strategies
|
||||
|
||||
SGLang supports two complementary caching approaches:
|
||||
|
||||
### Cache-DiT
|
||||
|
||||
[Cache-DiT](https://github.com/vipshop/cache-dit) provides block-level caching with advanced strategies. It can achieve up to **1.69x speedup**.
|
||||
|
||||
**Quick Start:**
|
||||
```bash
|
||||
SGLANG_CACHE_DIT_ENABLED=true \
|
||||
sglang generate --model-path Qwen/Qwen-Image \
|
||||
--prompt "A beautiful sunset over the mountains"
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
- **DBCache**: Dynamic block-level caching based on residual differences
|
||||
- **TaylorSeer**: Taylor expansion-based calibration for optimized caching
|
||||
- **SCM**: Step-level computation masking for additional speedup
|
||||
|
||||
See [Cache-DiT Documentation](cache/cache_dit.md) for detailed configuration.
|
||||
|
||||
### TeaCache
|
||||
|
||||
TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.
|
||||
|
||||
**Quick Overview:**
|
||||
- Tracks L1 distance between modulated inputs across timesteps
|
||||
- When accumulated distance is below threshold, reuses cached residual
|
||||
- Supports CFG with separate positive/negative caches
|
||||
|
||||
**Supported Models:** Wan (wan2.1, wan2.2), Hunyuan (HunyuanVideo), Z-Image
|
||||
|
||||
See [TeaCache Documentation](cache/teacache.md) for detailed configuration.
|
||||
|
||||
## Attention Backends
|
||||
|
||||
Different attention backends offer varying performance characteristics depending on your hardware and model:
|
||||
|
||||
- **FlashAttention**: Fastest on NVIDIA GPUs with fp16/bf16
|
||||
- **SageAttention**: Alternative optimized implementation
|
||||
- **xformers**: Memory-efficient attention
|
||||
- **SDPA**: PyTorch native scaled dot-product attention
|
||||
|
||||
See [Attention Backends](attention_backends.md) for platform support and configuration options.
|
||||
|
||||
## Profiling
|
||||
|
||||
To diagnose performance bottlenecks, SGLang-Diffusion supports profiling tools:
|
||||
|
||||
- **PyTorch Profiler**: Built-in Python profiling
|
||||
- **Nsight Systems**: GPU kernel-level analysis
|
||||
|
||||
See [Profiling Guide](profiling.md) for detailed instructions.
|
||||
|
||||
## References
|
||||
|
||||
- [Cache-DiT Repository](https://github.com/vipshop/cache-dit)
|
||||
- [TeaCache Paper](https://arxiv.org/abs/2411.14324)
|
||||
@@ -23,7 +23,7 @@ To add support for a new diffusion model, you will primarily need to define or c
|
||||
|
||||
3. **`ComposedPipeline` (not a config)**: This is the central class where you define the structure of your model's generation pipeline. You will create a new class that inherits from `ComposedPipelineBase` and, within it, instantiate and chain together the necessary `PipelineStage`s in the correct order. See `ComposedPipelineBase` and `PipelineStage` base definitions:
|
||||
- [`ComposedPipelineBase`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines/composed_pipeline_base.py)
|
||||
- [`PipelineStage`]( https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines/stages/base.py)
|
||||
- [`PipelineStage`](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/runtime/pipelines/stages/base.py)
|
||||
- [Central registry (models/config mapping)](https://github.com/sgl-project/sglang/blob/main/python/sglang/multimodal_gen/registry.py)
|
||||
|
||||
4. **Modules (components referenced by the pipeline)**: Each pipeline references a set of modules that are loaded from the model repository (e.g., Diffusers `model_index.json`) and assembled via the registry/loader. Common modules include:
|
||||
@@ -37,7 +37,7 @@ To add support for a new diffusion model, you will primarily need to define or c
|
||||
|
||||
## Available Pipeline Stages
|
||||
|
||||
You can build your custom `ComposedPipeline` by combining the following available stages as your will. Each stage is responsible for a specific part of the generation process.
|
||||
You can build your custom `ComposedPipeline` by combining the following available stages as needed. Each stage is responsible for a specific part of the generation process.
|
||||
|
||||
| Stage Class | Description |
|
||||
| -------------------------------- | ------------------------------------------------------------------------------------------------------- |
|
||||
@@ -45,7 +45,6 @@ You can build your custom `ComposedPipeline` by combining the following availabl
|
||||
| `TextEncodingStage` | Encodes text prompts into embeddings using one or more text encoders. |
|
||||
| `ImageEncodingStage` | Encodes input images into embeddings, often used in image-to-image tasks. |
|
||||
| `ImageVAEEncodingStage` | Specifically encodes an input image into the latent space using a Variational Autoencoder (VAE). |
|
||||
| `ConditioningStage` | Prepares the conditioning tensors (e.g., from text or image embeddings) for the denoising loop. |
|
||||
| `TimestepPreparationStage` | Prepares the scheduler's timesteps for the diffusion process. |
|
||||
| `LatentPreparationStage` | Creates the initial noisy latent tensor that will be denoised. |
|
||||
| `DenoisingStage` | Executes the main denoising loop, iteratively applying the model (e.g., UNet) to refine the latents. |
|
||||
@@ -88,15 +87,13 @@ To illustrate the process, let's look at how `Qwen-Image-Edit` is implemented. T
|
||||
_required_config_modules = ["processor", "scheduler", "text_encoder", "tokenizer", "transformer", "vae"]
|
||||
|
||||
def create_pipeline_stages(self, server_args: ServerArgs):
|
||||
"""Set up pipeline stages sequentially."""
|
||||
self.add_stage(stage_name="input_validation_stage", stage=InputValidationStage())
|
||||
self.add_stage(stage_name="prompt_encoding_stage_primary", stage=ImageEncodingStage(...))
|
||||
self.add_stage(stage_name="image_encoding_stage_primary", stage=ImageVAEEncodingStage(...))
|
||||
self.add_stage(stage_name="timestep_preparation_stage", stage=TimestepPreparationStage(...))
|
||||
self.add_stage(stage_name="latent_preparation_stage", stage=LatentPreparationStage(...))
|
||||
self.add_stage(stage_name="conditioning_stage", stage=ConditioningStage())
|
||||
self.add_stage(stage_name="denoising_stage", stage=DenoisingStage(...))
|
||||
self.add_stage(stage_name="decoding_stage", stage=DecodingStage(...))
|
||||
self.add_stage(InputValidationStage())
|
||||
self.add_stage(ImageEncodingStage(...))
|
||||
self.add_stage(ImageVAEEncodingStage(...))
|
||||
self.add_stage(TimestepPreparationStage(...))
|
||||
self.add_stage(LatentPreparationStage(...))
|
||||
self.add_stage(DenoisingStage(...))
|
||||
self.add_stage(DecodingStage(...))
|
||||
```
|
||||
The pipeline is constructed by adding stages in order. `Qwen-Image-Edit` uses `ImageEncodingStage` (for prompt and image processing) and `ImageVAEEncodingStage` (for latent extraction) before standard denoising and decoding.
|
||||
|
||||
@@ -35,8 +35,6 @@ Its core features include:
|
||||
basic_usage/native_api.ipynb
|
||||
basic_usage/sampling_params.md
|
||||
basic_usage/popular_model_usage.rst
|
||||
basic_usage/diffusion_llms.md
|
||||
basic_usage/diffusion.md
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
@@ -74,11 +72,30 @@ Its core features include:
|
||||
:caption: Supported Models
|
||||
|
||||
supported_models/text_generation/index
|
||||
supported_models/image_generation/index
|
||||
supported_models/retrieval_ranking/index
|
||||
supported_models/specialized/index
|
||||
supported_models/extending/index
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:caption: SGLang Diffusion
|
||||
|
||||
diffusion/index
|
||||
diffusion/installation
|
||||
diffusion/compatibility_matrix
|
||||
diffusion/api/cli
|
||||
diffusion/api/openai_api
|
||||
diffusion/performance/index
|
||||
diffusion/performance/attention_backends
|
||||
diffusion/performance/profiling
|
||||
diffusion/performance/cache/index
|
||||
diffusion/performance/cache/cache_dit
|
||||
diffusion/performance/cache/teacache
|
||||
diffusion/support_new_models
|
||||
diffusion/contributing
|
||||
diffusion/ci_perf
|
||||
diffusion/environment_variables
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Hardware Platforms
|
||||
@@ -113,6 +130,7 @@ Its core features include:
|
||||
references/custom_chat_template.md
|
||||
references/frontend/frontend_index.rst
|
||||
references/post_training_integration.md
|
||||
references/release_lookup
|
||||
references/learn_more.md
|
||||
|
||||
.. toctree::
|
||||
|
||||
@@ -14,21 +14,26 @@ let currentMetricType = 'throughput'; // throughput, latency, ttft, inputThrough
|
||||
|
||||
// Metric type definitions
|
||||
const metricTypes = {
|
||||
throughput: { label: 'Overall Throughput', unit: 'tokens/sec', field: 'throughput' },
|
||||
outputThroughput: { label: 'Output Throughput', unit: 'tokens/sec', field: 'outputThroughput' },
|
||||
inputThroughput: { label: 'Input Throughput', unit: 'tokens/sec', field: 'inputThroughput' },
|
||||
latency: { label: 'Latency', unit: 'ms', field: 'latency' },
|
||||
ttft: { label: 'Time to First Token', unit: 'ms', field: 'ttft' },
|
||||
accLength: { label: 'Accept Length', unit: 'tokens', field: 'accLength', filterInvalid: true }
|
||||
// Text/VLM metrics
|
||||
throughput: { label: 'Overall Throughput', unit: 'tokens/sec', field: 'throughput', type: 'text' },
|
||||
outputThroughput: { label: 'Output Throughput', unit: 'tokens/sec', field: 'outputThroughput', type: 'text' },
|
||||
inputThroughput: { label: 'Input Throughput', unit: 'tokens/sec', field: 'inputThroughput', type: 'text' },
|
||||
latency: { label: 'Latency', unit: 'ms', field: 'latency', type: 'text' },
|
||||
ttft: { label: 'Time to First Token', unit: 'ms', field: 'ttft', type: 'text' },
|
||||
accLength: { label: 'Accept Length', unit: 'tokens', field: 'accLength', filterInvalid: true, type: 'text' },
|
||||
// Diffusion metrics
|
||||
e2eMs: { label: 'End-to-End Time', unit: 'ms', field: 'e2e_ms', type: 'diffusion' },
|
||||
avgDenoiseMs: { label: 'Avg Denoise Time', unit: 'ms', field: 'avg_denoise_ms', type: 'diffusion' },
|
||||
medianDenoiseMs: { label: 'Median Denoise Time', unit: 'ms', field: 'median_denoise_ms', type: 'diffusion' }
|
||||
};
|
||||
|
||||
// Chart.js default configuration for dark theme
|
||||
Chart.defaults.color = '#8b949e';
|
||||
Chart.defaults.borderColor = '#30363d';
|
||||
Chart.defaults.color = '#94a3b8';
|
||||
Chart.defaults.borderColor = '#1e293b';
|
||||
|
||||
const chartColors = [
|
||||
'#58a6ff', '#3fb950', '#d29922', '#f85149', '#a371f7',
|
||||
'#79c0ff', '#56d364', '#e3b341', '#ff7b72', '#bc8cff'
|
||||
'#22d3ee', '#34d399', '#fbbf24', '#f87171', '#a78bfa',
|
||||
'#67e8f9', '#6ee7b7', '#fcd34d', '#fca5a5', '#c4b5fd'
|
||||
];
|
||||
|
||||
// Initialize the dashboard
|
||||
@@ -53,7 +58,7 @@ async function init() {
|
||||
async function loadData() {
|
||||
// Try local server API first (if running server.py)
|
||||
try {
|
||||
const response = await fetch('/api/metrics');
|
||||
const response = await fetch('/api/metrics', { headers: getAuthHeaders() });
|
||||
if (response.ok) {
|
||||
const data = await response.json();
|
||||
if (data.length > 0 && data[0].results && data[0].results.length > 0) {
|
||||
@@ -142,32 +147,51 @@ async function fetchMetricsForRun(run) {
|
||||
}
|
||||
}
|
||||
|
||||
// Helper function to detect if result is diffusion type
|
||||
function isDiffusionResult(result) {
|
||||
return result.test_type === 'diffusion' || (result.tests && !result.benchmarks);
|
||||
}
|
||||
|
||||
// Populate filter dropdowns
|
||||
function populateFilters() {
|
||||
const gpuConfigs = new Set();
|
||||
const models = new Set();
|
||||
const testNames = new Set(); // For diffusion tests
|
||||
const batchSizes = new Set();
|
||||
const ioLengths = new Set();
|
||||
|
||||
allMetricsData.forEach(run => {
|
||||
run.results.forEach(result => {
|
||||
gpuConfigs.add(result.gpu_config);
|
||||
models.add(result.model);
|
||||
// Try new structure first (benchmarks_by_io_len), fall back to flat benchmarks
|
||||
if (result.benchmarks_by_io_len) {
|
||||
Object.entries(result.benchmarks_by_io_len).forEach(([ioKey, ioData]) => {
|
||||
ioLengths.add(ioKey);
|
||||
ioData.benchmarks.forEach(bench => {
|
||||
batchSizes.add(bench.batch_size);
|
||||
|
||||
// Handle diffusion results
|
||||
if (isDiffusionResult(result)) {
|
||||
models.add(result.test_suite || 'diffusion');
|
||||
if (result.tests) {
|
||||
result.tests.forEach(test => {
|
||||
testNames.add(test.test_name);
|
||||
});
|
||||
});
|
||||
} else if (result.benchmarks) {
|
||||
result.benchmarks.forEach(bench => {
|
||||
batchSizes.add(bench.batch_size);
|
||||
if (bench.input_len && bench.output_len) {
|
||||
ioLengths.add(`${bench.input_len}_${bench.output_len}`);
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
// Handle text/VLM results
|
||||
else {
|
||||
models.add(result.model);
|
||||
// Try new structure first (benchmarks_by_io_len), fall back to flat benchmarks
|
||||
if (result.benchmarks_by_io_len) {
|
||||
Object.entries(result.benchmarks_by_io_len).forEach(([ioKey, ioData]) => {
|
||||
ioLengths.add(ioKey);
|
||||
ioData.benchmarks.forEach(bench => {
|
||||
batchSizes.add(bench.batch_size);
|
||||
});
|
||||
});
|
||||
} else if (result.benchmarks) {
|
||||
result.benchmarks.forEach(bench => {
|
||||
batchSizes.add(bench.batch_size);
|
||||
if (bench.input_len && bench.output_len) {
|
||||
ioLengths.add(`${bench.input_len}_${bench.output_len}`);
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
});
|
||||
});
|
||||
@@ -345,7 +369,16 @@ function createMetricTabs() {
|
||||
const tabsContainer = document.getElementById('metric-tabs');
|
||||
tabsContainer.innerHTML = '';
|
||||
|
||||
Object.entries(metricTypes).forEach(([key, metric], index) => {
|
||||
// Detect if current data is diffusion or text
|
||||
const isDiffusion = detectCurrentDataType() === 'diffusion';
|
||||
const dataType = isDiffusion ? 'diffusion' : 'text';
|
||||
|
||||
// Filter metrics based on data type
|
||||
const relevantMetrics = Object.entries(metricTypes).filter(([key, metric]) =>
|
||||
metric.type === dataType
|
||||
);
|
||||
|
||||
relevantMetrics.forEach(([key, metric], index) => {
|
||||
const tab = document.createElement('div');
|
||||
tab.className = index === 0 ? 'tab active' : 'tab';
|
||||
tab.textContent = metric.label;
|
||||
@@ -353,6 +386,31 @@ function createMetricTabs() {
|
||||
tab.onclick = () => selectMetricTab(key, tab);
|
||||
tabsContainer.appendChild(tab);
|
||||
});
|
||||
|
||||
// Set initial metric type
|
||||
if (relevantMetrics.length > 0) {
|
||||
currentMetricType = relevantMetrics[0][0];
|
||||
}
|
||||
}
|
||||
|
||||
function detectCurrentDataType() {
|
||||
// Check if currently selected model/GPU config has diffusion data
|
||||
const gpuFilter = document.getElementById('gpu-filter')?.value;
|
||||
const modelFilter = currentModel;
|
||||
|
||||
if (!gpuFilter || !modelFilter) return 'text';
|
||||
|
||||
for (const run of allMetricsData) {
|
||||
for (const result of run.results) {
|
||||
if (result.gpu_config === gpuFilter) {
|
||||
const resultModel = result.test_suite || result.model;
|
||||
if (resultModel === modelFilter && isDiffusionResult(result)) {
|
||||
return 'diffusion';
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return 'text';
|
||||
}
|
||||
|
||||
function selectMetricTab(metricKey, tabElement) {
|
||||
@@ -374,6 +432,8 @@ function handleModelFilterChange(model) {
|
||||
updateVariantFilter();
|
||||
// Update IO length filter based on new model selection
|
||||
updateIoLenFilter();
|
||||
// Recreate metric tabs in case data type changed (text vs diffusion)
|
||||
createMetricTabs();
|
||||
updateCharts();
|
||||
}
|
||||
|
||||
@@ -383,6 +443,8 @@ function handleGpuFilterChange() {
|
||||
updateVariantFilter();
|
||||
// Update IO length filter based on new GPU selection
|
||||
updateIoLenFilter();
|
||||
// Recreate metric tabs in case data type changed (text vs diffusion)
|
||||
createMetricTabs();
|
||||
updateCharts();
|
||||
}
|
||||
|
||||
@@ -518,6 +580,7 @@ function prepareChartData(gpuFilter, modelFilter, variantFilter, ioLenFilter, ba
|
||||
// Prepare chart data grouped by batch size - each batch size is a separate series
|
||||
function prepareChartDataByBatch(gpuFilter, modelFilter, variantFilter, ioLenFilter, batchFilter) {
|
||||
const batchDataMap = new Map(); // batch_size -> Map of variant -> data
|
||||
const testDataMap = new Map(); // For diffusion: test_name -> data
|
||||
|
||||
allMetricsData.forEach(run => {
|
||||
const runDate = new Date(run.run_date);
|
||||
@@ -525,6 +588,37 @@ function prepareChartDataByBatch(gpuFilter, modelFilter, variantFilter, ioLenFil
|
||||
run.results.forEach(result => {
|
||||
// Apply filters - GPU and Model are required (no "all" option)
|
||||
if (result.gpu_config !== gpuFilter) return;
|
||||
|
||||
// Handle diffusion results
|
||||
if (isDiffusionResult(result)) {
|
||||
const resultModel = result.test_suite || 'diffusion';
|
||||
if (resultModel !== modelFilter) return;
|
||||
|
||||
if (result.tests) {
|
||||
result.tests.forEach(test => {
|
||||
const testName = test.test_name;
|
||||
if (!testDataMap.has(testName)) {
|
||||
testDataMap.set(testName, {
|
||||
label: testName,
|
||||
data: [],
|
||||
model: resultModel,
|
||||
testName: testName
|
||||
});
|
||||
}
|
||||
|
||||
testDataMap.get(testName).data.push({
|
||||
x: runDate,
|
||||
e2e_ms: test.e2e_ms,
|
||||
avg_denoise_ms: test.avg_denoise_ms,
|
||||
median_denoise_ms: test.median_denoise_ms,
|
||||
runId: run.run_id
|
||||
});
|
||||
});
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
// Handle text/VLM results
|
||||
if (result.model !== modelFilter) return;
|
||||
if (variantFilter !== 'all' && result.variant !== variantFilter) return;
|
||||
|
||||
@@ -622,6 +716,17 @@ function prepareChartDataByBatch(gpuFilter, modelFilter, variantFilter, ioLenFil
|
||||
|
||||
// Sort data points by date and convert to array format
|
||||
const result = {};
|
||||
|
||||
// For diffusion data, use test names as "batch sizes"
|
||||
if (testDataMap.size > 0) {
|
||||
testDataMap.forEach((series, testName) => {
|
||||
series.data.sort((a, b) => a.x - b.x);
|
||||
result[testName] = [series]; // Each test is its own series
|
||||
});
|
||||
return result;
|
||||
}
|
||||
|
||||
// For text/VLM data, use batch sizes
|
||||
batchDataMap.forEach((variantMap, batchSize) => {
|
||||
variantMap.forEach(series => {
|
||||
series.data.sort((a, b) => a.x - b.x);
|
||||
@@ -642,7 +747,16 @@ function updateMetricChart(chartDataByBatch, metricType) {
|
||||
activeCharts = [];
|
||||
|
||||
const metric = metricTypes[metricType];
|
||||
const batchSizes = Object.keys(chartDataByBatch).sort((a, b) => parseInt(a) - parseInt(b));
|
||||
const isDiffusion = metric.type === 'diffusion';
|
||||
|
||||
// For diffusion, keys are test names; for text, keys are batch sizes
|
||||
const keys = Object.keys(chartDataByBatch);
|
||||
if (!isDiffusion) {
|
||||
keys.sort((a, b) => parseInt(a) - parseInt(b));
|
||||
} else {
|
||||
keys.sort(); // Alphabetical sort for test names
|
||||
}
|
||||
const batchSizes = keys; // Keep variable name for compatibility
|
||||
|
||||
if (batchSizes.length === 0) {
|
||||
container.innerHTML = '<div class="no-data">No data available for the selected filters</div>';
|
||||
@@ -682,7 +796,8 @@ function updateMetricChart(chartDataByBatch, metricType) {
|
||||
|
||||
const title = document.createElement('div');
|
||||
title.className = 'batch-chart-title';
|
||||
title.textContent = `Batch Size: ${batchSize}`;
|
||||
// For diffusion, show test name; for text, show batch size
|
||||
title.textContent = isDiffusion ? `Test: ${batchSize}` : `Batch Size: ${batchSize}`;
|
||||
chartWrapper.appendChild(title);
|
||||
|
||||
const chartContainer = document.createElement('div');
|
||||
@@ -726,12 +841,13 @@ function getChartOptions(yAxisLabel) {
|
||||
}
|
||||
},
|
||||
tooltip: {
|
||||
backgroundColor: '#21262d',
|
||||
borderColor: '#30363d',
|
||||
backgroundColor: '#1a2332',
|
||||
borderColor: 'rgba(148, 163, 184, 0.1)',
|
||||
borderWidth: 1,
|
||||
titleFont: { size: 13 },
|
||||
bodyFont: { size: 12 },
|
||||
padding: 12
|
||||
titleFont: { size: 13, family: "'DM Sans', sans-serif" },
|
||||
bodyFont: { size: 12, family: "'JetBrains Mono', monospace" },
|
||||
padding: 14,
|
||||
cornerRadius: 8
|
||||
}
|
||||
},
|
||||
scales: {
|
||||
@@ -744,7 +860,7 @@ function getChartOptions(yAxisLabel) {
|
||||
}
|
||||
},
|
||||
grid: {
|
||||
color: '#21262d'
|
||||
color: 'rgba(148, 163, 184, 0.06)'
|
||||
}
|
||||
},
|
||||
y: {
|
||||
@@ -753,7 +869,7 @@ function getChartOptions(yAxisLabel) {
|
||||
text: yAxisLabel
|
||||
},
|
||||
grid: {
|
||||
color: '#21262d'
|
||||
color: 'rgba(148, 163, 184, 0.06)'
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -832,5 +948,109 @@ function formatNumber(num) {
|
||||
return num.toFixed(1);
|
||||
}
|
||||
|
||||
// Authentication state
|
||||
let authToken = sessionStorage.getItem('dashboard_auth_token') || null;
|
||||
|
||||
// Get auth headers for API requests
|
||||
function getAuthHeaders() {
|
||||
const headers = {};
|
||||
if (authToken) {
|
||||
headers['Authorization'] = `Bearer ${authToken}`;
|
||||
}
|
||||
return headers;
|
||||
}
|
||||
|
||||
// Check if server requires authentication and show/hide login accordingly
|
||||
async function checkAuthAndInit() {
|
||||
const loginOverlay = document.getElementById('login-overlay');
|
||||
const dashboardContainer = document.getElementById('dashboard-container');
|
||||
|
||||
try {
|
||||
const response = await fetch('/api/auth-check');
|
||||
if (response.ok) {
|
||||
const data = await response.json();
|
||||
if (!data.auth_required) {
|
||||
// No auth required - skip login, show dashboard directly
|
||||
loginOverlay.style.display = 'none';
|
||||
dashboardContainer.style.display = 'block';
|
||||
init();
|
||||
return;
|
||||
}
|
||||
}
|
||||
} catch (e) {
|
||||
// Server not available (e.g. static hosting) - skip login
|
||||
loginOverlay.style.display = 'none';
|
||||
dashboardContainer.style.display = 'block';
|
||||
init();
|
||||
return;
|
||||
}
|
||||
|
||||
// Auth is required - check if we have a valid token from a previous session
|
||||
if (authToken) {
|
||||
try {
|
||||
const testResponse = await fetch('/api/metrics', {
|
||||
headers: getAuthHeaders()
|
||||
});
|
||||
if (testResponse.ok) {
|
||||
loginOverlay.style.display = 'none';
|
||||
dashboardContainer.style.display = 'block';
|
||||
init();
|
||||
return;
|
||||
}
|
||||
} catch (e) {
|
||||
// Token invalid or expired
|
||||
}
|
||||
// Clear invalid token
|
||||
authToken = null;
|
||||
sessionStorage.removeItem('dashboard_auth_token');
|
||||
}
|
||||
|
||||
// Show login form
|
||||
loginOverlay.style.display = 'flex';
|
||||
dashboardContainer.style.display = 'none';
|
||||
}
|
||||
|
||||
// Handle login form submission
|
||||
async function handleLogin(event) {
|
||||
event.preventDefault();
|
||||
|
||||
const username = document.getElementById('login-username').value;
|
||||
const password = document.getElementById('login-password').value;
|
||||
const errorEl = document.getElementById('login-error');
|
||||
const loginBtn = document.getElementById('login-btn');
|
||||
|
||||
errorEl.textContent = '';
|
||||
loginBtn.disabled = true;
|
||||
loginBtn.textContent = 'Signing in...';
|
||||
|
||||
try {
|
||||
const response = await fetch('/api/login', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ username, password })
|
||||
});
|
||||
|
||||
const data = await response.json();
|
||||
|
||||
if (response.ok && data.token) {
|
||||
authToken = data.token;
|
||||
sessionStorage.setItem('dashboard_auth_token', authToken);
|
||||
|
||||
document.getElementById('login-overlay').style.display = 'none';
|
||||
document.getElementById('dashboard-container').style.display = 'block';
|
||||
init();
|
||||
} else {
|
||||
errorEl.textContent = data.error || 'Invalid username or password';
|
||||
}
|
||||
} catch (e) {
|
||||
errorEl.textContent = 'Unable to connect to server';
|
||||
} finally {
|
||||
loginBtn.disabled = false;
|
||||
loginBtn.textContent = 'Sign In';
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
// Initialize on page load
|
||||
document.addEventListener('DOMContentLoaded', init);
|
||||
document.addEventListener('DOMContentLoaded', checkAuthAndInit);
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -12,13 +12,19 @@ Usage:
|
||||
python server.py --port 8080
|
||||
python server.py --host 0.0.0.0 # Allow external access
|
||||
python server.py --fetch-on-start
|
||||
python server.py --username admin --password secret # Enable authentication
|
||||
DASHBOARD_USERNAME=admin DASHBOARD_PASSWORD=secret python server.py # Via env vars
|
||||
python server.py --refresh-interval 12 # Auto-refresh data every 12 hours
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import hmac
|
||||
import http.server
|
||||
import io
|
||||
import json
|
||||
import os
|
||||
import secrets
|
||||
import socketserver
|
||||
import threading
|
||||
import time
|
||||
@@ -44,6 +50,47 @@ metrics_cache = {
|
||||
CACHE_TTL = 300 # 5 minutes
|
||||
REQUEST_TIMEOUT = 30 # seconds
|
||||
|
||||
# Authentication configuration (set via CLI flags)
|
||||
auth_config = {
|
||||
"enabled": False,
|
||||
"username": None,
|
||||
"password_hash": None, # SHA-256 hash of the password
|
||||
"active_tokens": {}, # token -> expiry timestamp
|
||||
}
|
||||
auth_lock = threading.Lock()
|
||||
AUTH_TOKEN_TTL = 3600 # 1 hour
|
||||
|
||||
|
||||
def hash_password(password):
|
||||
"""Hash a password using SHA-256 for constant-time comparison."""
|
||||
return hashlib.sha256(password.encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
def create_auth_token():
|
||||
"""Create a new session token."""
|
||||
token = secrets.token_hex(32)
|
||||
with auth_lock:
|
||||
# Clean up expired tokens
|
||||
now = time.time()
|
||||
auth_config["active_tokens"] = {
|
||||
t: exp for t, exp in auth_config["active_tokens"].items() if exp > now
|
||||
}
|
||||
auth_config["active_tokens"][token] = now + AUTH_TOKEN_TTL
|
||||
return token
|
||||
|
||||
|
||||
def verify_auth_token(token):
|
||||
"""Verify a session token is valid and not expired."""
|
||||
if not token:
|
||||
return False
|
||||
with auth_lock:
|
||||
expiry = auth_config["active_tokens"].get(token)
|
||||
if expiry and expiry > time.time():
|
||||
return True
|
||||
# Remove expired token
|
||||
auth_config["active_tokens"].pop(token, None)
|
||||
return False
|
||||
|
||||
|
||||
def get_github_token():
|
||||
"""Get GitHub token from environment or gh CLI."""
|
||||
@@ -187,12 +234,47 @@ def update_cache_async():
|
||||
metrics_cache["updating"] = False
|
||||
|
||||
|
||||
def start_periodic_refresh(interval_hours):
|
||||
"""Start a background thread that refreshes the cache periodically."""
|
||||
interval_seconds = interval_hours * 3600
|
||||
|
||||
def refresh_loop():
|
||||
while True:
|
||||
time.sleep(interval_seconds)
|
||||
print(f"Periodic refresh triggered (every {interval_hours}h)")
|
||||
update_cache_async()
|
||||
|
||||
thread = threading.Thread(target=refresh_loop, daemon=True)
|
||||
thread.start()
|
||||
print(f"Periodic refresh enabled: every {interval_hours} hours")
|
||||
|
||||
|
||||
class DashboardHandler(http.server.SimpleHTTPRequestHandler):
|
||||
"""HTTP request handler for the dashboard."""
|
||||
|
||||
def __init__(self, *args, directory=None, **kwargs):
|
||||
super().__init__(*args, directory=directory, **kwargs)
|
||||
|
||||
def _send_json(self, data, status=200):
|
||||
"""Send a JSON response."""
|
||||
self.send_response(status)
|
||||
self.send_header("Content-Type", "application/json")
|
||||
self.send_header("Access-Control-Allow-Origin", "*")
|
||||
self.end_headers()
|
||||
self.wfile.write(json.dumps(data).encode())
|
||||
|
||||
def _check_auth(self):
|
||||
"""Check if request is authenticated. Returns True if OK, sends 401 and returns False otherwise."""
|
||||
if not auth_config["enabled"]:
|
||||
return True
|
||||
auth_header = self.headers.get("Authorization", "")
|
||||
if auth_header.startswith("Bearer "):
|
||||
token = auth_header[7:]
|
||||
if verify_auth_token(token):
|
||||
return True
|
||||
self._send_json({"error": "Unauthorized"}, status=401)
|
||||
return False
|
||||
|
||||
def do_GET(self):
|
||||
parsed = urlparse(self.path)
|
||||
|
||||
@@ -201,13 +283,55 @@ class DashboardHandler(http.server.SimpleHTTPRequestHandler):
|
||||
self.send_error(400, "Invalid path")
|
||||
return
|
||||
|
||||
if parsed.path == "/api/metrics":
|
||||
self.handle_metrics_api(parsed)
|
||||
if parsed.path == "/api/auth-check":
|
||||
self.handle_auth_check()
|
||||
elif parsed.path == "/api/metrics":
|
||||
if self._check_auth():
|
||||
self.handle_metrics_api(parsed)
|
||||
elif parsed.path == "/api/refresh":
|
||||
self.handle_refresh_api()
|
||||
if self._check_auth():
|
||||
self.handle_refresh_api()
|
||||
else:
|
||||
super().do_GET()
|
||||
|
||||
def do_POST(self):
|
||||
parsed = urlparse(self.path)
|
||||
|
||||
if parsed.path == "/api/login":
|
||||
self.handle_login()
|
||||
else:
|
||||
self.send_error(404, "Not Found")
|
||||
|
||||
def handle_auth_check(self):
|
||||
"""Tell the frontend whether authentication is required."""
|
||||
self._send_json({"auth_required": auth_config["enabled"]})
|
||||
|
||||
def handle_login(self):
|
||||
"""Validate username/password and return a session token."""
|
||||
content_length = int(self.headers.get("Content-Length", 0))
|
||||
if content_length == 0 or content_length > 4096:
|
||||
self._send_json({"error": "Invalid request"}, status=400)
|
||||
return
|
||||
|
||||
try:
|
||||
body = json.loads(self.rfile.read(content_length))
|
||||
except (json.JSONDecodeError, ValueError):
|
||||
self._send_json({"error": "Invalid JSON"}, status=400)
|
||||
return
|
||||
|
||||
username = body.get("username", "")
|
||||
password = body.get("password", "")
|
||||
|
||||
if hmac.compare_digest(
|
||||
username, auth_config["username"]
|
||||
) and hmac.compare_digest(
|
||||
hash_password(password), auth_config["password_hash"]
|
||||
):
|
||||
token = create_auth_token()
|
||||
self._send_json({"token": token})
|
||||
else:
|
||||
self._send_json({"error": "Invalid username or password"}, status=401)
|
||||
|
||||
def handle_metrics_api(self, parsed):
|
||||
"""Handle /api/metrics endpoint."""
|
||||
# Check cache with thread safety
|
||||
@@ -222,21 +346,12 @@ class DashboardHandler(http.server.SimpleHTTPRequestHandler):
|
||||
# Trigger background update
|
||||
threading.Thread(target=update_cache_async, daemon=True).start()
|
||||
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "application/json")
|
||||
self.send_header("Access-Control-Allow-Origin", "*")
|
||||
self.end_headers()
|
||||
self.wfile.write(json.dumps(data).encode())
|
||||
self._send_json(data)
|
||||
|
||||
def handle_refresh_api(self):
|
||||
"""Handle /api/refresh endpoint."""
|
||||
threading.Thread(target=update_cache_async, daemon=True).start()
|
||||
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "application/json")
|
||||
self.send_header("Access-Control-Allow-Origin", "*")
|
||||
self.end_headers()
|
||||
self.wfile.write(json.dumps({"status": "refreshing"}).encode())
|
||||
self._send_json({"status": "refreshing"})
|
||||
|
||||
def log_message(self, format, *args):
|
||||
"""Custom log format."""
|
||||
@@ -254,8 +369,33 @@ def main():
|
||||
parser.add_argument(
|
||||
"--fetch-on-start", action="store_true", help="Fetch metrics on startup"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--refresh-interval",
|
||||
type=float,
|
||||
default=12,
|
||||
help="Auto-refresh interval in hours (default: 12, set to 0 to disable)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--username",
|
||||
default=os.environ.get("DASHBOARD_USERNAME"),
|
||||
help="Username for dashboard authentication (or set DASHBOARD_USERNAME env var)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--password",
|
||||
default=os.environ.get("DASHBOARD_PASSWORD"),
|
||||
help="Password for dashboard authentication (or set DASHBOARD_PASSWORD env var)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Configure authentication if both username and password are provided
|
||||
if args.username and args.password:
|
||||
auth_config["enabled"] = True
|
||||
auth_config["username"] = args.username
|
||||
auth_config["password_hash"] = hash_password(args.password)
|
||||
print(f"Authentication enabled for user: {args.username}")
|
||||
elif args.username or args.password:
|
||||
parser.error("Both --username and --password must be provided together")
|
||||
|
||||
# Change to dashboard directory
|
||||
dashboard_dir = Path(__file__).parent
|
||||
os.chdir(dashboard_dir)
|
||||
@@ -264,6 +404,9 @@ def main():
|
||||
print("Fetching initial metrics data...")
|
||||
update_cache_async()
|
||||
|
||||
if args.refresh_interval > 0:
|
||||
start_periodic_refresh(args.refresh_interval)
|
||||
|
||||
handler = lambda *a, **kw: DashboardHandler(*a, directory=str(dashboard_dir), **kw)
|
||||
|
||||
with socketserver.TCPServer((args.host, args.port), handler) as httpd:
|
||||
|
||||
@@ -126,7 +126,6 @@ export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
|
||||
unset TASK_QUEUE_ENABLE
|
||||
export SGLANG_NPU_USE_MLAPO=1
|
||||
export SGLANG_USE_FIA_NZ=1
|
||||
export ENABLE_MOE_NZ=1
|
||||
|
||||
# suggest max-running-requests <= max-cuda-graph-bs * dp_size, Because when this value is exceeded, performance will significantly degrade.
|
||||
python -m sglang.launch_server \
|
||||
|
||||
39
docs/platforms/ascend_npu_environment_variables.md
Normal file
39
docs/platforms/ascend_npu_environment_variables.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# Environment Variables
|
||||
|
||||
SGLang supports various environment variables related to Ascend NPU that can be used to configure its runtime behavior.
|
||||
This document provides a list of commonly used environment variables and aims to stay updated over time.
|
||||
|
||||
## Directly Used in SGLang
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
|--------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
|
||||
| `SGLANG_NPU_USE_MLAPO` | Adopts the `MLAPO` fusion operator in attention <br/> preprocessing stage of the MLA model. | `false` |
|
||||
| `SGLANG_USE_FIA_NZ` | Reshapes KV Cache for FIA NZ format.<br/> `SGLANG_USE_FIA_NZ` must be enabled with `SGLANG_NPU_USE_MLAPO` | `false` |
|
||||
| `SGLANG_NPU_USE_MULTI_STREAM` | Enable dual-stream computation of shared experts <br/> and routing experts in DeepSeek models.<br/> Enable dual-stream computation in DeepSeek NSA Indexer. | `false` |
|
||||
| `SGLANG_NPU_DISABLE_ACL_FORMAT_WEIGHT` | Disable cast model weight tensor to a specific NPU <br/> ACL format. | `false` |
|
||||
| `SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK` | The maximum number of dispatched tokens on each rank. | `128` |
|
||||
|
||||
## Used in DeepEP Ascend
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
|-------------------------------------------|------------------------------------------------------------------------------------------------------------------------|---------------|
|
||||
| `DEEPEP_NORMAL_LONG_SEQ_PER_ROUND_TOKENS` | Enable ant-moving function in dispatch stage. Indicates <br/> the number of tokens transmitted per round on each rank. | `8192` |
|
||||
| `DEEPEP_NORMAL_LONG_SEQ_ROUND` | Enable ant-moving function in dispatch stage. Indicates <br/> the number of rounds transmitted on each rank. | `1` |
|
||||
| `DEEPEP_NORMAL_COMBINE_ENABLE_LONG_SEQ` | Enable ant-moving function in combine stage. <br/> The value `0` means disabled. | `0` |
|
||||
| `MOE_ENABLE_TOPK_NEG_ONE` | Needs to be enabled when the expert ID to be processed by <br/> DEEPEP contains -1. | `0` |
|
||||
| `DEEP_NORMAL_MODE_USE_INT8_QUANT` | Quantizes x to int8 and returns (tensor, scales) in dispatch operator. | `0` |
|
||||
|
||||
## Others
|
||||
|
||||
| Environment Variable | Description | Default Value |
|
||||
|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
|
||||
| `TASK_QUEUE_ENABLE` | Used to control the optimization level of the dispatch queue<br/> about the task_queue operator. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/730/comref/Envvariables/docs/zh/environment_variable_reference/TASK_QUEUE_ENABLE.md) | `1` |
|
||||
| `INF_NAN_MODE_ENABLE` | Controls whether the chip uses saturation mode or INF_NAN mode. [Detail](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha001/apiref/envref/envref_07_0056.html) | `1` |
|
||||
| `STREAMS_PER_DEVICE` | Configures the maximum number of streams for the stream pool. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/720/comref/Envvariables/Envir_041.html) | `32` |
|
||||
| `PYTORCH_NPU_ALLOC_CONF` | Controls the behavior of the cache allocator. <br/>This variable changes memory usage and may cause performance fluctuations. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/700/comref/Envvariables/Envir_012.html) | |
|
||||
| `ASCEND_MF_STORE_URL` | The address of config store in MemFabric during PD separation, <br/>which is generally set to the IP address of the P primary node<br/> with an arbitrary port number. | |
|
||||
| `ASCEND_LAUNCH_BLOCKING` | Controls whether synchronous mode is enabled during operator execution. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/710/comref/Envvariables/Envir_006.html) | `0` |
|
||||
| `HCCL_OP_EXPANSION_MODE` | Configures the expansion position for communication algorithm scheduling. [Detail](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha001/apiref/envref/envref_07_0094.html) | |
|
||||
| `HCCL_BUFFSIZE` | Controls the size of the buffer area for shared data between two NPUs. <br/>The unit is MB, and the value must be greater than or equal to 1. [Detail](https://www.hiascend.com/document/detail/zh/Pytorch/60RC3/ptmoddevg/trainingmigrguide/performance_tuning_0047.html) | `200` |
|
||||
| `HCCL_SOCKET_IFNAME` | Configures the name of the network card used by the Host <br/>during HCCL initialization. [Detail](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/apiref/envvar/envref_07_0075.html) | |
|
||||
| `GLOO_SOCKET_IFNAME` | Configures the network interface name for GLOO communication. | |
|
||||
194
docs/platforms/ascend_npu_glm5_examples.md
Normal file
194
docs/platforms/ascend_npu_glm5_examples.md
Normal file
@@ -0,0 +1,194 @@
|
||||
# GLM-5 examples
|
||||
|
||||
## Introduction
|
||||
|
||||
The GLM (General Language Model) series is an open-source bilingual large language model family jointly developed by the KEG Laboratory of Tsinghua University and Zhipu AI. This series of models has performed outstandingly in the field of Chinese NLP with its unique unified pre-training framework and bilingual capabilities. [GLM-5](https://huggingface.co/zai-org/GLM-5) adopts the DeepSeek-V3/V3.2 architecture, including the sparse attention (DSA) and multi-token prediction (MTP). Ascend supports GLM-5 with 0Day based on the SGLang inference framework, achieving low-code seamless enablement and compatibility with the mainstream distributed parallel capabilities within the current SGLang framework. We welcome developers to download and experience it.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weight
|
||||
|
||||
- `GLM-5.0`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5).
|
||||
- `GLM-5.0-w4a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Eco-Tech/GLM-5-w4a8).
|
||||
- You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively.
|
||||
|
||||
|
||||
### Installation
|
||||
|
||||
The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the quay.io platform. You can directly pull it.
|
||||
|
||||
```{code-block} bash
|
||||
#Atlas 800 A3
|
||||
docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann8.5.0-a3-glm5
|
||||
#Atlas 800 A2
|
||||
docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:cann8.5.0-910b-glm5
|
||||
|
||||
#start container
|
||||
docker run -itd --shm-size=16g --privileged=true --name ${NAME} \
|
||||
--privileged=true --net=host \
|
||||
-v /var/queue_schedule:/var/queue_schedule \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /usr/local/sbin:/usr/local/sbin \
|
||||
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
|
||||
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
|
||||
--device=/dev/davinci0:/dev/davinci0 \
|
||||
--device=/dev/davinci1:/dev/avinci1 \
|
||||
--device=/dev/davinci2:/dev/davinci2 \
|
||||
--device=/dev/davinci3:/dev/davinci3 \
|
||||
--device=/dev/davinci4:/dev/davinci4 \
|
||||
--device=/dev/davinci5:/dev/davinci5 \
|
||||
--device=/dev/davinci6:/dev/davinci6 \
|
||||
--device=/dev/davinci7:/dev/davinci7 \
|
||||
--device=/dev/davinci8:/dev/davinci8 \
|
||||
--device=/dev/davinci9:/dev/davinci9 \
|
||||
--device=/dev/davinci10:/dev/davinci10 \
|
||||
--device=/dev/davinci11:/dev/davinci11 \
|
||||
--device=/dev/davinci12:/dev/davinci12 \
|
||||
--device=/dev/davinci13:/dev/davinci13 \
|
||||
--device=/dev/davinci14:/dev/davinci14 \
|
||||
--device=/dev/davinci15:/dev/davinci15 \
|
||||
--device=/dev/davinci_manager:/dev/davinci_manager \
|
||||
--device=/dev/hisi_hdc:/dev/hisi_hdc \
|
||||
--entrypoint=bash \
|
||||
swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:${TAG}
|
||||
```
|
||||
|
||||
Note: Using this image, you need to update transformers to main branch
|
||||
``` shell
|
||||
# reinstall transformers
|
||||
pip install git+https://github.com/huggingface/transformers.git
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
- Quantized model `glm5_w4a8` can be deployed on 1 Atlas 800 A3 (64G × 16) .
|
||||
|
||||
Run the following script to execute online inference.
|
||||
|
||||
```shell
|
||||
# high performance cpu
|
||||
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
sysctl -w vm.swappiness=0
|
||||
sysctl -w kernel.numa_balancing=0
|
||||
sysctl -w kernel.sched_migration_cost_ns=50000
|
||||
# bind cpu
|
||||
export SGLANG_SET_CPU_AFFINITY=1
|
||||
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
unset HTTPS_PROXY
|
||||
unset HTTP_PROXY
|
||||
unset ASCEND_LAUNCH_BLOCKING
|
||||
# cann
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
|
||||
export STREAMS_PER_DEVICE=32
|
||||
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
|
||||
export SGLANG_ENABLE_SPEC_V2=1
|
||||
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
|
||||
export SGLANG_NPU_USE_MULTI_STREAM=1
|
||||
export HCCL_BUFFSIZE=1000
|
||||
export HCCL_OP_EXPANSION_MODE=AIV
|
||||
export HCCL_SOCKET_IFNAME=lo
|
||||
export GLOO_SOCKET_IFNAME=lo
|
||||
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path $MODEL_PATH \
|
||||
--attention-backend ascend \
|
||||
--device npu \
|
||||
--tp-size 16 --nnodes 1 --node-rank 0 \
|
||||
--chunked-prefill-size 16384 --max-prefill-tokens 280000 \
|
||||
--trust-remote-code \
|
||||
--host 127.0.0.1 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--port 8000 \
|
||||
--served-model-name glm-5 \
|
||||
--cuda-graph-bs 16 \
|
||||
--quantization modelslim \
|
||||
--moe-a2a-backend deepep --deepep-mode auto
|
||||
```
|
||||
|
||||
### Multi-node Deployment
|
||||
|
||||
- `GLM-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16).
|
||||
|
||||
**A3 series**
|
||||
|
||||
Modify the IP of 2 nodes, then run the same scripts on two nodes.
|
||||
|
||||
**node 0/1**
|
||||
|
||||
```shell
|
||||
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
sysctl -w vm.swappiness=0
|
||||
sysctl -w kernel.numa_balancing=0
|
||||
sysctl -w kernel.sched_migration_cost_ns=50000
|
||||
# bind cpu
|
||||
export SGLANG_SET_CPU_AFFINITY=1
|
||||
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
unset HTTPS_PROXY
|
||||
unset HTTP_PROXY
|
||||
unset ASCEND_LAUNCH_BLOCKING
|
||||
# cann
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
|
||||
export STREAMS_PER_DEVICE=32
|
||||
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
|
||||
export SGLANG_ENABLE_SPEC_V2=1
|
||||
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
|
||||
export SGLANG_NPU_USE_MULTI_STREAM=1
|
||||
export HCCL_BUFFSIZE=1000
|
||||
export HCCL_OP_EXPANSION_MODE=AIV
|
||||
|
||||
# Run command ifconfig on two nodes, find out which inet addr has same IP with your node IP. That is your public interface, which should be added here
|
||||
export HCCL_SOCKET_IFNAME=lo
|
||||
export GLOO_SOCKET_IFNAME=lo
|
||||
|
||||
|
||||
P_IP=('your ip1' 'your ip2')
|
||||
P_MASTER="${P_IP[0]}:your port"
|
||||
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
|
||||
|
||||
export SGLANG_ENABLE_SPEC_V2=1
|
||||
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
|
||||
|
||||
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
|
||||
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
|
||||
for i in "${!P_IP[@]}";
|
||||
do
|
||||
if [[ "$LOCAL_HOST1" == "${P_IP[$i]}" || "$LOCAL_HOST2" == "${P_IP[$i]}" ]];
|
||||
then
|
||||
echo "${P_IP[$i]}"
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path $MODEL_PATH \
|
||||
--attention-backend ascend \
|
||||
--device npu \
|
||||
--tp-size 32 --nnodes 2 --node-rank $i --dist-init-addr $P_MASTER \
|
||||
--chunked-prefill-size 16384 --max-prefill-tokens 131072 \
|
||||
--trust-remote-code \
|
||||
--host 127.0.0.1 \
|
||||
--mem-fraction-static 0.8\
|
||||
--port 8000 \
|
||||
--served-model-name glm-5 \
|
||||
--cuda-graph-max-bs 16 \
|
||||
--disable-radix-cache
|
||||
NODE_RANK=$i
|
||||
break
|
||||
fi
|
||||
done
|
||||
|
||||
```
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
Not test yet.
|
||||
|
||||
### Using Benchmark
|
||||
|
||||
Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md) for details.
|
||||
231
docs/platforms/ascend_npu_qwen3_5_examples.md
Normal file
231
docs/platforms/ascend_npu_qwen3_5_examples.md
Normal file
@@ -0,0 +1,231 @@
|
||||
# Qwen3.5 examples
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Installation
|
||||
|
||||
The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the quay.io platform. You can directly pull it.
|
||||
|
||||
```{code-block} bash
|
||||
#Atlas 800 A3
|
||||
docker pull quay.io/ascend/sglang:v0.5.9-cann8.5.0-a3
|
||||
#Atlas 800 A2
|
||||
docker pull quay.io/ascend/sglang:v0.5.9-cann8.5.0-910b
|
||||
|
||||
#start container
|
||||
docker run -itd --shm-size=16g --privileged=true --name ${NAME} \
|
||||
--privileged=true --net=host \
|
||||
-v /var/queue_schedule:/var/queue_schedule \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /usr/local/sbin:/usr/local/sbin \
|
||||
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
|
||||
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
|
||||
--device=/dev/davinci0:/dev/davinci0 \
|
||||
--device=/dev/davinci1:/dev/davinci1 \
|
||||
--device=/dev/davinci2:/dev/davinci2 \
|
||||
--device=/dev/davinci3:/dev/davinci3 \
|
||||
--device=/dev/davinci4:/dev/davinci4 \
|
||||
--device=/dev/davinci5:/dev/davinci5 \
|
||||
--device=/dev/davinci6:/dev/davinci6 \
|
||||
--device=/dev/davinci7:/dev/davinci7 \
|
||||
--device=/dev/davinci8:/dev/davinci8 \
|
||||
--device=/dev/davinci9:/dev/davinci9 \
|
||||
--device=/dev/davinci10:/dev/davinci10 \
|
||||
--device=/dev/davinci11:/dev/davinci11 \
|
||||
--device=/dev/davinci12:/dev/davinci12 \
|
||||
--device=/dev/davinci13:/dev/davinci13 \
|
||||
--device=/dev/davinci14:/dev/davinci14 \
|
||||
--device=/dev/davinci15:/dev/davinci15 \
|
||||
--device=/dev/davinci_manager:/dev/davinci_manager \
|
||||
--device=/dev/hisi_hdc:/dev/hisi_hdc \
|
||||
--entrypoint=bash \
|
||||
quay.io/ascend/sglang:${tag}
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Single-node Deployment
|
||||
|
||||
Run the following script to execute online inference.
|
||||
|
||||
#### Qwen3.5 397B
|
||||
|
||||
```shell
|
||||
# high performance cpu
|
||||
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
sysctl -w vm.swappiness=0
|
||||
sysctl -w kernel.numa_balancing=0
|
||||
sysctl -w kernel.sched_migration_cost_ns=50000
|
||||
# bind cpu
|
||||
export SGLANG_SET_CPU_AFFINITY=1
|
||||
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
unset HTTPS_PROXY
|
||||
unset HTTP_PROXY
|
||||
unset ASCEND_LAUNCH_BLOCKING
|
||||
# cann
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
|
||||
export STREAMS_PER_DEVICE=32
|
||||
export HCCL_BUFFSIZE=1000
|
||||
export HCCL_OP_EXPANSION_MODE=AIV
|
||||
export HCCL_SOCKET_IFNAME=lo
|
||||
export GLOO_SOCKET_IFNAME=lo
|
||||
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path $MODEL_PATH \
|
||||
--attention-backend ascend \
|
||||
--device npu \
|
||||
--tp-size 16 --nnodes 1 --node-rank 0 \
|
||||
--chunked-prefill-size 4096 --max-prefill-tokens 280000 \
|
||||
--disable-radix-cache \
|
||||
--trust-remote-code \
|
||||
--host 127.0.0.1 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--port 8000 \
|
||||
--cuda-graph-bs 16 \
|
||||
--quantization modelslim \
|
||||
--enable-multimodal \
|
||||
--mm-attention-backend ascend_attn \
|
||||
--dtype bfloat16
|
||||
```
|
||||
|
||||
#### Qwen3.5 122B
|
||||
|
||||
```shell
|
||||
# high performance cpu
|
||||
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
sysctl -w vm.swappiness=0
|
||||
sysctl -w kernel.numa_balancing=0
|
||||
sysctl -w kernel.sched_migration_cost_ns=50000
|
||||
# bind cpu
|
||||
export SGLANG_SET_CPU_AFFINITY=1
|
||||
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
unset HTTPS_PROXY
|
||||
unset HTTP_PROXY
|
||||
unset ASCEND_LAUNCH_BLOCKING
|
||||
# cann
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
|
||||
export STREAMS_PER_DEVICE=32
|
||||
export HCCL_BUFFSIZE=1000
|
||||
export HCCL_OP_EXPANSION_MODE=AIV
|
||||
export HCCL_SOCKET_IFNAME=lo
|
||||
export GLOO_SOCKET_IFNAME=lo
|
||||
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path $MODEL_PATH \
|
||||
--attention-backend ascend \
|
||||
--device npu \
|
||||
--tp-size 8 --nnodes 1 --node-rank 0 \
|
||||
--chunked-prefill-size 4096 --max-prefill-tokens 280000 \
|
||||
--disable-radix-cache \
|
||||
--trust-remote-code \
|
||||
--host 127.0.0.1 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--port 8000 \
|
||||
--cuda-graph-bs 16 \
|
||||
--quantization modelslim \
|
||||
--enable-multimodal \
|
||||
--mm-attention-backend ascend_attn \
|
||||
--dtype bfloat16
|
||||
```
|
||||
|
||||
#### Qwen3.5 35B
|
||||
|
||||
```shell
|
||||
# high performance cpu
|
||||
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
sysctl -w vm.swappiness=0
|
||||
sysctl -w kernel.numa_balancing=0
|
||||
sysctl -w kernel.sched_migration_cost_ns=50000
|
||||
# bind cpu
|
||||
export SGLANG_SET_CPU_AFFINITY=1
|
||||
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
unset HTTPS_PROXY
|
||||
unset HTTP_PROXY
|
||||
unset ASCEND_LAUNCH_BLOCKING
|
||||
# cann
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
|
||||
export STREAMS_PER_DEVICE=32
|
||||
export HCCL_BUFFSIZE=1000
|
||||
export HCCL_OP_EXPANSION_MODE=AIV
|
||||
export HCCL_SOCKET_IFNAME=lo
|
||||
export GLOO_SOCKET_IFNAME=lo
|
||||
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path $MODEL_PATH \
|
||||
--attention-backend ascend \
|
||||
--device npu \
|
||||
--tp-size 2 --nnodes 1 --node-rank 0 \
|
||||
--chunked-prefill-size 4096 --max-prefill-tokens 280000 \
|
||||
--disable-radix-cache \
|
||||
--trust-remote-code \
|
||||
--host 127.0.0.1 \
|
||||
--mem-fraction-static 0.7 \
|
||||
--port 8000 \
|
||||
--cuda-graph-bs 16 \
|
||||
--quantization modelslim \
|
||||
--enable-multimodal \
|
||||
--mm-attention-backend ascend_attn \
|
||||
--dtype bfloat16
|
||||
```
|
||||
|
||||
#### Qwen3.5 27B
|
||||
|
||||
```shell
|
||||
# high performance cpu
|
||||
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
sysctl -w vm.swappiness=0
|
||||
sysctl -w kernel.numa_balancing=0
|
||||
sysctl -w kernel.sched_migration_cost_ns=50000
|
||||
# bind cpu
|
||||
export SGLANG_SET_CPU_AFFINITY=1
|
||||
|
||||
unset https_proxy
|
||||
unset http_proxy
|
||||
unset HTTPS_PROXY
|
||||
unset HTTP_PROXY
|
||||
unset ASCEND_LAUNCH_BLOCKING
|
||||
# cann
|
||||
source /usr/local/Ascend/ascend-toolkit/set_env.sh
|
||||
source /usr/local/Ascend/nnal/atb/set_env.sh
|
||||
|
||||
export STREAMS_PER_DEVICE=32
|
||||
export HCCL_BUFFSIZE=1000
|
||||
export HCCL_OP_EXPANSION_MODE=AIV
|
||||
export HCCL_SOCKET_IFNAME=lo
|
||||
export GLOO_SOCKET_IFNAME=lo
|
||||
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path $MODEL_PATH \
|
||||
--attention-backend ascend \
|
||||
--device npu \
|
||||
--tp-size 2 \
|
||||
--chunked-prefill-size -1 --max-prefill-tokens 120000 \
|
||||
--disable-radix-cache \
|
||||
--trust-remote-code \
|
||||
--host 127.0.0.1 \
|
||||
--mem-fraction-static 0.8 \
|
||||
--port 8000 \
|
||||
--cuda-graph-bs 32 \
|
||||
--enable-multimodal \
|
||||
--mm-attention-backend ascend_attn
|
||||
```
|
||||
|
||||
### Prefill-Decode Disaggregation
|
||||
|
||||
Not test yet.
|
||||
|
||||
### Using Benchmark
|
||||
|
||||
Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md) for details.
|
||||
@@ -12,3 +12,6 @@ Ascend NPUs
|
||||
mindspore_backend.md
|
||||
ascend_contribution_guide.md
|
||||
ascend_npu_best_practice.md
|
||||
ascend_npu_qwen3_5_examples.md
|
||||
ascend_npu_glm5_examples.md
|
||||
ascend_npu_environment_variables.md
|
||||
|
||||
@@ -17,8 +17,8 @@ SGLang supports various environment variables that can be used to configure its
|
||||
| `SGLANG_HEALTH_CHECK_TIMEOUT` | Timeout for health check in seconds | `20` |
|
||||
| `SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL` | The interval of passes to collect the metric of selected count of physical experts on each layer and GPU rank. 0 means disabled. | `0` |
|
||||
| `SGLANG_FORWARD_UNKNOWN_TOOLS` | Forward unknown tool calls to clients instead of dropping them | `false` (drop unknown tools) |
|
||||
| `SGLANG_QUEUED_TIMEOUT_MS` | Timeout (in ms) for requests in the waiting queue | `-1` |
|
||||
| `SGLANG_FORWARD_TIMEOUT_MS` | Timeout (in ms) for requests in the forward batch | `-1` |
|
||||
| `SGLANG_REQ_WAITING_TIMEOUT` | Timeout (in seconds) for requests waiting in the queue before being scheduled | `-1` |
|
||||
| `SGLANG_REQ_RUNNING_TIMEOUT` | Timeout (in seconds) for requests running in the decode batch | `-1` |
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
@@ -76,6 +76,7 @@ SGLang supports various environment variables that can be used to configure its
|
||||
| --- | --- | --- |
|
||||
| `SGLANG_MORI_FP8_DISP` | Use FP8 for dispatch | `"false"` |
|
||||
| `SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK` | Maximum number of dispatch tokens per rank for MORI-EP buffer allocation | `4096` |
|
||||
| `SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD` | Threshold for switching between `InterNodeV1` and `InterNodeV1LL` kernel types. `InterNodeV1LL` is used if `SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK` is less than or equal to this threshold; otherwise, `InterNodeV1` is used. | `256` |
|
||||
| `SGLANG_MORI_QP_PER_TRANSFER` | Number of RDMA Queue Pairs (QPs) used per transfer operation | `1` |
|
||||
| `SGLANG_MORI_POST_BATCH_SIZE` | Number of RDMA work requests posted in a single batch to each QP | `-1` |
|
||||
| `SGLANG_MORI_NUM_WORKERS` | Number of worker threads in the RDMA executor thread pool | `1` |
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user