sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-05-14 01:34:58 +00:00

Go to file

Benjamin F 43ed1ec77a refactor(dsv4): isolate DeepSeek V4 Flash behind plugin registries (#47 )

Squash of 5 kt-side commits on top of upstream main: the four v4-2604B
follow-up fixes (env defaults flip, topk kwargs drop, hf-transformers
backup gate, SwiGLU torch.compile collapse) plus the dsv4-plugin-redo
refactor. Three other v4-2604B fixes (bf16 cpu_buf, mxfp4 SwiGLU clamp,
kt swiglu_limit) are already squash-merged into upstream as part of
PR #44 (6ac4f82e8) and were skipped during rebase.

Goal: prevent DSV4 module bugs / missing dependencies from breaking
non-DSV4 model loads (Qwen, GLM, Kimi, etc.), while preserving full
DSV4-Flash functionality on supported hardware.

## Architecture

Four new plugin-registry modules in base sglang that DSV4 self-attaches
to at module-load time:

- layers/moe/quant_method_registry.py     — chained-wrap MoE quant methods
                                            (mxfp4_deepseek priority=10,
                                             kt_ep priority=20)
- mem_cache/pool_registry.py              — KV pool factory dispatch
                                            (DeepSeekV4TokenToKVPool)
- managers/coordinator_registry.py        — request coordinator factory
                                            (HiSparseCoordinator)
- managers/forward_hooks_registry.py      — scheduler / runner lifecycle
                                            event dispatch (HiSparse hook
                                            adapter)

DSV4 plugin entry: models/deepseek_v4.py runs side-effect imports of
kt_ep_wrapper, mxfp4_deepseek, hisparse_coordinator,
deepseekv4_memory_pool. Each of those self-registers; non-DSV4 models
never trigger these imports when SGLANG_DISABLED_MODEL_ARCHS skips
deepseek_v4 / deepseek_v4_nextn.

## isinstance → duck-type tags

Replaced 12+ isinstance(_, DSV4Class) checks across base files with
class-attribute tags (_quant_wrapper_id, _is_v4_token_pool,
_is_dsv4_backend_radix). Base files no longer need to import DSV4
classes just to test object identity.

## Bundled v4-2604B fixes (originally separate commits)

- environ.py: flip 25 SGLANG_DSV4_* / SGLANG_OPT_* env defaults to OFF
  so non-DSV4 paths default to upstream behavior.
- moe/topk.py: drop 2 kwargs from select_experts else-branch's fused_topk
  call that PR #38 left incompatible with non-DSV4 callers.
- utils/hf_transformers_utils.py: gate the deepseek backup-config path
  on _peek_is_deepseek_arch so non-deepseek models (Qwen3, GLM, Kimi)
  with no top-level num_hidden_layers don't RuntimeError on startup.
- moe/fused_moe_triton/fused_moe.py: collapse the 60-line DSV4-specific
  SwiGLU clamp branch in fused_experts_impl down to 5 lines via a
  reused torch.compile dispatch helper (_swiglu_clamp_silu_mul).

## Pre-existing PR #38 bugs surfaced and fixed in this branch

- configs/deepseek_v4.py: was double-`@dataclass`-decorated by
  transformers v5+ PretrainedConfig.__init_subclass__, which stripped
  default_factory(...) from quantization_config / rope_scaling /
  compress_ratios in some builds, causing
  `'Field' object has no attribute 'to_dict'` at runtime. Rewritten
  to traditional __init__ kwargs idiom, matching all other sglang
  configs (afmoe, chatglm, dbrx, bailing_hybrid, ...).
- utils/hf_transformers_utils.py: _load_deepseek_temp_model hardcoded
  config_json["model_type"] = "deepseek_v3" even for V4 architecture,
  causing AutoConfig to resolve transformers' DeepseekV3Config (which
  doesn't expose rope_theta / compress_rope_theta / compress_ratios at
  the top level in transformers-kt 5.6.0). Now picks "deepseek_v4"
  for DeepseekV4ForCausalLM architecture.
- models/deepseek_v2.py: SGLANG_DSV4_MODE=2604 in operator's shell
  caused config.num_hash_layers AttributeError on non-DSV4 configs
  inheriting DeepseekV2MoE (e.g., GlmMoeDsaConfig). Now gated on
  is_deepseek_compressed(config).
- models/deepseek_v4.py: side-effect plugin imports wrapped in
  try/except so a sibling failure (e.g., flashinfer < 0.6.9 trips the
  module-load version check in mxfp4_deepseek) doesn't block
  DeepseekV4ForCausalLM from registering with ModelRegistry.
- _V4MoE subclass replaces is_deepseek_v4 boolean flag pollution in
  DeepseekV2MoE — V4 NextN draft layers bypass hash MoE via
  _compute_is_hash override.

## Robustness fixes from E2E hardware testing

- Triton kernels MXFP4 path: force num_stages=2 in
  triton_kernels.opt_flags constraints to defend against the bare
  `assert num_stages >= 1` for capabilities outside the tested matrix.
- launch_server.py: sweep stale ninja `lock` / `.ninja_lock` files
  under ~/.cache/torch_extensions older than 30 minutes (configurable
  via SGLANG_STALE_LOCK_AGE_MINUTES) so a SIGKILL'd build doesn't wedge
  the next launch indefinitely.

## Verified

- E2E pass on Qwen2.5-7B, Qwen3.5-FP8, Qwen3.5-35B-A3B-FP8,
  Qwen3-Coder-Next-FP8, Kimi-K2.5 (non-DSV4 models, hardware confirmed).
- E2E pass on DeepSeek-V4-Flash with TP=8, MXFP4 routed experts, KT-EP
  CPU/GPU split, hash-MoE, NSA sparse attention (after pinning
  flashinfer>=0.6.9, apache-tvm-ffi==0.1.9, tilelang>=0.1.8).
- 0 DSV4 modules in sys.modules when SGLANG_DISABLED_MODEL_ARCHS skips
  deepseek_v4 / deepseek_v4_nextn — DSV4 plugin failures cannot affect
  non-DSV4 startup.
- pyproject.toml unchanged: drop-in replacement for kt-sglang pre-DSV4
  packaging.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-09 16:33:18 +08:00

.claude/skills

[SKILL] Better claude skills for sgl-kernel and jit-kernel (#19302 )

2026-02-25 15:26:55 +08:00

.devcontainer

update toc for doc and dockerfile code style format (#6450 )

2025-05-27 13:05:11 +08:00

.github

update jit_kernel codeowners (#19385 )

2026-02-26 10:36:34 +08:00

3rdparty/amd

update pre-commit config (#18860 )

2026-02-16 00:18:31 +08:00

assets

[Feature] Hybrid EP and TP (#8590 )

2025-07-31 02:53:25 -07:00

benchmark

[Benchmark] Fix generated_shared_prefix attribute naming and remove args dependency (#19363 )

2026-02-25 18:45:54 -08:00

docker

[AMD] Fix ROCm Docker builds, update apache-tvm-ffi (#19359 )

2026-02-26 10:16:28 +08:00

docs

[Ascend ] Add qwen3.5 122B/35B/27B deployment examples on doc (#19339 )

2026-02-25 21:09:22 +08:00

examples

[Misc] Normalize --host parameter to use plain hostname without scheme (#19309 )

2026-02-25 00:37:24 -08:00

expert_distribution/scripts

fix OOM

2026-01-19 08:50:27 +00:00

python

refactor(dsv4): isolate DeepSeek V4 Flash behind plugin registries (#47 )

2026-05-09 16:33:18 +08:00

scripts

Merge upstream/main: bring in PCG (Piecewise CUDA Graph) support for Qwen3.5 GDN

2026-02-26 07:52:14 +00:00

sgl-kernel

feat(deepseek-v4-flash): DeepSeek V4 Flash model + consumer-GPU (SM_120) support (#38 )

2026-05-02 21:45:22 +08:00

sgl-model-gateway

Merge upstream/main: bring in PCG (Piecewise CUDA Graph) support for Qwen3.5 GDN

2026-02-26 07:52:14 +00:00

test

refactor(dsv4): isolate DeepSeek V4 Flash behind plugin registries (#47 )

2026-05-09 16:33:18 +08:00

.codespellrc

ci(codespell): centralizes list of ignorable words (#17524 )

2026-01-21 12:29:14 -08:00

.dockerignore

[fix]: fix errors

2026-01-27 16:06:04 +08:00

.editorconfig

minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926 )

2024-11-06 13:46:04 +00:00

.gitignore

Update .gitignore to remove '.claude/' (#19296 )

2026-02-25 11:51:50 +08:00

.isort.cfg

minor: Add basic editorconfig and pre-commit hooks to enforce style for whitespaces (#1926 )

2024-11-06 13:46:04 +00:00

.pre-commit-config.yaml

update pre-commit config (#18860 )

2026-02-16 00:18:31 +08:00

CODE_OF_CONDUCT.md

docs: Add Contributor Covenant Code of Conduct (#11689 )

2025-10-15 18:50:26 -07:00

convert_lora.py

add lora-sglang with KT

2025-12-24 12:13:16 +08:00

LICENSE

docs: fix module docstrings and copyright headers (#2077 )

2024-11-22 22:16:53 +08:00

README.md

[Docs] fix readme typo (#18207 )

2026-02-03 17:37:28 -08:00

README.md

News

[2026/01] 🔥 SGLang Diffusion accelerates video and image generation (blog).
[2025/12] SGLang provides day-0 support for latest open models (MiMo-V2-Flash, Nemotron 3 Nano, Mistral Large 3, LLaDA 2.0 Diffusion LLM, MiniMax M2).
[2025/10] 🔥 SGLang now runs natively on TPU with the SGLang-Jax backend (blog).
[2025/09] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput (blog).
[2025/09] SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention (blog).
[2025/08] SGLang x AMD SF Meetup on 8/22: Hands-on GPU workshop, tech talks by AMD/xAI/SGLang, and networking (Roadmap, Large-scale EP, Highlights, AITER/MoRI, Wave).

[2025/11] SGLang Diffusion accelerates video and image generation (blog).
[2025/10] PyTorch Conference 2025 SGLang Talk (slide).
[2025/10] SGLang x Nvidia SF Meetup on 10/2 (recap).
[2025/08] SGLang provides day-0 support for OpenAI gpt-oss model (instructions)
[2025/06] SGLang, the high-performance serving infrastructure powering trillions of tokens daily, has been awarded the third batch of the Open Source AI Grant by a16z (a16z blog).
[2025/05] Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs (blog).
[2025/06] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput (blog).
[2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X (AMD blog)
[2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine (PyTorch blog)
[2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU (AMD blog)
[2025/01] SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. (instructions, AMD blog, 10+ other companies)
[2024/12] v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs (blog).
[2024/10] The First SGLang Online Meetup (slides).
[2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision (blog).
[2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) (blog).
[2024/02] SGLang enables 3x faster JSON decoding with compressed finite state machine (blog).
[2024/01] SGLang provides up to 5x faster inference with RadixAttention (blog).
[2024/01] SGLang powers the serving of the official LLaVA v1.6 release demo (usage).

About

SGLang is a high-performance serving framework for large language models and multimodal models. It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters. Its core features include:

Fast Runtime: Provides efficient serving with RadixAttention for prefix caching, a zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
Broad Model Support: Supports a wide range of language models (Llama, Qwen, DeepSeek, Kimi, GLM, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse), reward models (Skywork), and diffusion models (WAN, Qwen-Image), with easy extensibility for adding new models. Compatible with most Hugging Face models and OpenAI APIs.
Extensive Hardware Support: Runs on NVIDIA GPUs (GB200/B300/H100/A100/Spark), AMD GPUs (MI355/MI300), Intel Xeon CPUs, Google TPUs, Ascend NPUs, and more.
Active Community: SGLang is open-source and supported by a vibrant community with widespread industry adoption, powering over 400,000 GPUs worldwide.
RL & Post-Training Backbone: SGLang is a proven rollout backend across the world, with native RL integrations and adoption by well-known post-training frameworks such as AReaL, Miles, slime, Tunix, verl and more.

Getting Started

Benchmark and Performance

Learn more in the release blogs: v0.2 blog, v0.3 blog, v0.4 blog, Large-scale expert parallelism, GB200 rack-scale parallelism.

Adoption and Sponsorship

SGLang has been deployed at large scale, generating trillions of tokens in production each day. It is trusted and adopted by a wide range of leading enterprises and institutions, including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, Oracle Cloud, Google Cloud, Microsoft Azure, AWS, Atlas Cloud, Voltage Park, Nebius, DataCrunch, Novita, InnoMatrix, MIT, UCLA, the University of Washington, Stanford, UC Berkeley, Tsinghua University, Jam & Tea Studios, Baseten, and other major technology organizations across North America and Asia. As an open-source LLM inference engine, SGLang has become the de facto industry standard, with deployments running on over 400,000 GPUs worldwide. SGLang is currently hosted under the non-profit open-source organization LMSYS.

Contact Us

For enterprises interested in adopting or deploying SGLang at scale, including technical consulting, sponsorship opportunities, or partnership inquiries, please contact us at sglang@lmsys.org

Acknowledgment

We learned the design and reused code from the following projects: Guidance, vLLM, LightLLM, FlashInfer, Outlines, and LMQL.

Languages

Python 79.8%

Rust 8.8%

Cuda 5.6%

C++ 4.2%

Shell 0.5%

Other 0.9%