mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-05-14 01:34:58 +00:00
Squash of 5 kt-side commits on top of upstream main: the four v4-2604B
follow-up fixes (env defaults flip, topk kwargs drop, hf-transformers
backup gate, SwiGLU torch.compile collapse) plus the dsv4-plugin-redo
refactor. Three other v4-2604B fixes (bf16 cpu_buf, mxfp4 SwiGLU clamp,
kt swiglu_limit) are already squash-merged into upstream as part of
PR #44 (6ac4f82e8) and were skipped during rebase.
Goal: prevent DSV4 module bugs / missing dependencies from breaking
non-DSV4 model loads (Qwen, GLM, Kimi, etc.), while preserving full
DSV4-Flash functionality on supported hardware.
## Architecture
Four new plugin-registry modules in base sglang that DSV4 self-attaches
to at module-load time:
- layers/moe/quant_method_registry.py — chained-wrap MoE quant methods
(mxfp4_deepseek priority=10,
kt_ep priority=20)
- mem_cache/pool_registry.py — KV pool factory dispatch
(DeepSeekV4TokenToKVPool)
- managers/coordinator_registry.py — request coordinator factory
(HiSparseCoordinator)
- managers/forward_hooks_registry.py — scheduler / runner lifecycle
event dispatch (HiSparse hook
adapter)
DSV4 plugin entry: models/deepseek_v4.py runs side-effect imports of
kt_ep_wrapper, mxfp4_deepseek, hisparse_coordinator,
deepseekv4_memory_pool. Each of those self-registers; non-DSV4 models
never trigger these imports when SGLANG_DISABLED_MODEL_ARCHS skips
deepseek_v4 / deepseek_v4_nextn.
## isinstance → duck-type tags
Replaced 12+ isinstance(_, DSV4Class) checks across base files with
class-attribute tags (_quant_wrapper_id, _is_v4_token_pool,
_is_dsv4_backend_radix). Base files no longer need to import DSV4
classes just to test object identity.
## Bundled v4-2604B fixes (originally separate commits)
- environ.py: flip 25 SGLANG_DSV4_* / SGLANG_OPT_* env defaults to OFF
so non-DSV4 paths default to upstream behavior.
- moe/topk.py: drop 2 kwargs from select_experts else-branch's fused_topk
call that PR #38 left incompatible with non-DSV4 callers.
- utils/hf_transformers_utils.py: gate the deepseek backup-config path
on _peek_is_deepseek_arch so non-deepseek models (Qwen3, GLM, Kimi)
with no top-level num_hidden_layers don't RuntimeError on startup.
- moe/fused_moe_triton/fused_moe.py: collapse the 60-line DSV4-specific
SwiGLU clamp branch in fused_experts_impl down to 5 lines via a
reused torch.compile dispatch helper (_swiglu_clamp_silu_mul).
## Pre-existing PR #38 bugs surfaced and fixed in this branch
- configs/deepseek_v4.py: was double-`@dataclass`-decorated by
transformers v5+ PretrainedConfig.__init_subclass__, which stripped
default_factory(...) from quantization_config / rope_scaling /
compress_ratios in some builds, causing
`'Field' object has no attribute 'to_dict'` at runtime. Rewritten
to traditional __init__ kwargs idiom, matching all other sglang
configs (afmoe, chatglm, dbrx, bailing_hybrid, ...).
- utils/hf_transformers_utils.py: _load_deepseek_temp_model hardcoded
config_json["model_type"] = "deepseek_v3" even for V4 architecture,
causing AutoConfig to resolve transformers' DeepseekV3Config (which
doesn't expose rope_theta / compress_rope_theta / compress_ratios at
the top level in transformers-kt 5.6.0). Now picks "deepseek_v4"
for DeepseekV4ForCausalLM architecture.
- models/deepseek_v2.py: SGLANG_DSV4_MODE=2604 in operator's shell
caused config.num_hash_layers AttributeError on non-DSV4 configs
inheriting DeepseekV2MoE (e.g., GlmMoeDsaConfig). Now gated on
is_deepseek_compressed(config).
- models/deepseek_v4.py: side-effect plugin imports wrapped in
try/except so a sibling failure (e.g., flashinfer < 0.6.9 trips the
module-load version check in mxfp4_deepseek) doesn't block
DeepseekV4ForCausalLM from registering with ModelRegistry.
- _V4MoE subclass replaces is_deepseek_v4 boolean flag pollution in
DeepseekV2MoE — V4 NextN draft layers bypass hash MoE via
_compute_is_hash override.
## Robustness fixes from E2E hardware testing
- Triton kernels MXFP4 path: force num_stages=2 in
triton_kernels.opt_flags constraints to defend against the bare
`assert num_stages >= 1` for capabilities outside the tested matrix.
- launch_server.py: sweep stale ninja `lock` / `.ninja_lock` files
under ~/.cache/torch_extensions older than 30 minutes (configurable
via SGLANG_STALE_LOCK_AGE_MINUTES) so a SIGKILL'd build doesn't wedge
the next launch indefinitely.
## Verified
- E2E pass on Qwen2.5-7B, Qwen3.5-FP8, Qwen3.5-35B-A3B-FP8,
Qwen3-Coder-Next-FP8, Kimi-K2.5 (non-DSV4 models, hardware confirmed).
- E2E pass on DeepSeek-V4-Flash with TP=8, MXFP4 routed experts, KT-EP
CPU/GPU split, hash-MoE, NSA sparse attention (after pinning
flashinfer>=0.6.9, apache-tvm-ffi==0.1.9, tilelang>=0.1.8).
- 0 DSV4 modules in sys.modules when SGLANG_DISABLED_MODEL_ARCHS skips
deepseek_v4 / deepseek_v4_nextn — DSV4 plugin failures cannot affect
non-DSV4 startup.
- pyproject.toml unchanged: drop-in replacement for kt-sglang pre-DSV4
packaging.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>