Files
Benjamin F 43ed1ec77a refactor(dsv4): isolate DeepSeek V4 Flash behind plugin registries (#47)
Squash of 5 kt-side commits on top of upstream main: the four v4-2604B
follow-up fixes (env defaults flip, topk kwargs drop, hf-transformers
backup gate, SwiGLU torch.compile collapse) plus the dsv4-plugin-redo
refactor. Three other v4-2604B fixes (bf16 cpu_buf, mxfp4 SwiGLU clamp,
kt swiglu_limit) are already squash-merged into upstream as part of
PR #44 (6ac4f82e8) and were skipped during rebase.

Goal: prevent DSV4 module bugs / missing dependencies from breaking
non-DSV4 model loads (Qwen, GLM, Kimi, etc.), while preserving full
DSV4-Flash functionality on supported hardware.

## Architecture

Four new plugin-registry modules in base sglang that DSV4 self-attaches
to at module-load time:

- layers/moe/quant_method_registry.py     — chained-wrap MoE quant methods
                                            (mxfp4_deepseek priority=10,
                                             kt_ep priority=20)
- mem_cache/pool_registry.py              — KV pool factory dispatch
                                            (DeepSeekV4TokenToKVPool)
- managers/coordinator_registry.py        — request coordinator factory
                                            (HiSparseCoordinator)
- managers/forward_hooks_registry.py      — scheduler / runner lifecycle
                                            event dispatch (HiSparse hook
                                            adapter)

DSV4 plugin entry: models/deepseek_v4.py runs side-effect imports of
kt_ep_wrapper, mxfp4_deepseek, hisparse_coordinator,
deepseekv4_memory_pool. Each of those self-registers; non-DSV4 models
never trigger these imports when SGLANG_DISABLED_MODEL_ARCHS skips
deepseek_v4 / deepseek_v4_nextn.

## isinstance → duck-type tags

Replaced 12+ isinstance(_, DSV4Class) checks across base files with
class-attribute tags (_quant_wrapper_id, _is_v4_token_pool,
_is_dsv4_backend_radix). Base files no longer need to import DSV4
classes just to test object identity.

## Bundled v4-2604B fixes (originally separate commits)

- environ.py: flip 25 SGLANG_DSV4_* / SGLANG_OPT_* env defaults to OFF
  so non-DSV4 paths default to upstream behavior.
- moe/topk.py: drop 2 kwargs from select_experts else-branch's fused_topk
  call that PR #38 left incompatible with non-DSV4 callers.
- utils/hf_transformers_utils.py: gate the deepseek backup-config path
  on _peek_is_deepseek_arch so non-deepseek models (Qwen3, GLM, Kimi)
  with no top-level num_hidden_layers don't RuntimeError on startup.
- moe/fused_moe_triton/fused_moe.py: collapse the 60-line DSV4-specific
  SwiGLU clamp branch in fused_experts_impl down to 5 lines via a
  reused torch.compile dispatch helper (_swiglu_clamp_silu_mul).

## Pre-existing PR #38 bugs surfaced and fixed in this branch

- configs/deepseek_v4.py: was double-`@dataclass`-decorated by
  transformers v5+ PretrainedConfig.__init_subclass__, which stripped
  default_factory(...) from quantization_config / rope_scaling /
  compress_ratios in some builds, causing
  `'Field' object has no attribute 'to_dict'` at runtime. Rewritten
  to traditional __init__ kwargs idiom, matching all other sglang
  configs (afmoe, chatglm, dbrx, bailing_hybrid, ...).
- utils/hf_transformers_utils.py: _load_deepseek_temp_model hardcoded
  config_json["model_type"] = "deepseek_v3" even for V4 architecture,
  causing AutoConfig to resolve transformers' DeepseekV3Config (which
  doesn't expose rope_theta / compress_rope_theta / compress_ratios at
  the top level in transformers-kt 5.6.0). Now picks "deepseek_v4"
  for DeepseekV4ForCausalLM architecture.
- models/deepseek_v2.py: SGLANG_DSV4_MODE=2604 in operator's shell
  caused config.num_hash_layers AttributeError on non-DSV4 configs
  inheriting DeepseekV2MoE (e.g., GlmMoeDsaConfig). Now gated on
  is_deepseek_compressed(config).
- models/deepseek_v4.py: side-effect plugin imports wrapped in
  try/except so a sibling failure (e.g., flashinfer < 0.6.9 trips the
  module-load version check in mxfp4_deepseek) doesn't block
  DeepseekV4ForCausalLM from registering with ModelRegistry.
- _V4MoE subclass replaces is_deepseek_v4 boolean flag pollution in
  DeepseekV2MoE — V4 NextN draft layers bypass hash MoE via
  _compute_is_hash override.

## Robustness fixes from E2E hardware testing

- Triton kernels MXFP4 path: force num_stages=2 in
  triton_kernels.opt_flags constraints to defend against the bare
  `assert num_stages >= 1` for capabilities outside the tested matrix.
- launch_server.py: sweep stale ninja `lock` / `.ninja_lock` files
  under ~/.cache/torch_extensions older than 30 minutes (configurable
  via SGLANG_STALE_LOCK_AGE_MINUTES) so a SIGKILL'd build doesn't wedge
  the next launch indefinitely.

## Verified

- E2E pass on Qwen2.5-7B, Qwen3.5-FP8, Qwen3.5-35B-A3B-FP8,
  Qwen3-Coder-Next-FP8, Kimi-K2.5 (non-DSV4 models, hardware confirmed).
- E2E pass on DeepSeek-V4-Flash with TP=8, MXFP4 routed experts, KT-EP
  CPU/GPU split, hash-MoE, NSA sparse attention (after pinning
  flashinfer>=0.6.9, apache-tvm-ffi==0.1.9, tilelang>=0.1.8).
- 0 DSV4 modules in sys.modules when SGLANG_DISABLED_MODEL_ARCHS skips
  deepseek_v4 / deepseek_v4_nextn — DSV4 plugin failures cannot affect
  non-DSV4 startup.
- pyproject.toml unchanged: drop-in replacement for kt-sglang pre-DSV4
  packaging.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 16:33:18 +08:00
..