6716 Commits

Author SHA1 Message Date
Benjamin
5d6bef9f61 fix(v4): auto-bump swa_full_tokens_ratio to fit chunks_in_flight (#60) 2026-06-25 16:19:52 +08:00
usrlocalben
1344f647a6 fix: kt_ep_wrapper silently fails to import (#59) 2026-06-24 12:11:25 +08:00
chenghanke
37eecb4e99 Enable DeepSeek V4 Flash inference on Ampere GPUs (#58)
* feat: enable DeepSeek V4 Flash inference on Ampere (SM_86) GPUs without FP8

Ampere GPUs (SM_86, e.g. RTX 3090) lack native FP8 support, preventing
DeepSeek V4's compressed attention from running. This commit adds BF16
fallback paths throughout the V4 Flash attention pipeline.

Changes ported from deepseek-v4-ampere branch, excluding all MTP
(Multi-Token Prediction) speculative decoding code.

=== BF16 KV Cache Mode ===
- mem_cache/deepseekv4_memory_pool.py: Add use_bf16_cache parameter,
  SetBf16KAndS Triton kernel, BF16 byte/token size calculations
- model_runner_kv_cache_mixin.py: BF16 cell size calculation for
  compressed attention
- memory_profiler.py: BF16 memory profile calculations

=== V4 Flash NSA/Compressed Attention BF16 Kernels ===
- nsa/v4_triton_kernel.py: FP8 LUT decode for SM<90 + full BF16 sparse
  attention kernel (decode_sparse_attention_bf16)
- nsa/index_buf_accessor_v4.py: SetBf16KAndS class with Triton kernel
  and torch fallback; NopeBf16RopeBf16Pack data class
- nsa/quant_k_cache_v4.py: BF16 quantize/dequantize path
- nsa/triton_kernel.py: SM version check for act_quant fallback
- nsa/index_buf_accessor.py: SM version check fallback to vanilla path
- compressed/indexer.py: tilelang-based BF16 paged MQA logits kernels
  (bf16_paged_mqa_logits_tilelang, bf16_direct_paged_mqa_logits_tilelang)
  + torch fallback for CUDA graph safety
- compressed/compressor.py: SM-based dispatch for BF16 vs FP8 compressor
- compressed/paged_prefill.py: SWA BF16 cache store dispatch
- debug_flash_mla_adapter.py: BF16 decode dispatch
- deepseek_v4_backend_radix.py: BF16 SWA key storage dispatch

=== SM_89 (Ada Lovelace) Capability Guard Fixes ===
- All compute capability checks use cc < (8, 9) tuple comparison
  instead of major < 9, correctly excluding SM_89 (RTX 4090) which
  has native FP8 support and should not use BF16 fallbacks

=== Function Call / Reasoning Parser ===
- function_call/deepseekv4_detector.py: JSON format + structural tag
- function_call/function_call_parser.py: Register DeepSeekV4Detector
- parser/reasoning_parser.py: DeepSeekV3Detector with explicit_thinking

=== Infrastructure ===
- cuda_graph_runner.py: Fix KTMoEWrapper.set_capture_batch_sizes() to
  pass num_tokens (bs * num_tokens_per_bs) instead of num_seqs, since
  KExpertsCPUBuffer.get_buffer() indexes by hidden_states.shape[0]
- server_args.py: Document V4 compressed attention FP8 dtype requirement

* revert: restore upstream docstring and remove unused logging in DeepSeekV4Detector

Revert the class docstring to the concise upstream version and remove
unused  / .
Keep  as the only functional addition.

Co-Authored-By: Claude <noreply@anthropic.com>

* Update python/sglang/srt/layers/attention/compressed/paged_prefill.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Benjamin <159887351+yyj6666667@users.noreply.github.com>
2026-06-22 13:53:55 +08:00
Benjamin
8b636f9008 Feat/minimax m3 (#56)
* feat(minimax-m3): port MiniMax-M3 to kvcache base + initial MXFP8 SM90 path

Port MiniMax M3 inference support from trial-m3 (PR #27944) to the
kvcache-ai/sglang main branch, plus the runtime fixes and MXFP8-on-SM90
enablement landed in subsequent bring-up patches.

Core port (model + sparse attention + JIT + CUDA + pools + parsers)
- minimax_m3.py (1924 lines), VL variant (minimax_m3_vl.py), VL config.
- Sparse attention: MiniMaxHybridAttnBackend, MiniMaxSparseAttnBackend
  with MSA (SM100), piecewise CUDA graph, TopK block selection.
- JIT kernels: QK norm + RoPE, decode TopK, KV index store, UE8M0 quant.
- CUDA kernels: fused QK norm, fused KV store, decode TopK, per-token quant.
- Memory pool: MiniMaxSparseKVPool (dual-pool: main KV + sparse index).
- KT integration: swiglu_alpha plumbing in kt_ep_wrapper + server_args.
- Function-call parser: MinimaxM3Detector (]<]minimax[>[ token format).
- Reasoning parser: MiniMaxM3Detector (<mm:think> blocks).
- MXFP8 MoE configs: H100/H20/H200/A800 tuning for E=128/129.
- Tests: JIT kernel, sparse pool, function call, E2E, AMD benchmarks.
- Infrastructure: forward_context, cuda_graph_config, fused_moe (upstream).

Initial MXFP8 SM90 enablement (block-fp8 conversion pipeline)
- MXFP8 -> block-FP8 [128,128] conversion pipeline + M3 model defaults
  (loader converts raw MXFP8 weights to block-FP8 at load time).
- Allow Triton MoE backend for MXFP8 on SM90 (Hopper / H20 / H100):
  Fp8MoEMethod accepts MXFP8 weights, runs the block-FP8 path after
  conversion.
- Disable piecewise CUDA graph for the MXFP8 block-fp8 conversion path
  (the conversion runs once at load, but piecewise capture re-traced
  the load step).
- Adapt KT EP wrapper for yyj6666667/ktransformers feat/mxfp8-moe-m3:
  CPU-side MXFP8 byte-copy + GPU shadow gpu_method holding block-FP8
  view, so hybrid forwards take the same Fp8MoEMethod path as pure GPU.

Runtime fixes for kvcache base
- get_rope_config compatibility, dense-sparse decode guard for the
  hybrid attn backend, VL processor tokenizer access on kvcache's
  Processor wrapper.

Source: yyj6666667/sglang trial-m3 (3db7af4d4) / sgl-project/sglang PR
#27944. Target: kvcache-ai/sglang main (51032b712).

This commit is the consolidation of the original 7 bring-up commits
(3073753, 8c597a7, 3b9dbfe, bd6c876, cc84e6e, b712433, 77d514a) — see
archive/pre-port-squash-20260618 for the pre-squash history.

* fix(minimax-m3): port upstream infra so M3 launches on kvcache base

Bring-up patches surfacing as the M3 launch pipeline crashes through:
module-load → ServerArgs → model construct → attn backend → KV pool →
CUDA graph. Each fix is the smallest delta from upstream M3 expectations
to the current kvcache fork.

env (environ.py)
- Register all env vars M3 references at __init__ / capture time
  (EnvBool(False) to keep current default behaviour):
  SGLANG_OPT_USE_MINIMAX_FUSED_QKNORM_ROPE,
  SGLANG_OPT_USE_MINIMAX_FUSED_KV_INDEX_STORE,
  SGLANG_OPT_USE_MINIMAX_DECODE_TOPK_RADIX,
  SGLANG_OPT_USE_BF16_ROUTER_GEMM,
  SGLANG_OPT_USE_MINIMAX_DENSE_SPARSE_DECODE,
  SGLANG_DISABLE_MSA.

moe utils (layers/moe/utils.py)
- Backport upstream get_moe_padding_size(is_aiter_moe) + AITER/TRITON
  padding constants.

triton_utils import paths (moe_runner/triton_utils/{fused_moe,mxfp8_moe}.py)
- Rewrite imports to use existing .fused_moe_triton.* symbols (sibling
  .moe_runner.triton_utils.* modules don't exist on kvcache).
- Wrap sglang.jit_kernel.activation import in try/except (M3 only needs
  swiglu_no_interleaved_with_alpha_and_limit, a 5-line torch.compile fn).

mxfp8_native.py
- Port upstream native MXFP8 quant module (mxfp8_e4m3_quantize consumed
  by mxfp8_moe.py).

schedule_batch.MultimodalProcessorOutput
- Add upstream dataclass; M3 VL processor returns it from process().

FusedMoE.__init__
- Accept interleaved: bool = False; assert against True (kvcache triton
  MoE assumes non-interleaved W13).

KV pool selection (model_runner_kv_cache_mixin.py)
- Instantiate MiniMaxSparseKVPool directly when is_minimax_sparse(hf_config);
  pool_registry's M3 registration is gated behind is_v4_model and never
  fires for M3.

MiniMaxSparse/HybridAttnBackend
- Add init_forward_metadata wrapper and legacy capture/replay adapters
  that translate to upstream's _out_graph(in_capture=) + _in_graph split.

KT-EP wrapper (kt_ep_wrapper.py)
- Remove duplicate swiglu_alpha kwarg and out-of-scope hf_config reference
  in create_kt_config_from_server_args (SyntaxError took out KT-EP, every
  rank fell back to standard FP8 weight allocation and OOM'd at TP=8).

cuda_graph_config (cuda_graph_config.py)
- check_cuda_graph_backend reads flat fields (disable_cuda_graph,
  enable_piecewise_cuda_graph, ...) when the aggregate cuda_graph_config
  object is absent (kvcache keeps the cuda-graph args flat).

* fix(minimax-m3): support apply_routed_scaling_factor_on_output in fused_topk

select_experts asserted `not apply_routed_scaling_factor_on_output` on
the non-grouped / non-sqrtsoftplus paths (topk.py:1024). M3's MoE passes
`apply_routed_scaling_factor_on_output=True` to TopK, which falls into
the fused_topk branch (scoring_func != "sqrtsoftplus"), so capture-time
forward hit AssertionError.

fused_topk itself does not take the kwarg. The semantic is just
"multiply topk_weights by routed_scaling_factor on the output side", so
apply it manually after the fused_topk call.

* refactor(minimax-m3): read swiglu alpha/limit from MoeRunnerConfig, drop CLI flags

kt_ep_wrapper.py
- KTEPWrapperMethod.create_weights() now pulls gemm1_alpha /
  gemm1_clamp_limit / swiglu_limit from layer.moe_runner_config
  (which the model file populates from HF config — minimax_m3.py
  forwards config.swiglu_alpha / swiglu_limit; deepseek_v2.py
  forwards config.swiglu_limit). This is the same data source the
  GPU runners already use, so CPU and GPU experts can no longer
  drift to different alpha/limit values (the silu-vs-swigluoai
  mismatch from 8469f58a83).
- Remove KTConfig.swiglu_alpha / swiglu_limit fields and their
  wiring in create_kt_config_from_server_args.
- Drop the SGLANG_DSV4_2604_SUBMODE env fallback for swiglu_limit:
  DSV4-2604B's limit=10 already flows through MoeRunnerConfig via
  deepseek_v2.py, no env override needed. The unrelated 2604B
  path-checker bump later in the file is kept.

server_args.py
- Delete --kt-swiglu-alpha / --kt-swiglu-limit CLI flags and their
  ServerArgs fields. Launch scripts that passed them (M3 hybrid)
  must drop the two flags after upgrading.

Verified on qjh005 (TP=4 GPUs 4-7, no CLI flags): kt-kernel's
[M3-PROBE-2] / [M3-PROBE-3] prints show layer_idx=4
swiglu_alpha=1.702 swiglu_limit=7.0 reaching the wrapper, matching
M3's HF config defaults.

* feat(minimax-m3): native MXFP8 path + shared-expert routing on SM90

Previously M3 on H20 (SM90) was forced through the lossy MXFP8 -> block-FP8
[128,128] convert path, collapsing 294,912 mxfp8 scales (uint8 ue8m0 [1,32])
per expert w13 into 576 block-fp8 scales (fp32). M3 was trained with mxfp8
granularity; the squash + a missing shared-expert routing step + a missing
M3 thinking_mode branch combined to give garbage output (0% GSM8K on pure
sglang). Net result of this commit: pure sglang GSM8K 87%, hybrid
(--kt-num-gpu-experts=8 + layerwise prefill) 89% over 200 questions.

Fp8MoEMethod / Fp8LinearMethod (layers/quantization/fp8.py)
  Add use_mxfp8_native_sm90 gate (CUDA + SM90 + use_mxfp8 + not SM100+).
  When True: process_weights_after_loading keeps weights raw MXFP8 (no
  convert), apply() routes to get_triton_quant_info -> TritonMoeQuantInfo
  (use_mxfp8=True) for MoE and to dot_scaled_mxfp8_blockscaled_linear (the
  Triton tl.dot_scaled path in mxfp8_native.py, no SM100 swizzle) for
  dense linear. Add a no-op _process_mxfp8_linear_weight_scale so M3's
  MiniMaxM3FusedQKVIndex no longer crashes when its scale-finalize hook
  runs on the raw-MXFP8 path.

TritonMoeQuantInfo / TritonRunnerCore / fused_experts_none_to_triton
(layers/moe/moe_runner/triton.py)
  Add use_mxfp8 bool field (default False; non-MXFP8 callers unaffected).
  Both run() and the registered fused-func dispatch route to
  triton_utils/mxfp8_moe.py:fused_experts_mxfp8 (formerly dead code: zero
  callers) when use_mxfp8 is set. interleaved=False is hardcoded since
  M3's FusedMoE asserts it.

KT-EP layerwise prefill (layers/moe/kt_ep_wrapper.py)
  Byte-copy MXFP8 only; shadow gpu_method stays in MXFP8 view so apply()
  takes the same native MXFP8 path as the steady-state hybrid forward.

Shared-expert routing (layers/moe/topk.py)
  Installed sgl_kernel.topk_sigmoid is 5-arg, doesn't accept
  num_fused_shared_experts. Without intervention M3's shared expert
  (id = num_local_experts = 128) was never routed to, killing the
  always-active component. select_experts now appends the shared id to
  topk_ids[:, -1] and assigns the matching weight (sum(routed)/rsf, then
  renormalize routed-only, then *= rsf), mirroring biased_grouped_topk
  semantics. Brings GSM8K from 0% to ~85%+.

M3 thinking_mode in serving_chat (entrypoints/openai/serving_chat.py)
  Equivalent of upstream 49dde29cc6 + 297abb2838. reasoning_parser.py
  factory already had the M3 branch; serving_chat._get_reasoning_from_request
  was missing it, so every M3 request fell through to `return True` (force
  reasoning), shoving the answer into reasoning_content. Now: thinking_mode
  == "enabled" forces reasoning, the other three (disabled / adaptive /
  unset) let the detector self-handle the <mm:think> tag.
2026-06-21 17:09:49 +08:00
usrlocalben
31124d0ae7 fix: kt_ep_wrapper silently fails to import after a2f451315 (#57)
Upstream sglang removed the gptq_marlin* kernels to the JIT system.
- Added to JIT in sglang upstream PR #18543
- Removed from sgl-kernel in sglang upstream PR #19241

At that moment the kt_ep_wrapper should have been updated but was not.

In many cases the trouble is not observed since sgl-kernel is often
installed from (stale) prebuilt wheels. When building from source
however, e.g. for CUDA123/sm120, the produced binary does not have
gptq_marlin_repack and kt_ep_wrapper is not imported.
2026-06-21 14:03:44 +08:00
Jianwei Dong
a2a8a7e9e0 support glm5.2
Merge/sgl 27114 onto 537eb762
2026-06-16 16:33:56 +08:00
yyj
ee149528a3 fix(dsa): wire skip_topk-gated indexer for GlmMoeDsa to unblock GLM-5.2
Three changes to make GlmMoeDsaForCausalLM (e.g. GLM-5.2-0610-Provider-FP8)
load and forward end-to-end on top of the PR 27114 cherry-pick:

1. DeepseekV2AttentionMLA.__init__: add the is_nextn parameter that PR 27114
   assumed was already in scope. It was not in kt-sglang base; the auto-merge
   inserted self.is_nextn = is_nextn without the corresponding param, raising
   NameError at construction. Also thread is_nextn= through the DecoderLayer
   call site.

2. Compute self.skip_topk / self.next_skip_topk BEFORE creating the Indexer,
   and only instantiate Indexer when (not skip_topk) or is_nextn. GLM-5.2's
   checkpoint ships indexer weights only for "full" layers (driven by
   index_topk_freq=4, index_skip_topk_offset=3); creating an Indexer module
   on shared layers either zero-inits its weights and emits out-of-range
   topk indices that hang the NSA gather kernel (observed: 5-minute
   watchdog timeout at the first forward) or fails a strict load.

3. Gate the three call sites that invoke self.indexer
   (deepseek_v2.py forward_absorb_prepare alt-stream + non-alt branches,
   plus deepseek_common/attention_forward_methods/forward_mha.py
   forward_normal_prepare) on
   (not self.skip_topk) or (self.is_nextn and forward_batch.topk_indices is None)
   and use forward_batch.topk_indices as the cross-layer carrier so shared
   layers reuse the most recent full layer's topk_indices.

Smoke-verified on qjh007 8x H20: TP=8, eager mode, --mem-fraction-static
0.97, --kv-cache-dtype fp8_e4m3. Three diverse prompts return coherent
text; e2e_latency 6-14s for 24-48 tokens.

DecoderLayer.forward still returns the original 2-tuple. The full Tier C
plumbing (3-tuple decoder return plus a signature-level prev_topk_indices
argument) remains future work; the ForwardBatch-carried path here is
sufficient for serving GLM-5.2 in eager mode.
2026-06-13 20:19:01 +08:00
Yuxuan Zhang
dd9ba529f6 [Bugfix] Restore overridden HF config fields and support index_skip_topk_offset for DSA topk sharing (#27114) 2026-06-13 18:41:26 +08:00
Jiaheng Dai
51032b7127 feat: support end-to-end KT LoRA serving for Qwen3.5 MoE (#53) 2026-06-05 15:37:28 +08:00
Benjamin F
ebaff7729b fix: regressions (scheduler hang, cuda graph TypeError, MXFP4 cache, rsf double-apply) (#50)
* fix(v4-flash): remove broken MXFP4 weight cache + fix rsf double-apply

move routed_scaling_factor application from inside
apply_v4_triton_kernels_moe to the caller (mxfp4_deepseek.apply),
mirroring the trtllm path convention. This fixes a latent double-apply
when SGLANG_OPT_MXFP4_FUSE_RSF_SHARED_ADD is enabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(scheduler): revert PR #38 req_pool changes that break TP-only mode

PR #38 introduced changes that together cause scheduler hang on
TP-only configurations with max_running_requests=1:

1. scheduler.py: Removed `if self.pp_size > 1:` guard in
   get_num_allocatable_reqs, causing TP-only mode to check
   available_size() unconditionally.

2. memory_pool.py: Changed free_slots from `range(size)` to
   `range(1, size)` to reserve index 0. With max_running_requests=1,
   this produces empty free_slots list.

3. scheduler_runtime_checker_mixin.py: Changed expected_free from
   `req_total_size` to `req_total_size - 1` to match the reserved slot.

This fix reverts all 4 locations to v0.6.1.post1 behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(cuda_graph): use out-of-band _replay_forward_batch for non-DSV4 backends

Cherry-pick fix from upstream 3ffc34dbe to resolve TypeError when
non-DSV4 backends (TritonAttnBackend, etc.) receive unexpected
out_cache_loc kwarg during CUDA graph replay.

Instead of passing out_cache_loc as a parameter (which requires all
backends to update their signatures), use an out-of-band attribute:
- Set attn_backend._replay_forward_batch before the call
- DSV4 backend reads out_cache_loc from this attribute
- Clear the attribute after the call

Conflict resolution: kept kt-sglang's attribute path
`self.model_runner.attn_backend` (vs upstream's `self.attn_backend`).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: remove undefined _GraphBucket reference in cuda graph replay

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-14 14:00:27 +08:00
Benjamin F
43ed1ec77a refactor(dsv4): isolate DeepSeek V4 Flash behind plugin registries (#47)
Squash of 5 kt-side commits on top of upstream main: the four v4-2604B
follow-up fixes (env defaults flip, topk kwargs drop, hf-transformers
backup gate, SwiGLU torch.compile collapse) plus the dsv4-plugin-redo
refactor. Three other v4-2604B fixes (bf16 cpu_buf, mxfp4 SwiGLU clamp,
kt swiglu_limit) are already squash-merged into upstream as part of
PR #44 (6ac4f82e8) and were skipped during rebase.

Goal: prevent DSV4 module bugs / missing dependencies from breaking
non-DSV4 model loads (Qwen, GLM, Kimi, etc.), while preserving full
DSV4-Flash functionality on supported hardware.

## Architecture

Four new plugin-registry modules in base sglang that DSV4 self-attaches
to at module-load time:

- layers/moe/quant_method_registry.py     — chained-wrap MoE quant methods
                                            (mxfp4_deepseek priority=10,
                                             kt_ep priority=20)
- mem_cache/pool_registry.py              — KV pool factory dispatch
                                            (DeepSeekV4TokenToKVPool)
- managers/coordinator_registry.py        — request coordinator factory
                                            (HiSparseCoordinator)
- managers/forward_hooks_registry.py      — scheduler / runner lifecycle
                                            event dispatch (HiSparse hook
                                            adapter)

DSV4 plugin entry: models/deepseek_v4.py runs side-effect imports of
kt_ep_wrapper, mxfp4_deepseek, hisparse_coordinator,
deepseekv4_memory_pool. Each of those self-registers; non-DSV4 models
never trigger these imports when SGLANG_DISABLED_MODEL_ARCHS skips
deepseek_v4 / deepseek_v4_nextn.

## isinstance → duck-type tags

Replaced 12+ isinstance(_, DSV4Class) checks across base files with
class-attribute tags (_quant_wrapper_id, _is_v4_token_pool,
_is_dsv4_backend_radix). Base files no longer need to import DSV4
classes just to test object identity.

## Bundled v4-2604B fixes (originally separate commits)

- environ.py: flip 25 SGLANG_DSV4_* / SGLANG_OPT_* env defaults to OFF
  so non-DSV4 paths default to upstream behavior.
- moe/topk.py: drop 2 kwargs from select_experts else-branch's fused_topk
  call that PR #38 left incompatible with non-DSV4 callers.
- utils/hf_transformers_utils.py: gate the deepseek backup-config path
  on _peek_is_deepseek_arch so non-deepseek models (Qwen3, GLM, Kimi)
  with no top-level num_hidden_layers don't RuntimeError on startup.
- moe/fused_moe_triton/fused_moe.py: collapse the 60-line DSV4-specific
  SwiGLU clamp branch in fused_experts_impl down to 5 lines via a
  reused torch.compile dispatch helper (_swiglu_clamp_silu_mul).

## Pre-existing PR #38 bugs surfaced and fixed in this branch

- configs/deepseek_v4.py: was double-`@dataclass`-decorated by
  transformers v5+ PretrainedConfig.__init_subclass__, which stripped
  default_factory(...) from quantization_config / rope_scaling /
  compress_ratios in some builds, causing
  `'Field' object has no attribute 'to_dict'` at runtime. Rewritten
  to traditional __init__ kwargs idiom, matching all other sglang
  configs (afmoe, chatglm, dbrx, bailing_hybrid, ...).
- utils/hf_transformers_utils.py: _load_deepseek_temp_model hardcoded
  config_json["model_type"] = "deepseek_v3" even for V4 architecture,
  causing AutoConfig to resolve transformers' DeepseekV3Config (which
  doesn't expose rope_theta / compress_rope_theta / compress_ratios at
  the top level in transformers-kt 5.6.0). Now picks "deepseek_v4"
  for DeepseekV4ForCausalLM architecture.
- models/deepseek_v2.py: SGLANG_DSV4_MODE=2604 in operator's shell
  caused config.num_hash_layers AttributeError on non-DSV4 configs
  inheriting DeepseekV2MoE (e.g., GlmMoeDsaConfig). Now gated on
  is_deepseek_compressed(config).
- models/deepseek_v4.py: side-effect plugin imports wrapped in
  try/except so a sibling failure (e.g., flashinfer < 0.6.9 trips the
  module-load version check in mxfp4_deepseek) doesn't block
  DeepseekV4ForCausalLM from registering with ModelRegistry.
- _V4MoE subclass replaces is_deepseek_v4 boolean flag pollution in
  DeepseekV2MoE — V4 NextN draft layers bypass hash MoE via
  _compute_is_hash override.

## Robustness fixes from E2E hardware testing

- Triton kernels MXFP4 path: force num_stages=2 in
  triton_kernels.opt_flags constraints to defend against the bare
  `assert num_stages >= 1` for capabilities outside the tested matrix.
- launch_server.py: sweep stale ninja `lock` / `.ninja_lock` files
  under ~/.cache/torch_extensions older than 30 minutes (configurable
  via SGLANG_STALE_LOCK_AGE_MINUTES) so a SIGKILL'd build doesn't wedge
  the next launch indefinitely.

## Verified

- E2E pass on Qwen2.5-7B, Qwen3.5-FP8, Qwen3.5-35B-A3B-FP8,
  Qwen3-Coder-Next-FP8, Kimi-K2.5 (non-DSV4 models, hardware confirmed).
- E2E pass on DeepSeek-V4-Flash with TP=8, MXFP4 routed experts, KT-EP
  CPU/GPU split, hash-MoE, NSA sparse attention (after pinning
  flashinfer>=0.6.9, apache-tvm-ffi==0.1.9, tilelang>=0.1.8).
- 0 DSV4 modules in sys.modules when SGLANG_DISABLED_MODEL_ARCHS skips
  deepseek_v4 / deepseek_v4_nextn — DSV4 plugin failures cannot affect
  non-DSV4 startup.
- pyproject.toml unchanged: drop-in replacement for kt-sglang pre-DSV4
  packaging.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 16:33:18 +08:00
Benjamin F
6ac4f82e8f fix(v4-flash): bundle V4-2604B SwiGLU clamp + hybrid SWA chunked-prefill hang fix (#44)
* fix(kt-ep): match cpu_buf dtype to kt-kernel's bf16 scale write for MXFP4

kt-kernel's write_weights_to_buffer (operators/amx/fp4-moe.hpp) writes
gate/up scales as bf16 via fast_fp32_to_bf16, but mxfp4_deepseek allocates
w13/w2_weight_scale_inv as fp32. The 2x element-size mismatch caused
kt-kernel to fill only the first half of cpu_buf in fp32-element terms;
after Phase 3 .to(float8_e8m0fnu) the second half (= up_proj rows) became
2^-127, zeroing dequantized up_proj weights for all experts loaded via
the kt double-buffered pipeline. Single-chunk GPU prefill on V4-Flash
MXFP4 produced mode-collapsed garbage as a result.

Allocate the cpu_buf with bf16 dtype for these two scale tensors so
kt-kernel's write fills it exactly; gpu_t[e].copy_(cpu_buf[slot]) then
performs the bf16->fp32 dtype cast automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v4-2604b): apply SwiGLU clamp on triton-kernels GPU MoE path

The trtllm and deep_gemm paths both apply a 2604B-specific asymmetric
gate/up clamp (gate.clamp(max=limit); up.clamp(-limit, limit)) on the
gemm1 output before silu_and_mul. The triton-kernels path (default GPU
MoE on every capability outside _TRTLLM_FP4_CAPS, including SM_120
RTX 5090) was constructing a bare matmul_ogs → silu_and_mul → matmul_ogs
sequence with no clamp, leaving routed-expert outputs numerically
inconsistent with the trtllm reference on long-prompt / large-activation
tokens.

Threads moe_runner_config.swiglu_limit through DeepSeekMxfp4MoEMethod.apply
to apply_v4_triton_kernels_moe; semantics match
moe_runner/deep_gemm.py:_apply_swiglu_limit verbatim. No-op when
submode != 2604B (swiglu_limit is None).

Origin: sglang 本身.

* feat(v4-2604b): pass swiglu_limit through KTEPWrapper to kt-kernel

The kt CPU expert path was applying plain silu(g)*u with no clamp,
diverging from the trtllm `gemm1_clamp_limit` and deep_gemm
`_apply_swiglu_limit` references on long-prompt / large-activation
tokens. Companion changes in kt-kernel
(`feat/v4-2604b-swiglu-clamp:d10bd3d`) plumb a `swiglu_limit` field
through `MOEConfig` into the AMX `act_fn`; this commit passes the
value through the kt-sglang bridge.

The KTMoEWrapper is constructed in `create_weights`, before
`create_moe_runner` delivers the full `MoeRunnerConfig`, but the
value is fully determined by SGLANG_DSV4_2604_SUBMODE which is fixed
at process start, so we read the env directly here. Mirrors the
`assert swiglu_limit == 10` in moe_runner/deep_gemm.py and the
`torch.full(..., swiglu_limit, ...)` constructor in
mxfp4_deepseek.py:177-186.

Origin: kt-sglang 耦合.

* fix(scheduler): correct inverted chunked_req check that hangs hybrid SWA chunked prefill

In _get_new_batch_prefill_raw the inline comment explicitly says
"Ignore the check if self.chunked_req is not None" but the code below
used `is not None`, which is the opposite. With --disable-radix-cache +
hybrid SWA + multi-chunk prompt, the chunked_req keeps holding its
req_pool slot across chunks (ChunkCache.cache_unfinished_req does not
release it), and ReqToTokenPool initialises free_slots = list(range(1,
size)) wasting index 0, so once chunked_req takes the only available
slot the check fires forever and the scheduler returns None on every
iteration -> silent hang (chunk1 prefill completes, chunk2 never starts;
TP CPU 60-145% busy spin; client request never returns).

The sister check at line 2065 (`and self.chunked_req is None: return
None`) is correctly inverted; this brings line 2082 in line with the
comment and with that sister check.

Repro (DeepSeek V4 Flash, hybrid SWA, page_size=256):
  --disable-radix-cache --chunked-prefill-size 2048 \
  --tensor-parallel-size 4 --max-running-requests 2
  + a prompt > 2048 tokens (forces multi-chunk)

Before: chunk1 prefill runs, then silent hang or false-positive
    "token_to_kv_pool_allocator memory leak detected" SIGQUIT
    (the hybrid leak check is also too strict; addressed in a
    follow-up commit).
After: 5001-token English prompt -> 3 chunks, HTTP 200 in 26.4s;
    6695-token Chinese prompt -> 4 chunks, HTTP 200 in 52.2s.

Origin: sglang itself (not kt-sglang coupling). Reproduces on pip-
installed upstream sglang as well as on the kt third_party submodule.

* fix(scheduler): skip self_check_during_idle when in-flight work still holds KV slots

Defensive guard for the same bug class as the previous commit. When the
scheduler enters the idle branch with chunked_req != None or a non-empty
running_batch / waiting_queue, the in-flight KV slots are not yet freed
nor cached. _check_hybrid_memory then reports them as leaked because
its formula `full_num_used != 0` does not subtract protected_size /
in-flight usage the way _check_radix_cache_memory does. The result was
a SIGQUIT-on-false-positive: 4 TP ranks raise simultaneously and the
server dies mid-request.

The other branches of self_check_during_idle (DisaggregationMode.PREFILL
and .DECODE) already early-return on similar in-flight conditions; this
patch adds the equivalent guard for DisaggregationMode.NULL which had
no such check. The same pattern is used at scheduler.py line 1372 and
process_input_requests around line 1370.

This guard is no longer load-bearing once the scheduler.py 2082 fix is
in (chunked prefill advances every iter, the scheduler never reaches
batch=None mid-request), but is kept as defence-in-depth against any
future path that produces a double-None batch frame.

Origin: sglang itself.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 23:00:52 +08:00
Benjamin F
c9edb75e0c Fix/v4flash gpu prefill fallback mxfp4 (#41) 2026-05-03 19:31:25 +08:00
Benjamin F
40d3a822af fix(v4-flash): runtime check that flashinfer >= 0.6.9 before MXFP4 MoE imports (#40) 2026-05-03 13:10:01 +08:00
Peilin Li
2bcdc9b63d fix(qwen3.5): normalize MoE nested configs
Normalize Qwen3.5 nested text/vision configs so Qwen3.5 MoE checkpoints expose config attributes after Transformers v5 loading.
2026-05-02 23:45:43 +08:00
Codex
51ea403866 fix(qwen3.5): normalize moe nested configs 2026-05-02 15:44:48 +00:00
Benjamin F
3cbd49c291 feat(deepseek-v4-flash): DeepSeek V4 Flash model + consumer-GPU (SM_120) support (#38)
Adds end-to-end DeepSeek V4 Flash inference support, with portable
fallbacks for hardware lacking vendor kernels (e.g. RTX 5090 / SM_120).

Model
- Register DeepSeekV4 model + config + reasoning parser (HF AutoConfig).
- Server args: V4-specific dispatch flags + KT hybrid CPU/GPU MoE wrapper.

NSA (sparse FP8 MLA decode)
- CUDA path uses the flash_mla wheel on whitelisted caps {(9, 0)}.
- Other caps fall back to a portable Triton kernel
  (decode_sparse_attention_triton, ported from vLLM PR #40929) — handles
  FP8/BF16 dual cache, swa+extra dispatch, and attention sink.

MoE (MXFP4 routed experts)
- cap (10, 0): trtllm-fp4 (existing path, unchanged).
- All other caps: OpenAI triton_kernels matmul_ogs with strided MXFP4
  layout (_use_strided_layout / _patch_strided_mxfp).
- KT hybrid wrapper splits experts between CPU (kt-kernel cpuinfer) and
  GPU (kt-num-gpu-experts), with cudaLaunchHostFunc for CG-capturable
  CPU triggers.

Capability-driven dispatch hardening
- DeepGEMM, fp8-paged-mqa-logits, NSA flash_mla, MoE trtllm-fp4 all
  gated by torch.cuda.get_device_capability() whitelists; non-whitelist
  caps automatically fall to portable Triton/TileLang paths.
- launch_server.py auto-injects FLASHINFER_CUDA_ARCH_LIST /
  TORCH_CUDA_ARCH_LIST (with `a` suffix per cap) before flashinfer
  imports — zero arch-list env required at launch.

Co-authored-by: Oql <1692110604@qq.com>
2026-05-02 21:45:22 +08:00
Codex
8be8492fd2 fix: accept raw fp8 kt method alias 2026-04-29 18:12:20 +00:00
Codex
d647a4241b fix: align sglang kt with transformers kt 2026-04-29 17:44:33 +00:00
Codex
8a86fb89c7 fix: align sglang-kt torch 2.9.1 runtime baseline 2026-04-29 16:05:47 +00:00
Benjamin F
1d10fc8f95 [fix](Qwen3-30b-A3B): fix gibberish output by correcting RoPE write-back (#31)
- Restore in-place storage for q/k and ensure k_cache is updated with the rotated values.
2026-04-27 21:34:59 +08:00
Codex
c819c37879 fix(packaging): align torch deps for kt 0.6.1 2026-04-25 20:20:00 +08:00
ouqingliang
eb0898d217 [refactor]: rename KT fallback logs to layerwise prefill 2026-04-09 03:26:44 +00:00
ouqingliang
a53cb6d078 add kt-numa-nodes 2026-03-31 09:30:52 +00:00
ouqingliang
6b5135e1a1 fix speculative worker and moe layer stability 2026-03-31 02:37:57 +00:00
djw
a45b8d6976 [fix]: update self-referencing extras from sglang to sglang-kt 2026-03-04 08:53:14 +00:00
djw
6b8b5f4649 Use static versioning for sglang-kt, starting at 0.5.2 2026-03-04 06:41:40 +00:00
djw
480c3229d5 Rename PyPI package from sglang to sglang-kt 2026-03-04 06:31:51 +00:00
ouqingliang
b3356b6c46 fix(kt): synchronize INT4 double-buffer slot reuse in fallback prefill 2026-03-03 03:45:30 +00:00
ouqingliang
f1a12b9a93 fix(moe): harden marlin routing and int4 param resolution 2026-03-02 09:35:59 +00:00
xwy-amd8
8d06c338d4 fix(kt): fix Kimi K2.5 RAWINT4 CUDA graph capture crash
Three fixes for Kimi K2.5 RAWINT4 failing to start with CUDA graph:

1. fused_marlin_moe.py: Fix IndentationError from bad merge conflict
   resolution — imports were left outside the `if _is_cuda:` block.

2. fused_marlin_moe.py: Add early return for E=0/M=0. When
   kt-num-gpu-experts=0, GPU expert weights are empty tensors (E=0).
   The marlin MoE kernel crashes on these empty inputs. Return zeros
   so KT CPU experts can contribute the full result.

3. deepseek_v2.py: Skip dual-stream path for KT wrapper. The
   forward_normal_dual_stream uses alt_stream for shared expert
   parallelism, which conflicts with KT wrapper internal _cpu_stream
   during CUDA graph capture.

Fixes #1866
2026-03-01 23:12:05 +08:00
Chen Hongtao
a8b821aee8 fix(kt): robust quant detection in layerwise fallback (#23)
Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com>
2026-02-28 19:45:03 +08:00
chenht2022
529d06ac2b fix(kt): harden expert remap and metadata fallback 2026-02-28 05:54:52 +00:00
xwy-amd8
48b12817a0 Fix MiMo-V2-Flash KTransformers compatibility
- kt_ep_wrapper.py: normalize list-form moe_layer_freq to int
  MiMo-V2-Flash uses per-layer mask [0,1,1,...] instead of int freq
- mimo_v2_flash.py: use getattr for pad_token_id (not in MiMo config)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 03:53:32 +00:00
xwy-amd8
e6428614ab Fix: add DeepseekV3ForCausalLM to MLA detection list
_load_deepseek_v32_model() rewrites architectures from
DeepseekV32ForCausalLM to DeepseekV3ForCausalLM for transformers
compatibility, but the MLA detection list did not include
DeepseekV3ForCausalLM, causing use_mla_backend=False and
MHATokenToKVPool to be created instead of NSATokenToKVPool/MLATokenToKVPool.
2026-02-26 15:23:59 +00:00
xwy-amd8
2a1bafeb16 Revert "Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list"
This reverts commit 01202ee43d.
2026-02-26 14:10:22 +00:00
xwy-amd8
01202ee43d Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list
DeepseekV32ForCausalLM was missing from the model_arch guard in
_handle_model_specific_adjustments(), so is_deepseek_nsa() was never
reached for V3.2 models. This caused the NSA attention backend to not
be auto-selected, leading to q_rope TypeError with flashinfer or
incorrect behavior with other backends.

Upstream bug introduced in sgl-project/sglang#13687 (commit 618ca2380)
which refactored the flat is_deepseek_nsa() check into a nested block
under model_arch guard but only listed DeepseekV3ForCausalLM.
2026-02-26 14:03:55 +00:00
xwy-amd8
6a63993e9f Revert "Skip KT CPU-GPU coordination during CUDA graph capture"
This reverts commit 2ba1f0dea6.
2026-02-26 12:57:31 +00:00
xwy-amd8
2ba1f0dea6 Skip KT CPU-GPU coordination during CUDA graph capture
During CUDA graph capture (regular or PCG), torch.cuda.synchronize()
and CPU-GPU expert coordination are not allowed. Detect capture mode
via is_in_piecewise_cuda_graph() and torch.cuda.is_current_stream_capturing(),
and delegate directly to the GPU method in those cases.

This enables running Qwen3.5 with --attention-backend triton without
--disable-cuda-graph, improving decode from ~11 tok/s to ~65 tok/s.
2026-02-26 08:34:30 +00:00
xwy-amd8
4605b77c7f Fix: revert kt_ep_wrapper.py for kt-kernel 0.5.1 compat, fix rope_scaling property access 2026-02-26 07:57:30 +00:00
xwy-amd8
a2f4513154 Merge upstream/main: bring in PCG (Piecewise CUDA Graph) support for Qwen3.5 GDN 2026-02-26 07:52:14 +00:00
Yilong Zhao
de3d1e7669 [misc] use ORJSONResponse in http-server generate (#19191) 2026-02-25 21:26:25 -08:00
Alison Shao
0fd44ff342 Fix NSA CP positions mismatch in eagle NextN model (#19367) 2026-02-25 20:14:33 -08:00
Xinyu Zhang
119c91cb8b Skip signal handler registration when not on main thread (#18752) 2026-02-25 19:30:05 -08:00
Minglei Zhu
b3202fe6d0 [PCG] fix piecewise cuda graph for Qwen3.5 (#19220) 2026-02-26 11:16:52 +08:00
Alison Shao
a0a8f1473c [Benchmark] Fix generated_shared_prefix attribute naming and remove args dependency (#19363)
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
Co-authored-by: sglang-bot <sglangbot@gmail.com>
2026-02-25 18:45:54 -08:00
sglang-bot
6e82183f5a [Disagg] Route disagg prefill results through process_batch_result (#19364) 2026-02-25 18:38:39 -08:00
fzyzcjy
265eb56d44 Support multi-step alignment and pipeline integration in dump comparator (#19378) 2026-02-26 10:23:22 +08:00
Yuan Luo
4e843f1216 [DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache (#19148)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
2026-02-26 10:23:10 +08:00
fzyzcjy
f9a2f0398f Support token aligner planning and execution in dump comparator (#19377) 2026-02-26 10:04:33 +08:00