* feat(minimax-m3): port MiniMax-M3 to kvcache base + initial MXFP8 SM90 path
Port MiniMax M3 inference support from trial-m3 (PR #27944) to the
kvcache-ai/sglang main branch, plus the runtime fixes and MXFP8-on-SM90
enablement landed in subsequent bring-up patches.
Core port (model + sparse attention + JIT + CUDA + pools + parsers)
- minimax_m3.py (1924 lines), VL variant (minimax_m3_vl.py), VL config.
- Sparse attention: MiniMaxHybridAttnBackend, MiniMaxSparseAttnBackend
with MSA (SM100), piecewise CUDA graph, TopK block selection.
- JIT kernels: QK norm + RoPE, decode TopK, KV index store, UE8M0 quant.
- CUDA kernels: fused QK norm, fused KV store, decode TopK, per-token quant.
- Memory pool: MiniMaxSparseKVPool (dual-pool: main KV + sparse index).
- KT integration: swiglu_alpha plumbing in kt_ep_wrapper + server_args.
- Function-call parser: MinimaxM3Detector (]<]minimax[>[ token format).
- Reasoning parser: MiniMaxM3Detector (<mm:think> blocks).
- MXFP8 MoE configs: H100/H20/H200/A800 tuning for E=128/129.
- Tests: JIT kernel, sparse pool, function call, E2E, AMD benchmarks.
- Infrastructure: forward_context, cuda_graph_config, fused_moe (upstream).
Initial MXFP8 SM90 enablement (block-fp8 conversion pipeline)
- MXFP8 -> block-FP8 [128,128] conversion pipeline + M3 model defaults
(loader converts raw MXFP8 weights to block-FP8 at load time).
- Allow Triton MoE backend for MXFP8 on SM90 (Hopper / H20 / H100):
Fp8MoEMethod accepts MXFP8 weights, runs the block-FP8 path after
conversion.
- Disable piecewise CUDA graph for the MXFP8 block-fp8 conversion path
(the conversion runs once at load, but piecewise capture re-traced
the load step).
- Adapt KT EP wrapper for yyj6666667/ktransformers feat/mxfp8-moe-m3:
CPU-side MXFP8 byte-copy + GPU shadow gpu_method holding block-FP8
view, so hybrid forwards take the same Fp8MoEMethod path as pure GPU.
Runtime fixes for kvcache base
- get_rope_config compatibility, dense-sparse decode guard for the
hybrid attn backend, VL processor tokenizer access on kvcache's
Processor wrapper.
Source: yyj6666667/sglang trial-m3 (3db7af4d4) / sgl-project/sglang PR
#27944. Target: kvcache-ai/sglang main (51032b712).
This commit is the consolidation of the original 7 bring-up commits
(3073753, 8c597a7, 3b9dbfe, bd6c876, cc84e6e, b712433, 77d514a) — see
archive/pre-port-squash-20260618 for the pre-squash history.
* fix(minimax-m3): port upstream infra so M3 launches on kvcache base
Bring-up patches surfacing as the M3 launch pipeline crashes through:
module-load → ServerArgs → model construct → attn backend → KV pool →
CUDA graph. Each fix is the smallest delta from upstream M3 expectations
to the current kvcache fork.
env (environ.py)
- Register all env vars M3 references at __init__ / capture time
(EnvBool(False) to keep current default behaviour):
SGLANG_OPT_USE_MINIMAX_FUSED_QKNORM_ROPE,
SGLANG_OPT_USE_MINIMAX_FUSED_KV_INDEX_STORE,
SGLANG_OPT_USE_MINIMAX_DECODE_TOPK_RADIX,
SGLANG_OPT_USE_BF16_ROUTER_GEMM,
SGLANG_OPT_USE_MINIMAX_DENSE_SPARSE_DECODE,
SGLANG_DISABLE_MSA.
moe utils (layers/moe/utils.py)
- Backport upstream get_moe_padding_size(is_aiter_moe) + AITER/TRITON
padding constants.
triton_utils import paths (moe_runner/triton_utils/{fused_moe,mxfp8_moe}.py)
- Rewrite imports to use existing .fused_moe_triton.* symbols (sibling
.moe_runner.triton_utils.* modules don't exist on kvcache).
- Wrap sglang.jit_kernel.activation import in try/except (M3 only needs
swiglu_no_interleaved_with_alpha_and_limit, a 5-line torch.compile fn).
mxfp8_native.py
- Port upstream native MXFP8 quant module (mxfp8_e4m3_quantize consumed
by mxfp8_moe.py).
schedule_batch.MultimodalProcessorOutput
- Add upstream dataclass; M3 VL processor returns it from process().
FusedMoE.__init__
- Accept interleaved: bool = False; assert against True (kvcache triton
MoE assumes non-interleaved W13).
KV pool selection (model_runner_kv_cache_mixin.py)
- Instantiate MiniMaxSparseKVPool directly when is_minimax_sparse(hf_config);
pool_registry's M3 registration is gated behind is_v4_model and never
fires for M3.
MiniMaxSparse/HybridAttnBackend
- Add init_forward_metadata wrapper and legacy capture/replay adapters
that translate to upstream's _out_graph(in_capture=) + _in_graph split.
KT-EP wrapper (kt_ep_wrapper.py)
- Remove duplicate swiglu_alpha kwarg and out-of-scope hf_config reference
in create_kt_config_from_server_args (SyntaxError took out KT-EP, every
rank fell back to standard FP8 weight allocation and OOM'd at TP=8).
cuda_graph_config (cuda_graph_config.py)
- check_cuda_graph_backend reads flat fields (disable_cuda_graph,
enable_piecewise_cuda_graph, ...) when the aggregate cuda_graph_config
object is absent (kvcache keeps the cuda-graph args flat).
* fix(minimax-m3): support apply_routed_scaling_factor_on_output in fused_topk
select_experts asserted `not apply_routed_scaling_factor_on_output` on
the non-grouped / non-sqrtsoftplus paths (topk.py:1024). M3's MoE passes
`apply_routed_scaling_factor_on_output=True` to TopK, which falls into
the fused_topk branch (scoring_func != "sqrtsoftplus"), so capture-time
forward hit AssertionError.
fused_topk itself does not take the kwarg. The semantic is just
"multiply topk_weights by routed_scaling_factor on the output side", so
apply it manually after the fused_topk call.
* refactor(minimax-m3): read swiglu alpha/limit from MoeRunnerConfig, drop CLI flags
kt_ep_wrapper.py
- KTEPWrapperMethod.create_weights() now pulls gemm1_alpha /
gemm1_clamp_limit / swiglu_limit from layer.moe_runner_config
(which the model file populates from HF config — minimax_m3.py
forwards config.swiglu_alpha / swiglu_limit; deepseek_v2.py
forwards config.swiglu_limit). This is the same data source the
GPU runners already use, so CPU and GPU experts can no longer
drift to different alpha/limit values (the silu-vs-swigluoai
mismatch from 8469f58a83).
- Remove KTConfig.swiglu_alpha / swiglu_limit fields and their
wiring in create_kt_config_from_server_args.
- Drop the SGLANG_DSV4_2604_SUBMODE env fallback for swiglu_limit:
DSV4-2604B's limit=10 already flows through MoeRunnerConfig via
deepseek_v2.py, no env override needed. The unrelated 2604B
path-checker bump later in the file is kept.
server_args.py
- Delete --kt-swiglu-alpha / --kt-swiglu-limit CLI flags and their
ServerArgs fields. Launch scripts that passed them (M3 hybrid)
must drop the two flags after upgrading.
Verified on qjh005 (TP=4 GPUs 4-7, no CLI flags): kt-kernel's
[M3-PROBE-2] / [M3-PROBE-3] prints show layer_idx=4
swiglu_alpha=1.702 swiglu_limit=7.0 reaching the wrapper, matching
M3's HF config defaults.
* feat(minimax-m3): native MXFP8 path + shared-expert routing on SM90
Previously M3 on H20 (SM90) was forced through the lossy MXFP8 -> block-FP8
[128,128] convert path, collapsing 294,912 mxfp8 scales (uint8 ue8m0 [1,32])
per expert w13 into 576 block-fp8 scales (fp32). M3 was trained with mxfp8
granularity; the squash + a missing shared-expert routing step + a missing
M3 thinking_mode branch combined to give garbage output (0% GSM8K on pure
sglang). Net result of this commit: pure sglang GSM8K 87%, hybrid
(--kt-num-gpu-experts=8 + layerwise prefill) 89% over 200 questions.
Fp8MoEMethod / Fp8LinearMethod (layers/quantization/fp8.py)
Add use_mxfp8_native_sm90 gate (CUDA + SM90 + use_mxfp8 + not SM100+).
When True: process_weights_after_loading keeps weights raw MXFP8 (no
convert), apply() routes to get_triton_quant_info -> TritonMoeQuantInfo
(use_mxfp8=True) for MoE and to dot_scaled_mxfp8_blockscaled_linear (the
Triton tl.dot_scaled path in mxfp8_native.py, no SM100 swizzle) for
dense linear. Add a no-op _process_mxfp8_linear_weight_scale so M3's
MiniMaxM3FusedQKVIndex no longer crashes when its scale-finalize hook
runs on the raw-MXFP8 path.
TritonMoeQuantInfo / TritonRunnerCore / fused_experts_none_to_triton
(layers/moe/moe_runner/triton.py)
Add use_mxfp8 bool field (default False; non-MXFP8 callers unaffected).
Both run() and the registered fused-func dispatch route to
triton_utils/mxfp8_moe.py:fused_experts_mxfp8 (formerly dead code: zero
callers) when use_mxfp8 is set. interleaved=False is hardcoded since
M3's FusedMoE asserts it.
KT-EP layerwise prefill (layers/moe/kt_ep_wrapper.py)
Byte-copy MXFP8 only; shadow gpu_method stays in MXFP8 view so apply()
takes the same native MXFP8 path as the steady-state hybrid forward.
Shared-expert routing (layers/moe/topk.py)
Installed sgl_kernel.topk_sigmoid is 5-arg, doesn't accept
num_fused_shared_experts. Without intervention M3's shared expert
(id = num_local_experts = 128) was never routed to, killing the
always-active component. select_experts now appends the shared id to
topk_ids[:, -1] and assigns the matching weight (sum(routed)/rsf, then
renormalize routed-only, then *= rsf), mirroring biased_grouped_topk
semantics. Brings GSM8K from 0% to ~85%+.
M3 thinking_mode in serving_chat (entrypoints/openai/serving_chat.py)
Equivalent of upstream 49dde29cc6 + 297abb2838. reasoning_parser.py
factory already had the M3 branch; serving_chat._get_reasoning_from_request
was missing it, so every M3 request fell through to `return True` (force
reasoning), shoving the answer into reasoning_content. Now: thinking_mode
== "enabled" forces reasoning, the other three (disabled / adaptive /
unset) let the detector self-handle the <mm:think> tag.
Upstream sglang removed the gptq_marlin* kernels to the JIT system.
- Added to JIT in sglang upstream PR #18543
- Removed from sgl-kernel in sglang upstream PR #19241
At that moment the kt_ep_wrapper should have been updated but was not.
In many cases the trouble is not observed since sgl-kernel is often
installed from (stale) prebuilt wheels. When building from source
however, e.g. for CUDA123/sm120, the produced binary does not have
gptq_marlin_repack and kt_ep_wrapper is not imported.
Three changes to make GlmMoeDsaForCausalLM (e.g. GLM-5.2-0610-Provider-FP8)
load and forward end-to-end on top of the PR 27114 cherry-pick:
1. DeepseekV2AttentionMLA.__init__: add the is_nextn parameter that PR 27114
assumed was already in scope. It was not in kt-sglang base; the auto-merge
inserted self.is_nextn = is_nextn without the corresponding param, raising
NameError at construction. Also thread is_nextn= through the DecoderLayer
call site.
2. Compute self.skip_topk / self.next_skip_topk BEFORE creating the Indexer,
and only instantiate Indexer when (not skip_topk) or is_nextn. GLM-5.2's
checkpoint ships indexer weights only for "full" layers (driven by
index_topk_freq=4, index_skip_topk_offset=3); creating an Indexer module
on shared layers either zero-inits its weights and emits out-of-range
topk indices that hang the NSA gather kernel (observed: 5-minute
watchdog timeout at the first forward) or fails a strict load.
3. Gate the three call sites that invoke self.indexer
(deepseek_v2.py forward_absorb_prepare alt-stream + non-alt branches,
plus deepseek_common/attention_forward_methods/forward_mha.py
forward_normal_prepare) on
(not self.skip_topk) or (self.is_nextn and forward_batch.topk_indices is None)
and use forward_batch.topk_indices as the cross-layer carrier so shared
layers reuse the most recent full layer's topk_indices.
Smoke-verified on qjh007 8x H20: TP=8, eager mode, --mem-fraction-static
0.97, --kv-cache-dtype fp8_e4m3. Three diverse prompts return coherent
text; e2e_latency 6-14s for 24-48 tokens.
DecoderLayer.forward still returns the original 2-tuple. The full Tier C
plumbing (3-tuple decoder return plus a signature-level prev_topk_indices
argument) remains future work; the ForwardBatch-carried path here is
sufficient for serving GLM-5.2 in eager mode.
* fix(v4-flash): remove broken MXFP4 weight cache + fix rsf double-apply
move routed_scaling_factor application from inside
apply_v4_triton_kernels_moe to the caller (mxfp4_deepseek.apply),
mirroring the trtllm path convention. This fixes a latent double-apply
when SGLANG_OPT_MXFP4_FUSE_RSF_SHARED_ADD is enabled.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix(scheduler): revert PR #38 req_pool changes that break TP-only mode
PR #38 introduced changes that together cause scheduler hang on
TP-only configurations with max_running_requests=1:
1. scheduler.py: Removed `if self.pp_size > 1:` guard in
get_num_allocatable_reqs, causing TP-only mode to check
available_size() unconditionally.
2. memory_pool.py: Changed free_slots from `range(size)` to
`range(1, size)` to reserve index 0. With max_running_requests=1,
this produces empty free_slots list.
3. scheduler_runtime_checker_mixin.py: Changed expected_free from
`req_total_size` to `req_total_size - 1` to match the reserved slot.
This fix reverts all 4 locations to v0.6.1.post1 behavior.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(cuda_graph): use out-of-band _replay_forward_batch for non-DSV4 backends
Cherry-pick fix from upstream 3ffc34dbe to resolve TypeError when
non-DSV4 backends (TritonAttnBackend, etc.) receive unexpected
out_cache_loc kwarg during CUDA graph replay.
Instead of passing out_cache_loc as a parameter (which requires all
backends to update their signatures), use an out-of-band attribute:
- Set attn_backend._replay_forward_batch before the call
- DSV4 backend reads out_cache_loc from this attribute
- Clear the attribute after the call
Conflict resolution: kept kt-sglang's attribute path
`self.model_runner.attn_backend` (vs upstream's `self.attn_backend`).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix: remove undefined _GraphBucket reference in cuda graph replay
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Squash of 5 kt-side commits on top of upstream main: the four v4-2604B
follow-up fixes (env defaults flip, topk kwargs drop, hf-transformers
backup gate, SwiGLU torch.compile collapse) plus the dsv4-plugin-redo
refactor. Three other v4-2604B fixes (bf16 cpu_buf, mxfp4 SwiGLU clamp,
kt swiglu_limit) are already squash-merged into upstream as part of
PR #44 (6ac4f82e8) and were skipped during rebase.
Goal: prevent DSV4 module bugs / missing dependencies from breaking
non-DSV4 model loads (Qwen, GLM, Kimi, etc.), while preserving full
DSV4-Flash functionality on supported hardware.
## Architecture
Four new plugin-registry modules in base sglang that DSV4 self-attaches
to at module-load time:
- layers/moe/quant_method_registry.py — chained-wrap MoE quant methods
(mxfp4_deepseek priority=10,
kt_ep priority=20)
- mem_cache/pool_registry.py — KV pool factory dispatch
(DeepSeekV4TokenToKVPool)
- managers/coordinator_registry.py — request coordinator factory
(HiSparseCoordinator)
- managers/forward_hooks_registry.py — scheduler / runner lifecycle
event dispatch (HiSparse hook
adapter)
DSV4 plugin entry: models/deepseek_v4.py runs side-effect imports of
kt_ep_wrapper, mxfp4_deepseek, hisparse_coordinator,
deepseekv4_memory_pool. Each of those self-registers; non-DSV4 models
never trigger these imports when SGLANG_DISABLED_MODEL_ARCHS skips
deepseek_v4 / deepseek_v4_nextn.
## isinstance → duck-type tags
Replaced 12+ isinstance(_, DSV4Class) checks across base files with
class-attribute tags (_quant_wrapper_id, _is_v4_token_pool,
_is_dsv4_backend_radix). Base files no longer need to import DSV4
classes just to test object identity.
## Bundled v4-2604B fixes (originally separate commits)
- environ.py: flip 25 SGLANG_DSV4_* / SGLANG_OPT_* env defaults to OFF
so non-DSV4 paths default to upstream behavior.
- moe/topk.py: drop 2 kwargs from select_experts else-branch's fused_topk
call that PR #38 left incompatible with non-DSV4 callers.
- utils/hf_transformers_utils.py: gate the deepseek backup-config path
on _peek_is_deepseek_arch so non-deepseek models (Qwen3, GLM, Kimi)
with no top-level num_hidden_layers don't RuntimeError on startup.
- moe/fused_moe_triton/fused_moe.py: collapse the 60-line DSV4-specific
SwiGLU clamp branch in fused_experts_impl down to 5 lines via a
reused torch.compile dispatch helper (_swiglu_clamp_silu_mul).
## Pre-existing PR #38 bugs surfaced and fixed in this branch
- configs/deepseek_v4.py: was double-`@dataclass`-decorated by
transformers v5+ PretrainedConfig.__init_subclass__, which stripped
default_factory(...) from quantization_config / rope_scaling /
compress_ratios in some builds, causing
`'Field' object has no attribute 'to_dict'` at runtime. Rewritten
to traditional __init__ kwargs idiom, matching all other sglang
configs (afmoe, chatglm, dbrx, bailing_hybrid, ...).
- utils/hf_transformers_utils.py: _load_deepseek_temp_model hardcoded
config_json["model_type"] = "deepseek_v3" even for V4 architecture,
causing AutoConfig to resolve transformers' DeepseekV3Config (which
doesn't expose rope_theta / compress_rope_theta / compress_ratios at
the top level in transformers-kt 5.6.0). Now picks "deepseek_v4"
for DeepseekV4ForCausalLM architecture.
- models/deepseek_v2.py: SGLANG_DSV4_MODE=2604 in operator's shell
caused config.num_hash_layers AttributeError on non-DSV4 configs
inheriting DeepseekV2MoE (e.g., GlmMoeDsaConfig). Now gated on
is_deepseek_compressed(config).
- models/deepseek_v4.py: side-effect plugin imports wrapped in
try/except so a sibling failure (e.g., flashinfer < 0.6.9 trips the
module-load version check in mxfp4_deepseek) doesn't block
DeepseekV4ForCausalLM from registering with ModelRegistry.
- _V4MoE subclass replaces is_deepseek_v4 boolean flag pollution in
DeepseekV2MoE — V4 NextN draft layers bypass hash MoE via
_compute_is_hash override.
## Robustness fixes from E2E hardware testing
- Triton kernels MXFP4 path: force num_stages=2 in
triton_kernels.opt_flags constraints to defend against the bare
`assert num_stages >= 1` for capabilities outside the tested matrix.
- launch_server.py: sweep stale ninja `lock` / `.ninja_lock` files
under ~/.cache/torch_extensions older than 30 minutes (configurable
via SGLANG_STALE_LOCK_AGE_MINUTES) so a SIGKILL'd build doesn't wedge
the next launch indefinitely.
## Verified
- E2E pass on Qwen2.5-7B, Qwen3.5-FP8, Qwen3.5-35B-A3B-FP8,
Qwen3-Coder-Next-FP8, Kimi-K2.5 (non-DSV4 models, hardware confirmed).
- E2E pass on DeepSeek-V4-Flash with TP=8, MXFP4 routed experts, KT-EP
CPU/GPU split, hash-MoE, NSA sparse attention (after pinning
flashinfer>=0.6.9, apache-tvm-ffi==0.1.9, tilelang>=0.1.8).
- 0 DSV4 modules in sys.modules when SGLANG_DISABLED_MODEL_ARCHS skips
deepseek_v4 / deepseek_v4_nextn — DSV4 plugin failures cannot affect
non-DSV4 startup.
- pyproject.toml unchanged: drop-in replacement for kt-sglang pre-DSV4
packaging.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(kt-ep): match cpu_buf dtype to kt-kernel's bf16 scale write for MXFP4
kt-kernel's write_weights_to_buffer (operators/amx/fp4-moe.hpp) writes
gate/up scales as bf16 via fast_fp32_to_bf16, but mxfp4_deepseek allocates
w13/w2_weight_scale_inv as fp32. The 2x element-size mismatch caused
kt-kernel to fill only the first half of cpu_buf in fp32-element terms;
after Phase 3 .to(float8_e8m0fnu) the second half (= up_proj rows) became
2^-127, zeroing dequantized up_proj weights for all experts loaded via
the kt double-buffered pipeline. Single-chunk GPU prefill on V4-Flash
MXFP4 produced mode-collapsed garbage as a result.
Allocate the cpu_buf with bf16 dtype for these two scale tensors so
kt-kernel's write fills it exactly; gpu_t[e].copy_(cpu_buf[slot]) then
performs the bf16->fp32 dtype cast automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(v4-2604b): apply SwiGLU clamp on triton-kernels GPU MoE path
The trtllm and deep_gemm paths both apply a 2604B-specific asymmetric
gate/up clamp (gate.clamp(max=limit); up.clamp(-limit, limit)) on the
gemm1 output before silu_and_mul. The triton-kernels path (default GPU
MoE on every capability outside _TRTLLM_FP4_CAPS, including SM_120
RTX 5090) was constructing a bare matmul_ogs → silu_and_mul → matmul_ogs
sequence with no clamp, leaving routed-expert outputs numerically
inconsistent with the trtllm reference on long-prompt / large-activation
tokens.
Threads moe_runner_config.swiglu_limit through DeepSeekMxfp4MoEMethod.apply
to apply_v4_triton_kernels_moe; semantics match
moe_runner/deep_gemm.py:_apply_swiglu_limit verbatim. No-op when
submode != 2604B (swiglu_limit is None).
Origin: sglang 本身.
* feat(v4-2604b): pass swiglu_limit through KTEPWrapper to kt-kernel
The kt CPU expert path was applying plain silu(g)*u with no clamp,
diverging from the trtllm `gemm1_clamp_limit` and deep_gemm
`_apply_swiglu_limit` references on long-prompt / large-activation
tokens. Companion changes in kt-kernel
(`feat/v4-2604b-swiglu-clamp:d10bd3d`) plumb a `swiglu_limit` field
through `MOEConfig` into the AMX `act_fn`; this commit passes the
value through the kt-sglang bridge.
The KTMoEWrapper is constructed in `create_weights`, before
`create_moe_runner` delivers the full `MoeRunnerConfig`, but the
value is fully determined by SGLANG_DSV4_2604_SUBMODE which is fixed
at process start, so we read the env directly here. Mirrors the
`assert swiglu_limit == 10` in moe_runner/deep_gemm.py and the
`torch.full(..., swiglu_limit, ...)` constructor in
mxfp4_deepseek.py:177-186.
Origin: kt-sglang 耦合.
* fix(scheduler): correct inverted chunked_req check that hangs hybrid SWA chunked prefill
In _get_new_batch_prefill_raw the inline comment explicitly says
"Ignore the check if self.chunked_req is not None" but the code below
used `is not None`, which is the opposite. With --disable-radix-cache +
hybrid SWA + multi-chunk prompt, the chunked_req keeps holding its
req_pool slot across chunks (ChunkCache.cache_unfinished_req does not
release it), and ReqToTokenPool initialises free_slots = list(range(1,
size)) wasting index 0, so once chunked_req takes the only available
slot the check fires forever and the scheduler returns None on every
iteration -> silent hang (chunk1 prefill completes, chunk2 never starts;
TP CPU 60-145% busy spin; client request never returns).
The sister check at line 2065 (`and self.chunked_req is None: return
None`) is correctly inverted; this brings line 2082 in line with the
comment and with that sister check.
Repro (DeepSeek V4 Flash, hybrid SWA, page_size=256):
--disable-radix-cache --chunked-prefill-size 2048 \
--tensor-parallel-size 4 --max-running-requests 2
+ a prompt > 2048 tokens (forces multi-chunk)
Before: chunk1 prefill runs, then silent hang or false-positive
"token_to_kv_pool_allocator memory leak detected" SIGQUIT
(the hybrid leak check is also too strict; addressed in a
follow-up commit).
After: 5001-token English prompt -> 3 chunks, HTTP 200 in 26.4s;
6695-token Chinese prompt -> 4 chunks, HTTP 200 in 52.2s.
Origin: sglang itself (not kt-sglang coupling). Reproduces on pip-
installed upstream sglang as well as on the kt third_party submodule.
* fix(scheduler): skip self_check_during_idle when in-flight work still holds KV slots
Defensive guard for the same bug class as the previous commit. When the
scheduler enters the idle branch with chunked_req != None or a non-empty
running_batch / waiting_queue, the in-flight KV slots are not yet freed
nor cached. _check_hybrid_memory then reports them as leaked because
its formula `full_num_used != 0` does not subtract protected_size /
in-flight usage the way _check_radix_cache_memory does. The result was
a SIGQUIT-on-false-positive: 4 TP ranks raise simultaneously and the
server dies mid-request.
The other branches of self_check_during_idle (DisaggregationMode.PREFILL
and .DECODE) already early-return on similar in-flight conditions; this
patch adds the equivalent guard for DisaggregationMode.NULL which had
no such check. The same pattern is used at scheduler.py line 1372 and
process_input_requests around line 1370.
This guard is no longer load-bearing once the scheduler.py 2082 fix is
in (chunked prefill advances every iter, the scheduler never reaches
batch=None mid-request), but is kept as defence-in-depth against any
future path that produces a double-None batch frame.
Origin: sglang itself.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes for Kimi K2.5 RAWINT4 failing to start with CUDA graph:
1. fused_marlin_moe.py: Fix IndentationError from bad merge conflict
resolution — imports were left outside the `if _is_cuda:` block.
2. fused_marlin_moe.py: Add early return for E=0/M=0. When
kt-num-gpu-experts=0, GPU expert weights are empty tensors (E=0).
The marlin MoE kernel crashes on these empty inputs. Return zeros
so KT CPU experts can contribute the full result.
3. deepseek_v2.py: Skip dual-stream path for KT wrapper. The
forward_normal_dual_stream uses alt_stream for shared expert
parallelism, which conflicts with KT wrapper internal _cpu_stream
during CUDA graph capture.
Fixes#1866
- kt_ep_wrapper.py: normalize list-form moe_layer_freq to int
MiMo-V2-Flash uses per-layer mask [0,1,1,...] instead of int freq
- mimo_v2_flash.py: use getattr for pad_token_id (not in MiMo config)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_load_deepseek_v32_model() rewrites architectures from
DeepseekV32ForCausalLM to DeepseekV3ForCausalLM for transformers
compatibility, but the MLA detection list did not include
DeepseekV3ForCausalLM, causing use_mla_backend=False and
MHATokenToKVPool to be created instead of NSATokenToKVPool/MLATokenToKVPool.
DeepseekV32ForCausalLM was missing from the model_arch guard in
_handle_model_specific_adjustments(), so is_deepseek_nsa() was never
reached for V3.2 models. This caused the NSA attention backend to not
be auto-selected, leading to q_rope TypeError with flashinfer or
incorrect behavior with other backends.
Upstream bug introduced in sgl-project/sglang#13687 (commit 618ca2380)
which refactored the flat is_deepseek_nsa() check into a nested block
under model_arch guard but only listed DeepseekV3ForCausalLM.
During CUDA graph capture (regular or PCG), torch.cuda.synchronize()
and CPU-GPU expert coordination are not allowed. Detect capture mode
via is_in_piecewise_cuda_graph() and torch.cuda.is_current_stream_capturing(),
and delegate directly to the GPU method in those cases.
This enables running Qwen3.5 with --attention-backend triton without
--disable-cuda-graph, improving decode from ~11 tok/s to ~65 tok/s.