sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-06-29 19:28:51 +00:00

Author	SHA1	Message	Date
Benjamin	5d6bef9f61	fix(v4): auto-bump swa_full_tokens_ratio to fit chunks_in_flight (#60 )	2026-06-25 16:19:52 +08:00
usrlocalben	1344f647a6	fix: kt_ep_wrapper silently fails to import (#59 )	2026-06-24 12:11:25 +08:00
chenghanke	37eecb4e99	Enable DeepSeek V4 Flash inference on Ampere GPUs (#58 ) * feat: enable DeepSeek V4 Flash inference on Ampere (SM_86) GPUs without FP8 Ampere GPUs (SM_86, e.g. RTX 3090) lack native FP8 support, preventing DeepSeek V4's compressed attention from running. This commit adds BF16 fallback paths throughout the V4 Flash attention pipeline. Changes ported from deepseek-v4-ampere branch, excluding all MTP (Multi-Token Prediction) speculative decoding code. === BF16 KV Cache Mode === - mem_cache/deepseekv4_memory_pool.py: Add use_bf16_cache parameter, SetBf16KAndS Triton kernel, BF16 byte/token size calculations - model_runner_kv_cache_mixin.py: BF16 cell size calculation for compressed attention - memory_profiler.py: BF16 memory profile calculations === V4 Flash NSA/Compressed Attention BF16 Kernels === - nsa/v4_triton_kernel.py: FP8 LUT decode for SM<90 + full BF16 sparse attention kernel (decode_sparse_attention_bf16) - nsa/index_buf_accessor_v4.py: SetBf16KAndS class with Triton kernel and torch fallback; NopeBf16RopeBf16Pack data class - nsa/quant_k_cache_v4.py: BF16 quantize/dequantize path - nsa/triton_kernel.py: SM version check for act_quant fallback - nsa/index_buf_accessor.py: SM version check fallback to vanilla path - compressed/indexer.py: tilelang-based BF16 paged MQA logits kernels (bf16_paged_mqa_logits_tilelang, bf16_direct_paged_mqa_logits_tilelang) + torch fallback for CUDA graph safety - compressed/compressor.py: SM-based dispatch for BF16 vs FP8 compressor - compressed/paged_prefill.py: SWA BF16 cache store dispatch - debug_flash_mla_adapter.py: BF16 decode dispatch - deepseek_v4_backend_radix.py: BF16 SWA key storage dispatch === SM_89 (Ada Lovelace) Capability Guard Fixes === - All compute capability checks use cc < (8, 9) tuple comparison instead of major < 9, correctly excluding SM_89 (RTX 4090) which has native FP8 support and should not use BF16 fallbacks === Function Call / Reasoning Parser === - function_call/deepseekv4_detector.py: JSON format + structural tag - function_call/function_call_parser.py: Register DeepSeekV4Detector - parser/reasoning_parser.py: DeepSeekV3Detector with explicit_thinking === Infrastructure === - cuda_graph_runner.py: Fix KTMoEWrapper.set_capture_batch_sizes() to pass num_tokens (bs * num_tokens_per_bs) instead of num_seqs, since KExpertsCPUBuffer.get_buffer() indexes by hidden_states.shape[0] - server_args.py: Document V4 compressed attention FP8 dtype requirement * revert: restore upstream docstring and remove unused logging in DeepSeekV4Detector Revert the class docstring to the concise upstream version and remove unused / . Keep as the only functional addition. Co-Authored-By: Claude <noreply@anthropic.com> * Update python/sglang/srt/layers/attention/compressed/paged_prefill.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Benjamin <159887351+yyj6666667@users.noreply.github.com>	2026-06-22 13:53:55 +08:00
Benjamin	8b636f9008	Feat/minimax m3 (#56 ) * feat(minimax-m3): port MiniMax-M3 to kvcache base + initial MXFP8 SM90 path Port MiniMax M3 inference support from trial-m3 (PR #27944) to the kvcache-ai/sglang main branch, plus the runtime fixes and MXFP8-on-SM90 enablement landed in subsequent bring-up patches. Core port (model + sparse attention + JIT + CUDA + pools + parsers) - minimax_m3.py (1924 lines), VL variant (minimax_m3_vl.py), VL config. - Sparse attention: MiniMaxHybridAttnBackend, MiniMaxSparseAttnBackend with MSA (SM100), piecewise CUDA graph, TopK block selection. - JIT kernels: QK norm + RoPE, decode TopK, KV index store, UE8M0 quant. - CUDA kernels: fused QK norm, fused KV store, decode TopK, per-token quant. - Memory pool: MiniMaxSparseKVPool (dual-pool: main KV + sparse index). - KT integration: swiglu_alpha plumbing in kt_ep_wrapper + server_args. - Function-call parser: MinimaxM3Detector (]<]minimax[>[ token format). - Reasoning parser: MiniMaxM3Detector (<mm:think> blocks). - MXFP8 MoE configs: H100/H20/H200/A800 tuning for E=128/129. - Tests: JIT kernel, sparse pool, function call, E2E, AMD benchmarks. - Infrastructure: forward_context, cuda_graph_config, fused_moe (upstream). Initial MXFP8 SM90 enablement (block-fp8 conversion pipeline) - MXFP8 -> block-FP8 [128,128] conversion pipeline + M3 model defaults (loader converts raw MXFP8 weights to block-FP8 at load time). - Allow Triton MoE backend for MXFP8 on SM90 (Hopper / H20 / H100): Fp8MoEMethod accepts MXFP8 weights, runs the block-FP8 path after conversion. - Disable piecewise CUDA graph for the MXFP8 block-fp8 conversion path (the conversion runs once at load, but piecewise capture re-traced the load step). - Adapt KT EP wrapper for yyj6666667/ktransformers feat/mxfp8-moe-m3: CPU-side MXFP8 byte-copy + GPU shadow gpu_method holding block-FP8 view, so hybrid forwards take the same Fp8MoEMethod path as pure GPU. Runtime fixes for kvcache base - get_rope_config compatibility, dense-sparse decode guard for the hybrid attn backend, VL processor tokenizer access on kvcache's Processor wrapper. Source: yyj6666667/sglang trial-m3 (3db7af4d4) / sgl-project/sglang PR #27944. Target: kvcache-ai/sglang main (`51032b712`). This commit is the consolidation of the original 7 bring-up commits (3073753, 8c597a7, 3b9dbfe, bd6c876, cc84e6e, b712433, 77d514a) — see archive/pre-port-squash-20260618 for the pre-squash history. * fix(minimax-m3): port upstream infra so M3 launches on kvcache base Bring-up patches surfacing as the M3 launch pipeline crashes through: module-load → ServerArgs → model construct → attn backend → KV pool → CUDA graph. Each fix is the smallest delta from upstream M3 expectations to the current kvcache fork. env (environ.py) - Register all env vars M3 references at __init__ / capture time (EnvBool(False) to keep current default behaviour): SGLANG_OPT_USE_MINIMAX_FUSED_QKNORM_ROPE, SGLANG_OPT_USE_MINIMAX_FUSED_KV_INDEX_STORE, SGLANG_OPT_USE_MINIMAX_DECODE_TOPK_RADIX, SGLANG_OPT_USE_BF16_ROUTER_GEMM, SGLANG_OPT_USE_MINIMAX_DENSE_SPARSE_DECODE, SGLANG_DISABLE_MSA. moe utils (layers/moe/utils.py) - Backport upstream get_moe_padding_size(is_aiter_moe) + AITER/TRITON padding constants. triton_utils import paths (moe_runner/triton_utils/{fused_moe,mxfp8_moe}.py) - Rewrite imports to use existing .fused_moe_triton.* symbols (sibling .moe_runner.triton_utils.* modules don't exist on kvcache). - Wrap sglang.jit_kernel.activation import in try/except (M3 only needs swiglu_no_interleaved_with_alpha_and_limit, a 5-line torch.compile fn). mxfp8_native.py - Port upstream native MXFP8 quant module (mxfp8_e4m3_quantize consumed by mxfp8_moe.py). schedule_batch.MultimodalProcessorOutput - Add upstream dataclass; M3 VL processor returns it from process(). FusedMoE.__init__ - Accept interleaved: bool = False; assert against True (kvcache triton MoE assumes non-interleaved W13). KV pool selection (model_runner_kv_cache_mixin.py) - Instantiate MiniMaxSparseKVPool directly when is_minimax_sparse(hf_config); pool_registry's M3 registration is gated behind is_v4_model and never fires for M3. MiniMaxSparse/HybridAttnBackend - Add init_forward_metadata wrapper and legacy capture/replay adapters that translate to upstream's _out_graph(in_capture=) + _in_graph split. KT-EP wrapper (kt_ep_wrapper.py) - Remove duplicate swiglu_alpha kwarg and out-of-scope hf_config reference in create_kt_config_from_server_args (SyntaxError took out KT-EP, every rank fell back to standard FP8 weight allocation and OOM'd at TP=8). cuda_graph_config (cuda_graph_config.py) - check_cuda_graph_backend reads flat fields (disable_cuda_graph, enable_piecewise_cuda_graph, ...) when the aggregate cuda_graph_config object is absent (kvcache keeps the cuda-graph args flat). * fix(minimax-m3): support apply_routed_scaling_factor_on_output in fused_topk select_experts asserted `not apply_routed_scaling_factor_on_output` on the non-grouped / non-sqrtsoftplus paths (topk.py:1024). M3's MoE passes `apply_routed_scaling_factor_on_output=True` to TopK, which falls into the fused_topk branch (scoring_func != "sqrtsoftplus"), so capture-time forward hit AssertionError. fused_topk itself does not take the kwarg. The semantic is just "multiply topk_weights by routed_scaling_factor on the output side", so apply it manually after the fused_topk call. * refactor(minimax-m3): read swiglu alpha/limit from MoeRunnerConfig, drop CLI flags kt_ep_wrapper.py - KTEPWrapperMethod.create_weights() now pulls gemm1_alpha / gemm1_clamp_limit / swiglu_limit from layer.moe_runner_config (which the model file populates from HF config — minimax_m3.py forwards config.swiglu_alpha / swiglu_limit; deepseek_v2.py forwards config.swiglu_limit). This is the same data source the GPU runners already use, so CPU and GPU experts can no longer drift to different alpha/limit values (the silu-vs-swigluoai mismatch from 8469f58a83). - Remove KTConfig.swiglu_alpha / swiglu_limit fields and their wiring in create_kt_config_from_server_args. - Drop the SGLANG_DSV4_2604_SUBMODE env fallback for swiglu_limit: DSV4-2604B's limit=10 already flows through MoeRunnerConfig via deepseek_v2.py, no env override needed. The unrelated 2604B path-checker bump later in the file is kept. server_args.py - Delete --kt-swiglu-alpha / --kt-swiglu-limit CLI flags and their ServerArgs fields. Launch scripts that passed them (M3 hybrid) must drop the two flags after upgrading. Verified on qjh005 (TP=4 GPUs 4-7, no CLI flags): kt-kernel's [M3-PROBE-2] / [M3-PROBE-3] prints show layer_idx=4 swiglu_alpha=1.702 swiglu_limit=7.0 reaching the wrapper, matching M3's HF config defaults. * feat(minimax-m3): native MXFP8 path + shared-expert routing on SM90 Previously M3 on H20 (SM90) was forced through the lossy MXFP8 -> block-FP8 [128,128] convert path, collapsing 294,912 mxfp8 scales (uint8 ue8m0 [1,32]) per expert w13 into 576 block-fp8 scales (fp32). M3 was trained with mxfp8 granularity; the squash + a missing shared-expert routing step + a missing M3 thinking_mode branch combined to give garbage output (0% GSM8K on pure sglang). Net result of this commit: pure sglang GSM8K 87%, hybrid (--kt-num-gpu-experts=8 + layerwise prefill) 89% over 200 questions. Fp8MoEMethod / Fp8LinearMethod (layers/quantization/fp8.py) Add use_mxfp8_native_sm90 gate (CUDA + SM90 + use_mxfp8 + not SM100+). When True: process_weights_after_loading keeps weights raw MXFP8 (no convert), apply() routes to get_triton_quant_info -> TritonMoeQuantInfo (use_mxfp8=True) for MoE and to dot_scaled_mxfp8_blockscaled_linear (the Triton tl.dot_scaled path in mxfp8_native.py, no SM100 swizzle) for dense linear. Add a no-op _process_mxfp8_linear_weight_scale so M3's MiniMaxM3FusedQKVIndex no longer crashes when its scale-finalize hook runs on the raw-MXFP8 path. TritonMoeQuantInfo / TritonRunnerCore / fused_experts_none_to_triton (layers/moe/moe_runner/triton.py) Add use_mxfp8 bool field (default False; non-MXFP8 callers unaffected). Both run() and the registered fused-func dispatch route to triton_utils/mxfp8_moe.py:fused_experts_mxfp8 (formerly dead code: zero callers) when use_mxfp8 is set. interleaved=False is hardcoded since M3's FusedMoE asserts it. KT-EP layerwise prefill (layers/moe/kt_ep_wrapper.py) Byte-copy MXFP8 only; shadow gpu_method stays in MXFP8 view so apply() takes the same native MXFP8 path as the steady-state hybrid forward. Shared-expert routing (layers/moe/topk.py) Installed sgl_kernel.topk_sigmoid is 5-arg, doesn't accept num_fused_shared_experts. Without intervention M3's shared expert (id = num_local_experts = 128) was never routed to, killing the always-active component. select_experts now appends the shared id to topk_ids[:, -1] and assigns the matching weight (sum(routed)/rsf, then renormalize routed-only, then *= rsf), mirroring biased_grouped_topk semantics. Brings GSM8K from 0% to ~85%+. M3 thinking_mode in serving_chat (entrypoints/openai/serving_chat.py) Equivalent of upstream 49dde29cc6 + 297abb2838. reasoning_parser.py factory already had the M3 branch; serving_chat._get_reasoning_from_request was missing it, so every M3 request fell through to `return True` (force reasoning), shoving the answer into reasoning_content. Now: thinking_mode == "enabled" forces reasoning, the other three (disabled / adaptive / unset) let the detector self-handle the <mm:think> tag.	2026-06-21 17:09:49 +08:00
usrlocalben	31124d0ae7	fix: kt_ep_wrapper silently fails to import after `a2f451315` (#57 ) Upstream sglang removed the gptq_marlin* kernels to the JIT system. - Added to JIT in sglang upstream PR #18543 - Removed from sgl-kernel in sglang upstream PR #19241 At that moment the kt_ep_wrapper should have been updated but was not. In many cases the trouble is not observed since sgl-kernel is often installed from (stale) prebuilt wheels. When building from source however, e.g. for CUDA123/sm120, the produced binary does not have gptq_marlin_repack and kt_ep_wrapper is not imported.	2026-06-21 14:03:44 +08:00
Jianwei Dong	a2a8a7e9e0	support glm5.2 Merge/sgl 27114 onto `537eb762`	2026-06-16 16:33:56 +08:00
yyj	ee149528a3	fix(dsa): wire skip_topk-gated indexer for GlmMoeDsa to unblock GLM-5.2 Three changes to make GlmMoeDsaForCausalLM (e.g. GLM-5.2-0610-Provider-FP8) load and forward end-to-end on top of the PR 27114 cherry-pick: 1. DeepseekV2AttentionMLA.__init__: add the is_nextn parameter that PR 27114 assumed was already in scope. It was not in kt-sglang base; the auto-merge inserted self.is_nextn = is_nextn without the corresponding param, raising NameError at construction. Also thread is_nextn= through the DecoderLayer call site. 2. Compute self.skip_topk / self.next_skip_topk BEFORE creating the Indexer, and only instantiate Indexer when (not skip_topk) or is_nextn. GLM-5.2's checkpoint ships indexer weights only for "full" layers (driven by index_topk_freq=4, index_skip_topk_offset=3); creating an Indexer module on shared layers either zero-inits its weights and emits out-of-range topk indices that hang the NSA gather kernel (observed: 5-minute watchdog timeout at the first forward) or fails a strict load. 3. Gate the three call sites that invoke self.indexer (deepseek_v2.py forward_absorb_prepare alt-stream + non-alt branches, plus deepseek_common/attention_forward_methods/forward_mha.py forward_normal_prepare) on (not self.skip_topk) or (self.is_nextn and forward_batch.topk_indices is None) and use forward_batch.topk_indices as the cross-layer carrier so shared layers reuse the most recent full layer's topk_indices. Smoke-verified on qjh007 8x H20: TP=8, eager mode, --mem-fraction-static 0.97, --kv-cache-dtype fp8_e4m3. Three diverse prompts return coherent text; e2e_latency 6-14s for 24-48 tokens. DecoderLayer.forward still returns the original 2-tuple. The full Tier C plumbing (3-tuple decoder return plus a signature-level prev_topk_indices argument) remains future work; the ForwardBatch-carried path here is sufficient for serving GLM-5.2 in eager mode.	2026-06-13 20:19:01 +08:00
Yuxuan Zhang	dd9ba529f6	[Bugfix] Restore overridden HF config fields and support index_skip_topk_offset for DSA topk sharing (#27114 )	2026-06-13 18:41:26 +08:00
Jiaheng Dai	51032b7127	feat: support end-to-end KT LoRA serving for Qwen3.5 MoE (#53 )	2026-06-05 15:37:28 +08:00
Benjamin F	ebaff7729b	fix: regressions (scheduler hang, cuda graph TypeError, MXFP4 cache, rsf double-apply) (#50 ) * fix(v4-flash): remove broken MXFP4 weight cache + fix rsf double-apply move routed_scaling_factor application from inside apply_v4_triton_kernels_moe to the caller (mxfp4_deepseek.apply), mirroring the trtllm path convention. This fixes a latent double-apply when SGLANG_OPT_MXFP4_FUSE_RSF_SHARED_ADD is enabled. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(scheduler): revert PR #38 req_pool changes that break TP-only mode PR #38 introduced changes that together cause scheduler hang on TP-only configurations with max_running_requests=1: 1. scheduler.py: Removed `if self.pp_size > 1:` guard in get_num_allocatable_reqs, causing TP-only mode to check available_size() unconditionally. 2. memory_pool.py: Changed free_slots from `range(size)` to `range(1, size)` to reserve index 0. With max_running_requests=1, this produces empty free_slots list. 3. scheduler_runtime_checker_mixin.py: Changed expected_free from `req_total_size` to `req_total_size - 1` to match the reserved slot. This fix reverts all 4 locations to v0.6.1.post1 behavior. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(cuda_graph): use out-of-band _replay_forward_batch for non-DSV4 backends Cherry-pick fix from upstream 3ffc34dbe to resolve TypeError when non-DSV4 backends (TritonAttnBackend, etc.) receive unexpected out_cache_loc kwarg during CUDA graph replay. Instead of passing out_cache_loc as a parameter (which requires all backends to update their signatures), use an out-of-band attribute: - Set attn_backend._replay_forward_batch before the call - DSV4 backend reads out_cache_loc from this attribute - Clear the attribute after the call Conflict resolution: kept kt-sglang's attribute path `self.model_runner.attn_backend` (vs upstream's `self.attn_backend`). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: remove undefined _GraphBucket reference in cuda graph replay --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-14 14:00:27 +08:00
Benjamin F	43ed1ec77a	refactor(dsv4): isolate DeepSeek V4 Flash behind plugin registries (#47 ) Squash of 5 kt-side commits on top of upstream main: the four v4-2604B follow-up fixes (env defaults flip, topk kwargs drop, hf-transformers backup gate, SwiGLU torch.compile collapse) plus the dsv4-plugin-redo refactor. Three other v4-2604B fixes (bf16 cpu_buf, mxfp4 SwiGLU clamp, kt swiglu_limit) are already squash-merged into upstream as part of PR #44 (`6ac4f82e8`) and were skipped during rebase. Goal: prevent DSV4 module bugs / missing dependencies from breaking non-DSV4 model loads (Qwen, GLM, Kimi, etc.), while preserving full DSV4-Flash functionality on supported hardware. ## Architecture Four new plugin-registry modules in base sglang that DSV4 self-attaches to at module-load time: - layers/moe/quant_method_registry.py — chained-wrap MoE quant methods (mxfp4_deepseek priority=10, kt_ep priority=20) - mem_cache/pool_registry.py — KV pool factory dispatch (DeepSeekV4TokenToKVPool) - managers/coordinator_registry.py — request coordinator factory (HiSparseCoordinator) - managers/forward_hooks_registry.py — scheduler / runner lifecycle event dispatch (HiSparse hook adapter) DSV4 plugin entry: models/deepseek_v4.py runs side-effect imports of kt_ep_wrapper, mxfp4_deepseek, hisparse_coordinator, deepseekv4_memory_pool. Each of those self-registers; non-DSV4 models never trigger these imports when SGLANG_DISABLED_MODEL_ARCHS skips deepseek_v4 / deepseek_v4_nextn. ## isinstance → duck-type tags Replaced 12+ isinstance(_, DSV4Class) checks across base files with class-attribute tags (_quant_wrapper_id, _is_v4_token_pool, _is_dsv4_backend_radix). Base files no longer need to import DSV4 classes just to test object identity. ## Bundled v4-2604B fixes (originally separate commits) - environ.py: flip 25 SGLANG_DSV4_* / SGLANG_OPT_* env defaults to OFF so non-DSV4 paths default to upstream behavior. - moe/topk.py: drop 2 kwargs from select_experts else-branch's fused_topk call that PR #38 left incompatible with non-DSV4 callers. - utils/hf_transformers_utils.py: gate the deepseek backup-config path on _peek_is_deepseek_arch so non-deepseek models (Qwen3, GLM, Kimi) with no top-level num_hidden_layers don't RuntimeError on startup. - moe/fused_moe_triton/fused_moe.py: collapse the 60-line DSV4-specific SwiGLU clamp branch in fused_experts_impl down to 5 lines via a reused torch.compile dispatch helper (_swiglu_clamp_silu_mul). ## Pre-existing PR #38 bugs surfaced and fixed in this branch - configs/deepseek_v4.py: was double-`@dataclass`-decorated by transformers v5+ PretrainedConfig.__init_subclass__, which stripped default_factory(...) from quantization_config / rope_scaling / compress_ratios in some builds, causing `'Field' object has no attribute 'to_dict'` at runtime. Rewritten to traditional __init__ kwargs idiom, matching all other sglang configs (afmoe, chatglm, dbrx, bailing_hybrid, ...). - utils/hf_transformers_utils.py: _load_deepseek_temp_model hardcoded config_json["model_type"] = "deepseek_v3" even for V4 architecture, causing AutoConfig to resolve transformers' DeepseekV3Config (which doesn't expose rope_theta / compress_rope_theta / compress_ratios at the top level in transformers-kt 5.6.0). Now picks "deepseek_v4" for DeepseekV4ForCausalLM architecture. - models/deepseek_v2.py: SGLANG_DSV4_MODE=2604 in operator's shell caused config.num_hash_layers AttributeError on non-DSV4 configs inheriting DeepseekV2MoE (e.g., GlmMoeDsaConfig). Now gated on is_deepseek_compressed(config). - models/deepseek_v4.py: side-effect plugin imports wrapped in try/except so a sibling failure (e.g., flashinfer < 0.6.9 trips the module-load version check in mxfp4_deepseek) doesn't block DeepseekV4ForCausalLM from registering with ModelRegistry. - _V4MoE subclass replaces is_deepseek_v4 boolean flag pollution in DeepseekV2MoE — V4 NextN draft layers bypass hash MoE via _compute_is_hash override. ## Robustness fixes from E2E hardware testing - Triton kernels MXFP4 path: force num_stages=2 in triton_kernels.opt_flags constraints to defend against the bare `assert num_stages >= 1` for capabilities outside the tested matrix. - launch_server.py: sweep stale ninja `lock` / `.ninja_lock` files under ~/.cache/torch_extensions older than 30 minutes (configurable via SGLANG_STALE_LOCK_AGE_MINUTES) so a SIGKILL'd build doesn't wedge the next launch indefinitely. ## Verified - E2E pass on Qwen2.5-7B, Qwen3.5-FP8, Qwen3.5-35B-A3B-FP8, Qwen3-Coder-Next-FP8, Kimi-K2.5 (non-DSV4 models, hardware confirmed). - E2E pass on DeepSeek-V4-Flash with TP=8, MXFP4 routed experts, KT-EP CPU/GPU split, hash-MoE, NSA sparse attention (after pinning flashinfer>=0.6.9, apache-tvm-ffi==0.1.9, tilelang>=0.1.8). - 0 DSV4 modules in sys.modules when SGLANG_DISABLED_MODEL_ARCHS skips deepseek_v4 / deepseek_v4_nextn — DSV4 plugin failures cannot affect non-DSV4 startup. - pyproject.toml unchanged: drop-in replacement for kt-sglang pre-DSV4 packaging. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 16:33:18 +08:00
Benjamin F	6ac4f82e8f	fix(v4-flash): bundle V4-2604B SwiGLU clamp + hybrid SWA chunked-prefill hang fix (#44 ) * fix(kt-ep): match cpu_buf dtype to kt-kernel's bf16 scale write for MXFP4 kt-kernel's write_weights_to_buffer (operators/amx/fp4-moe.hpp) writes gate/up scales as bf16 via fast_fp32_to_bf16, but mxfp4_deepseek allocates w13/w2_weight_scale_inv as fp32. The 2x element-size mismatch caused kt-kernel to fill only the first half of cpu_buf in fp32-element terms; after Phase 3 .to(float8_e8m0fnu) the second half (= up_proj rows) became 2^-127, zeroing dequantized up_proj weights for all experts loaded via the kt double-buffered pipeline. Single-chunk GPU prefill on V4-Flash MXFP4 produced mode-collapsed garbage as a result. Allocate the cpu_buf with bf16 dtype for these two scale tensors so kt-kernel's write fills it exactly; gpu_t[e].copy_(cpu_buf[slot]) then performs the bf16->fp32 dtype cast automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v4-2604b): apply SwiGLU clamp on triton-kernels GPU MoE path The trtllm and deep_gemm paths both apply a 2604B-specific asymmetric gate/up clamp (gate.clamp(max=limit); up.clamp(-limit, limit)) on the gemm1 output before silu_and_mul. The triton-kernels path (default GPU MoE on every capability outside _TRTLLM_FP4_CAPS, including SM_120 RTX 5090) was constructing a bare matmul_ogs → silu_and_mul → matmul_ogs sequence with no clamp, leaving routed-expert outputs numerically inconsistent with the trtllm reference on long-prompt / large-activation tokens. Threads moe_runner_config.swiglu_limit through DeepSeekMxfp4MoEMethod.apply to apply_v4_triton_kernels_moe; semantics match moe_runner/deep_gemm.py:_apply_swiglu_limit verbatim. No-op when submode != 2604B (swiglu_limit is None). Origin: sglang 本身. * feat(v4-2604b): pass swiglu_limit through KTEPWrapper to kt-kernel The kt CPU expert path was applying plain silu(g)u with no clamp, diverging from the trtllm `gemm1_clamp_limit` and deep_gemm `_apply_swiglu_limit` references on long-prompt / large-activation tokens. Companion changes in kt-kernel (`feat/v4-2604b-swiglu-clamp:d10bd3d`) plumb a `swiglu_limit` field through `MOEConfig` into the AMX `act_fn`; this commit passes the value through the kt-sglang bridge. The KTMoEWrapper is constructed in `create_weights`, before `create_moe_runner` delivers the full `MoeRunnerConfig`, but the value is fully determined by SGLANG_DSV4_2604_SUBMODE which is fixed at process start, so we read the env directly here. Mirrors the `assert swiglu_limit == 10` in moe_runner/deep_gemm.py and the `torch.full(..., swiglu_limit, ...)` constructor in mxfp4_deepseek.py:177-186. Origin: kt-sglang 耦合. fix(scheduler): correct inverted chunked_req check that hangs hybrid SWA chunked prefill In _get_new_batch_prefill_raw the inline comment explicitly says "Ignore the check if self.chunked_req is not None" but the code below used `is not None`, which is the opposite. With --disable-radix-cache + hybrid SWA + multi-chunk prompt, the chunked_req keeps holding its req_pool slot across chunks (ChunkCache.cache_unfinished_req does not release it), and ReqToTokenPool initialises free_slots = list(range(1, size)) wasting index 0, so once chunked_req takes the only available slot the check fires forever and the scheduler returns None on every iteration -> silent hang (chunk1 prefill completes, chunk2 never starts; TP CPU 60-145% busy spin; client request never returns). The sister check at line 2065 (`and self.chunked_req is None: return None`) is correctly inverted; this brings line 2082 in line with the comment and with that sister check. Repro (DeepSeek V4 Flash, hybrid SWA, page_size=256): --disable-radix-cache --chunked-prefill-size 2048 \ --tensor-parallel-size 4 --max-running-requests 2 + a prompt > 2048 tokens (forces multi-chunk) Before: chunk1 prefill runs, then silent hang or false-positive "token_to_kv_pool_allocator memory leak detected" SIGQUIT (the hybrid leak check is also too strict; addressed in a follow-up commit). After: 5001-token English prompt -> 3 chunks, HTTP 200 in 26.4s; 6695-token Chinese prompt -> 4 chunks, HTTP 200 in 52.2s. Origin: sglang itself (not kt-sglang coupling). Reproduces on pip- installed upstream sglang as well as on the kt third_party submodule. * fix(scheduler): skip self_check_during_idle when in-flight work still holds KV slots Defensive guard for the same bug class as the previous commit. When the scheduler enters the idle branch with chunked_req != None or a non-empty running_batch / waiting_queue, the in-flight KV slots are not yet freed nor cached. _check_hybrid_memory then reports them as leaked because its formula `full_num_used != 0` does not subtract protected_size / in-flight usage the way _check_radix_cache_memory does. The result was a SIGQUIT-on-false-positive: 4 TP ranks raise simultaneously and the server dies mid-request. The other branches of self_check_during_idle (DisaggregationMode.PREFILL and .DECODE) already early-return on similar in-flight conditions; this patch adds the equivalent guard for DisaggregationMode.NULL which had no such check. The same pattern is used at scheduler.py line 1372 and process_input_requests around line 1370. This guard is no longer load-bearing once the scheduler.py 2082 fix is in (chunked prefill advances every iter, the scheduler never reaches batch=None mid-request), but is kept as defence-in-depth against any future path that produces a double-None batch frame. Origin: sglang itself. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 23:00:52 +08:00
Benjamin F	c9edb75e0c	Fix/v4flash gpu prefill fallback mxfp4 (#41 )	2026-05-03 19:31:25 +08:00
Benjamin F	40d3a822af	fix(v4-flash): runtime check that flashinfer >= 0.6.9 before MXFP4 MoE imports (#40 )	2026-05-03 13:10:01 +08:00
Peilin Li	2bcdc9b63d	fix(qwen3.5): normalize MoE nested configs Normalize Qwen3.5 nested text/vision configs so Qwen3.5 MoE checkpoints expose config attributes after Transformers v5 loading.	2026-05-02 23:45:43 +08:00
Codex	51ea403866	fix(qwen3.5): normalize moe nested configs	2026-05-02 15:44:48 +00:00
Benjamin F	3cbd49c291	feat(deepseek-v4-flash): DeepSeek V4 Flash model + consumer-GPU (SM_120) support (#38 ) Adds end-to-end DeepSeek V4 Flash inference support, with portable fallbacks for hardware lacking vendor kernels (e.g. RTX 5090 / SM_120). Model - Register DeepSeekV4 model + config + reasoning parser (HF AutoConfig). - Server args: V4-specific dispatch flags + KT hybrid CPU/GPU MoE wrapper. NSA (sparse FP8 MLA decode) - CUDA path uses the flash_mla wheel on whitelisted caps {(9, 0)}. - Other caps fall back to a portable Triton kernel (decode_sparse_attention_triton, ported from vLLM PR #40929) — handles FP8/BF16 dual cache, swa+extra dispatch, and attention sink. MoE (MXFP4 routed experts) - cap (10, 0): trtllm-fp4 (existing path, unchanged). - All other caps: OpenAI triton_kernels matmul_ogs with strided MXFP4 layout (_use_strided_layout / _patch_strided_mxfp). - KT hybrid wrapper splits experts between CPU (kt-kernel cpuinfer) and GPU (kt-num-gpu-experts), with cudaLaunchHostFunc for CG-capturable CPU triggers. Capability-driven dispatch hardening - DeepGEMM, fp8-paged-mqa-logits, NSA flash_mla, MoE trtllm-fp4 all gated by torch.cuda.get_device_capability() whitelists; non-whitelist caps automatically fall to portable Triton/TileLang paths. - launch_server.py auto-injects FLASHINFER_CUDA_ARCH_LIST / TORCH_CUDA_ARCH_LIST (with `a` suffix per cap) before flashinfer imports — zero arch-list env required at launch. Co-authored-by: Oql <1692110604@qq.com>	2026-05-02 21:45:22 +08:00
Codex	8be8492fd2	fix: accept raw fp8 kt method alias	2026-04-29 18:12:20 +00:00
Codex	d647a4241b	fix: align sglang kt with transformers kt	2026-04-29 17:44:33 +00:00
Codex	8a86fb89c7	fix: align sglang-kt torch 2.9.1 runtime baseline	2026-04-29 16:05:47 +00:00
Benjamin F	1d10fc8f95	[fix](Qwen3-30b-A3B): fix gibberish output by correcting RoPE write-back (#31 ) - Restore in-place storage for q/k and ensure k_cache is updated with the rotated values.	2026-04-27 21:34:59 +08:00
Codex	c819c37879	fix(packaging): align torch deps for kt 0.6.1	2026-04-25 20:20:00 +08:00
ouqingliang	eb0898d217	[refactor]: rename KT fallback logs to layerwise prefill	2026-04-09 03:26:44 +00:00
ouqingliang	a53cb6d078	add kt-numa-nodes	2026-03-31 09:30:52 +00:00
ouqingliang	6b5135e1a1	fix speculative worker and moe layer stability	2026-03-31 02:37:57 +00:00
djw	a45b8d6976	[fix]: update self-referencing extras from sglang to sglang-kt	2026-03-04 08:53:14 +00:00
djw	6b8b5f4649	Use static versioning for sglang-kt, starting at 0.5.2	2026-03-04 06:41:40 +00:00
djw	480c3229d5	Rename PyPI package from sglang to sglang-kt	2026-03-04 06:31:51 +00:00
ouqingliang	b3356b6c46	fix(kt): synchronize INT4 double-buffer slot reuse in fallback prefill	2026-03-03 03:45:30 +00:00
ouqingliang	f1a12b9a93	fix(moe): harden marlin routing and int4 param resolution	2026-03-02 09:35:59 +00:00
xwy-amd8	8d06c338d4	fix(kt): fix Kimi K2.5 RAWINT4 CUDA graph capture crash Three fixes for Kimi K2.5 RAWINT4 failing to start with CUDA graph: 1. fused_marlin_moe.py: Fix IndentationError from bad merge conflict resolution — imports were left outside the `if _is_cuda:` block. 2. fused_marlin_moe.py: Add early return for E=0/M=0. When kt-num-gpu-experts=0, GPU expert weights are empty tensors (E=0). The marlin MoE kernel crashes on these empty inputs. Return zeros so KT CPU experts can contribute the full result. 3. deepseek_v2.py: Skip dual-stream path for KT wrapper. The forward_normal_dual_stream uses alt_stream for shared expert parallelism, which conflicts with KT wrapper internal _cpu_stream during CUDA graph capture. Fixes #1866	2026-03-01 23:12:05 +08:00
Chen Hongtao	a8b821aee8	fix(kt): robust quant detection in layerwise fallback (#23 ) Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com>	2026-02-28 19:45:03 +08:00
chenht2022	529d06ac2b	fix(kt): harden expert remap and metadata fallback	2026-02-28 05:54:52 +00:00
xwy-amd8	48b12817a0	Fix MiMo-V2-Flash KTransformers compatibility - kt_ep_wrapper.py: normalize list-form moe_layer_freq to int MiMo-V2-Flash uses per-layer mask [0,1,1,...] instead of int freq - mimo_v2_flash.py: use getattr for pad_token_id (not in MiMo config) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 03:53:32 +00:00
xwy-amd8	e6428614ab	Fix: add DeepseekV3ForCausalLM to MLA detection list _load_deepseek_v32_model() rewrites architectures from DeepseekV32ForCausalLM to DeepseekV3ForCausalLM for transformers compatibility, but the MLA detection list did not include DeepseekV3ForCausalLM, causing use_mla_backend=False and MHATokenToKVPool to be created instead of NSATokenToKVPool/MLATokenToKVPool.	2026-02-26 15:23:59 +00:00
xwy-amd8	2a1bafeb16	Revert "Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list" This reverts commit `01202ee43d`.	2026-02-26 14:10:22 +00:00
xwy-amd8	01202ee43d	Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list DeepseekV32ForCausalLM was missing from the model_arch guard in _handle_model_specific_adjustments(), so is_deepseek_nsa() was never reached for V3.2 models. This caused the NSA attention backend to not be auto-selected, leading to q_rope TypeError with flashinfer or incorrect behavior with other backends. Upstream bug introduced in sgl-project/sglang#13687 (commit `618ca2380`) which refactored the flat is_deepseek_nsa() check into a nested block under model_arch guard but only listed DeepseekV3ForCausalLM.	2026-02-26 14:03:55 +00:00
xwy-amd8	6a63993e9f	Revert "Skip KT CPU-GPU coordination during CUDA graph capture" This reverts commit `2ba1f0dea6`.	2026-02-26 12:57:31 +00:00
xwy-amd8	2ba1f0dea6	Skip KT CPU-GPU coordination during CUDA graph capture During CUDA graph capture (regular or PCG), torch.cuda.synchronize() and CPU-GPU expert coordination are not allowed. Detect capture mode via is_in_piecewise_cuda_graph() and torch.cuda.is_current_stream_capturing(), and delegate directly to the GPU method in those cases. This enables running Qwen3.5 with --attention-backend triton without --disable-cuda-graph, improving decode from ~11 tok/s to ~65 tok/s.	2026-02-26 08:34:30 +00:00
xwy-amd8	4605b77c7f	Fix: revert kt_ep_wrapper.py for kt-kernel 0.5.1 compat, fix rope_scaling property access	2026-02-26 07:57:30 +00:00
xwy-amd8	a2f4513154	Merge upstream/main: bring in PCG (Piecewise CUDA Graph) support for Qwen3.5 GDN	2026-02-26 07:52:14 +00:00
Yilong Zhao	de3d1e7669	[misc] use ORJSONResponse in http-server generate (#19191 )	2026-02-25 21:26:25 -08:00
Alison Shao	0fd44ff342	Fix NSA CP positions mismatch in eagle NextN model (#19367 )	2026-02-25 20:14:33 -08:00
Xinyu Zhang	119c91cb8b	Skip signal handler registration when not on main thread (#18752 )	2026-02-25 19:30:05 -08:00
Minglei Zhu	b3202fe6d0	[PCG] fix piecewise cuda graph for Qwen3.5 (#19220 )	2026-02-26 11:16:52 +08:00
Alison Shao	a0a8f1473c	[Benchmark] Fix generated_shared_prefix attribute naming and remove args dependency (#19363 ) Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net> Co-authored-by: sglang-bot <sglangbot@gmail.com>	2026-02-25 18:45:54 -08:00
sglang-bot	6e82183f5a	[Disagg] Route disagg prefill results through `process_batch_result` (#19364 )	2026-02-25 18:38:39 -08:00
fzyzcjy	265eb56d44	Support multi-step alignment and pipeline integration in dump comparator (#19378 )	2026-02-26 10:23:22 +08:00
Yuan Luo	4e843f1216	[DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache (#19148 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>	2026-02-26 10:23:10 +08:00
fzyzcjy	f9a2f0398f	Support token aligner planning and execution in dump comparator (#19377 )	2026-02-26 10:04:33 +08:00

1 2 3 4 5 ...

6716 Commits