10204 Commits

Author SHA1 Message Date
ouqingliang
eb0898d217 [refactor]: rename KT fallback logs to layerwise prefill 2026-04-09 03:26:44 +00:00
ouqingliang
a53cb6d078 add kt-numa-nodes 2026-03-31 09:30:52 +00:00
ouqingliang
6b5135e1a1 fix speculative worker and moe layer stability 2026-03-31 02:37:57 +00:00
Jianwei Dong
f6adb4f473 Merge pull request #26 from kvcache-ai/fix/sglang-kt-self-referencing-extras
Fix/sglang kt self referencing extras
2026-03-04 16:54:25 +08:00
djw
a45b8d6976 [fix]: update self-referencing extras from sglang to sglang-kt 2026-03-04 08:53:14 +00:00
djw
6b8b5f4649 Use static versioning for sglang-kt, starting at 0.5.2 2026-03-04 06:41:40 +00:00
Jianwei Dong
7241d118b4 Merge pull request #25 from kvcache-ai/rename-to-sglang-kt
Rename PyPI package from sglang to sglang-kt
2026-03-04 14:33:22 +08:00
djw
480c3229d5 Rename PyPI package from sglang to sglang-kt 2026-03-04 06:31:51 +00:00
ouqingliang
b3356b6c46 fix(kt): synchronize INT4 double-buffer slot reuse in fallback prefill 2026-03-03 03:45:30 +00:00
ouqingliang
f1a12b9a93 fix(moe): harden marlin routing and int4 param resolution 2026-03-02 09:35:59 +00:00
xwy-amd8
8d06c338d4 fix(kt): fix Kimi K2.5 RAWINT4 CUDA graph capture crash
Three fixes for Kimi K2.5 RAWINT4 failing to start with CUDA graph:

1. fused_marlin_moe.py: Fix IndentationError from bad merge conflict
   resolution — imports were left outside the `if _is_cuda:` block.

2. fused_marlin_moe.py: Add early return for E=0/M=0. When
   kt-num-gpu-experts=0, GPU expert weights are empty tensors (E=0).
   The marlin MoE kernel crashes on these empty inputs. Return zeros
   so KT CPU experts can contribute the full result.

3. deepseek_v2.py: Skip dual-stream path for KT wrapper. The
   forward_normal_dual_stream uses alt_stream for shared expert
   parallelism, which conflicts with KT wrapper internal _cpu_stream
   during CUDA graph capture.

Fixes #1866
2026-03-01 23:12:05 +08:00
Chen Hongtao
a8b821aee8 fix(kt): robust quant detection in layerwise fallback (#23)
Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com>
2026-02-28 19:45:03 +08:00
Chen Hongtao
e49cfbfb44 Merge pull request #22 from kvcache-ai/fix/mistral-kt-loader-remap
fix(kt): harden expert remap and metadata fallback
2026-02-28 17:50:45 +08:00
chenht2022
529d06ac2b fix(kt): harden expert remap and metadata fallback 2026-02-28 05:54:52 +00:00
xwy-amd8
48b12817a0 Fix MiMo-V2-Flash KTransformers compatibility
- kt_ep_wrapper.py: normalize list-form moe_layer_freq to int
  MiMo-V2-Flash uses per-layer mask [0,1,1,...] instead of int freq
- mimo_v2_flash.py: use getattr for pad_token_id (not in MiMo config)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 03:53:32 +00:00
xwy-amd8
e6428614ab Fix: add DeepseekV3ForCausalLM to MLA detection list
_load_deepseek_v32_model() rewrites architectures from
DeepseekV32ForCausalLM to DeepseekV3ForCausalLM for transformers
compatibility, but the MLA detection list did not include
DeepseekV3ForCausalLM, causing use_mla_backend=False and
MHATokenToKVPool to be created instead of NSATokenToKVPool/MLATokenToKVPool.
2026-02-26 15:23:59 +00:00
xwy-amd8
2a1bafeb16 Revert "Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list"
This reverts commit 01202ee43d.
2026-02-26 14:10:22 +00:00
xwy-amd8
01202ee43d Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list
DeepseekV32ForCausalLM was missing from the model_arch guard in
_handle_model_specific_adjustments(), so is_deepseek_nsa() was never
reached for V3.2 models. This caused the NSA attention backend to not
be auto-selected, leading to q_rope TypeError with flashinfer or
incorrect behavior with other backends.

Upstream bug introduced in sgl-project/sglang#13687 (commit 618ca2380)
which refactored the flat is_deepseek_nsa() check into a nested block
under model_arch guard but only listed DeepseekV3ForCausalLM.
2026-02-26 14:03:55 +00:00
xwy-amd8
6a63993e9f Revert "Skip KT CPU-GPU coordination during CUDA graph capture"
This reverts commit 2ba1f0dea6.
2026-02-26 12:57:31 +00:00
xwy-amd8
2ba1f0dea6 Skip KT CPU-GPU coordination during CUDA graph capture
During CUDA graph capture (regular or PCG), torch.cuda.synchronize()
and CPU-GPU expert coordination are not allowed. Detect capture mode
via is_in_piecewise_cuda_graph() and torch.cuda.is_current_stream_capturing(),
and delegate directly to the GPU method in those cases.

This enables running Qwen3.5 with --attention-backend triton without
--disable-cuda-graph, improving decode from ~11 tok/s to ~65 tok/s.
2026-02-26 08:34:30 +00:00
xwy-amd8
4605b77c7f Fix: revert kt_ep_wrapper.py for kt-kernel 0.5.1 compat, fix rope_scaling property access 2026-02-26 07:57:30 +00:00
xwy-amd8
a2f4513154 Merge upstream/main: bring in PCG (Piecewise CUDA Graph) support for Qwen3.5 GDN 2026-02-26 07:52:14 +00:00
Feng Su
d2b3c7fb14 [Tracing] update script for converting otel tracing data to perfetto format (#19396) 2026-02-26 14:03:22 +08:00
Yilong Zhao
de3d1e7669 [misc] use ORJSONResponse in http-server generate (#19191) 2026-02-25 21:26:25 -08:00
Alison Shao
0fd44ff342 Fix NSA CP positions mismatch in eagle NextN model (#19367) 2026-02-25 20:14:33 -08:00
Xinyu Zhang
119c91cb8b Skip signal handler registration when not on main thread (#18752) 2026-02-25 19:30:05 -08:00
Qi Yuhang
88ad3b894a [sgl-kernel][Feat][B200][2/N] Support MXFP8 Grouped GEMM in Blackwell (#14640)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-02-26 11:23:37 +08:00
Minglei Zhu
b3202fe6d0 [PCG] fix piecewise cuda graph for Qwen3.5 (#19220) 2026-02-26 11:16:52 +08:00
Alison Shao
a0a8f1473c [Benchmark] Fix generated_shared_prefix attribute naming and remove args dependency (#19363)
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
Co-authored-by: sglang-bot <sglangbot@gmail.com>
2026-02-25 18:45:54 -08:00
sglang-bot
6e82183f5a [Disagg] Route disagg prefill results through process_batch_result (#19364) 2026-02-25 18:38:39 -08:00
Xiaoyu Zhang
914ed34757 update jit_kernel codeowners (#19385) 2026-02-26 10:36:34 +08:00
fzyzcjy
265eb56d44 Support multi-step alignment and pipeline integration in dump comparator (#19378) 2026-02-26 10:23:22 +08:00
Yuan Luo
4e843f1216 [DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache (#19148)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
2026-02-26 10:23:10 +08:00
Michael
f230967e65 [AMD] Fix ROCm Docker builds, update apache-tvm-ffi (#19359) 2026-02-26 10:16:28 +08:00
fzyzcjy
f9a2f0398f Support token aligner planning and execution in dump comparator (#19377) 2026-02-26 10:04:33 +08:00
fzyzcjy
d34d5aca07 Support loading token aligner data in dump comparator (#19376) 2026-02-26 10:03:56 +08:00
fzyzcjy
e8dd14519d Add aligner entrypoint and bundle handler in dump comparator (#19375) 2026-02-26 10:03:22 +08:00
pansicheng
2ad475b4ed use flashinfer.sampling (#18696) 2026-02-26 10:02:38 +08:00
fzyzcjy
2739d7df62 Reorganize modules and pipeline in dump comparator (#19374) 2026-02-26 10:00:13 +08:00
fzyzcjy
508b8e3387 Handle warnings via sink for structured output and add pair in dump comparator (#19373) 2026-02-26 09:59:15 +08:00
fzyzcjy
46321ee70e Support dumping rid for correlation across passes in dump comparator (#19372) 2026-02-26 09:57:57 +08:00
Yuan Luo
7c9e8e2def [Re-land][jit kernel] Support per_token_group_quant_8bit jit kernel (#19140)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu>
2026-02-26 09:53:57 +08:00
Linyu Wu
beabaa8d37 [Kernel Slimming] Migrate marlin moe kernel to JIT (#19181)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-02-26 09:05:13 +08:00
Daniel Cámpora
350190487b Flashinfer MOE FP8 support for Mistral Large 3. (#15422)
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2026-02-25 15:00:37 -08:00
Liangsheng Yin
c60dcc40bb [Logging] Guard log_prefill_stats against idle batches in disagg prefill (#19361) 2026-02-25 13:31:52 -08:00
YAMY
08957c88ea [Logging] Fix prefill side logging in pd disagg (#19350) 2026-02-25 12:42:18 -08:00
Kangyan-Zhou
306c552639 Revert "Fix HybridAttnBackend forward for linear attention" (#19356) 2026-02-25 11:49:50 -08:00
HAI
a0f3361023 Update (#19351) 2026-02-25 11:37:07 -08:00
jacky.cheng
b2c46fc60b [AMD] Support Qwen3-Coder-Next on AMD platform (#18355)
Co-authored-by: yichiche@amd.com <jacky.cheng>
2026-02-25 11:06:22 -08:00
Alison Shao
cc1ca61c81 fix: add --cuda-graph-max-bs to DSV3 FA3 FP8 KV cache test (#19307) 2026-02-25 10:27:31 -08:00