ouqingliang
eb0898d217
[refactor]: rename KT fallback logs to layerwise prefill
2026-04-09 03:26:44 +00:00
ouqingliang
a53cb6d078
add kt-numa-nodes
2026-03-31 09:30:52 +00:00
ouqingliang
6b5135e1a1
fix speculative worker and moe layer stability
2026-03-31 02:37:57 +00:00
Jianwei Dong
f6adb4f473
Merge pull request #26 from kvcache-ai/fix/sglang-kt-self-referencing-extras
...
Fix/sglang kt self referencing extras
2026-03-04 16:54:25 +08:00
djw
a45b8d6976
[fix]: update self-referencing extras from sglang to sglang-kt
2026-03-04 08:53:14 +00:00
djw
6b8b5f4649
Use static versioning for sglang-kt, starting at 0.5.2
2026-03-04 06:41:40 +00:00
Jianwei Dong
7241d118b4
Merge pull request #25 from kvcache-ai/rename-to-sglang-kt
...
Rename PyPI package from sglang to sglang-kt
2026-03-04 14:33:22 +08:00
djw
480c3229d5
Rename PyPI package from sglang to sglang-kt
2026-03-04 06:31:51 +00:00
ouqingliang
b3356b6c46
fix(kt): synchronize INT4 double-buffer slot reuse in fallback prefill
2026-03-03 03:45:30 +00:00
ouqingliang
f1a12b9a93
fix(moe): harden marlin routing and int4 param resolution
2026-03-02 09:35:59 +00:00
xwy-amd8
8d06c338d4
fix(kt): fix Kimi K2.5 RAWINT4 CUDA graph capture crash
...
Three fixes for Kimi K2.5 RAWINT4 failing to start with CUDA graph:
1. fused_marlin_moe.py: Fix IndentationError from bad merge conflict
resolution — imports were left outside the `if _is_cuda:` block.
2. fused_marlin_moe.py: Add early return for E=0/M=0. When
kt-num-gpu-experts=0, GPU expert weights are empty tensors (E=0).
The marlin MoE kernel crashes on these empty inputs. Return zeros
so KT CPU experts can contribute the full result.
3. deepseek_v2.py: Skip dual-stream path for KT wrapper. The
forward_normal_dual_stream uses alt_stream for shared expert
parallelism, which conflicts with KT wrapper internal _cpu_stream
during CUDA graph capture.
Fixes #1866
2026-03-01 23:12:05 +08:00
Chen Hongtao
a8b821aee8
fix(kt): robust quant detection in layerwise fallback ( #23 )
...
Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com >
2026-02-28 19:45:03 +08:00
Chen Hongtao
e49cfbfb44
Merge pull request #22 from kvcache-ai/fix/mistral-kt-loader-remap
...
fix(kt): harden expert remap and metadata fallback
2026-02-28 17:50:45 +08:00
chenht2022
529d06ac2b
fix(kt): harden expert remap and metadata fallback
2026-02-28 05:54:52 +00:00
xwy-amd8
48b12817a0
Fix MiMo-V2-Flash KTransformers compatibility
...
- kt_ep_wrapper.py: normalize list-form moe_layer_freq to int
MiMo-V2-Flash uses per-layer mask [0,1,1,...] instead of int freq
- mimo_v2_flash.py: use getattr for pad_token_id (not in MiMo config)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-27 03:53:32 +00:00
xwy-amd8
e6428614ab
Fix: add DeepseekV3ForCausalLM to MLA detection list
...
_load_deepseek_v32_model() rewrites architectures from
DeepseekV32ForCausalLM to DeepseekV3ForCausalLM for transformers
compatibility, but the MLA detection list did not include
DeepseekV3ForCausalLM, causing use_mla_backend=False and
MHATokenToKVPool to be created instead of NSATokenToKVPool/MLATokenToKVPool.
2026-02-26 15:23:59 +00:00
xwy-amd8
2a1bafeb16
Revert "Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list"
...
This reverts commit 01202ee43d .
2026-02-26 14:10:22 +00:00
xwy-amd8
01202ee43d
Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list
...
DeepseekV32ForCausalLM was missing from the model_arch guard in
_handle_model_specific_adjustments(), so is_deepseek_nsa() was never
reached for V3.2 models. This caused the NSA attention backend to not
be auto-selected, leading to q_rope TypeError with flashinfer or
incorrect behavior with other backends.
Upstream bug introduced in sgl-project/sglang#13687 (commit 618ca2380 )
which refactored the flat is_deepseek_nsa() check into a nested block
under model_arch guard but only listed DeepseekV3ForCausalLM.
2026-02-26 14:03:55 +00:00
xwy-amd8
6a63993e9f
Revert "Skip KT CPU-GPU coordination during CUDA graph capture"
...
This reverts commit 2ba1f0dea6 .
2026-02-26 12:57:31 +00:00
xwy-amd8
2ba1f0dea6
Skip KT CPU-GPU coordination during CUDA graph capture
...
During CUDA graph capture (regular or PCG), torch.cuda.synchronize()
and CPU-GPU expert coordination are not allowed. Detect capture mode
via is_in_piecewise_cuda_graph() and torch.cuda.is_current_stream_capturing(),
and delegate directly to the GPU method in those cases.
This enables running Qwen3.5 with --attention-backend triton without
--disable-cuda-graph, improving decode from ~11 tok/s to ~65 tok/s.
2026-02-26 08:34:30 +00:00
xwy-amd8
4605b77c7f
Fix: revert kt_ep_wrapper.py for kt-kernel 0.5.1 compat, fix rope_scaling property access
2026-02-26 07:57:30 +00:00
xwy-amd8
a2f4513154
Merge upstream/main: bring in PCG (Piecewise CUDA Graph) support for Qwen3.5 GDN
2026-02-26 07:52:14 +00:00
Feng Su
d2b3c7fb14
[Tracing] update script for converting otel tracing data to perfetto format ( #19396 )
2026-02-26 14:03:22 +08:00
Yilong Zhao
de3d1e7669
[misc] use ORJSONResponse in http-server generate ( #19191 )
2026-02-25 21:26:25 -08:00
Alison Shao
0fd44ff342
Fix NSA CP positions mismatch in eagle NextN model ( #19367 )
2026-02-25 20:14:33 -08:00
Xinyu Zhang
119c91cb8b
Skip signal handler registration when not on main thread ( #18752 )
2026-02-25 19:30:05 -08:00
Qi Yuhang
88ad3b894a
[sgl-kernel][Feat][B200][2/N] Support MXFP8 Grouped GEMM in Blackwell ( #14640 )
...
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com >
2026-02-26 11:23:37 +08:00
Minglei Zhu
b3202fe6d0
[PCG] fix piecewise cuda graph for Qwen3.5 ( #19220 )
2026-02-26 11:16:52 +08:00
Alison Shao
a0a8f1473c
[Benchmark] Fix generated_shared_prefix attribute naming and remove args dependency ( #19363 )
...
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net >
Co-authored-by: sglang-bot <sglangbot@gmail.com >
2026-02-25 18:45:54 -08:00
sglang-bot
6e82183f5a
[Disagg] Route disagg prefill results through process_batch_result ( #19364 )
2026-02-25 18:38:39 -08:00
Xiaoyu Zhang
914ed34757
update jit_kernel codeowners ( #19385 )
2026-02-26 10:36:34 +08:00
fzyzcjy
265eb56d44
Support multi-step alignment and pipeline integration in dump comparator ( #19378 )
2026-02-26 10:23:22 +08:00
Yuan Luo
4e843f1216
[DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache ( #19148 )
...
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com >
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com >
2026-02-26 10:23:10 +08:00
Michael
f230967e65
[AMD] Fix ROCm Docker builds, update apache-tvm-ffi ( #19359 )
2026-02-26 10:16:28 +08:00
fzyzcjy
f9a2f0398f
Support token aligner planning and execution in dump comparator ( #19377 )
2026-02-26 10:04:33 +08:00
fzyzcjy
d34d5aca07
Support loading token aligner data in dump comparator ( #19376 )
2026-02-26 10:03:56 +08:00
fzyzcjy
e8dd14519d
Add aligner entrypoint and bundle handler in dump comparator ( #19375 )
2026-02-26 10:03:22 +08:00
pansicheng
2ad475b4ed
use flashinfer.sampling ( #18696 )
2026-02-26 10:02:38 +08:00
fzyzcjy
2739d7df62
Reorganize modules and pipeline in dump comparator ( #19374 )
2026-02-26 10:00:13 +08:00
fzyzcjy
508b8e3387
Handle warnings via sink for structured output and add pair in dump comparator ( #19373 )
2026-02-26 09:59:15 +08:00
fzyzcjy
46321ee70e
Support dumping rid for correlation across passes in dump comparator ( #19372 )
2026-02-26 09:57:57 +08:00
Yuan Luo
7c9e8e2def
[Re-land][jit kernel] Support per_token_group_quant_8bit jit kernel ( #19140 )
...
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com >
Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu >
2026-02-26 09:53:57 +08:00
Linyu Wu
beabaa8d37
[Kernel Slimming] Migrate marlin moe kernel to JIT ( #19181 )
...
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com >
2026-02-26 09:05:13 +08:00
Daniel Cámpora
350190487b
Flashinfer MOE FP8 support for Mistral Large 3. ( #15422 )
...
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-02-25 15:00:37 -08:00
Liangsheng Yin
c60dcc40bb
[Logging] Guard log_prefill_stats against idle batches in disagg prefill ( #19361 )
2026-02-25 13:31:52 -08:00
YAMY
08957c88ea
[Logging] Fix prefill side logging in pd disagg ( #19350 )
2026-02-25 12:42:18 -08:00
Kangyan-Zhou
306c552639
Revert "Fix HybridAttnBackend forward for linear attention" ( #19356 )
2026-02-25 11:49:50 -08:00
HAI
a0f3361023
Update ( #19351 )
2026-02-25 11:37:07 -08:00
jacky.cheng
b2c46fc60b
[AMD] Support Qwen3-Coder-Next on AMD platform ( #18355 )
...
Co-authored-by: yichiche@amd.com <jacky.cheng>
2026-02-25 11:06:22 -08:00
Alison Shao
cc1ca61c81
fix: add --cuda-graph-max-bs to DSV3 FA3 FP8 KV cache test ( #19307 )
2026-02-25 10:27:31 -08:00