Commit Graph

6551 Commits

Author SHA1 Message Date
Minglei Zhu
b3202fe6d0 [PCG] fix piecewise cuda graph for Qwen3.5 (#19220) 2026-02-26 11:16:52 +08:00
Alison Shao
a0a8f1473c [Benchmark] Fix generated_shared_prefix attribute naming and remove args dependency (#19363)
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
Co-authored-by: sglang-bot <sglangbot@gmail.com>
2026-02-25 18:45:54 -08:00
sglang-bot
6e82183f5a [Disagg] Route disagg prefill results through process_batch_result (#19364) 2026-02-25 18:38:39 -08:00
fzyzcjy
265eb56d44 Support multi-step alignment and pipeline integration in dump comparator (#19378) 2026-02-26 10:23:22 +08:00
Yuan Luo
4e843f1216 [DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache (#19148)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
2026-02-26 10:23:10 +08:00
fzyzcjy
f9a2f0398f Support token aligner planning and execution in dump comparator (#19377) 2026-02-26 10:04:33 +08:00
fzyzcjy
d34d5aca07 Support loading token aligner data in dump comparator (#19376) 2026-02-26 10:03:56 +08:00
fzyzcjy
e8dd14519d Add aligner entrypoint and bundle handler in dump comparator (#19375) 2026-02-26 10:03:22 +08:00
pansicheng
2ad475b4ed use flashinfer.sampling (#18696) 2026-02-26 10:02:38 +08:00
fzyzcjy
2739d7df62 Reorganize modules and pipeline in dump comparator (#19374) 2026-02-26 10:00:13 +08:00
fzyzcjy
508b8e3387 Handle warnings via sink for structured output and add pair in dump comparator (#19373) 2026-02-26 09:59:15 +08:00
fzyzcjy
46321ee70e Support dumping rid for correlation across passes in dump comparator (#19372) 2026-02-26 09:57:57 +08:00
Yuan Luo
7c9e8e2def [Re-land][jit kernel] Support per_token_group_quant_8bit jit kernel (#19140)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu>
2026-02-26 09:53:57 +08:00
Linyu Wu
beabaa8d37 [Kernel Slimming] Migrate marlin moe kernel to JIT (#19181)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-02-26 09:05:13 +08:00
Daniel Cámpora
350190487b Flashinfer MOE FP8 support for Mistral Large 3. (#15422)
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2026-02-25 15:00:37 -08:00
Liangsheng Yin
c60dcc40bb [Logging] Guard log_prefill_stats against idle batches in disagg prefill (#19361) 2026-02-25 13:31:52 -08:00
YAMY
08957c88ea [Logging] Fix prefill side logging in pd disagg (#19350) 2026-02-25 12:42:18 -08:00
Kangyan-Zhou
306c552639 Revert "Fix HybridAttnBackend forward for linear attention" (#19356) 2026-02-25 11:49:50 -08:00
jacky.cheng
b2c46fc60b [AMD] Support Qwen3-Coder-Next on AMD platform (#18355)
Co-authored-by: yichiche@amd.com <jacky.cheng>
2026-02-25 11:06:22 -08:00
Makcum888e
0217e82a08 [diffusion] Clean code (#19325) 2026-02-25 21:16:03 +03:00
Even Zhou
2fb239450e Revert "bugfix: prioritize init_npu_backend to fix various initialization bugs" (#19343) 2026-02-25 23:04:30 +08:00
Yuhao Yang
c7c4a1cbbd refactor linear attention backend (#18622)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2026-02-25 23:02:44 +08:00
Mick
471acd98b9 [diffusion] logging: improve logging (#19312) 2026-02-25 23:00:35 +08:00
Qingfu Wen
59b9d1e86d [diffusion] improve: improve fuse_scale_shift_kernel with non-blocking op (#18710)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-02-25 21:04:20 +08:00
akhilg-nv
c144e55462 Fix HybridAttnBackend forward for linear attention (#19006) 2026-02-25 21:02:37 +08:00
Zheng Li
d38c0e537d fix(dense): fix Qwen3.5 dense model precision bug in TP_SIZE>1 (#19070) 2026-02-25 20:54:42 +08:00
Even Zhou
cdc411160b [NPU] Fix a corner case where FusedMoE.top_k is not explicitly declared (#19287) 2026-02-25 20:49:59 +08:00
Mick
9840cd3f68 [diffusion] chore: enable sequence shard for wan by default (#19311) 2026-02-25 18:21:44 +08:00
billishyahao
60eeef7370 [AMD][with CI Fix] support two batch overlapping for mori ep (#19216)
Co-authored-by: Duyi-Wang <duyi.wang@amd.com>
Co-authored-by: kkHuang-amd <wunhuang@amd.com>
Co-authored-by: Feiyue Zhai <feiyue.zhai@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2026-02-25 02:14:08 -08:00
GMI Xiao Jin
c4ef33862b [diffusion] fix: fix bugs to let LTX-2 pipeline support latest Sglang Args pipelines (#19295) 2026-02-25 17:30:36 +08:00
Mohammad Miadh Angkad
671b595570 Fix trtllm_mha fp8 SWA KV index translation (#19107) 2026-02-25 17:02:17 +08:00
Julian Huang
a55f658835 [Misc] Normalize --host parameter to use plain hostname without scheme (#19309)
Co-authored-by: 墨楼 <huangzhilin.hzl@antgroup.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2026-02-25 00:37:24 -08:00
YAMY
f75abb4521 [Fix][Qwen3.5] Fix KV cache slice transfer for GQA models with replicated KV heads (#19086)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-02-25 16:26:44 +08:00
huangtingwei
d40cb2f725 [HiCache] Support heterogeneous tp for hicache storage (#18541)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2026-02-25 00:13:57 -08:00
Wang, Yi
3d879c69e9 refactor: extract device-to-backend mapping into get_default_distributed_backend (#19202)
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2026-02-24 23:50:26 -08:00
Hexq0210
d0bb140034 [NPU] bugfix for model Qwen3-Coder-Next at weight shape transpose for npu. (#18700)
Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk>
2026-02-25 15:46:20 +08:00
xutizhou
a1b39c1c26 Perf/fuse mamba state scatter mtp verify (#18088) 2026-02-25 15:40:55 +08:00
lw9527
4a3a787f1e [Fix] Kimi K2.5 support pp (#18434)
Co-authored-by: Ilya Boytsov <ilya.boytsov@aleph-alpha.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-02-25 15:22:11 +08:00
Shangming Cai
8d9ee6669e Fix comment for tp_rank calculation in dp_attention (#19306) 2026-02-25 15:19:10 +08:00
Hubert Lu
17b0affbdf [AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs (#13747)
Co-authored-by: yctseng0211 <yctseng@amd.com>
2026-02-24 23:11:55 -08:00
YAMY
73fe389dd1 [Qwen3.5] Raise Exception when radix_cache and extra_buffer are enabled at the same time (#19169) 2026-02-25 15:04:37 +08:00
Liangsheng Yin
76d5410e01 [Disagg] Fix decode querying unregistered dp_rank when prefill dp_size is 1 (#19305)
Co-authored-by: Yangmin Li <yangminl@nvidia.com>
2026-02-24 22:52:39 -08:00
Hubert Lu
8bd644765f [AMD] Enable ROCm kvcache JIT path and add AMD CI coverage. (#18992)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-25 14:15:05 +08:00
Michael
ca09d71cf0 Fix nightly grok failure on rotary embedding import (#19192)
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
2026-02-25 13:25:16 +08:00
jacky.cheng
e138f7960a [AMD] Fix accuracy while using --enable-dp-attention (#19247)
Co-authored-by: yichiche@amd.com <jacky.cheng>
2026-02-24 20:50:28 -08:00
Liangsheng Yin
ab0f608788 [PD-Disagg] Fix bootstrap server race condition when prefill workers not yet registered (#19288)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-24 20:22:16 -08:00
Liangsheng Yin
539f772f54 [PD-Disagg] Fully support external DP dispatch w/ PD-disaggregation mode. (#19268)
Co-authored-by: Ratish P <114130421+ratish1@users.noreply.github.com>
2026-02-24 19:58:01 -08:00
Mick
241ee90164 [diffusion] chore: tiny fix pyproject.toml (#19256) 2026-02-25 11:57:53 +08:00
Shangming Cai
0fac2796b6 [PD-Disagg] Improve KVManager init across all backends (#19240)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-02-25 10:37:09 +08:00
siyu
c0fdfd4b92 Delete mm.feature after decode phase (#17324) 2026-02-24 18:13:03 -08:00