Commit Graph

6296 Commits

Author SHA1 Message Date
Ratish P
ddfe147377 [diffusion]: Improve layerwise offload buffer reuse and shared-storage handling (#18611) 2026-02-15 22:17:51 +08:00
Mick
3feb48139e [diffusion] quant: add support for svdquant and nunchaku (#18549)
Co-authored-by: AichenF <aichenf@nvidia.com>
Co-authored-by: jianyingzhu <53300651@qq.com>
2026-02-15 20:43:00 +08:00
Michael
88010e9601 [AMD] Fix nightly 1-GPU test failures and bench_serving regression (#18761)
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
2026-02-15 20:36:47 +08:00
fzyzcjy
4c7f986c6b Extract dumper and prefill delayer tests common utils (#18857) 2026-02-15 18:33:23 +08:00
haowen-han
b992828ad2 fix: fix bug on kimi2.5 with dp2 and tp4 (#18604)
Co-authored-by: hanhaowen <hanhaowen@baidu.com>
2026-02-15 16:32:13 +08:00
Ratish P
274bf6607a [diffusion] fix: enable torch.compile for UlyssesAttention (#18840) 2026-02-15 15:54:27 +08:00
zhangxiaolei123456
ad1bdb93df perf: add minimax-2.5 fused_moe tuning config for h20 (#18833) 2026-02-15 15:46:56 +08:00
jackey hua
922fbc21e2 [Perf] Tune MiniMax M2 fused moe kernel on H100 GPU (#18851) 2026-02-15 15:30:52 +08:00
andyluo7
944a9f6fcf Fix/qwen3 5 amd rope cutedsl fallback (#18753)
Co-authored-by: seungrokj <seungrok.jung@amd.com>
2026-02-14 22:09:44 -08:00
muse-coder
91230dcca8 [FIX] Correct JIT kernel compilation on newer GPUs with outdated driver metadata. (#18496)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-02-15 12:14:39 +08:00
Bhavneek Singh
1ce3420784 Model: Support IBM Granite (Dense/Mamba + MoE) (#18040) 2026-02-15 11:24:41 +08:00
Lianmin Zheng
b33769786f [Auto Sync] Update grpc_request_manager.py, tokenizer_manag... (20260214) (#18838)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-02-14 18:12:32 -08:00
Guangda Liu
190fa8246f Fix model loading for DeepSeek-V3.2-AWQ (#16907)
Co-authored-by: Guangda Liu <bingps@users.noreply.github.com>
2026-02-15 09:39:53 +08:00
Lianmin Zheng
8b2020584c [Auto Sync] Update test_deterministic.py (20260214) (#18839)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jiayi Yuan <34369239+jy-yuan@users.noreply.github.com>
2026-02-14 17:19:30 -08:00
Xiaoyu Zhang
4067d9487d [diffusion] feat: opt vae decode with channels_last_3d (#18540) 2026-02-14 23:19:45 +08:00
Xiaoyu Zhang
c29394e3c8 [kernel slimming] Move fast_hadamard_transform to jit_kernel (#18475) 2026-02-14 23:06:21 +08:00
Kangyan-Zhou
ae95869292 Enable SGLANG_ENABLE_SPEC_V2 for nightly speculative decoding tests (#18719)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 23:00:33 +08:00
Raayan Dhar
92cdd398cd feat: Support mrope_section with rope_type: "yarn" (#13313)
Signed-off-by: Raayan Dhar raayan.dhar@gmail.com <raayan.dhar@gmail.com>
Signed-off-by: raayandhar <raayan.dhar@gmail.com>
2026-02-14 22:51:44 +08:00
Ke Bao
f51e9d9ca1 Add ci test for ring model (#18829) 2026-02-14 22:20:23 +08:00
ybyang
c8aa2a6534 Fix dsv32 encode_messages (#18126) 2026-02-14 16:44:13 +08:00
Johnsonms
34132d6da5 Kernel: optimize decoding metadata in NSA multi-spec backend with fused kernels (#17554) 2026-02-14 16:40:15 +08:00
Yuan Luo
fa0ef6e4f7 [VLM][LLM] Optimize fused_moe triton kernel tma (#18782)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-02-14 14:35:26 +08:00
JD
f6c18c3a85 Fix/partial gen from waiting queue miss metadata (#17610) 2026-02-13 19:04:08 -08:00
R0CKSTAR
45a4697d45 [diffusion][MUSA] fix: MUSA platform breakage caused by PR #13662 (#18456)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-02-14 11:00:39 +08:00
qmzznbxhl
066b0b70d9 Handle abort for retracted requests in disagg decode prealloc queue (#18705)
Co-authored-by: sunhailiang <sunhailiang@baidu.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2026-02-13 18:39:39 -08:00
shuwenn
bd39de7d5e [Env] centralize hicache vars in environ.py (#17204) 2026-02-13 18:02:31 -08:00
Liangsheng Yin
dcea74d63f Add timeout abort kits for normal / eagle. (#18815) 2026-02-13 17:57:30 -08:00
Liangsheng Yin
4474fb98b4 [PD-Disagg] Fix double free when prebuilt batch is aborted. (#18822) 2026-02-13 17:46:35 -08:00
Leon Gao
ab0fb248fd feat: add SGLANG_DISTRIBUTED_INIT_METHOD_OVERRIDE env var (#18743) 2026-02-14 09:37:33 +08:00
Minglei Zhu
8be18c655d [Perf] refactor piecewise cuda graph support of Qwen3-Next (#17613) 2026-02-14 09:30:50 +08:00
shuwenn
3299c4f9c1 [CI] feat: add early exit to wait_for_server when process dies (#18602) 2026-02-13 16:46:09 -08:00
Mohammad Miadh Angkad
1be41e9036 [FlashInfer] Bump FlashInfer version from 0.6.2 to 0.6.3 (#18448) 2026-02-14 07:43:33 +08:00
JD
191d354f53 fix double-free kv cache for requests that have already finished and been freed during preemption (#18694) 2026-02-13 13:17:44 -08:00
Lianmin Zheng
008ea46af1 [Auto Sync] Update loader.py, weight_utils.py (20260213) (#18779)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Xiuyu Li <xiuyu@x.ai>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2026-02-13 12:22:50 -08:00
Qi Jia
4c6afbeeaa [bugfix] fix mamba slot leak when scheduling fails with radix cache (#15840) (#16067)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2026-02-13 23:43:57 +08:00
dongjiyingdjy
8b4c364960 refactor context parallel state (#17213)
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2026-02-13 23:18:17 +08:00
Linyu Wu
0012d6a4eb [Kernel Slimming] Migrate GPTQ-Marlin repack kernel to JIT (#18543)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-02-13 22:29:22 +08:00
Mick
37273408eb [diffusion] chore: use batched P2P ops in VAE parallel decoding (#18728) 2026-02-13 22:11:20 +08:00
triple-mu
acc940d302 [diffusion] fix typo (#18790) 2026-02-13 21:59:39 +08:00
R0CKSTAR
07633349c9 [diffusion] fix: webui task_type check (#18462)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-02-13 21:19:16 +08:00
Mick
efdd676d56 [diffusion] refactor: merge redundant default_dtype and param_dtype parameters in FSDP loader (#18789) 2026-02-13 21:18:02 +08:00
Kaixi
98ad284ebf Added cuda availability guard (#18480) 2026-02-13 20:18:34 +08:00
Ke Bao
a0ebaa6498 Cleanup debug log for Ring model (#18793) 2026-02-13 18:36:20 +08:00
Ke Bao
eacab2868a Adjust mamba cache allocation (#18786) 2026-02-13 18:06:23 +08:00
Yinghai Lu
e4b2b57620 [schedule] Fix streaming return of customized_info (#18654) 2026-02-13 17:19:16 +08:00
Xinwei Qiang
356e338607 [diffusion] feat: support SparseVideoGen2 attention backend (#17507)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-02-13 16:20:46 +08:00
ant-yy
d97eb111a3 Support LingV2_5 model (#18598)
Co-authored-by: zhangkaihong.zkh <zhangkaihong.zkh@antgroup.com>
Co-authored-by: 有禾 <zhangdonghao.zdh@antgroup.com>
Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com>
Co-authored-by: 悠扬 <youyang.zmy@antgroup.com>
Co-authored-by: xinxingyang <xinxing.yangxx@antgroup.com>
Co-authored-by: zmy460290 <zmy460290@antgroup.com>
2026-02-13 16:09:15 +08:00
Xiaoyu Zhang
013a199bc6 [CI] Skip cutedsl gdn performance test in jit_kernel ci (#18783) 2026-02-13 15:49:30 +08:00
Shangming Cai
1f39bf6523 [Bugfix] Add warnings when NSA indexer cache indice mismatch in PD module (#18727) 2026-02-13 15:20:05 +08:00
Liangsheng Yin
e6f7a372ef Rename request timeout env vars for waiting/running stages (#18766) 2026-02-12 22:58:40 -08:00