Commit Graph

7855 Commits

Author SHA1 Message Date
Lianmin Zheng
44e67c6835 Remove deprecated double sparsity feature (#23009) 2026-04-17 13:33:12 -07:00
andyluo7
9df6107dca [AMD] Enable DFLASH speculative decoding on ROCm (#22342)
Signed-off-by: Andy Luo <andyluo7@users.noreply.github.com>
Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com>
2026-04-17 13:10:14 -07:00
shuwenn
90c76d665e [HiCache] fix: HiCacheFile component key suffixing (#22891)
Co-authored-by: Zhangheng <hzh0425@apache.org>
2026-04-17 13:06:28 -07:00
YC Yen-Ching Tseng
5d4e899477 [AMD] Fix AMD Multimodal Test - skip nvfp4 tests (#23045) 2026-04-17 09:02:39 -07:00
Jincong Chen
2bac219d0c [Perf] Precompute gemma_weight to avoid redundant add on every forward (#22673) 2026-04-17 23:37:41 +08:00
Xiaoyu Zhang
83c5119d01 [diffusion] CI: fix ModelOpt B200 CI artifact coverage (#22955) 2026-04-17 23:33:42 +08:00
Mick
5de89ea942 [diffusion] CI: fix auto-partition (#23076) 2026-04-17 22:37:24 +08:00
Opher Lieber
6e3bbef568 expose num_embeddings in VocabParallelEmbeddingWithLoRA (#22547) 2026-04-17 02:35:13 -07:00
Jonah Bernard
0d031335ed [Pipeline Parallelism][Bug] Fix scheduler hang in pipeline parallelism setup (#23006) 2026-04-17 14:50:47 +08:00
Duyi-Wang
8c190f6b91 [AMD] Add SGLANG_MORI_MOE_MAX_INPUT_TOKENS to truncate dispatch before MoE. (#22952) 2026-04-16 23:40:15 -07:00
RichardoMu
7390eddf28 feat(observability): add OpenTelemetry tracing for speculative decoding (#19545)
Co-authored-by: Mu Huai <tianbowen.tbw@antgroup.com>
2026-04-17 14:01:58 +08:00
narutolhy
5fa0c6a52e Allow piecewise CUDA graph with speculative decoding (#22128)
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 13:39:30 +08:00
Xiaoyu Zhang
91679d935d [codex] Update diffusion skills (#23028) 2026-04-17 13:29:26 +08:00
blzheng
0dcfae5553 [CPU] Add gemma4_rmsnorm_cpu kernel (#22842)
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-04-17 13:03:16 +08:00
YC Yen-Ching Tseng
f0f0148167 Revert "feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143)" (#23031) 2026-04-16 21:53:25 -07:00
Zhangheng
7d47f40a96 [UnifiedRadixTree]: Add HiCache hook interface for TreeComponent (#22924) 2026-04-17 12:09:41 +08:00
Byron Hsu
cf9845f8e3 [Bug Fix] Ensure prefill_info_table is populated before honoring disagg_prefill_dp_rank (#22990)
Co-authored-by: Byron Hsu <byron+per@periodiclabs.ai>
2026-04-17 11:10:31 +08:00
Jan Bernlöhr
04a53955b9 feat: add coordinated checkpoint prefetch for network filesystem loading (#20843) 2026-04-16 20:08:19 -07:00
Yuhao Yang
a77abbe005 [VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport (#22662) 2026-04-17 10:38:36 +08:00
Yuxuan Zhang
16d11c2a10 Fix for the low-probability garbled output issue in the GLM-5 series models. (#22811) 2026-04-17 09:52:13 +08:00
Makcum888e
e353630b57 [Diffusion] [NPU] Fix multimodal gen CI (#22879) 2026-04-17 04:09:44 +03:00
Egor Filimonov
ba850d3a9d [Bugfix] [NPU] Fix check_env on Ascend for CANN 8.5 (#22888) 2026-04-17 04:05:20 +03:00
Mick
3d2d57c6cc [diffusion] refactor: extract LTX2 image encoding from denoising stage (#22976) 2026-04-17 08:35:15 +08:00
Daifeng Li
2cc52d8326 feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143) 2026-04-16 16:51:32 -07:00
pdasgup
f639425ff0 add check for none status code in FinishAbort (#22535)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-04-16 16:21:07 -07:00
Tarushii Goel
2211b4d9c6 [sgl] improve accuracy of additional page requirement during spec decode (#22406) 2026-04-16 15:50:51 -07:00
Liangsheng Yin
db7a751d48 refactor: extract FanOutCommunicator and use declarative spec table (#22967) 2026-04-16 15:37:19 -07:00
mqhc2020
52f0b86f5d [AMD] Qwen3.5 MXFP4 breaks after shared expert fusion is enabled (#22948)
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
2026-04-16 15:25:33 -07:00
Liangsheng Yin
c83ef4fdb6 use envs in server_args (#22994) 2026-04-16 15:01:33 -07:00
Xinyu Zhang
c0172aef6e [Ray] Bind scheduler actors to GPU-local NUMA node (#22989)
Co-authored-by: xyuzh <xyuzh@users.noreply.github.com>
2026-04-16 14:52:15 -07:00
Xinyu Zhang
d430034bde [Ray] Support multi-replica serving by making scheduler actor names unique (#22917) 2026-04-16 14:51:01 -07:00
Qiaolin Yu
a87806a65f [misc] refine outdated comments for chain-style multi-layer MTP (#22996) 2026-04-16 14:49:43 -07:00
ybyang
41258f874d [PD]feat(bench): add --fake-prefill flag for decode-only stress testing (#22973) 2026-04-16 13:57:55 -07:00
Yuhao Yang
9da998a882 [diffusion] feat: disaggregated diffusion (#21701) 2026-04-16 23:51:32 +08:00
Liangsheng Yin
62309f09db fix(loads): preserve include filtering after watching mode switch (#22959) 2026-04-16 03:04:53 -07:00
ybyang
03fef357a6 fix(loads): switch get_loads_communicator to watching mode (#22919) 2026-04-16 02:12:22 -07:00
ybyang
fbd6dc3565 fix: normalize tool message content for GLM5.1 chat template (#22595) 2026-04-16 16:48:38 +08:00
Aleksi Vesanto
aaa682346e [diffusion] model: Properly validate device for Mistral 3 attention (#22690) 2026-04-16 00:29:23 -07:00
Lianmin Zheng
35da90cb76 [misc] Configure logging before ServerArgs.__post_init__ (#22926) 2026-04-15 23:53:15 -07:00
yuefeng Wu
65bc839a5f [Fix] eagle/eagle3 speculative decoding conflicts with xgrammar in NPU (#20989)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-15 23:34:23 -07:00
Bi Xue
c43716a357 [sgl] provide an option to send control req to all dp ranks rank0 (#22758) 2026-04-16 14:24:26 +08:00
Byron Hsu
3600465e81 [Bug Fix] Remove follow_bootstrap_room fast path in PD disaggregation DP rank resolution (#22901) 2026-04-15 22:53:29 -07:00
LHXuuu
e7ad7c587a [EPD][VLM] Support Kimi VL EPD (#22490)
Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>
2026-04-16 12:40:02 +08:00
CYYYC0310
58c6b871b2 Remove compatibility restriction between Pipeline Parallelism and Mixed Chunked Prefill (#22920)
Co-authored-by: cyy <cy02433585@alibaba-inc.com>
2026-04-16 11:25:31 +08:00
Xinyuan Tong
34fef07a15 Upgrade transformers to 5.5.3 and refactor hf_transformers_utils into subpackage (#21569) 2026-04-15 20:03:44 -07:00
JINZ
14e122cdee [BugFix][RadixTree]:Fix stale eviction assertion in HiMambaRadixCache host eviction path (#22592)
Co-authored-by: Zhangheng <hzh0425@apache.org>
2026-04-16 10:49:30 +08:00
Yuhao Yang
b8794baa6d [Step3p5] Optimize allreduce in MoE layers (#22773) 2026-04-16 09:33:12 +08:00
Liangsheng Yin
a4cf2ea128 streaming session: spec v2 bonus accounting + comprehensive test matrix (#22651) 2026-04-15 17:12:41 -07:00
Xinyu Zhang
e8c6e5466c [Ray] Auto-create placement group in RayEngine when none is detected (#22898) 2026-04-15 15:17:52 -07:00
Qiaolin Yu
0b1b07db72 [misc] fix ray folder lint (#22905) 2026-04-15 15:08:18 -07:00