Commit Graph

7954 Commits

Author SHA1 Message Date
Byron Hsu
ba4e9d2ac2 Apply should_use_dp_reduce_scatterv guard to remaining MoE models (follow-up to #23731) (#23732)
Co-authored-by: Byron Hsu <byronhsu@noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
2026-04-25 20:36:16 -07:00
Byron Hsu
71029abd64 Fix Qwen3 MoE: also guard EP all-reduce with not use_reduce_scatter (follow-up to #23731) (#23734)
Co-authored-by: Byron Hsu <byron@periodiclabs.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 20:35:52 -07:00
Byron Hsu
99b59b279c Fix Qwen3 MoE double-reduce when DP attention + EP + reduce_scatterv (#23729) (#23731)
Co-authored-by: Byron Hsu <byronhsu@noreply.github.com>
2026-04-25 15:28:28 -07:00
AlbeeSo
e0a4522370 [typo] fix typo in parallel_state (#23710) 2026-04-25 09:33:33 -07:00
Mick
03849496ad jit_kernel: tolerate FA3 kernels without out arg (#23717) 2026-04-25 23:42:33 +08:00
1874.
046c14a3ed [NPU] Support GGUF quantization for Ascend NPU (dense + MoE) (#17883)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-04-25 17:16:47 +03:00
gjsheu
e708ea6d94 [diffusion] fix: restore cache-dit support for LTX2 (#23235)
Co-authored-by: gengjinsong <gengjinsong@huawei.com>
2026-04-25 18:10:43 +08:00
Aleksi Vesanto
50ce2708ca [diffusion] fix: Fix FLUX.1/2 graph breaks (#23648) 2026-04-25 17:54:52 +08:00
kk
393252f514 [AMD] fused qk gemma norm kernels to reduce four kernels (#23575)
Co-authored-by: root <root@smci355-ccs-aus-g12-26.cs-aus.dcgpu>
2026-04-25 00:30:01 -07:00
Артем Савкин
bd523dd60d [NPU] [Bugfix] [Diffusion] Fixed gray images at the generation output (#23266)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-04-25 10:20:38 +03:00
Yujing
6175946db7 [Feature]Add MSProbe dump support in SGLang (#18349) 2026-04-25 10:12:50 +03:00
Yujun Dong
21835fb0af [HiCache] Prevent move_hybrid_indices from polluting radix-tree node host state (#23427)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2026-04-25 14:27:42 +08:00
DarkSharpness
82254bd9c5 [JIT Kernel] Reland JIT activation (#22094)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Cheng Wan <chwan@rice.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:00:28 -07:00
YC Yen-Ching Tseng
adc59325bc [AMD] Optimize MiniMax-M2.5 - enable fused Triton kernel for FP8 KV cache write in aiter decode path (#23620) 2026-04-24 22:23:49 -07:00
YC Yen-Ching Tseng
fb272d27db [AMD] Optimize MiniMax-M2.5 - use aiter biased_grouped_topk for sigmoid scoring in MoE routing (#23611) 2026-04-24 22:18:08 -07:00
Shenxiu Liu
8471c9ebe6 Skip torch.cuda.empty_cache() in weight update flush path (#22998) 2026-04-25 12:41:39 +08:00
Yuhao Yang
4a3fe2a091 model: support parakeet nemotron encoder (#23568)
Co-authored-by: trangdough <trangtdo22@gmail.com>
2026-04-25 11:00:23 +08:00
Jackey Hua
465abadd3c Add fused moe triton config for Qwen3.5-397B-A17B-FP8 (#23682) 2026-04-24 18:35:32 -07:00
Xinyi Song
76da28f6d6 [AMD][bugfix] add gate rocm >= 7.2 for bpreshuffle (#23671) 2026-04-24 13:26:16 -07:00
Jia Guo
587fd15bd2 perf: eliminate attention DtoD copy by passing pre-allocated output to FA (#21985)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-24 12:05:16 -07:00
Xinyuan Tong
6d03861476 support Hy3 preview (#23533)
Co-authored-by: pengmeng <pengmeng@tencent.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: chengvjiang <chengvjiang@tencent.com>
Co-authored-by: russellfeng <russellfeng@tencent.com>
2026-04-24 12:03:24 -07:00
Lianmin Zheng
6344b546c8 Deprecate --collect-tokens-histogram, auto-collect with --enable-metrics (#23595) 2026-04-24 12:00:16 -07:00
Mick
05696527ea [diffusion] feat: support LoRA for LTX2.3 (#23649) 2026-04-25 01:52:41 +08:00
Kang Yifei
baa0aa670f [HiCache & HybridModel] 3FS backend support DSA & mamba model (#23241)
Co-authored-by: 墨已 <kangyifei.kyf@alibaba-inc.com>
Co-authored-by: hzh0425 <hzh0425@apache.org>
2026-04-25 00:48:01 +08:00
Kangrui Du
92d262f710 [diffusion] RL: add per-step rollout options for SDE and trajectory capture (#23151) 2026-04-24 23:26:16 +08:00
Siju Samuel
bca3dd958a [Intel GPU] Enable pipeline parallelism on XPU (#23645) 2026-04-24 19:52:44 +08:00
Yuwei An
60bbb800db [Experimental] Breakable Piecewise Cuda Graph (#22218)
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-24 04:33:05 -07:00
Mick
b3b03369a5 [diffusion] fix: unify LTX-2.3 HQ codepath gates for all LTX-2.3 variants (#23624) 2026-04-24 17:44:08 +08:00
Shangming Cai
b8d883398d Revert "[Intel GPU] Enable pipeline parallelism on XPU" (#23641) 2026-04-24 17:36:35 +08:00
Hubert Lu
4cb0c4e1f3 [AMD] Fix memory access fault when --page-size > 1 with speculative decoding on AMD GPUs (#23596) 2026-04-23 23:56:36 -07:00
Mick
cd1fa7506a [diffusion] model: support LTX2.3 high quality pipeline (#23366) 2026-04-24 14:18:20 +08:00
Shaojun Zhou
59724e90a9 model: support Moss-VL (#23454) 2026-04-24 11:14:29 +08:00
Siju Samuel
bf98eb3ab7 [Intel GPU] Enable pipeline parallelism on XPU (#23472)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-04-24 10:41:51 +08:00
popsiclexu
b35213be11 [MUSA][16/N] Add MUSA backend support for layers and DeepSeek models (V2/V3/R1) (#22774)
Co-authored-by: popsiclexu <zhenxue.xu@mthreads.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-23 18:59:51 -07:00
R0CKSTAR
87e50f20f6 [Apple Silicon][MLX] Cache seq_lens-derived tensors in BatchedDecodeContext (#23470)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
2026-04-23 18:12:26 -07:00
Mick
c0166355ae [diffusion] CI: minor refactor CI (#23576) 2026-04-24 08:48:31 +08:00
Cheng Wan
d9c72bdd2b Skip unselected experts in flashinfer_trtllm (#23493) 2026-04-23 17:30:19 -07:00
Cheng Wan
000a2525e1 Move expert_mask_gpu from FusedMoE layer to StandardDispatcher (#23585)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 17:17:27 -07:00
Lianmin Zheng
95d021b523 Pre-set SWA cache location in CudaGraphRunner (#23552) 2026-04-23 16:51:29 -07:00
Lianmin Zheng
bb962b0046 Fix MoE no_combine: skip router weight in down projection (#23545) 2026-04-23 16:47:58 -07:00
Sundara Raman Ramachandran
cf88fdcc9c Expose child process PIDs from Engine for health check support (#23320) 2026-04-23 16:44:49 -07:00
sglang-bot
f3b88e080a chore: bump flashinfer version to 0.6.8.post1 (#23281)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2026-04-23 15:23:03 -07:00
Byron Hsu
17210350fd [PD+DP] Allow PrefillDelayer in disaggregated-prefill mode (#23588) 2026-04-23 14:51:16 -07:00
Alex Nails
579bd0b152 [bug fix] has_fp8_weights_in_checkpoint: handle HF repo IDs, not just local paths (#23542)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:56:49 -07:00
WangHao-hw
80125febb1 [BUGFIX]Fix Ascend backend pre-allocated range in NPU Graph Mode. (#22778) 2026-04-24 01:23:35 +08:00
Jinghong Li
c6872fc8fb Fix: fallback to torch API when NVML memory query is not supported (#23426)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-04-23 19:26:04 +03:00
Jie Hao
86ed0680d7 feat: add OpenTelemetry tracing to DiffGenerator (#21254) 2026-04-23 09:25:23 -07:00
Arseniy Mironov
76e4c5a1f8 [Diffusion][NPU][Bugfix] Ascend_fa crashes when sequence parallelism is used. (#23572)
Co-authored-by: Napkin-AI <arseniy.mironov.dev@gmail.com>
2026-04-23 19:21:30 +03:00
Baichuan
54e21bb3a5 [fix] Fix dynamic chunking profiling crash on GLM-5 models (#23060)
Co-authored-by: liubaichuan <liubaichuan@infini-ai.com>
2026-04-23 19:30:57 +08:00
Xinyi Song
cd459af4e2 [AMD] Use bpreshuffle FP8 blockscale GEMM to replace ABScale GEMM (#23319)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2026-04-23 01:51:30 -07:00