Commit Graph

7206 Commits

Author SHA1 Message Date
Zhang Yiyang (SII)
a3ed2e4d29 [diffusion][CI] Add CI for MOVA model inference (#20430)
Co-authored-by: Luo <139519292+0-693@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-24 21:28:16 +03:00
YC Yen-Ching Tseng
71f5ae3f9a [AMD] Fix AMD Nightly Test - Transformers 5.3.0 incompatibility and gemma2-27b kv issue (#21193)
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
2026-03-24 10:41:44 -07:00
Elizaveta Martirosian
9f4d8ac99f [Diffusion][NPU] Add support for Hunyuan3D (#20352)
Co-authored-by: Elizaveta Martirosian <elizaveta.martirosian@gmail.com>
2026-03-24 16:18:49 +03:00
shadowxz109
1b4933d45d [NPU][ModelSlim] adapt w2 quant layer for Minimax2.5 (#20905) 2026-03-24 20:57:18 +08:00
Aleksi Vesanto
eefb504f84 [diffusion] model: Fix FLUX.1 output correctness (#21041)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-24 15:17:33 +03:00
Mohammad Miadh Angkad
4fbb311234 [Fix][Eval] Keep --dataset-path scoped to longbench_v2 (#21156) 2026-03-24 02:25:11 -07:00
Thomas Wang
855d15adf6 [AMD] Tilelang sparse fwd for dsv32 mi355/mi300 (#19945) 2026-03-24 02:01:39 -07:00
Shunkangz
dac148167c Enable the qwen3 test (#21195)
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2026-03-23 23:40:59 -07:00
Xiaoyu Zhang
69f02e36e8 [diffusion] Fix torch.zeros typo in causal wan (#21250) 2026-03-24 14:39:16 +08:00
Xiaoyu Zhang
d9f97b2115 Refine diffusion skills and align JIT kernel docs with the new CI flow (#21283) 2026-03-24 14:38:36 +08:00
Cheng Wan
c01ee848b0 Revert "fix: use consistent time denominator for throughput metrics in bench_one_batch_server" (#21276) 2026-03-23 22:14:54 -07:00
Baidu-AIAK
6491728797 [Perf] Overlap NSA-CP key all-gather with query computation for DeepSeek-V3.2 (#20438)
Co-authored-by: Shurui Jia <18817781975@163.com>
Co-authored-by: Baidu-AIAK <baiduaiak~123>
2026-03-23 21:31:48 -07:00
Lianmin Zheng
260abe1fb1 Refactor JIT kernel CI to use run_suite.py registration system (#21239) 2026-03-23 21:17:27 -07:00
hzh0425
0986bed8e2 [HiCache][HybridModel]: Support mamba state offloading & HybridCacheController (#20457)
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
2026-03-23 20:02:50 -07:00
Ratish P
2b1d3c935e [diffusion] fix Z-Image SP sharding for portrait and padded resolutions (#21042)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-03-24 10:15:33 +08:00
Yuxuan Zhang
fcaad42b00 [Bug Fix] GLM-V / GLM-OCR: field detection for transformers 5.x and MTP omission fix (#21134) 2026-03-23 13:19:48 -07:00
Baizhou Zhang
ed316a26ef Fix CP in-seq-split method for DeepSeek V32 and update related tests (#21192) 2026-03-23 12:34:10 -07:00
Lianmin Zheng
27ac831a84 docs: improve CI and testing documentation (#21202)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 10:48:50 -07:00
jacky.cheng
b4d3fb001d [AMD] Add fused GemmaRMSNorm forward_hip to use aiter/vllm kernels for qwen3.5 (#21188) 2026-03-23 10:21:36 -07:00
Johnsonms
777edb6ef7 Fix(jit): support rmsnorm for hidden_size in {64, 128, 256} (#20661) 2026-03-23 23:17:44 +08:00
Yuan Luo
5bdc07d974 [Qwen3.5] Fuse split/reshape/cat ops in GDN projection with Triton kernel (#21019)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-23 23:17:01 +08:00
McZyWu
8662ba7db4 [NPU] bugfix for import sgl-kernel error (#21200) 2026-03-23 19:52:36 +08:00
strgrb
80d4a0753a fix fused_set_kv_buffer for rope with Ling-v2 (#20316) 2026-03-23 19:20:40 +08:00
McZyWu
4641e5a3d2 [NPU] enhance accuracy for model minimaxm2 from 16.5% to 95.5% (#17695) 2026-03-23 19:06:38 +08:00
XDaoHong
2d288ba8c9 [Bugfix] fix npu get kv_item_lens in PD separation when use ASCEND_US… (#15852)
Co-authored-by: ZhengdQin <zhengdqin@gmail.com>
2026-03-23 15:56:47 +08:00
kpham-sgl
59cb9a9da6 [Spec][Ngram] 3/N: Fix synchronization issues in Ngram.cpp (#21186)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 00:37:07 -07:00
Lianmin Zheng
814202704b ci: unify PR test suite naming (#21187) 2026-03-23 00:18:45 -07:00
yudian0504
3d312643b9 [BUGFIX] Fix CP residual size mismatch crash when tp_size == attn_cp_size (#21170) 2026-03-23 00:12:58 -07:00
Lianmin Zheng
7757a9ddd0 ci: remove IS_BLACKWELL env var; auto-detect Blackwell (#21118) 2026-03-22 23:44:48 -07:00
kpham-sgl
bc4aaab6a1 [Spec][Ngram] 2/N: Rename branch length to max trie depth (#21181)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 23:35:25 -07:00
Cheng Wan
d6b12c401c Revert "[bugfix] Fix PPMissingLayer AttributeError when Using PP" (#21189) 2026-03-22 23:28:36 -07:00
Zhiqiang Xie
13f4f010d8 HiSparse for Sparse Attention (#20343) 2026-03-22 23:09:31 -07:00
Lianmin Zheng
7050011dee Enable JIT clamp_position and resolve_future_token_ids on ROCm (#21116) 2026-03-22 22:33:54 -07:00
Yuhao Yang
32a85ef128 [diffusion] CI: auto-skip diffusion tests when required pipeline class is missing from diffusers (#21139) 2026-03-23 12:15:21 +08:00
Mohammad Miadh Angkad
d8a5b1dbaf [Bugfix] Work around FlashInfer unified transport issue on GB (#20039) 2026-03-22 21:10:25 -07:00
Xiaoyu Zhang
a94d67d44b [SKILL] fix(bench): Support model-specific DenoisingStage variants in… (#21137) 2026-03-23 12:08:00 +08:00
fanghao
2b47bd3a34 [Bug Fix] Fix non-streaming request abort failure when --enable-metrics is enabled (#20625) 2026-03-22 19:58:49 -07:00
yuumn
889e8489e9 [diffusion] model: support FireRed-Image-Edit (#20862)
Co-authored-by: yuumn <1010797597@qqã.com>
2026-03-23 10:27:07 +08:00
Cishoon
999bad5aba Fix VRAM leak in overlap scheduling with structured output (#20640) (#20697) 2026-03-22 17:07:39 -07:00
Yilong Zhao
343998865a perf: pad max-num-requests in decode cuda graph for higher coverage (#20978) 2026-03-22 17:06:16 -07:00
Ziang Li
ce0541404f [FlashInfer v0.6.6][RL] Support fp8-last-n-bf16 RL for flashinfer_trtllm_routed moe backend (#20214) 2026-03-22 11:17:01 -07:00
Xiaoyu Zhang
c1fe5de69c [Diffusion] Clean up diffusion Triton kernels and modernize custom op registration (#21122)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-22 22:38:57 +08:00
Xiaoyu Zhang
766d225fcc Add SGLang CUDA crash API logging inspired by FlashInfer (#20910) 2026-03-22 16:39:40 +08:00
Shunkangz
bb737d7a82 Support Qwen3 MoE context parallel (#18233)
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
2026-03-22 01:27:20 -07:00
kpham-sgl
6d160b42bb [Spec][Ngram] 1/N: Reference based Speculative Decoding refactor (#20393) 2026-03-22 00:55:10 -07:00
Xiaoyu Zhang
1b65c0d259 [Diffusion] Fix torch.compile RMSNorm fallback for Z-Image (#20962)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-22 15:38:22 +08:00
Bowen Li
3bc595acbc [FlashAttn] Add fused triton kernel for normal_decode_set_metadata (#20778)
Co-authored-by: kinza99 <dh18324568312@163.com>
2026-03-22 15:12:29 +08:00
Mick
f7fc2c8592 [diffusion] fix: fix accuracy for some image models (#20679) 2026-03-22 15:11:57 +08:00
shuwenn
2fba2bdad1 refactor: Remove dead code from utils/common.py (#20668) 2026-03-21 21:54:17 -07:00
Lianmin Zheng
76e4a8662c Replace clamp_position with JIT kernel + platform dispatch (#20999)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 21:26:26 -07:00