Xiaoyu Zhang
|
c1fe5de69c
|
[Diffusion] Clean up diffusion Triton kernels and modernize custom op registration (#21122)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-03-22 22:38:57 +08:00 |
|
Xiaoyu Zhang
|
766d225fcc
|
Add SGLang CUDA crash API logging inspired by FlashInfer (#20910)
|
2026-03-22 16:39:40 +08:00 |
|
Shunkangz
|
bb737d7a82
|
Support Qwen3 MoE context parallel (#18233)
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
|
2026-03-22 01:27:20 -07:00 |
|
kpham-sgl
|
6d160b42bb
|
[Spec][Ngram] 1/N: Reference based Speculative Decoding refactor (#20393)
|
2026-03-22 00:55:10 -07:00 |
|
Xiaoyu Zhang
|
1b65c0d259
|
[Diffusion] Fix torch.compile RMSNorm fallback for Z-Image (#20962)
Co-authored-by: Mick <mickjagger19@icloud.com>
|
2026-03-22 15:38:22 +08:00 |
|
Bowen Li
|
3bc595acbc
|
[FlashAttn] Add fused triton kernel for normal_decode_set_metadata (#20778)
Co-authored-by: kinza99 <dh18324568312@163.com>
|
2026-03-22 15:12:29 +08:00 |
|
Mick
|
f7fc2c8592
|
[diffusion] fix: fix accuracy for some image models (#20679)
|
2026-03-22 15:11:57 +08:00 |
|
shuwenn
|
2fba2bdad1
|
refactor: Remove dead code from utils/common.py (#20668)
|
2026-03-21 21:54:17 -07:00 |
|
Lianmin Zheng
|
76e4a8662c
|
Replace clamp_position with JIT kernel + platform dispatch (#20999)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-21 21:26:26 -07:00 |
|
Changyi Yang
|
c1794e2944
|
[diffusion] fix: fix Sana corrupted output by removing spurious QK norm layers (#20656)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
|
2026-03-22 12:06:49 +08:00 |
|
Yuhao Yang
|
c32e35a2a5
|
[diffusion] CI: fix picklingerror for diffusion models using diffusers backend (#20854)
|
2026-03-22 11:51:03 +08:00 |
|
Mick
|
6dfa8a40bc
|
[diffusion] CI: make auxiliary coverage explicit and simplify testcases (#20983)
|
2026-03-21 20:18:23 +08:00 |
|
KnightLTC
|
a0862f00c2
|
dbrx instruct npu support (#17121)
Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk>
|
2026-03-21 17:10:35 +08:00 |
|
Alison Shao
|
852e112ebf
|
[Qwen3.5] Fix broken pipeline parallelism layer splitting (#21070)
Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>
|
2026-03-21 01:02:51 -07:00 |
|
Lianmin Zheng
|
dba6fb3d30
|
Fix streaming logprobs corruption caused by shared mutable list reference (#21030)
|
2026-03-21 00:18:48 -07:00 |
|
kk
|
3f0ba021fc
|
[AMD] Improve openai/gpt-oss performance (#21020)
Co-authored-by: root <root@smci355-ccs-aus-m15-21.cs-aus.dcgpu>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
|
2026-03-20 23:16:47 -07:00 |
|
Baizhou Zhang
|
67cad3e69e
|
Revert "Support CuteDSL mm_fp4 backend" (#21077)
|
2026-03-20 22:47:47 -07:00 |
|
Xiaoyu Zhang
|
c076968c52
|
[CI] Remove obsolete AOT-only jit-kernel benchmarks after sgl-kernel 4.0 (#21075)
|
2026-03-21 13:40:42 +08:00 |
|
Baizhou Zhang
|
5f3393c04c
|
Fix deepseek-v32-fp4 b200 ci (#21072)
|
2026-03-20 22:28:40 -07:00 |
|
Alison Shao
|
048d90e165
|
Revert "[AMD] Add MoE weights and scales padding" (#21067)
|
2026-03-20 20:26:17 -07:00 |
|
shuwenn
|
6c91590e1b
|
[HiCache] refactor: hicache normalization flow and compatibility checks (#19669)
|
2026-03-20 18:38:44 -07:00 |
|
mqhc2020
|
9419453713
|
[AMD] Add MoE weights and scales padding (#18684)
|
2026-03-20 14:55:09 -07:00 |
|
YC Yen-Ching Tseng
|
f97c09dac1
|
[AMD] Enable aiter unified attention for non-SWA models (Qwen3-VL) (#20897)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2026-03-20 12:07:41 -07:00 |
|
fzyzcjy
|
146700db68
|
Add e2e demo test in dump comparator (#21031)
|
2026-03-20 22:41:01 +08:00 |
|
fzyzcjy
|
6703cc4484
|
Enhance output formatting in dump comparator (#21029)
|
2026-03-20 22:04:50 +08:00 |
|
fzyzcjy
|
fdbcb8156e
|
Refactor dp_utils to use ParallelAxis enum in dump comparator (#21028)
|
2026-03-20 22:04:20 +08:00 |
|
fzyzcjy
|
154395ab7d
|
Support s≡t dimension name equivalence in dump comparator (#21027)
|
2026-03-20 22:03:34 +08:00 |
|
fzyzcjy
|
cc22601d28
|
Validate replicated axes orthogonality in dump comparator (#21026)
|
2026-03-20 22:02:40 +08:00 |
|
fzyzcjy
|
2f01950a0e
|
Support jointly-determined axes inference in dump comparator (#21025)
|
2026-03-20 22:01:26 +08:00 |
|
fzyzcjy
|
ecd7e40d20
|
Support dependent axis auto-resolution in dump comparator (#21024)
|
2026-03-20 21:56:39 +08:00 |
|
Lianmin Zheng
|
104b10f70a
|
refactor: consolidate is_in_ci (jit_kernel, sgl-kernel benchmarks, tests) (#21009)
|
2026-03-20 05:55:36 -07:00 |
|
Артем Савкин
|
9fbe6800aa
|
[NPU] [Diffusion] Update CI performance baseline for Wan2.2-T2V-A14B-Diffusers-w8a8 (#20997)
|
2026-03-20 15:54:12 +03:00 |
|
xingsy97
|
f41832795e
|
Add compile-time 256-bit vector guard for pre-Blackwell (#19794)
|
2026-03-20 18:25:12 +08:00 |
|
DarkSharpness
|
2dd9196079
|
[JIT Kernel][Feature] Support JIT custom all reduce (rewrite as v2) (#19880)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2026-03-20 18:24:07 +08:00 |
|
Muqi Li
|
2099943a49
|
Fix scale_step_k computation in the fp8_kernel (#20819)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2026-03-20 18:09:31 +08:00 |
|
Jia Guo
|
ec01ef9092
|
Fix torch.compile/dynamo crash with Qwen3 QK-norm in piecewise CUDA g… (#19818)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-20 18:05:09 +08:00 |
|
Prozac614
|
fa89d152c0
|
[diffusion] CI: fix hunyuan3d JIT cache (#20773)
Co-authored-by: daiweitao <dwti614707404@163.com>
|
2026-03-20 17:51:55 +08:00 |
|
Lianmin Zheng
|
a0a4dae67f
|
Revert "Fix DeepSeek V32 FP4 test" (#21003)
|
2026-03-20 02:19:28 -07:00 |
|
Lianmin Zheng
|
112b628227
|
Replace _resolve_future_token_ids with JIT kernel + platform dispatch (#20976)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-20 01:47:03 -07:00 |
|
Baizhou Zhang
|
c82d20d48e
|
Fix DeepSeek V32 FP4 test (#20984)
|
2026-03-20 01:04:32 -07:00 |
|
Yilong Zhao
|
26f709e97d
|
misc: make prefill-delayer compatible with multiple types of mem pool (#20979)
|
2026-03-20 00:05:53 -07:00 |
|
Yilong Zhao
|
95327458ee
|
misc: add BatchTokenizerReq hook into dp controller (#20981)
|
2026-03-19 23:59:53 -07:00 |
|
Lianmin Zheng
|
712a48c5d2
|
ci: move metrics scripts under scripts/ci/utils (#20986)
|
2026-03-19 23:47:57 -07:00 |
|
lviy
|
46a76af97b
|
[Bugifx] qwen3 rope parameter compatibility (#20931)
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2026-03-19 22:22:01 -07:00 |
|
Jia Guo
|
87549f8f0b
|
perf(mamba): use Triton conv1d for non-contiguous input to avoid .contiguous() copy (#20469)
|
2026-03-19 19:38:46 -07:00 |
|
Vedant V Jhaveri
|
db995fba47
|
perf(kimi_linear): replace einops rearrange with native torch ops in Kimi-Linear KDA path (#20396)
|
2026-03-20 10:38:12 +08:00 |
|
ehuaa
|
fa0d8f6629
|
perf: avoid unnecessary gpu-cpu sync in eagle_info (#20266)
Co-authored-by: root <qianhao@zhejianglab.org>
|
2026-03-19 19:37:29 -07:00 |
|
Mohammad Miadh Angkad
|
3d749c49ca
|
[JIT Kernel] Fix NVFP4 multi-arch compilation failure (#20874)
|
2026-03-20 10:30:04 +08:00 |
|
cs-cat
|
22e378af86
|
Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40 (#20368)
Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
|
2026-03-20 09:28:54 +08:00 |
|
Yuan Luo
|
d9794ef9f7
|
[Qwen3-Next] Fuse Qwen3-Next GDN's qkvz_proj and ba_proj (#19321)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2026-03-20 09:25:29 +08:00 |
|