fzyzcjy
|
fdbcb8156e
|
Refactor dp_utils to use ParallelAxis enum in dump comparator (#21028)
|
2026-03-20 22:04:20 +08:00 |
|
fzyzcjy
|
154395ab7d
|
Support s≡t dimension name equivalence in dump comparator (#21027)
|
2026-03-20 22:03:34 +08:00 |
|
fzyzcjy
|
cc22601d28
|
Validate replicated axes orthogonality in dump comparator (#21026)
|
2026-03-20 22:02:40 +08:00 |
|
fzyzcjy
|
2f01950a0e
|
Support jointly-determined axes inference in dump comparator (#21025)
|
2026-03-20 22:01:26 +08:00 |
|
fzyzcjy
|
ecd7e40d20
|
Support dependent axis auto-resolution in dump comparator (#21024)
|
2026-03-20 21:56:39 +08:00 |
|
Lianmin Zheng
|
104b10f70a
|
refactor: consolidate is_in_ci (jit_kernel, sgl-kernel benchmarks, tests) (#21009)
|
2026-03-20 05:55:36 -07:00 |
|
Артем Савкин
|
9fbe6800aa
|
[NPU] [Diffusion] Update CI performance baseline for Wan2.2-T2V-A14B-Diffusers-w8a8 (#20997)
|
2026-03-20 15:54:12 +03:00 |
|
xingsy97
|
f41832795e
|
Add compile-time 256-bit vector guard for pre-Blackwell (#19794)
|
2026-03-20 18:25:12 +08:00 |
|
DarkSharpness
|
2dd9196079
|
[JIT Kernel][Feature] Support JIT custom all reduce (rewrite as v2) (#19880)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2026-03-20 18:24:07 +08:00 |
|
Muqi Li
|
2099943a49
|
Fix scale_step_k computation in the fp8_kernel (#20819)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2026-03-20 18:09:31 +08:00 |
|
Jia Guo
|
ec01ef9092
|
Fix torch.compile/dynamo crash with Qwen3 QK-norm in piecewise CUDA g… (#19818)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-20 18:05:09 +08:00 |
|
Prozac614
|
fa89d152c0
|
[diffusion] CI: fix hunyuan3d JIT cache (#20773)
Co-authored-by: daiweitao <dwti614707404@163.com>
|
2026-03-20 17:51:55 +08:00 |
|
Lianmin Zheng
|
a0a4dae67f
|
Revert "Fix DeepSeek V32 FP4 test" (#21003)
|
2026-03-20 02:19:28 -07:00 |
|
Lianmin Zheng
|
112b628227
|
Replace _resolve_future_token_ids with JIT kernel + platform dispatch (#20976)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-20 01:47:03 -07:00 |
|
Baizhou Zhang
|
c82d20d48e
|
Fix DeepSeek V32 FP4 test (#20984)
|
2026-03-20 01:04:32 -07:00 |
|
Yilong Zhao
|
26f709e97d
|
misc: make prefill-delayer compatible with multiple types of mem pool (#20979)
|
2026-03-20 00:05:53 -07:00 |
|
Yilong Zhao
|
95327458ee
|
misc: add BatchTokenizerReq hook into dp controller (#20981)
|
2026-03-19 23:59:53 -07:00 |
|
Lianmin Zheng
|
712a48c5d2
|
ci: move metrics scripts under scripts/ci/utils (#20986)
|
2026-03-19 23:47:57 -07:00 |
|
lviy
|
46a76af97b
|
[Bugifx] qwen3 rope parameter compatibility (#20931)
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2026-03-19 22:22:01 -07:00 |
|
Jia Guo
|
87549f8f0b
|
perf(mamba): use Triton conv1d for non-contiguous input to avoid .contiguous() copy (#20469)
|
2026-03-19 19:38:46 -07:00 |
|
Vedant V Jhaveri
|
db995fba47
|
perf(kimi_linear): replace einops rearrange with native torch ops in Kimi-Linear KDA path (#20396)
|
2026-03-20 10:38:12 +08:00 |
|
ehuaa
|
fa0d8f6629
|
perf: avoid unnecessary gpu-cpu sync in eagle_info (#20266)
Co-authored-by: root <qianhao@zhejianglab.org>
|
2026-03-19 19:37:29 -07:00 |
|
Mohammad Miadh Angkad
|
3d749c49ca
|
[JIT Kernel] Fix NVFP4 multi-arch compilation failure (#20874)
|
2026-03-20 10:30:04 +08:00 |
|
cs-cat
|
22e378af86
|
Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40 (#20368)
Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
|
2026-03-20 09:28:54 +08:00 |
|
Yuan Luo
|
d9794ef9f7
|
[Qwen3-Next] Fuse Qwen3-Next GDN's qkvz_proj and ba_proj (#19321)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2026-03-20 09:25:29 +08:00 |
|
Baizhou Zhang
|
42f4b7276c
|
Revert "feat(mm)(grpc): compute M-RoPE positions for preprocessed VL inputs" (#20956)
|
2026-03-19 18:03:04 -07:00 |
|
Liangsheng Yin
|
2b53e660de
|
Simplify streaming session logprob handling (#20955)
|
2026-03-19 17:09:40 -07:00 |
|
Leon Gao
|
63c38aba5e
|
Fix token leak with logprob_start_len=0 in streaming sessions (#20557)
|
2026-03-19 15:37:27 -07:00 |
|
Brayden Zhong
|
b42b9f6e1a
|
Support CuteDSL mm_fp4 backend (#18801)
|
2026-03-19 14:20:01 -07:00 |
|
Yuwei An
|
d8ece7fb22
|
[Tiny Fix] Filter lru related warning with pcg (#20940)
Signed-off-by: yuweia <ayw.sirius19@gmail.com>
|
2026-03-19 13:20:49 -07:00 |
|
Lianmin Zheng
|
0949b138af
|
Simplify server startup output (#20885)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-19 13:11:37 -07:00 |
|
Xinyuan Tong
|
a02cff7f2b
|
[Fix] Patch is_flash_attn_2_available for flash-attn-4 in VLM input format test (#20946)
|
2026-03-19 13:00:51 -07:00 |
|
AlfredYong
|
c562e0d13b
|
[feat] Enhance Kimi-K2/K2.5 function call and reasoning detection (#19552)
Co-authored-by: alfredyyang <alfredyyang@tencent.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
|
2026-03-19 12:57:57 -07:00 |
|
Mohammad Miadh Angkad
|
29ced9c162
|
[UX] Suppress noisy httpx/httpcore INFO logs (#20944)
|
2026-03-19 10:58:41 -07:00 |
|
Xinyu Zhang
|
319bb4974c
|
[Fix] RayEngine multi-node: co-locate rank0 scheduler with Engine and fix CUDA device setting (#20722)
|
2026-03-19 10:27:16 -07:00 |
|
Cao E
|
274581fb77
|
Add support for more batch sizes in cpu_graph_runner (#13881)
|
2026-03-19 09:50:56 -07:00 |
|
kk
|
c8f0122acf
|
Fix gpu-fault issue when run deepseek-r1 and enable dp (#20841)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2026-03-19 02:36:12 -07:00 |
|
khalilzhk
|
574572b21b
|
[BugFix] bug fix for DeepSeek eagle3 in Attn-DP mode (#20492)
|
2026-03-19 14:48:46 +08:00 |
|
Shangming Cai
|
fd05532da1
|
Add logging for BootstrapServer for CI diagnosis (#20844)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2026-03-19 14:42:12 +08:00 |
|
blzheng
|
a98b456c70
|
[CPU] Add frontend support for Gemma (#12590)
|
2026-03-18 23:02:26 -07:00 |
|
jianan-gu
|
8d4fcf2f7b
|
[CPU] Fix MoE layer support for DeepSeek-OCR models (#12555)
|
2026-03-18 22:57:55 -07:00 |
|
Matti Varjokallio
|
85fe8c6793
|
[AMD] Use aiter_dsv3_router_gemm kernel if number of experts <= 256. (#18451)
|
2026-03-18 22:40:48 -07:00 |
|
kk
|
126cd5cfae
|
gpt-oss decode performance optimization (#20392)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2026-03-18 22:30:03 -07:00 |
|
blzheng
|
cd22aa27a9
|
[CPU] Add FP8 Bmm support (#9744)
Co-authored-by: Fan Yin <1106310035@qq.com>
|
2026-03-18 22:19:48 -07:00 |
|
Zaili Wang
|
2f4babe32b
|
[CPU] support LayerNorm with 3D shape (#15075)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
|
2026-03-18 22:15:24 -07:00 |
|
blzheng
|
dc6aa26ce9
|
[CPU] Add mrope kernel for Qwen3-vl (#12531)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
|
2026-03-18 22:12:48 -07:00 |
|
Juan Muneton
|
4052b53227
|
fix scheduler for non-cuda devices and disable piecewise cuda graph f… (#19992)
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
|
2026-03-18 21:54:19 -07:00 |
|
Ling Zhang
|
f85455ab24
|
[Bugfix] fix qwen3vl hang when --mm-enable-dp-encoder is enable (#20759)
|
2026-03-18 21:51:39 -07:00 |
|
Ethan (Yusheng) Su
|
7f6f1a3ab1
|
[LoRA][II] Add fused MOE LoRA Triton kernel and tests (#19711)
|
2026-03-18 19:58:14 -07:00 |
|
R0CKSTAR
|
7553b7dcb0
|
chore: extract diffusion_common in python/pyproject_other.toml (#20803)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
|
2026-03-19 10:39:16 +08:00 |
|