sushil Dubey
|
e26c73c4e9
|
[diffusion] platform: support Intel XPU (#17920)
Signed-off-by: sushil.dubey <sushil.dubey@intel.com>
Signed-off-by: Sushil Dubey <sushil.dubey@intel.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
|
2026-04-11 15:09:02 +08:00 |
|
YC Yen-Ching Tseng
|
cf1436d6ae
|
[AMD] Diffusion - Enabel rocm miopen tuning on vae (#22428)
|
2026-04-10 22:47:25 -07:00 |
|
Jacob0226
|
7e4e1dcd7a
|
[AMD] Fuse RMSNorm + FP8 per-token quant for GLM-4.7-FP8 (#21403)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-04-10 22:45:31 -07:00 |
|
Khoa Pham
|
aeeff58cd4
|
[Spec][Ngram] Clean up unused stateless batchMatch (#22487)
|
2026-04-10 21:52:56 -07:00 |
|
Khoa Pham
|
04bd8e1218
|
[Spec][Ngram] Return token counts in list_external_corpora API (#22471)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-10 21:50:02 -07:00 |
|
Zhangheng
|
f2af00d05a
|
[HiSparse-pd] Add device-buffer budget and fix logical pool admission in decode side (#22453)
|
2026-04-11 12:30:38 +08:00 |
|
Alex Nails
|
8eac618a8d
|
[tokenizer] lazy text accumulation + use deltas directly for streaming (#22548)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-10 21:26:04 -07:00 |
|
Liangsheng Yin
|
c7f93a2ce7
|
[metrics] Add PoolStats.update_scheduler_stats to deduplicate metrics assignment (#22559)
|
2026-04-10 21:04:18 -07:00 |
|
Bi Xue
|
d30b3efa84
|
[sgl] _ATTN_TP and _ATTN_CP use message queue for broadcast on CPU (#22205)
|
2026-04-10 20:52:49 -07:00 |
|
Xinyuan Tong
|
7c6db40540
|
Fix tool call constrained decoding and parsing for models with native formats (#21593)
|
2026-04-10 20:37:23 -07:00 |
|
Liangsheng Yin
|
c2821dfbe9
|
[mem] Introduce PoolStats dataclass; unify pool metrics and token_usage (#22554)
|
2026-04-10 20:35:50 -07:00 |
|
Yuhao Yang
|
16f306fd85
|
[VLM] GPU Image Preprocessing for Kimi-K2.5 (#22368)
|
2026-04-11 11:13:30 +08:00 |
|
Yilong Zhao
|
58f863956c
|
cuda graph: adjust capture time num-non-padded-tokens to align capture with replay (#22404)
|
2026-04-11 10:27:50 +08:00 |
|
Mick
|
0b4f5c9fcb
|
[diffusion] CI: improve readability and fix bug of early-return (#22507)
|
2026-04-11 10:08:44 +08:00 |
|
Alison Shao
|
75223c5404
|
[Diffusion][CI] Fix nunchaku unit test broken by #22365 (#22560)
Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>
|
2026-04-10 17:49:56 -07:00 |
|
Liangsheng Yin
|
b4a1d8fd71
|
[mem] Fix idle token_usage missing mamba_usage; add FIXME for naming (#22555)
|
2026-04-10 16:20:33 -07:00 |
|
Alex Nails
|
0af9166474
|
[tokenizer] improve non streaming request processing + some small fixes. (#20310)
|
2026-04-10 15:46:12 -07:00 |
|
ori
|
f7a1740101
|
[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) (#22051)
Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>
|
2026-04-10 14:18:39 -07:00 |
|
Minglei Zhu
|
6af34b95b6
|
perf: precompute FA3 scheduler_metadata to eliminate per-layer prepare_varlen_num_blocks (#21104)
Co-authored-by: zminglei <zminglei@linkedin.com>
|
2026-04-10 13:57:54 -07:00 |
|
Zhongdongming Dai
|
4ace144fae
|
feat: update ModelExpress metadata API to SourceIdentity-based schema (#21222)
|
2026-04-10 13:45:05 -07:00 |
|
Cheng Wan
|
6d95602ea3
|
Reduce GPU memory for MoE parallel groups (#22515)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-04-10 13:23:23 -07:00 |
|
satyamk7054
|
059b287e25
|
Add offline auto-tuning for LoRA CSGMV kernel (#20391)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
|
2026-04-10 13:10:43 -07:00 |
|
Qiaolin Yu
|
d8831355a3
|
Fix multi_layer_eagle_worker_v2 draft extend selection, add chain style multi layer mtp test (#22340)
Co-authored-by: 0xNullPath <luyan@nvidia.com>
|
2026-04-10 12:44:52 -07:00 |
|
Trevor Morris
|
7dbd0dd9f0
|
MiniMax-M2.5 - Support dp attention, dp reduce scatter, FP4 all gather, AR fusion in prepare_attn (#20067)
|
2026-04-10 12:41:27 -07:00 |
|
KrishnanPrash
|
a937ec31be
|
fix: server crash when stop_token_ids contains null (#22175)
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
|
2026-04-10 11:42:23 -07:00 |
|
Jia Guo
|
5cb4ea1d4d
|
perf: enable inductor combo_kernels for horizontal fusion (#21977)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-04-11 01:01:14 +08:00 |
|
Tarushii Goel
|
2ba94136ce
|
[sgl] improve mamba_track_indices perf in specdec (#22380)
|
2026-04-11 00:39:53 +08:00 |
|
Bi Xue
|
f652135d52
|
[sgl] fix using symmetric memory issues for attention_tp (#22286)
|
2026-04-11 00:26:18 +08:00 |
|
Ratish P
|
8227187d47
|
[SKILL]: add component accuracy guidance to the diffusion add-model skill (#22460)
|
2026-04-10 23:08:31 +08:00 |
|
Ratish P
|
cf5ad12612
|
[diffusion][CI]: route multimodal component accuracy through run_suite (#21960)
|
2026-04-10 23:06:03 +08:00 |
|
kingkingleeljj
|
84194c25c1
|
[BugFix] fix the bug of minimax_m2.5 model that causes repeated outputs when using tp16 (#20967)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-04-10 22:21:19 +08:00 |
|
Xiaoyu Zhang
|
1ff51555f2
|
[Diffusion] modelopt diffusion fp8 support for flux1/flux2 and wan2.2 (#22365)
|
2026-04-10 20:56:57 +08:00 |
|
Yujun Dong
|
8ba9646044
|
Make GDN support non-continuous B/A Tensor input to fix the accuracy regression of Qwen3.5-27B (#22312)
Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
|
2026-04-10 18:58:13 +08:00 |
|
Jincong Chen
|
0668a7f51a
|
[Perf] Remove two operations in gdn_backend extend verify path (#22444)
|
2026-04-10 17:53:57 +08:00 |
|
Shangming Cai
|
1c76f322df
|
[HiCache] Add CP support for HiCache (#20977)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2026-04-10 17:52:51 +08:00 |
|
Cheng Wan
|
37107bee6f
|
[Observability] Add pending token count to prefill log and get_load (#22480)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-04-10 02:05:21 -07:00 |
|
Lee Nau
|
c554dc5c64
|
Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE (#21339)
|
2026-04-10 01:35:56 -07:00 |
|
Mick
|
7c6b5c095c
|
[diffusion] fix: fix flux2 i2i accuracy (#22423)
|
2026-04-10 16:16:51 +08:00 |
|
Liangsheng Yin
|
6cf7f210bf
|
Add page_size to admission token budget check (#22495)
|
2026-04-10 01:16:04 -07:00 |
|
Jacob0226
|
dd41764487
|
[AMD][HIP] NSA: bf16 passthrough from RMSNorm to eliminate FP8 dequantization (#22258)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-04-10 01:08:32 -07:00 |
|
jianan-gu
|
2ab141547d
|
[CPU] Add apply_routed_scaling_factor_on_output support for biased_grouped_topk fusion (#22413)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
|
2026-04-10 15:16:05 +08:00 |
|
Polisetty V R K Jyothendra Varma
|
599cce4d82
|
[Intel GPU] import flash_attn functions from sgl_kernel only (#22438)
|
2026-04-10 15:10:00 +08:00 |
|
xieminghe1
|
18f41ac427
|
[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (#22316)
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
Co-authored-by: xq25478 <xq25478@qq.com>
|
2026-04-10 14:56:05 +08:00 |
|
Tarushii Goel
|
0334d4b7e8
|
[sgl] Fix mamba tracking calculation in spec dec (#22239)
|
2026-04-10 14:46:16 +08:00 |
|
Ethan (Yusheng) Su
|
6d79c60995
|
[Lora] Lora kimi support (#22381)
|
2026-04-09 22:31:53 -07:00 |
|
Liangsheng Yin
|
722e25a621
|
Fix SWA eviction boundary and page-align chunked prefill (#22470)
|
2026-04-09 22:09:43 -07:00 |
|
Ke Bao
|
e77bfba24d
|
Fix NCCL AllGather hanging issue for Qwen3 Next MTP (#22458)
|
2026-04-10 11:40:54 +08:00 |
|
Kangyan-Zhou
|
89553ff82b
|
[Observability] Add Prometheus metrics endpoint for gRPC mode (#20801)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-09 20:04:54 -07:00 |
|
LHXuuu
|
42ffb168b3
|
[EPD][VLM] Support Kimi K25 EPD (#22269)
Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>
|
2026-04-10 10:58:42 +08:00 |
|
jacky.cheng
|
d283808457
|
[AMD] Replace triton rotary_emb with aiter rotary_emb for Wan2.2 denoise (#22422)
|
2026-04-09 18:21:02 -07:00 |
|