Commit Graph

7855 Commits

Author SHA1 Message Date
sushil Dubey
e26c73c4e9 [diffusion] platform: support Intel XPU (#17920)
Signed-off-by: sushil.dubey <sushil.dubey@intel.com>
Signed-off-by: Sushil Dubey <sushil.dubey@intel.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-04-11 15:09:02 +08:00
YC Yen-Ching Tseng
cf1436d6ae [AMD] Diffusion - Enabel rocm miopen tuning on vae (#22428) 2026-04-10 22:47:25 -07:00
Jacob0226
7e4e1dcd7a [AMD] Fuse RMSNorm + FP8 per-token quant for GLM-4.7-FP8 (#21403)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 22:45:31 -07:00
Khoa Pham
aeeff58cd4 [Spec][Ngram] Clean up unused stateless batchMatch (#22487) 2026-04-10 21:52:56 -07:00
Khoa Pham
04bd8e1218 [Spec][Ngram] Return token counts in list_external_corpora API (#22471)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 21:50:02 -07:00
Zhangheng
f2af00d05a [HiSparse-pd] Add device-buffer budget and fix logical pool admission in decode side (#22453) 2026-04-11 12:30:38 +08:00
Alex Nails
8eac618a8d [tokenizer] lazy text accumulation + use deltas directly for streaming (#22548)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 21:26:04 -07:00
Liangsheng Yin
c7f93a2ce7 [metrics] Add PoolStats.update_scheduler_stats to deduplicate metrics assignment (#22559) 2026-04-10 21:04:18 -07:00
Bi Xue
d30b3efa84 [sgl] _ATTN_TP and _ATTN_CP use message queue for broadcast on CPU (#22205) 2026-04-10 20:52:49 -07:00
Xinyuan Tong
7c6db40540 Fix tool call constrained decoding and parsing for models with native formats (#21593) 2026-04-10 20:37:23 -07:00
Liangsheng Yin
c2821dfbe9 [mem] Introduce PoolStats dataclass; unify pool metrics and token_usage (#22554) 2026-04-10 20:35:50 -07:00
Yuhao Yang
16f306fd85 [VLM] GPU Image Preprocessing for Kimi-K2.5 (#22368) 2026-04-11 11:13:30 +08:00
Yilong Zhao
58f863956c cuda graph: adjust capture time num-non-padded-tokens to align capture with replay (#22404) 2026-04-11 10:27:50 +08:00
Mick
0b4f5c9fcb [diffusion] CI: improve readability and fix bug of early-return (#22507) 2026-04-11 10:08:44 +08:00
Alison Shao
75223c5404 [Diffusion][CI] Fix nunchaku unit test broken by #22365 (#22560)
Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>
2026-04-10 17:49:56 -07:00
Liangsheng Yin
b4a1d8fd71 [mem] Fix idle token_usage missing mamba_usage; add FIXME for naming (#22555) 2026-04-10 16:20:33 -07:00
Alex Nails
0af9166474 [tokenizer] improve non streaming request processing + some small fixes. (#20310) 2026-04-10 15:46:12 -07:00
ori
f7a1740101 [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) (#22051)
Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>
2026-04-10 14:18:39 -07:00
Minglei Zhu
6af34b95b6 perf: precompute FA3 scheduler_metadata to eliminate per-layer prepare_varlen_num_blocks (#21104)
Co-authored-by: zminglei <zminglei@linkedin.com>
2026-04-10 13:57:54 -07:00
Zhongdongming Dai
4ace144fae feat: update ModelExpress metadata API to SourceIdentity-based schema (#21222) 2026-04-10 13:45:05 -07:00
Cheng Wan
6d95602ea3 Reduce GPU memory for MoE parallel groups (#22515)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 13:23:23 -07:00
satyamk7054
059b287e25 Add offline auto-tuning for LoRA CSGMV kernel (#20391)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
2026-04-10 13:10:43 -07:00
Qiaolin Yu
d8831355a3 Fix multi_layer_eagle_worker_v2 draft extend selection, add chain style multi layer mtp test (#22340)
Co-authored-by: 0xNullPath <luyan@nvidia.com>
2026-04-10 12:44:52 -07:00
Trevor Morris
7dbd0dd9f0 MiniMax-M2.5 - Support dp attention, dp reduce scatter, FP4 all gather, AR fusion in prepare_attn (#20067) 2026-04-10 12:41:27 -07:00
KrishnanPrash
a937ec31be fix: server crash when stop_token_ids contains null (#22175)
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
2026-04-10 11:42:23 -07:00
Jia Guo
5cb4ea1d4d perf: enable inductor combo_kernels for horizontal fusion (#21977)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-11 01:01:14 +08:00
Tarushii Goel
2ba94136ce [sgl] improve mamba_track_indices perf in specdec (#22380) 2026-04-11 00:39:53 +08:00
Bi Xue
f652135d52 [sgl] fix using symmetric memory issues for attention_tp (#22286) 2026-04-11 00:26:18 +08:00
Ratish P
8227187d47 [SKILL]: add component accuracy guidance to the diffusion add-model skill (#22460) 2026-04-10 23:08:31 +08:00
Ratish P
cf5ad12612 [diffusion][CI]: route multimodal component accuracy through run_suite (#21960) 2026-04-10 23:06:03 +08:00
kingkingleeljj
84194c25c1 [BugFix] fix the bug of minimax_m2.5 model that causes repeated outputs when using tp16 (#20967)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 22:21:19 +08:00
Xiaoyu Zhang
1ff51555f2 [Diffusion] modelopt diffusion fp8 support for flux1/flux2 and wan2.2 (#22365) 2026-04-10 20:56:57 +08:00
Yujun Dong
8ba9646044 Make GDN support non-continuous B/A Tensor input to fix the accuracy regression of Qwen3.5-27B (#22312)
Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
2026-04-10 18:58:13 +08:00
Jincong Chen
0668a7f51a [Perf] Remove two operations in gdn_backend extend verify path (#22444) 2026-04-10 17:53:57 +08:00
Shangming Cai
1c76f322df [HiCache] Add CP support for HiCache (#20977)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-04-10 17:52:51 +08:00
Cheng Wan
37107bee6f [Observability] Add pending token count to prefill log and get_load (#22480)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 02:05:21 -07:00
Lee Nau
c554dc5c64 Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE (#21339) 2026-04-10 01:35:56 -07:00
Mick
7c6b5c095c [diffusion] fix: fix flux2 i2i accuracy (#22423) 2026-04-10 16:16:51 +08:00
Liangsheng Yin
6cf7f210bf Add page_size to admission token budget check (#22495) 2026-04-10 01:16:04 -07:00
Jacob0226
dd41764487 [AMD][HIP] NSA: bf16 passthrough from RMSNorm to eliminate FP8 dequantization (#22258)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 01:08:32 -07:00
jianan-gu
2ab141547d [CPU] Add apply_routed_scaling_factor_on_output support for biased_grouped_topk fusion (#22413)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-04-10 15:16:05 +08:00
Polisetty V R K Jyothendra Varma
599cce4d82 [Intel GPU] import flash_attn functions from sgl_kernel only (#22438) 2026-04-10 15:10:00 +08:00
xieminghe1
18f41ac427 [Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (#22316)
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
Co-authored-by: xq25478 <xq25478@qq.com>
2026-04-10 14:56:05 +08:00
Tarushii Goel
0334d4b7e8 [sgl] Fix mamba tracking calculation in spec dec (#22239) 2026-04-10 14:46:16 +08:00
Ethan (Yusheng) Su
6d79c60995 [Lora] Lora kimi support (#22381) 2026-04-09 22:31:53 -07:00
Liangsheng Yin
722e25a621 Fix SWA eviction boundary and page-align chunked prefill (#22470) 2026-04-09 22:09:43 -07:00
Ke Bao
e77bfba24d Fix NCCL AllGather hanging issue for Qwen3 Next MTP (#22458) 2026-04-10 11:40:54 +08:00
Kangyan-Zhou
89553ff82b [Observability] Add Prometheus metrics endpoint for gRPC mode (#20801)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 20:04:54 -07:00
LHXuuu
42ffb168b3 [EPD][VLM] Support Kimi K25 EPD (#22269)
Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>
2026-04-10 10:58:42 +08:00
jacky.cheng
d283808457 [AMD] Replace triton rotary_emb with aiter rotary_emb for Wan2.2 denoise (#22422) 2026-04-09 18:21:02 -07:00