Commit Graph

7855 Commits

Author SHA1 Message Date
lviy
944355c66f [Bugfix] Fix model output corruption caused by EPLB rebalance (Eager and CUDA Graph modes) (#18213)
Co-authored-by: FortPercent <49947620+FortPercent@users.noreply.github.com>
2026-03-17 18:30:24 -07:00
Liangsheng Yin
4d3976b6c5 [HiCache] Check in-flight async ops in is_fully_idle() before attach/detach (#20746) 2026-03-17 17:28:26 -07:00
Qiaolin Yu
c5d2528bff Revert "[AMD][MORI] Fix MTP crash with FP4/FP8 dispatch and add NEXTN dispatch env vars." (#20797) 2026-03-17 17:28:09 -07:00
Shangming Cai
2acb20f53b [Disagg] Non-blocking try_ensure_parallel_info in pending queue, consolidate rank mapping into PrefillServerInfo (#20785)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-17 17:26:18 -07:00
Rain Jiang
cb1e63aba4 bump fa4 to official released fa4 pkg (#20303) 2026-03-17 17:22:56 -07:00
Jincong Chen
c77d7c629e [Bugfix] Fix MTP prefill cuda graph logging (#20279)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-17 16:36:52 -07:00
Kaixi
744b1c9e6f Added fallback to individual copy_ (#20683) 2026-03-17 14:44:38 -07:00
Kangyan-Zhou
3d8fc9a0ca Revert "[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api" (#20792) 2026-03-17 11:59:02 -07:00
Артем Савкин
09f5097fe4 [NPU] [Bugfix] [diffusion] Fix NZ performance bug for diffusion models (#20684)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-17 21:23:09 +03:00
Shu Wang
d35fea1b2b [Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api (#12787) 2026-03-17 10:02:45 -07:00
Yongfei Xu
17031120b8 [DeepSeek v3.2][Bugfix] get_index_k_scale_buffer support cp (#18280) 2026-03-17 09:54:54 -07:00
Serge Panev
466ff20e51 [Model] Fix NemotronH OOM on unified-mem systems: stream weights + safetensors cleanup (#20580)
Signed-off-by: Serge Panev <spanev@nvidia.com>
2026-03-17 09:47:58 -07:00
Yuhao Yang
24a27d5320 vlm: support piecewise cuda graph for Kimi-K2.5 (#20747) 2026-03-18 00:32:07 +08:00
heziiop
b5f3eaecbc [NPU] Support dequant_swiglu_quant & moe_init_routing_v2 & npu_moe_token_unpermute for W8A8 MoE decode (#19913) 2026-03-17 21:39:29 +08:00
Mick
5717834f1f [diffusion] refactor: cleanup parallel_state.py (#20760)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-17 21:21:42 +08:00
Shangming Cai
17c81a3e07 Revert "[PD] Make pending reqs resolving more robust" (#20779) 2026-03-17 20:31:12 +08:00
YAMY
cfead25bbf [Qwen3.5] mamba slice fix (Prefill TP != Decode TP & decode TP size>1) (#20655)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-03-17 19:30:58 +08:00
AMD-yanfeiwang
966ae87d02 [AMD] avoid correction_bias_dtype dtype convert (#20692) 2026-03-17 02:55:05 -07:00
Liangsheng Yin
5270a06488 [Disagg] Fix health check false-positive in disagg is_fully_idle (#20756) 2026-03-17 17:18:54 +08:00
Duyi-Wang
385a35bd11 [AMD][MORI] Fix MTP crash with FP4/FP8 dispatch and add NEXTN dispatch env vars. (#20647) 2026-03-17 01:13:42 -07:00
Junhao Liu
ee106757df [diffusion] fix: fix Diffusers backend ignores model-specific sampling parameter (#20080)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-17 16:10:46 +08:00
akhilg-nv
9a697ceabb [Fix #20389] Illegal memory access in triton attention for large token counts (#20390) 2026-03-17 00:42:11 -07:00
Ratish P
e3277b3be2 [diffusion]: remove stale offload-manager in LTX2 AV denoising (#20624) 2026-03-17 15:14:00 +08:00
DefTruth
025691cd9e [diffusion] chore: bump up cache-dit & support quant for diffusers backend (#20361) 2026-03-17 12:51:31 +08:00
Rocky Song
079a1fd35e [Bugfix] Fix write-through events not processed when scheduler is idle (#20560) 2026-03-16 21:49:59 -07:00
Shangming Cai
5d5c31c6e4 [PP] Add CP pyobj broadcasting when enable dynamic CPP (#20738) 2026-03-17 12:20:11 +08:00
MMuzzammil1
855ec7017d Add check to provide hicache-storage-backend when enabling kv caching on Decode Side in PD Disaggregation (#20732)
Signed-off-by: Mohd Muzzammil <me.muzzammil@samsung.com>
2026-03-17 11:25:14 +08:00
Hubert Lu
943f34f642 Add NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency (#20477)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 20:23:14 -07:00
Jay Shaik
e4d06b3db2 Fix /generate JSON serialization for non-finite top_logprobs (#20714) 2026-03-16 20:07:12 -07:00
shuwenn
515b3a323d feat: support human-readable suffixes (25.6k, 1M, 1Mi) for token CLI (#20577) 2026-03-16 20:05:33 -07:00
psaab
9f56b471aa [Network] Use NetworkAddress for dist_init_method and loopback fallbacks (#20657)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-16 19:59:49 -07:00
Jason Yao
4dbec2dd2b [typo] Fix typos in comments and log messages in common.py (#20723) 2026-03-16 19:26:59 -07:00
Qiaolin Yu
7d87a6a071 Fix spec v1 token_ids_logprobs (#20718) 2026-03-16 19:23:28 -07:00
Mick
474a851ae3 [diffusion] fix: fix sampling params incorrectly override in cli (#20689) 2026-03-17 08:48:10 +08:00
Mick
1eea744855 [diffusion] CI: enable UT (#20690) 2026-03-17 07:44:04 +08:00
roikoren755
5ef5806160 [Nemotron] Small reasoning parser fix (#20284) 2026-03-16 13:29:40 -07:00
Bruce Wu
70a6fb53af Enable embedding lookup/lora_a logic for chunked backend (#17692)
Co-authored-by: Bruce Wu <mogicianwu@fb.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
2026-03-16 11:37:58 -07:00
Douglas Yang
061ec582bf fix: adding teacache.params back to sampling params as intended (#20665) 2026-03-16 11:27:06 -07:00
ybyang
289cbcf482 fix: support PP2+CP8+TP8 (PP with context parallelism) (#19548) 2026-03-16 16:51:47 +00:00
Xiaoyu Zhang
6489f77733 [Diffusion] Fix compile graph broken by flashinfer rope (#20699) 2026-03-16 23:14:27 +08:00
Du Bin
d3c0f4376a Fix AssertionError crash in disagg prefill inflight queue with PP (#20686) 2026-03-16 22:38:59 +08:00
Xiaoyu Zhang
15097c5c3b Release sglang kernel 0.4.0 (#20440)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-03-16 20:34:58 +08:00
sky
3d58cd16d9 [DP Attention] Optimize dp_padding_mode selection for dp_size=1 in extend mode (#20406)
Signed-off-by: wangfakang <fakangwang@gmail.com>
2026-03-16 18:44:42 +08:00
Xun Sun
549fbcc864 [5/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible (#12068)
Co-authored-by: Hank Han <hanhan.hank@bytedance.com>
Co-authored-by: Hank Han <hanhan7630@outlook.com>
2026-03-16 18:40:58 +08:00
Xiaoyu Zhang
3055b6906d [Diffusion] Document torch.compile graph-break checks in diffusion benchmark skills (#20681) 2026-03-16 17:41:40 +08:00
Mick
485597e651 [diffusion] fix: fix some sampling args passed via cli are omitted (#20630) 2026-03-16 16:55:30 +08:00
Sugar920
895e56097c Add NPU basic function testcases (#19382)
Co-authored-by: cy <chenyang08056032@163.com>
Co-authored-by: Cherry_ming <136634645@qq.com>
2026-03-16 15:09:56 +08:00
shuwenn
42f18fe560 [HiCache] fix: release write-through lock_ref during decode (#20049) 2026-03-16 14:49:31 +08:00
Ke Bao
39336f5812 Precompute swa cache location (#20449) 2026-03-16 14:38:08 +08:00
Zheng Wengang
135af6dc92 [EPD][VLM] support video/audio input (#17824)
Co-authored-by: siyu <liusy58@linux.alibaba.com>
2026-03-16 14:18:21 +08:00