Commit Graph

7087 Commits

Author SHA1 Message Date
maocheng23
4e8829e4cd Replace topk_ids with curr_topk_ids in fused_moe.py (#20302) 2026-03-18 21:57:05 +00:00
Chad Voegele
a3196d08b8 [MiniMax M2] Fix KV cache scale loading (#20870)
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 14:54:43 -07:00
Xinyuan Tong
6b8a6545b2 Add Mistral Small 4 (Pixtral) support (#20708)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Alex Nails <alexnails@radixark.ai>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: dbari <dbari@users.noreply.github.com>
2026-03-18 14:15:32 -07:00
Trevor Morris
df1d046de2 Add packed_modules_mapping for MiniMax-M2 (#19995) 2026-03-18 14:10:01 -07:00
Xinyuan Tong
d1e95af282 Upgrade transformers==5.3.0 (#17784)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Alison Shao <alisonshao@mac.lan>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-18 13:50:43 -07:00
Bruce Wu
e5750a572c Support TP for lora lm_head layer (#18511)
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
2026-03-18 13:48:03 -07:00
ishandhanani
8f0f36c64b [1/2] Add ModelExpress coordination for remote instance weight loading - matching TP (#19920)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>
2026-03-18 13:38:32 -07:00
Yaochen Han
c7a71740a5 [NPU][diffusion] npu support enable_torch_compile for torchair backend on diffusion models (#20687) 2026-03-18 22:40:35 +03:00
Vladislav Nosivskoy
b9dba851a0 Fix streaming token ids data loss under load (#19977)
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
2026-03-18 12:23:45 -07:00
Gabriel Wu
70876ae93b fix: guard configure_deep_gemm_num_sms when JIT disabled (#20868)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:15:20 -07:00
Jackie
a6c7bb54eb [Perf]Optimize waiting queue update with set usage (#20503) 2026-03-18 09:56:24 -07:00
jianan-gu
21c4fc6334 [DP encoder] Fix pos_emb layer TP issue when DP encoder enabled for Qwen3 VL (#20788) 2026-03-18 17:14:47 +08:00
Thomas Wang
c0a4408f78 [AMD] Fix dpsk-v32 accuracy issue on mi355 (#20840) 2026-03-18 02:06:15 -07:00
billishyahao
f0d7a3f427 [AMD][TBO] Fix mori ep dual stream accuracy (#19888) 2026-03-18 02:00:55 -07:00
Shangming Cai
8b46f1f4ec [PD] Add retry interval in ensure_prefill_info (#20832)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-03-18 16:02:20 +08:00
Chuan (Richard) Li
93422f27d6 [AMD][AITER] Guard _use_mla_ps_kernel with self.use_mla in draft_extend_v2 paths (#20409) 2026-03-18 00:45:22 -07:00
R0CKSTAR
ead9d7aa43 [diffusion] fix: fix vae model offload on mps(#20607)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-18 15:44:59 +08:00
chenxu214
532470bcca [NPU] add new fusion operator DispatchFFNCombine (#20245) 2026-03-18 15:22:04 +08:00
jinke
ae15fca192 [Bugfix] fix hicache mooncake backend extra config loading (#16808)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: jinke15 <jinke15@jd.com>
2026-03-18 15:07:39 +08:00
xingsy97
d20e9a20fa [JIT] Inject target architecture flag into JIT compilation (#20103) 2026-03-17 23:16:49 -07:00
xingsy97
f78d5c3b3c [JIT Kernel] Add hadamard kernel test and benchmark (#20030) 2026-03-17 23:16:35 -07:00
Артем Савкин
c64681f162 [Bugfix] [diffusion] Fix cache-dit with sp-degree only (#19965)
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-03-18 14:05:12 +08:00
Kangyan-Zhou
b6055e59cd [HiCache] Reduce per-request backup log noise (#20813)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 22:47:14 -07:00
Viacheslav
30a35ecd90 Add gigachat3.1 parser (#19886)
Signed-off-by: Viacheslav Barinov <vvadbarinov@sberbank.ru>
Signed-off-by: Viacheslav Bv <viacheslav.teh@gmail.com>
Co-authored-by: Viacheslav Barinov <vvadbarinov@sberbank.ru>
2026-03-17 22:45:01 -07:00
Evgueni Petrov
2e860233ca rocm: fix oom when loading fp8 weights close to size of available vram (#19941) 2026-03-17 22:44:19 -07:00
shiyu7
0acc1d3c9a fix: change qwen 3.5 linear attention a_log to fp32 (#19961)
Co-authored-by: sunqi.7 <sunqi.7@bytedance.com>
2026-03-17 22:42:06 -07:00
Brayden Zhong
88c40ec16d Use Flashinfer for target_verify in GDN model for SM120 (#20604) 2026-03-17 22:40:56 -07:00
Brayden Zhong
97d5386a21 Use TRTLLM allreduce fusion for Qwen 3.5 (#19889) 2026-03-17 22:40:22 -07:00
Yuan Luo
9c87e137ee [GDN] Support GDN packed decode (#20627) 2026-03-18 13:20:07 +08:00
Kaixi Hou
4cc19862ef [NVIDIA] Integrate FlashInfer decode kernel (Blackwell) for Qwen3.5 (#19150)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 13:11:18 +08:00
hzh0425
c43d495dd5 [RadixTree][9/N Refactor]: Support unified init_load_back params (#20590) 2026-03-18 11:19:52 +08:00
Mick
f15b3338c9 Revert "[Bugfix] Fix GLM-4.6V vision regression in glm4v_moe and glm_ocr" (#20740) 2026-03-18 10:09:50 +08:00
lviy
944355c66f [Bugfix] Fix model output corruption caused by EPLB rebalance (Eager and CUDA Graph modes) (#18213)
Co-authored-by: FortPercent <49947620+FortPercent@users.noreply.github.com>
2026-03-17 18:30:24 -07:00
Liangsheng Yin
4d3976b6c5 [HiCache] Check in-flight async ops in is_fully_idle() before attach/detach (#20746) 2026-03-17 17:28:26 -07:00
Qiaolin Yu
c5d2528bff Revert "[AMD][MORI] Fix MTP crash with FP4/FP8 dispatch and add NEXTN dispatch env vars." (#20797) 2026-03-17 17:28:09 -07:00
Shangming Cai
2acb20f53b [Disagg] Non-blocking try_ensure_parallel_info in pending queue, consolidate rank mapping into PrefillServerInfo (#20785)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-17 17:26:18 -07:00
Rain Jiang
cb1e63aba4 bump fa4 to official released fa4 pkg (#20303) 2026-03-17 17:22:56 -07:00
Jincong Chen
c77d7c629e [Bugfix] Fix MTP prefill cuda graph logging (#20279)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-17 16:36:52 -07:00
Kaixi
744b1c9e6f Added fallback to individual copy_ (#20683) 2026-03-17 14:44:38 -07:00
Kangyan-Zhou
3d8fc9a0ca Revert "[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api" (#20792) 2026-03-17 11:59:02 -07:00
Артем Савкин
09f5097fe4 [NPU] [Bugfix] [diffusion] Fix NZ performance bug for diffusion models (#20684)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-17 21:23:09 +03:00
Shu Wang
d35fea1b2b [Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api (#12787) 2026-03-17 10:02:45 -07:00
Yongfei Xu
17031120b8 [DeepSeek v3.2][Bugfix] get_index_k_scale_buffer support cp (#18280) 2026-03-17 09:54:54 -07:00
Serge Panev
466ff20e51 [Model] Fix NemotronH OOM on unified-mem systems: stream weights + safetensors cleanup (#20580)
Signed-off-by: Serge Panev <spanev@nvidia.com>
2026-03-17 09:47:58 -07:00
Yuhao Yang
24a27d5320 vlm: support piecewise cuda graph for Kimi-K2.5 (#20747) 2026-03-18 00:32:07 +08:00
heziiop
b5f3eaecbc [NPU] Support dequant_swiglu_quant & moe_init_routing_v2 & npu_moe_token_unpermute for W8A8 MoE decode (#19913) 2026-03-17 21:39:29 +08:00
Mick
5717834f1f [diffusion] refactor: cleanup parallel_state.py (#20760)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-17 21:21:42 +08:00
Shangming Cai
17c81a3e07 Revert "[PD] Make pending reqs resolving more robust" (#20779) 2026-03-17 20:31:12 +08:00
YAMY
cfead25bbf [Qwen3.5] mamba slice fix (Prefill TP != Decode TP & decode TP size>1) (#20655)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-03-17 19:30:58 +08:00
AMD-yanfeiwang
966ae87d02 [AMD] avoid correction_bias_dtype dtype convert (#20692) 2026-03-17 02:55:05 -07:00