Commit Graph

7855 Commits

Author SHA1 Message Date
Baizhou Zhang
776709efe8 [3/n] deepseek_v2.py Refactor: Migrate MLA forward method in deepseek_v2.py (#19122) 2026-02-27 13:37:29 -08:00
wufann
7e46aafebb [AMD] Enable cudagraph for aiter nsa backend and add aiter impl for nsa pr… (#18526) 2026-02-27 13:18:32 -08:00
Shu Wang
1b75d0d1a9 Fix BatchMLAPagedAttentionWrapper query/qo_inptr mismatch for EAGLE (#15601) 2026-02-27 11:35:45 -08:00
ishandhanani
6a1480ce45 Fix HiCacheNixl TypeError: mem_pool_host passed as file_path (#19517) 2026-02-27 10:59:32 -08:00
Mohammad Miadh Angkad
35ef38c61b Remove gpt-oss hybrid swa gate for trtllm_mha (#19079) 2026-02-27 10:30:00 -08:00
Michael
1b79934d34 [AMD] Fix AMD CI test of TestToolChoiceLfm2Moe (#19113)
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
2026-02-27 10:18:15 -08:00
R0CKSTAR
fe4bc8ebd5 [diffusion] fix: MulAdd 4D path (shift indexing) (#18673)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-02-28 01:52:57 +08:00
Makcum888e
b1249ac909 [Diffusion] [NPU] [CI] fix CI performance (#19486) 2026-02-27 18:23:02 +03:00
Yuan Luo
d2885a9094 [Qwen3-Next] Support gdn fused_rms_norm_gated (#19434)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-02-27 23:08:08 +08:00
joesun
ca5f2e2ed1 [diffusion] fix: Support default response_format=url in /v1/images/generations to avoid 400 errors when response_format is omitted (#19360)
Co-authored-by: Makcum888e <79456407+Makcum888e@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-27 19:47:38 +08:00
AMD-yanfeiwang
f69ca93d49 [AMD] remove redundancy H2D op in aiter attention backend (#19416)
Co-authored-by: root <root@mia1-p01-g28.mia.tensorwave.lan>
2026-02-27 02:15:55 -08:00
Yilong Zhao
07da4bed7b [cache] add conservative estimation (#19482) 2026-02-27 18:14:46 +08:00
fxmarty-amd
9496bbd7b1 [AMD] Use tilelang as default NSA attention backend dispatch on AMD Instinct (#18319) 2026-02-27 01:43:34 -08:00
ympcMark
43fade5f69 [4/N] (Elastic EP) Back up Expert Weights in DRAM (#17374)
Co-authored-by: UNIDY2002 <unidy2002@outlook.com>
2026-02-27 15:59:13 +08:00
KnightLTC
eef44ec916 [NPU]kimi k2 thinking bugfix (#19387)
Co-authored-by: sglang-npu-bot <sglangnpu@163.com>
2026-02-27 15:36:38 +08:00
Cao E
4f0f6cd9d0 Add torch.compile support for qwen3-next on CPU (#12444) 2026-02-26 23:28:03 -08:00
KnightLTC
bc9190435b llama4 npu adapt (#17123)
Co-authored-by: cy <chenyang08056032@163.com>
Co-authored-by: sglang-npu-bot <sglangnpu@163.com>
2026-02-27 14:52:37 +08:00
Ken J
f0c2089597 [vlm][internVL] Support processor and embedding inputs for InternVL (#19127) 2026-02-26 22:46:48 -08:00
triple-mu
8a56cc5836 [diffusion] model: Fix a performance bug in the Wan model about usp (#19340) 2026-02-27 14:36:23 +08:00
elvischenv
af3eccc9ab [Perf] Eliminate the slice op for Flashinfer trtllm_fp4_block_scale_moe (#15731)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2026-02-26 22:23:40 -08:00
wingedge777
d566816d83 fix qwen3_vl visual module loading (#19333) 2026-02-26 19:44:05 -08:00
Alison Shao
52c8a3632a Fix missing StandardCombineInput import in BF16 flashinfer_trtllm MoE (#19400)
Co-authored-by: Alison Shao <alisonshao@MacBook-Pro-D2W773R9CD.local>
2026-02-26 18:32:08 -08:00
Alison Shao
c2dce06d9f Fix parallel tool call test for speculative decoding variants (#19370)
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
Co-authored-by: Alison Shao <alisonshao@MacBook-Pro-D2W773R9CD.local>
2026-02-26 18:31:20 -08:00
littleyellowbicycle
5b5c509480 [NPU][feature adapt]remote load weight feature adp npu (#17968)
Co-authored-by: littleYellowBicycle <liguo29@huawei.com>
Co-authored-by: sglang-npu-bot <sglangnpu@163.com>
2026-02-27 10:10:36 +08:00
Yizhong Cao
e567215e44 [bugfix]fix fa4 decoding (#19388)
Co-authored-by: caoyizhong.cyz <caoyizhong.cyz@alibaba-inc.com>
2026-02-27 09:48:21 +08:00
Sam (Kesen Li)
5194eef88b [Fix] A followup fix for TRTLLM BF16 MoE (#15303) 2026-02-26 17:14:20 -08:00
billishyahao
f0d78f2e20 [AMD] Fix weight load shape mismatch for amd dsr1 0528 mxfp4 (#19425) 2026-02-26 17:11:59 -08:00
fzyzcjy
4add6ec0f6 Enhance displaying and debuggability in dump comparator (#19466) 2026-02-27 09:06:19 +08:00
fzyzcjy
23cbbd6d41 Integrate packed data context parallel in dump comparator (#19464) 2026-02-27 08:28:36 +08:00
Liangsheng Yin
1a32c0db4d [Minor] Rename misleading chunked to reusing in ReqToTokenPool.alloc() (#19465) 2026-02-26 16:24:03 -08:00
fzyzcjy
8293a914a6 Support token align with packed CP data in dump comparator (#19463) 2026-02-27 08:12:54 +08:00
fzyzcjy
695e93b91f Make reorderer support packed format with CP in dump comparator (#19462) 2026-02-27 08:12:18 +08:00
fzyzcjy
e3cdf6b1a3 Support CP packed format in unsharder in dump comparator (#19461) 2026-02-27 08:11:47 +08:00
fzyzcjy
f2d1b7c4ad Support sequence parallel and trivial dims in dump comparator (#19460) 2026-02-27 08:11:15 +08:00
fzyzcjy
eb2ada3804 Support non-packed format when aligning tokens in dump comparator (#19459) 2026-02-27 08:10:29 +08:00
fzyzcjy
e1e0cfd856 Use named tensors in dump comparator (#19458) 2026-02-27 08:09:55 +08:00
fzyzcjy
eb0e905fc3 Update token layout and cleanup printer in dump comparator (#19457) 2026-02-27 08:09:15 +08:00
fzyzcjy
8ac64e1487 Support unifying axis ordering in dump comparator (#19456) 2026-02-27 08:08:32 +08:00
fzyzcjy
425d333ee3 Support token dim in arbitrary location in dump comparator (#19455) 2026-02-27 08:07:38 +08:00
Thomas Wang
5172c37845 [AMD] Use fused GEMM with FP8 cast for FP8 prefill (#19422) 2026-02-26 15:14:52 -08:00
RoyWang
a1ef8e2cc0 [AMD] optimize Kimi K2.5 fused_moe_triton performance by tuning (#19228) 2026-02-26 11:50:13 -08:00
Shangming Cai
288300aafd [PD] Tiny code cleanup for prefill info registering (#19414)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-02-27 01:35:26 +08:00
zhangheng
e4b708d3e9 [Spec V2] Support specV2 for mamba hybrid attention (#18808)
Co-authored-by: Yi Zhong <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: Hanming Lu <hanming@x.ai>
2026-02-27 00:36:01 +08:00
DefTruth
78d6674c45 [diffusion] feat: support hybrid parallelism for diffusers backend (#19405) 2026-02-27 00:06:08 +08:00
Shangming Cai
e55e65535e [Bugfix] Add rids to the batch filtering for two batch overlap (#19418) 2026-02-26 06:57:25 -08:00
Shangming Cai
97f1fa5e6b [NPU] Fix disaggregation metadata buffer bootstrap_room_dtype for npu backend (#19423) 2026-02-26 21:10:50 +08:00
khalilzhk
86eb80007e [NPU] support Kimi-K2.5 on NPU (#19331) 2026-02-26 20:41:44 +08:00
AlfredYong
bdc1e46e5a [Qwen3.5] Qwen3.5-27B inference repeat bug fix (#19411) 2026-02-26 20:11:29 +08:00
Xiaoyu Zhang
74c8e7b215 refactor(jit_kernel): reduce duplication and separate test code (#19323) 2026-02-26 18:30:49 +08:00
Junhao Liu
a7152df2e3 [diffusion ] CLI: Fix typo in CLI usage doc string (#19316) 2026-02-26 13:24:14 +03:00