Commit Graph

8078 Commits

Author SHA1 Message Date
Aurick Qiao
bfccc8e504 Allow configuring NIXL backend parameters from env (#24169) 2026-05-01 18:30:43 -07:00
Mick
193b977572 [diffusion] chore: clean scheduler (#24229) 2026-05-02 09:30:06 +08:00
Liangsheng Yin
cb8fbd53fc Reserve slot 0 as padding in all req pools (#24243) 2026-05-01 16:41:36 -07:00
Cheng Wan
b47fab6f5d [bugfix] Support MIXED forward mode in TBO splitter for DP attention (#24241) 2026-05-01 16:01:23 -07:00
Lucia Fang
05de73efd1 [core/model] Use explicit model arch for Llama4 attention backend auto-selection (#24232) 2026-05-01 15:49:30 -07:00
Liangsheng Yin
8a530468fd [Bug] Size mamba mappings from req pool, not mamba pool (#24244) 2026-05-01 15:45:20 -07:00
Yuxuan Zhang
79bc2505a5 [Bug Fix] Resolve EAGLE cuda graph IMA under PD + DP + MTP with GLM-5.1 (#23037) 2026-05-01 13:53:52 -07:00
Lucia Fang
b58fa60a1f [core/attention] Add SGLANG_FLASHINFER_USE_PAGED env to force paged wrapper (#24165) 2026-05-01 12:52:46 -07:00
Lianmin Zheng
ece8a1a788 Refactor device timer, clean up metrics collector, and add fwd occupancy metric (#24197) 2026-05-01 10:25:25 -07:00
JINZ
4a50cd781e [BugFix][HiMamba] Fix host-protected node deletion in HiMamba tombstone del (#23696)
Co-authored-by: diemchai <diemchai@tencent.com>
Co-authored-by: Zhangheng <hzh0425@apache.org>
2026-05-01 21:57:47 +08:00
ishandhanani
5b7ce417d0 [P/D disagg] - support decode side radix cache (#19746) 2026-05-01 21:55:34 +08:00
Cheng Wan
d48095ba53 Bypass torch.cuda.use_mem_pool generator-CM in SymmetricMemoryContext (#24190) 2026-05-01 01:25:49 -07:00
Lianmin Zheng
d9e8a4a7f8 [SWA] Ensure we use pre-computed SWA cache location during prefill (#24138)
Co-authored-by: Xiaozhu Meng <mxz297@gmail.com>
Co-authored-by: Yinghai Lu <yinghai@meta.com>
2026-05-01 00:01:49 -07:00
Yanbin Jiang
8975479f87 [LoRA][MOE] Fix EP correctness in MoE LoRA slicing and virtual-experts kernels (#24171) 2026-04-30 22:42:10 -07:00
Mick
9d84268705 [diffusion] refactor: introduce component residency manager (#23771) 2026-05-01 11:10:41 +08:00
Cheng Wan
108bfd8b6a [MoE] Add Aiter MoE runner backend and purge aiter.fused_moe from quant methods (#23597)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:50:52 -07:00
Yilong Zhao
f67292539f spec: gate dp mlp sync with server args (#24177) 2026-04-30 16:29:41 -07:00
Polisetty V R K Jyothendra Varma
da7f890788 [Intel GPU] Integrate flash_mla_decode in Intel XPU attention backend (#23557)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-05-01 07:21:28 +08:00
shubham singhal
e35ac95cdc [Test] Add XPU device support to unit tests (#22236)
Co-authored-by: vshekhawat-hlab <vshekhawat@habana.ai>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-05-01 07:18:51 +08:00
Roopak Srivastava
9c5cad3914 Use device-agnostic helpers for Mamba tests and core ops (#20234)
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-05-01 07:14:53 +08:00
Kalyan Kumar
8a9e424faa Replace hardcoded CUDA device with get_device() for XPU support (#13599)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-05-01 07:13:46 +08:00
Lawrence Wu
f75a8b6220 fix: support HybridLinearAttnBackend in TboAttnBackend (#20114) 2026-04-30 15:40:13 -07:00
Hubert Lu
d57671527a Fix LFM2 ShortConv Mamba State Indexing (#23975) 2026-04-30 15:23:39 -07:00
Xinyuan Tong
989a16187d [Bench] Fix bench_serving missing reasoning_content stream chunks (#23954) 2026-04-30 15:00:27 -07:00
Erik Wijmans
c04b20dc88 Fix KeyError in prepare_lora_batch when lora_ids contains None (#21974) 2026-04-30 14:50:16 -04:00
ori
71e89e9003 [MUSA][19/N] Support qwen series models (#23654)
Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>
2026-04-30 11:26:47 -07:00
Zhonghua Deng
651af06a0b [Feature] Xiaomi MiMo-V2.5 day0 support (#23811)
Co-authored-by: 张袁 <zhangyuan36@xiaomi.com>
Co-authored-by: 刘安岐 <liuanqi6@xiaomi.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2026-05-01 00:02:26 +08:00
jianzhao-xu
aa74911448 [NPU] fix some npu error with OffloaderV2 (#19541)
Co-authored-by: Jianzhao Xu <xujianchao@huawei.com>
Co-authored-by: sglang-npu-bot <sglangnpu@163.com>
2026-04-30 15:05:35 +03:00
Yaochen Han
577dbc4ab9 [4/N] Quantization Refactor: AWQ schemes and Kernel call and weight init split (#21126) 2026-04-30 14:51:01 +03:00
Qiaolin Yu
583929c0a1 fix the compatibility between --moe-dense-tp-size 1 and piecewise cuda graph (#23972) 2026-04-30 02:12:13 -07:00
Opher Lieber
99c0b62f1e allow requests with exactly context_len total tokens (#22546)
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
2026-04-30 01:12:06 -07:00
Ethan (Yusheng) Su
125f75db72 fix(lora): avoid CUDA graph-breaking scalar assignment in seg_indptr (#23738) 2026-04-30 01:11:45 -07:00
billishyahao
692979a8d9 [AMD] Support sdma path for moriep (#23929) 2026-04-29 23:57:00 -07:00
Shaojun Zhou
4f0b44c5c6 [fix] moss-vl: use Conv3dLayer and remove no-op flat_encoder_result (#23932) 2026-04-30 14:19:45 +08:00
kkyyxhll
936c9c2355 fix(qwen3_5): broadcast per-tensor scale in _make_packed_weight_loader for FP8 models (#23062) 2026-04-30 14:16:57 +08:00
Jay Thakur
bcb34da9f9 Add deterministic mode for XPU operations (#16793)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-04-30 13:39:06 +08:00
Opher Lieber
c8c1c9261d LoRA support for qwen3.5 and nemotron3 (#23594)
Co-authored-by: Yanbin Jiang <jybsuper@gmail.com>
2026-04-29 21:51:53 -07:00
Mick
0b1fbdba15 [diffusion] CI: change ground truth upload path and improve publish script (#24120) 2026-04-30 12:26:10 +08:00
Yuxuan Zhang
d040333c95 [Bug Fix] missing index/KV transfer for MTP layer in NSA disaggregation (#23539) 2026-04-30 11:55:45 +08:00
yudian0504
2d2be5d7b2 [PD][Bugfix] fix mamba cache capping (#22462)
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: yizhang2077 <1109276519@qq.com>
2026-04-30 10:57:55 +08:00
MingxuZh
62136073f9 pin the version of xgrammar to v0.1.32 (#24010)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-04-30 10:13:08 +08:00
heziiop
3553fd0322 [NPU] add split_qkv_tp_rmsnorm_rope ops for minimax2 & fix eagle3 hidden states capture in dp attn mode (#23190) 2026-04-30 08:51:22 +08:00
Lianmin Zheng
e60c60eff0 [SWA] Fix missing mamba_indices parameter in cpu copy interface (#24026) 2026-04-29 17:33:38 -07:00
Kangyan-Zhou
6575aea128 [CI] Fix black formatting on main (unblocks PR #21247 lint) (#24093)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:59:17 -07:00
Jimmy Shong
3d31ac2672 [Fix] FP8 Qwen3-Next quant error by removing fallback fused shards (#23973) 2026-04-29 17:33:47 -04:00
jsheng_Linkedin
850021378a [Score API] Hoist query placeholder scan and specialize PositionalEmbeds stacking (#23513)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:51:53 -07:00
Qiaolin Yu
79dbfe4505 Use spec v2 by default (#21062) 2026-04-29 13:40:42 -07:00
Zhongdongming Dai
7389743d85 feat: Support modelexpress p2p RDMA transfer (#23105) 2026-04-29 12:57:40 -07:00
jsheng_Linkedin
db84a8ebbb [Model] Qwen3ForPooledOutput: forward get_input_embeddings to inner model (#23434) 2026-04-29 12:25:06 -07:00
Chang Min Bark
3272af2f00 [Apple Silicon] [MLX] MLX decode partial overlap scheduling for generation (async eval) (#22416)
Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com>
Co-authored-by: Alex Nails <alex.nails@radixark.ai>
2026-04-29 12:21:14 -07:00