Commit Graph

7855 Commits

Author SHA1 Message Date
strgrb
738ebfd330 KDA: fuse qkv conv and support stride for fused_sigmoid_gating_delta_rule_update_kernel (#19506) 2026-03-04 22:45:53 +08:00
YeChang Guo
6910c1b281 [Feature][NPU]: add runtime support for GPTQ-quantized MoE models (#16364)
Co-authored-by: GuoYechang <52730608+GuoYechang@users.noreply.github.com>
Co-authored-by: root <root@localhost.localdomain>
2026-03-04 16:02:19 +03:00
Shangming Cai
c2b66d320d [HiCache] Add an env var to control transfer engine reuse (#19867) 2026-03-04 20:36:32 +08:00
chenxu214
88cfa6c11d [NPU]Releasing redundant memory of w13_weight and nz when the ascend_fuseep feature is enabled (#19813) 2026-03-04 19:26:29 +08:00
sky
17119a697d Optimization: Reduce the number of D2H operations (#19424)
Signed-off-by: wangfakang <fakangwang@gmail.com>
2026-03-04 16:32:42 +08:00
Mohammad Miadh Angkad
09fa012ba7 Fix /health regression from early prebound socket listen (#19805) 2026-03-03 23:00:46 -08:00
Yuhao Yang
115f879958 Helios: Real Real-Time Long Video Generation Model (#19782) 2026-03-04 14:58:04 +08:00
qwe
562c3ff2d0 [Feature] implement the standard multi-layer MTP for step3p5 (#18564)
Co-authored-by: mei ran <meiran0528@gmail.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
2026-03-03 22:48:53 -08:00
DefTruth
e9b5706545 [diffusion] feat: support torch compile for diffusers backend (#19673) 2026-03-04 14:08:45 +08:00
Michael
c6850ac30c [AMD] Fix Qwen3-Coder-Next: Add missing k_scale/v_scale args to extend_attention_fwd in aiter_backend (#19736) 2026-03-03 22:01:08 -08:00
Jue Wang
5972f97f11 Remove naive rotary forward overriding. (#19263) 2026-03-03 21:50:40 -08:00
ybyang
ac1f07487a Fix triton alloc extend kernel (#19780) 2026-03-03 21:01:16 -08:00
Bi Xue
73bf2c5bdc [sgl]add pin_mem to remove cpu->gpu copy sync point (#19795)
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-03 21:00:51 -08:00
sglang-bot
b7f7df7ee6 [NSA] Fix line-too-long lint in can_nsa_prefill_cp_round_robin_split (#19829) 2026-03-03 20:34:22 -08:00
Yuhao Yang
ca44aa25af Fix dp_attention crash when dp_size < tp_size in warmup dummy run (#19760) 2026-03-03 19:43:13 -08:00
Ruihang Li
da9dcbc906 [diffusion] fix: fix corrupted image editing outputs in Multi-GPU SP mode for FLUX.2-klein models (#19454) 2026-03-04 11:35:46 +08:00
Baidu-AIAK
6851613b93 [Bugfix] For cp: Fixed hang problem in prefix cache and kvcache support fp8 in-seq-split mode (#19656)
Co-authored-by: vincent <vincent@vincentdeMacBook-Pro.local>
2026-03-03 19:19:46 -08:00
Xiaoyu Zhang
4348976f80 [Diffusion] Refactor diffusion benchmark/profile skill to reuse diffusion-perf skill and clarify profiling trigger (#19783) 2026-03-04 10:54:42 +08:00
Yuan Luo
82e7139c06 [VLM] Support cos sin cache for Ernie4.5-VL (#19743)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-04 10:54:23 +08:00
Xiaoyu Zhang
115e9a1acd [Diffusion] Delete useless _ulysses_input_split func (#19786) 2026-03-04 10:45:11 +08:00
xieminghe1
ee5ccde0ad support fused_moe_triton and moe_sum_all_reduce kernel fusion[reduce … (#19672)
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
2026-03-04 10:30:33 +08:00
Charles Chen
d22c6a3847 fix: Properly return abort error for streaming requests if the abort is triggered by scheduler (#19357) 2026-03-03 17:18:15 -08:00
Brayden Zhong
e2af840c3d Various SM120 improvements (#19721) 2026-03-03 16:46:13 -08:00
Hao Jin
a69b943356 [SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models (#18154)
Co-authored-by: Hao Jin <Hao Jin>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-03-04 08:39:46 +08:00
Hubert Lu
441045a7bf [AMD] Fix EAGLE3 speculative decoding with aiter attention backend (#19362) 2026-03-03 16:12:13 -08:00
Jiayi Yan
753da27535 [Bugfix] fix parse_lscpu_topology bug (#18520) 2026-03-03 15:15:36 -08:00
Cao E
069c7e4188 Fix CI failures (#19303) 2026-03-03 15:04:44 -08:00
Yi Zhong
b8c71f895e Add tuned triton==3.5.1 h200 tp2, tp4 for qwen 3 next (#15948)
Signed-off-by: vincentzed
2026-03-03 14:47:19 -08:00
Yi Zhong
0c760c4cd7 Add tuned triton==3.5.1 b200 tp2, tp4 for qwen 3 next (#15917)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
2026-03-03 14:47:05 -08:00
Jonah Bernard
fb37c0a400 [args] Add Expert Parallelism Argument To SRT Runner (#18492)
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
2026-03-03 14:16:35 -08:00
Praneth Paruchuri
f7897def96 [Feature] Improve weight loading log (#18651)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-03 14:16:13 -08:00
Brayden Zhong
9305f0e58d Support triton_kernels for GPT-OSS on SM120 (#19718)
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
2026-03-03 14:14:01 -08:00
Sam (Kesen Li)
5b2e2750b5 Enable XQA for SM90 and SM120 (#17115)
Co-authored-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>
2026-03-03 14:09:44 -08:00
Kangyan-Zhou
dc92f88a21 Enhance bench_multiturn.py with OpenAI API support and richer metrics (#19724)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:48:04 -08:00
Guy Stone
f749802402 [Score API][18132] return token usage in Score API response (#18381) 2026-03-03 13:45:35 -08:00
almaslof
b0f26698f5 feat(benchmark script): add similar to vllm --ready-check-timeout-sec parameter (#15466) 2026-03-03 13:44:38 -08:00
Karthik Koralla
85ab6a7f54 cli: Add lazy imports and fail-fast config validation (RFC #9853) (#19368)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-03 13:03:49 -08:00
xrwang8
cedb86a950 Feature:Reserve HTTP server port before model loading to immediately detect port conflicts instead of failing after several minutes of model loading. (#17754)
Signed-off-by: xrwang8 <xrwang8@gmail.com>
2026-03-03 11:59:18 -08:00
doujiang24
2e1b9e2547 Fix routed_dp_rank boundary validation (#19762)
Signed-off-by: doujiang24 <doujiang24@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2026-03-03 11:55:15 -08:00
yefei12
85f7a0aa30 feat: support Kimi K2.5 for Eagle3 (#19689)
Co-authored-by: chenyefei.cyf <chenyefei.cyf@U-9V5T77LW-2356.local>
Co-authored-by: GeLee-Q <865038696@qq.com>
Co-authored-by: Gao016 <yngao016@163.com>
Co-authored-by: sxl1993 <1218197792@qq.com>
2026-03-03 13:41:15 -05:00
xutizhou
c6377bbbca feat(gdn): add FlashInfer K-last SSM layout support for GDN prefill and decode for Hopper (#18361)
Co-authored-by: HongliMi <106042350+HongliMi@users.noreply.github.com>
Co-authored-by: xiaozhoupy <181108106+zhou9402@users.noreply.github.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Avery Yingyi Huang <averyh@nvidia.com>
Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com>
2026-03-03 20:30:48 +08:00
Jasonzhang517
d939e26585 [model gateway][0/N] router EPD support: add encoder grpc server backend support (#16552)
Co-authored-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Co-authored-by: Zongyao Chen <solar1s@163.com>
2026-03-03 19:38:15 +08:00
Shangming Cai
facde4c6d3 [PD] Enable all CP ranks for KVCache transfer (#19765)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-03-03 19:35:21 +08:00
shengzhaotian
365ca1edb5 [NPU] bugs fix: fix a condition bug when using speculative inference on Qwen3 and Qwen3 moe (#19532) 2026-03-03 17:59:25 +08:00
Muqi Li
666caaf9ce [Tool Call] Stream DeepSeek-V3.2 function call parameters in JSON format. (#16091)
Co-authored-by: Huixxi <uestc.hugo@gmail.com>
2026-03-03 01:46:29 -08:00
Shaun Kotek
4c95953b77 Fix/nemotron mtp quantaized (#19433) 2026-03-03 01:07:46 -08:00
Charles Chen
af0d35b224 Fix: Reject requests with a duplicate request ID which can cause server crash/hang (#19035)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-03 00:33:25 -08:00
Muqi Li
6af0448cc9 [Bugfix] Catch errors when DeepSeek-V3.2 generates malformed JSON (#18174) 2026-03-03 00:10:07 -08:00
Liangsheng Yin
7a2d3df96f Apply default stream to priority 0 in scheduling. (#16438) 2026-03-03 00:05:27 -08:00
Zack Yu
07b8d763ef feat: Add FP8 KV cache support for Triton attention backend (#18882) 2026-03-02 23:38:34 -08:00