Commit Graph

6750 Commits

Author SHA1 Message Date
Mohammad Miadh Angkad
09fa012ba7 Fix /health regression from early prebound socket listen (#19805) 2026-03-03 23:00:46 -08:00
Yuhao Yang
115f879958 Helios: Real Real-Time Long Video Generation Model (#19782) 2026-03-04 14:58:04 +08:00
qwe
562c3ff2d0 [Feature] implement the standard multi-layer MTP for step3p5 (#18564)
Co-authored-by: mei ran <meiran0528@gmail.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
2026-03-03 22:48:53 -08:00
DefTruth
e9b5706545 [diffusion] feat: support torch compile for diffusers backend (#19673) 2026-03-04 14:08:45 +08:00
Michael
c6850ac30c [AMD] Fix Qwen3-Coder-Next: Add missing k_scale/v_scale args to extend_attention_fwd in aiter_backend (#19736) 2026-03-03 22:01:08 -08:00
Jue Wang
5972f97f11 Remove naive rotary forward overriding. (#19263) 2026-03-03 21:50:40 -08:00
ybyang
ac1f07487a Fix triton alloc extend kernel (#19780) 2026-03-03 21:01:16 -08:00
Bi Xue
73bf2c5bdc [sgl]add pin_mem to remove cpu->gpu copy sync point (#19795)
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-03 21:00:51 -08:00
sglang-bot
b7f7df7ee6 [NSA] Fix line-too-long lint in can_nsa_prefill_cp_round_robin_split (#19829) 2026-03-03 20:34:22 -08:00
Yuhao Yang
ca44aa25af Fix dp_attention crash when dp_size < tp_size in warmup dummy run (#19760) 2026-03-03 19:43:13 -08:00
Ruihang Li
da9dcbc906 [diffusion] fix: fix corrupted image editing outputs in Multi-GPU SP mode for FLUX.2-klein models (#19454) 2026-03-04 11:35:46 +08:00
Baidu-AIAK
6851613b93 [Bugfix] For cp: Fixed hang problem in prefix cache and kvcache support fp8 in-seq-split mode (#19656)
Co-authored-by: vincent <vincent@vincentdeMacBook-Pro.local>
2026-03-03 19:19:46 -08:00
Xiaoyu Zhang
4348976f80 [Diffusion] Refactor diffusion benchmark/profile skill to reuse diffusion-perf skill and clarify profiling trigger (#19783) 2026-03-04 10:54:42 +08:00
Yuan Luo
82e7139c06 [VLM] Support cos sin cache for Ernie4.5-VL (#19743)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-04 10:54:23 +08:00
Xiaoyu Zhang
115e9a1acd [Diffusion] Delete useless _ulysses_input_split func (#19786) 2026-03-04 10:45:11 +08:00
xieminghe1
ee5ccde0ad support fused_moe_triton and moe_sum_all_reduce kernel fusion[reduce … (#19672)
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
2026-03-04 10:30:33 +08:00
Charles Chen
d22c6a3847 fix: Properly return abort error for streaming requests if the abort is triggered by scheduler (#19357) 2026-03-03 17:18:15 -08:00
Brayden Zhong
e2af840c3d Various SM120 improvements (#19721) 2026-03-03 16:46:13 -08:00
Hao Jin
a69b943356 [SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models (#18154)
Co-authored-by: Hao Jin <Hao Jin>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-03-04 08:39:46 +08:00
Hubert Lu
441045a7bf [AMD] Fix EAGLE3 speculative decoding with aiter attention backend (#19362) 2026-03-03 16:12:13 -08:00
Jiayi Yan
753da27535 [Bugfix] fix parse_lscpu_topology bug (#18520) 2026-03-03 15:15:36 -08:00
Cao E
069c7e4188 Fix CI failures (#19303) 2026-03-03 15:04:44 -08:00
Yi Zhong
b8c71f895e Add tuned triton==3.5.1 h200 tp2, tp4 for qwen 3 next (#15948)
Signed-off-by: vincentzed
2026-03-03 14:47:19 -08:00
Yi Zhong
0c760c4cd7 Add tuned triton==3.5.1 b200 tp2, tp4 for qwen 3 next (#15917)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
2026-03-03 14:47:05 -08:00
Jonah Bernard
fb37c0a400 [args] Add Expert Parallelism Argument To SRT Runner (#18492)
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
2026-03-03 14:16:35 -08:00
Praneth Paruchuri
f7897def96 [Feature] Improve weight loading log (#18651)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-03 14:16:13 -08:00
Brayden Zhong
9305f0e58d Support triton_kernels for GPT-OSS on SM120 (#19718)
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
2026-03-03 14:14:01 -08:00
Sam (Kesen Li)
5b2e2750b5 Enable XQA for SM90 and SM120 (#17115)
Co-authored-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>
2026-03-03 14:09:44 -08:00
Kangyan-Zhou
dc92f88a21 Enhance bench_multiturn.py with OpenAI API support and richer metrics (#19724)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:48:04 -08:00
Guy Stone
f749802402 [Score API][18132] return token usage in Score API response (#18381) 2026-03-03 13:45:35 -08:00
almaslof
b0f26698f5 feat(benchmark script): add similar to vllm --ready-check-timeout-sec parameter (#15466) 2026-03-03 13:44:38 -08:00
Karthik Koralla
85ab6a7f54 cli: Add lazy imports and fail-fast config validation (RFC #9853) (#19368)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-03 13:03:49 -08:00
xrwang8
cedb86a950 Feature:Reserve HTTP server port before model loading to immediately detect port conflicts instead of failing after several minutes of model loading. (#17754)
Signed-off-by: xrwang8 <xrwang8@gmail.com>
2026-03-03 11:59:18 -08:00
doujiang24
2e1b9e2547 Fix routed_dp_rank boundary validation (#19762)
Signed-off-by: doujiang24 <doujiang24@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2026-03-03 11:55:15 -08:00
yefei12
85f7a0aa30 feat: support Kimi K2.5 for Eagle3 (#19689)
Co-authored-by: chenyefei.cyf <chenyefei.cyf@U-9V5T77LW-2356.local>
Co-authored-by: GeLee-Q <865038696@qq.com>
Co-authored-by: Gao016 <yngao016@163.com>
Co-authored-by: sxl1993 <1218197792@qq.com>
2026-03-03 13:41:15 -05:00
xutizhou
c6377bbbca feat(gdn): add FlashInfer K-last SSM layout support for GDN prefill and decode for Hopper (#18361)
Co-authored-by: HongliMi <106042350+HongliMi@users.noreply.github.com>
Co-authored-by: xiaozhoupy <181108106+zhou9402@users.noreply.github.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Avery Yingyi Huang <averyh@nvidia.com>
Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com>
2026-03-03 20:30:48 +08:00
Jasonzhang517
d939e26585 [model gateway][0/N] router EPD support: add encoder grpc server backend support (#16552)
Co-authored-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Co-authored-by: Zongyao Chen <solar1s@163.com>
2026-03-03 19:38:15 +08:00
Shangming Cai
facde4c6d3 [PD] Enable all CP ranks for KVCache transfer (#19765)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-03-03 19:35:21 +08:00
shengzhaotian
365ca1edb5 [NPU] bugs fix: fix a condition bug when using speculative inference on Qwen3 and Qwen3 moe (#19532) 2026-03-03 17:59:25 +08:00
Muqi Li
666caaf9ce [Tool Call] Stream DeepSeek-V3.2 function call parameters in JSON format. (#16091)
Co-authored-by: Huixxi <uestc.hugo@gmail.com>
2026-03-03 01:46:29 -08:00
Shaun Kotek
4c95953b77 Fix/nemotron mtp quantaized (#19433) 2026-03-03 01:07:46 -08:00
Charles Chen
af0d35b224 Fix: Reject requests with a duplicate request ID which can cause server crash/hang (#19035)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-03 00:33:25 -08:00
Muqi Li
6af0448cc9 [Bugfix] Catch errors when DeepSeek-V3.2 generates malformed JSON (#18174) 2026-03-03 00:10:07 -08:00
Liangsheng Yin
7a2d3df96f Apply default stream to priority 0 in scheduling. (#16438) 2026-03-03 00:05:27 -08:00
Zack Yu
07b8d763ef feat: Add FP8 KV cache support for Triton attention backend (#18882) 2026-03-02 23:38:34 -08:00
赵晨阳
62480ebb1b [SGLang-Diffusion] Fix custom op fake impl missing eps default for torch.compile (#19725)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-03-03 15:24:36 +08:00
1StepForever
3c01b44700 [Fix] NPU deepep hccl buffer and fix IPC safe check (#17804) 2026-03-03 14:56:06 +08:00
Xinyuan Tong
dbf1247fe0 Add KimiK2Detector with tool interruption support (#19696)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-03-03 14:04:49 +08:00
Yuzhen Zhou
63003a39cf [BUG] Support tuple hidden_states from fused MXFP4/FP8 quantization (#19643) 2026-03-02 20:39:06 -08:00
Alison Shao
fe9d85d93c Fix CompressedTensorsMxInt4MoE abstract method and relax GPQA baseline (#19726)
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
2026-03-02 19:03:21 -08:00