strgrb
|
738ebfd330
|
KDA: fuse qkv conv and support stride for fused_sigmoid_gating_delta_rule_update_kernel (#19506)
|
2026-03-04 22:45:53 +08:00 |
|
YeChang Guo
|
6910c1b281
|
[Feature][NPU]: add runtime support for GPTQ-quantized MoE models (#16364)
Co-authored-by: GuoYechang <52730608+GuoYechang@users.noreply.github.com>
Co-authored-by: root <root@localhost.localdomain>
|
2026-03-04 16:02:19 +03:00 |
|
Shangming Cai
|
c2b66d320d
|
[HiCache] Add an env var to control transfer engine reuse (#19867)
|
2026-03-04 20:36:32 +08:00 |
|
chenxu214
|
88cfa6c11d
|
[NPU]Releasing redundant memory of w13_weight and nz when the ascend_fuseep feature is enabled (#19813)
|
2026-03-04 19:26:29 +08:00 |
|
sky
|
17119a697d
|
Optimization: Reduce the number of D2H operations (#19424)
Signed-off-by: wangfakang <fakangwang@gmail.com>
|
2026-03-04 16:32:42 +08:00 |
|
Mohammad Miadh Angkad
|
09fa012ba7
|
Fix /health regression from early prebound socket listen (#19805)
|
2026-03-03 23:00:46 -08:00 |
|
Yuhao Yang
|
115f879958
|
Helios: Real Real-Time Long Video Generation Model (#19782)
|
2026-03-04 14:58:04 +08:00 |
|
qwe
|
562c3ff2d0
|
[Feature] implement the standard multi-layer MTP for step3p5 (#18564)
Co-authored-by: mei ran <meiran0528@gmail.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
|
2026-03-03 22:48:53 -08:00 |
|
DefTruth
|
e9b5706545
|
[diffusion] feat: support torch compile for diffusers backend (#19673)
|
2026-03-04 14:08:45 +08:00 |
|
Michael
|
c6850ac30c
|
[AMD] Fix Qwen3-Coder-Next: Add missing k_scale/v_scale args to extend_attention_fwd in aiter_backend (#19736)
|
2026-03-03 22:01:08 -08:00 |
|
Jue Wang
|
5972f97f11
|
Remove naive rotary forward overriding. (#19263)
|
2026-03-03 21:50:40 -08:00 |
|
ybyang
|
ac1f07487a
|
Fix triton alloc extend kernel (#19780)
|
2026-03-03 21:01:16 -08:00 |
|
Bi Xue
|
73bf2c5bdc
|
[sgl]add pin_mem to remove cpu->gpu copy sync point (#19795)
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
|
2026-03-03 21:00:51 -08:00 |
|
sglang-bot
|
b7f7df7ee6
|
[NSA] Fix line-too-long lint in can_nsa_prefill_cp_round_robin_split (#19829)
|
2026-03-03 20:34:22 -08:00 |
|
Yuhao Yang
|
ca44aa25af
|
Fix dp_attention crash when dp_size < tp_size in warmup dummy run (#19760)
|
2026-03-03 19:43:13 -08:00 |
|
Ruihang Li
|
da9dcbc906
|
[diffusion] fix: fix corrupted image editing outputs in Multi-GPU SP mode for FLUX.2-klein models (#19454)
|
2026-03-04 11:35:46 +08:00 |
|
Baidu-AIAK
|
6851613b93
|
[Bugfix] For cp: Fixed hang problem in prefix cache and kvcache support fp8 in-seq-split mode (#19656)
Co-authored-by: vincent <vincent@vincentdeMacBook-Pro.local>
|
2026-03-03 19:19:46 -08:00 |
|
Xiaoyu Zhang
|
4348976f80
|
[Diffusion] Refactor diffusion benchmark/profile skill to reuse diffusion-perf skill and clarify profiling trigger (#19783)
|
2026-03-04 10:54:42 +08:00 |
|
Yuan Luo
|
82e7139c06
|
[VLM] Support cos sin cache for Ernie4.5-VL (#19743)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2026-03-04 10:54:23 +08:00 |
|
Xiaoyu Zhang
|
115e9a1acd
|
[Diffusion] Delete useless _ulysses_input_split func (#19786)
|
2026-03-04 10:45:11 +08:00 |
|
xieminghe1
|
ee5ccde0ad
|
support fused_moe_triton and moe_sum_all_reduce kernel fusion[reduce … (#19672)
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
|
2026-03-04 10:30:33 +08:00 |
|
Charles Chen
|
d22c6a3847
|
fix: Properly return abort error for streaming requests if the abort is triggered by scheduler (#19357)
|
2026-03-03 17:18:15 -08:00 |
|
Brayden Zhong
|
e2af840c3d
|
Various SM120 improvements (#19721)
|
2026-03-03 16:46:13 -08:00 |
|
Hao Jin
|
a69b943356
|
[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models (#18154)
Co-authored-by: Hao Jin <Hao Jin>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2026-03-04 08:39:46 +08:00 |
|
Hubert Lu
|
441045a7bf
|
[AMD] Fix EAGLE3 speculative decoding with aiter attention backend (#19362)
|
2026-03-03 16:12:13 -08:00 |
|
Jiayi Yan
|
753da27535
|
[Bugfix] fix parse_lscpu_topology bug (#18520)
|
2026-03-03 15:15:36 -08:00 |
|
Cao E
|
069c7e4188
|
Fix CI failures (#19303)
|
2026-03-03 15:04:44 -08:00 |
|
Yi Zhong
|
b8c71f895e
|
Add tuned triton==3.5.1 h200 tp2, tp4 for qwen 3 next (#15948)
Signed-off-by: vincentzed
|
2026-03-03 14:47:19 -08:00 |
|
Yi Zhong
|
0c760c4cd7
|
Add tuned triton==3.5.1 b200 tp2, tp4 for qwen 3 next (#15917)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
|
2026-03-03 14:47:05 -08:00 |
|
Jonah Bernard
|
fb37c0a400
|
[args] Add Expert Parallelism Argument To SRT Runner (#18492)
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
|
2026-03-03 14:16:35 -08:00 |
|
Praneth Paruchuri
|
f7897def96
|
[Feature] Improve weight loading log (#18651)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-03-03 14:16:13 -08:00 |
|
Brayden Zhong
|
9305f0e58d
|
Support triton_kernels for GPT-OSS on SM120 (#19718)
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
|
2026-03-03 14:14:01 -08:00 |
|
Sam (Kesen Li)
|
5b2e2750b5
|
Enable XQA for SM90 and SM120 (#17115)
Co-authored-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>
|
2026-03-03 14:09:44 -08:00 |
|
Kangyan-Zhou
|
dc92f88a21
|
Enhance bench_multiturn.py with OpenAI API support and richer metrics (#19724)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-03-03 13:48:04 -08:00 |
|
Guy Stone
|
f749802402
|
[Score API][18132] return token usage in Score API response (#18381)
|
2026-03-03 13:45:35 -08:00 |
|
almaslof
|
b0f26698f5
|
feat(benchmark script): add similar to vllm --ready-check-timeout-sec parameter (#15466)
|
2026-03-03 13:44:38 -08:00 |
|
Karthik Koralla
|
85ab6a7f54
|
cli: Add lazy imports and fail-fast config validation (RFC #9853) (#19368)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
|
2026-03-03 13:03:49 -08:00 |
|
xrwang8
|
cedb86a950
|
Feature:Reserve HTTP server port before model loading to immediately detect port conflicts instead of failing after several minutes of model loading. (#17754)
Signed-off-by: xrwang8 <xrwang8@gmail.com>
|
2026-03-03 11:59:18 -08:00 |
|
doujiang24
|
2e1b9e2547
|
Fix routed_dp_rank boundary validation (#19762)
Signed-off-by: doujiang24 <doujiang24@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
|
2026-03-03 11:55:15 -08:00 |
|
yefei12
|
85f7a0aa30
|
feat: support Kimi K2.5 for Eagle3 (#19689)
Co-authored-by: chenyefei.cyf <chenyefei.cyf@U-9V5T77LW-2356.local>
Co-authored-by: GeLee-Q <865038696@qq.com>
Co-authored-by: Gao016 <yngao016@163.com>
Co-authored-by: sxl1993 <1218197792@qq.com>
|
2026-03-03 13:41:15 -05:00 |
|
xutizhou
|
c6377bbbca
|
feat(gdn): add FlashInfer K-last SSM layout support for GDN prefill and decode for Hopper (#18361)
Co-authored-by: HongliMi <106042350+HongliMi@users.noreply.github.com>
Co-authored-by: xiaozhoupy <181108106+zhou9402@users.noreply.github.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Avery Yingyi Huang <averyh@nvidia.com>
Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com>
|
2026-03-03 20:30:48 +08:00 |
|
Jasonzhang517
|
d939e26585
|
[model gateway][0/N] router EPD support: add encoder grpc server backend support (#16552)
Co-authored-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Co-authored-by: Zongyao Chen <solar1s@163.com>
|
2026-03-03 19:38:15 +08:00 |
|
Shangming Cai
|
facde4c6d3
|
[PD] Enable all CP ranks for KVCache transfer (#19765)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2026-03-03 19:35:21 +08:00 |
|
shengzhaotian
|
365ca1edb5
|
[NPU] bugs fix: fix a condition bug when using speculative inference on Qwen3 and Qwen3 moe (#19532)
|
2026-03-03 17:59:25 +08:00 |
|
Muqi Li
|
666caaf9ce
|
[Tool Call] Stream DeepSeek-V3.2 function call parameters in JSON format. (#16091)
Co-authored-by: Huixxi <uestc.hugo@gmail.com>
|
2026-03-03 01:46:29 -08:00 |
|
Shaun Kotek
|
4c95953b77
|
Fix/nemotron mtp quantaized (#19433)
|
2026-03-03 01:07:46 -08:00 |
|
Charles Chen
|
af0d35b224
|
Fix: Reject requests with a duplicate request ID which can cause server crash/hang (#19035)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
|
2026-03-03 00:33:25 -08:00 |
|
Muqi Li
|
6af0448cc9
|
[Bugfix] Catch errors when DeepSeek-V3.2 generates malformed JSON (#18174)
|
2026-03-03 00:10:07 -08:00 |
|
Liangsheng Yin
|
7a2d3df96f
|
Apply default stream to priority 0 in scheduling. (#16438)
|
2026-03-03 00:05:27 -08:00 |
|
Zack Yu
|
07b8d763ef
|
feat: Add FP8 KV cache support for Triton attention backend (#18882)
|
2026-03-02 23:38:34 -08:00 |
|