Commit Graph

7019 Commits

Author SHA1 Message Date
Bruce Wu
70a6fb53af Enable embedding lookup/lora_a logic for chunked backend (#17692)
Co-authored-by: Bruce Wu <mogicianwu@fb.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
2026-03-16 11:37:58 -07:00
Douglas Yang
061ec582bf fix: adding teacache.params back to sampling params as intended (#20665) 2026-03-16 11:27:06 -07:00
ybyang
289cbcf482 fix: support PP2+CP8+TP8 (PP with context parallelism) (#19548) 2026-03-16 16:51:47 +00:00
Xiaoyu Zhang
6489f77733 [Diffusion] Fix compile graph broken by flashinfer rope (#20699) 2026-03-16 23:14:27 +08:00
Du Bin
d3c0f4376a Fix AssertionError crash in disagg prefill inflight queue with PP (#20686) 2026-03-16 22:38:59 +08:00
Xiaoyu Zhang
15097c5c3b Release sglang kernel 0.4.0 (#20440)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-03-16 20:34:58 +08:00
sky
3d58cd16d9 [DP Attention] Optimize dp_padding_mode selection for dp_size=1 in extend mode (#20406)
Signed-off-by: wangfakang <fakangwang@gmail.com>
2026-03-16 18:44:42 +08:00
Xun Sun
549fbcc864 [5/N] (Elastic EP) Use GPU P2P to exchange expert weights during EPLB as much as possible (#12068)
Co-authored-by: Hank Han <hanhan.hank@bytedance.com>
Co-authored-by: Hank Han <hanhan7630@outlook.com>
2026-03-16 18:40:58 +08:00
Xiaoyu Zhang
3055b6906d [Diffusion] Document torch.compile graph-break checks in diffusion benchmark skills (#20681) 2026-03-16 17:41:40 +08:00
Mick
485597e651 [diffusion] fix: fix some sampling args passed via cli are omitted (#20630) 2026-03-16 16:55:30 +08:00
Sugar920
895e56097c Add NPU basic function testcases (#19382)
Co-authored-by: cy <chenyang08056032@163.com>
Co-authored-by: Cherry_ming <136634645@qq.com>
2026-03-16 15:09:56 +08:00
shuwenn
42f18fe560 [HiCache] fix: release write-through lock_ref during decode (#20049) 2026-03-16 14:49:31 +08:00
Ke Bao
39336f5812 Precompute swa cache location (#20449) 2026-03-16 14:38:08 +08:00
Zheng Wengang
135af6dc92 [EPD][VLM] support video/audio input (#17824)
Co-authored-by: siyu <liusy58@linux.alibaba.com>
2026-03-16 14:18:21 +08:00
Shangming Cai
738cbde902 [PD] Make pending reqs resolving more robust (#20505)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-03-16 14:12:13 +08:00
pansicheng
97b2a89334 [RadixTree][8/N Refactor]: unify lock interface (#20330) 2026-03-16 11:49:51 +08:00
Liangsheng Yin
f0458e0b49 [Utils] Move network/socket utilities from common.py to network.py (#20646) 2026-03-15 20:35:24 -07:00
Javier Torres
afc71bae3a feat: Add 'none' reasoning effort to ChatCompletionRequest (#20556) 2026-03-15 20:25:48 -07:00
gaopengff
f4393bf3f6 Fix correctness test issue for bench_one_batch (#20650) 2026-03-15 20:05:36 -07:00
Xiaoyu Zhang
e1eb25880f [Diffusion] Add a benchmark for rmsnorm/fuse_add_rmsnorm (#20632) 2026-03-16 09:50:33 +08:00
Zhirui
35c249b4de [OpenAI] Log raw request payload for --log-requests (#20605) 2026-03-15 17:45:00 -07:00
Liangsheng Yin
d852f26cb6 Fix dual-stack socket handling: IPV6_V6ONLY, IPv4-first, is_port_available all-family check (#20643) 2026-03-15 17:17:23 -07:00
jellysnack
53f831691a fix: propagate grammar errors and improve llguidance backend (#20467) 2026-03-15 16:11:18 -07:00
psaab
1145805e7d Fix socket utilities and reserve_port for IPv6 dual-stack support (#20491)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-15 14:29:10 -07:00
Ke Bao
e2be31824f [CI] Add ut coverage tool (#20628) 2026-03-15 21:13:45 +08:00
Yuhao Yang
1c456a0af5 VLM: add Conv2dLayer/Conv3dLayer to fix PyTorch 2.9.1 CuDNN Conv3d (#20282)
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2026-03-15 19:17:44 +08:00
Kit Fraser-Taliente
7c773ddb0a [Fix] Slice input_embeds to extend_input_len in prepare_for_extend (#20376) 2026-03-15 00:07:05 -07:00
Juan Muneton
7458407437 Fix InternVL and vision attention for non-CUDA backends (e.g. XPU) (#19997)
Co-authored-by: Yang Wang <mr.yang.wang@outlook.com>
2026-03-14 23:24:41 -07:00
shuwenn
1ac6a26464 fix: Nemotron chunk size alias (#20458) 2026-03-14 23:23:39 -07:00
Liangsheng Yin
fc7f9c1de7 Rename --stream-output to --incremental-streaming-output (#20614) 2026-03-14 23:22:33 -07:00
shuwenn
538acb4c46 fix: Add .text property to HttpResponse to prevent AttributeError (#20518) 2026-03-14 22:59:32 -07:00
Yuhao Yang
a6ecf050be diffusion: fix helios accuracy issue (#20036) 2026-03-15 13:55:51 +08:00
sglang-bot
93afe15b43 chore: bump flashinfer version to 0.6.6 (#20480)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2026-03-14 13:05:10 -07:00
Xiaowei Wang
574dbe23b2 Add piecewise cuda graph for Qwen3-Next FP8 flashinfer_trtllm moe backend (#18184) 2026-03-14 13:03:31 -07:00
Baizhou Zhang
39008955ff Revert "[AMD][MORI] Fix MTP crash with FP4/FP8 dispatch and add NEXTN dispatch env vars." (#20602) 2026-03-14 12:12:42 -07:00
Xiaoyu Zhang
5ab2cfe9a8 [Diffusion] Clean upstream fa3 in hopper (#20576) 2026-03-14 23:41:23 +08:00
Yuan Luo
22e67876d6 [Omni] Optimize AudioEncoder for Qwen3_Omni_Thinker (#18185)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-14 23:00:17 +08:00
Ratish P
574aa2d723 [diffusion]: remove stale offload-manager cleanup in denoising stage (#20587) 2026-03-14 22:56:57 +08:00
Xiaoyu Zhang
25e38216b6 [kernel slimming] Clean many useless sgl-kernel deprecated kernels (#20277) 2026-03-14 16:45:54 +08:00
Mohammad Miadh Angkad
75a7879fd4 [Model] Support Nemotron 3 Super NVFP4 (#20407) 2026-03-14 00:56:26 -07:00
SoluMilken
c95dc88f86 [CI] migrate ascend-gptq from test/srt to test/registered (#19628) 2026-03-14 00:28:57 -07:00
Xiaoyu Zhang
f9e4221b71 [Diffusion] add mova and hunyuanvideo to perf skills (#20563) 2026-03-14 13:49:50 +08:00
Shangming Cai
99a3b25c9b [PP] Fix recv tensor dict potential race condition (#20341)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-03-14 13:35:01 +08:00
Xinyuan Tong
c330b687a1 [Bugfix] Fix GLM-4.6V vision regression in glm4v_moe and glm_ocr (#20463) 2026-03-13 21:48:28 -07:00
ziruiliu
dfd0a77a9a [bugfix] Add prev_prefix_len parameter to HiMambaRadixCache's _insert_helper() (#20539) 2026-03-14 09:54:14 +08:00
Duyi-Wang
0eea80bc00 [AMD][MORI] Fix MTP crash with FP4/FP8 dispatch and add NEXTN dispatch env vars. (#20453) 2026-03-13 14:03:17 -07:00
YC Tseng
c37ef7f18b [AMD] diffusion refactor: move ROCM VAE optimization to Platform abstraction (#20496) 2026-03-13 13:10:05 -07:00
Simo Lin
654fc02cf1 [gRPC] Extract gRPC servicer into standalone package (#20478)
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
2026-03-13 09:13:29 -07:00
Xiaoyu Zhang
be7a0311a0 [Diffusion] Fix and validate diffusion skills benchmarking/profiling workflow (#20528) 2026-03-13 21:11:37 +08:00
Leon Gao
b1246c50f8 Fix chunked prefill and KV cache leaks for streaming sessions (#20476)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-13 02:36:55 -07:00