Commit Graph

8252 Commits

Author SHA1 Message Date
Yuxuan Zhang
d49fc092cb [Bug Fix] GLM-5.1: drop constexpr on page_indice_batch_offset, skip offloader post_init on draft worker, support N=32 in copy_to_gpu_no_ce (#23550) 2026-05-09 15:43:45 +08:00
Liangsheng Yin
78da0d3106 [Spec] Move accept_tokens off EagleDraftInput; pass via method arg (#24735) 2026-05-08 23:24:18 -07:00
Chi McIsaac
8e534e8f15 [diffusion] fix: fix diffusers executor crash when component residency manager is absent (#24573) 2026-05-09 11:45:06 +08:00
storyicon
590b13b513 [diffusion] fix: fix NCCL deadlock in ulysses sp when sequence length has remainder (#24694)
Signed-off-by: storyicon <storyicon@foxmail.com>
2026-05-09 11:05:37 +08:00
Polisetty V R K Jyothendra Varma
50ed01674e fix is_arch_support_pdl function usage (#24600)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-05-09 09:39:34 +08:00
Liangsheng Yin
1613bae412 [Spec] Disambiguate verified_id into bonus_token(s) / accept_tokens (#24724) 2026-05-08 18:24:33 -07:00
Yuan Luo
a61a14f416 [KDA] Optimize prefill kernels with diagonal and recompute fuse (#24271)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-05-09 08:52:51 +08:00
Brayden Zhong
9ee830346f Disable Custom AR V2 when in multi-node (#24729)
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
2026-05-08 17:50:05 -07:00
Cheng Wan
d1c5937428 env: add SGLANG_RADIX_FORCE_MISS to force radix prefix-cache miss (#24726)
Co-authored-by: sihan-zzz <228612289+sihan-zzz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:46:38 -07:00
YAMY
560829a171 feat(scheduler): add adaptive queue-based prefill delayer trigger (#23189) 2026-05-08 16:54:30 -07:00
YAMY
6971a03fe6 fix(fa3): skip scheduler_metadata precompute under DP attention (#24632) 2026-05-08 16:19:20 -07:00
Niko Ma
62c2e091f6 [PD] MORI-IO: Add state transfer, inline transfer model, and high-concurrency fixes (#22665) 2026-05-08 16:07:22 -07:00
Jimmy Shong
fa8985486e [test/fix]: isolate VLM MMMU eval output dirs to fix nightly-4-gpu cross-test pollution (#24623)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-08 15:01:53 -07:00
Jimmy Shong
096ad02b06 [Model] Laguna-XS.2 Model Support (#24204) 2026-05-09 05:43:13 +08:00
Cheng Wan
7b707c9222 disable the combination of --enable-two-batch-overlap and --enforce-s… (#24720) 2026-05-08 14:27:35 -07:00
Yuhao Yang
09912fd89d Remove unnecessary bf16 assert in rotate_activation (#24686) 2026-05-09 05:00:52 +08:00
Yilong Zhao
f30d1d0b0a logits: remove blocking H2D copy (#24627) 2026-05-08 13:22:13 -07:00
Ethan Feng
672f778512 [NemotronH] Fix expert scale weight loading (#24434) 2026-05-08 12:37:06 -07:00
zhongdaor-nv
2cf1a4ab38 feat: Add KV events for Mamba radix cache (#23678)
Signed-off-by: zhongdaor-nv <220807034+zhongdaor-nv@users.noreply.github.com>
Co-authored-by: zhongdaor-nv <220807034+zhongdaor-nv@users.noreply.github.com>
2026-05-08 11:53:36 -07:00
Lianmin Zheng
e40e339c72 Filter non-int token ids in benchmark and observe decode-side bootstrap/alloc metrics (#24684) 2026-05-08 11:45:37 -07:00
Mick
73b8eda103 [diffusion] fix: fix FA3 varlen out argument handling (#24688) 2026-05-08 19:01:49 +08:00
fanxingran
7f8e7a9130 fix(aiter): drop FP8 KV upcast; use native FP8 path in paged_attentio… (#24129)
Co-authored-by: fanxingran <fanxingran@amd.com>
2026-05-08 02:47:48 -07:00
jacky.cheng
f21d4868dc [AMD] Replace naive triton RMSNorm with aiter RMSNorm for diffusion model (#24360) 2026-05-08 02:44:13 -07:00
YC Yen-Ching Tseng
e1150f66db [AMD][diffusion] Temporal-unfolded batched Conv2D for ROCm VAE decode (#22971) 2026-05-08 02:32:14 -07:00
Brayden Zhong
80d0226b68 Turn on JIT custom AR implementation by default (#24363)
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
2026-05-08 02:05:31 -07:00
HAI
73792629d4 [AMD] Intro SGLANG_DIFFUSION_AITER_FP8_ATTN (#24677) 2026-05-08 01:31:00 -07:00
jacky.cheng
b22d3cd606 [AMD] Support fp8 MLA for diffusion model (#20319) 2026-05-08 00:56:24 -07:00
Yibo Cai
55d8223c2b [sgl-kernel/cpu] support w8a8 int8 model for arm cpu (#16045)
skip gpu test as this one is not related to gpu backend.
2026-05-08 14:47:06 +08:00
JoyFuture
e1bc001872 fix(mimo_v2): auto-disable multimodal when vision/audio configs are absent (#24652) 2026-05-08 13:40:08 +08:00
maocheng23
7deed98e1b [fix] /pause_generation and /continue_generation wrong for --tokenizer-worker-num > 1 (#24462)
Co-authored-by: lawrence-harmonic <185285563+lawrence-harmonic@users.noreply.github.com>
2026-05-07 21:32:21 -07:00
Mick
2afb450501 [diffusion] optimize: optimize frame returns path (#24616) 2026-05-08 12:10:09 +08:00
johnnycxm
cdf5771f91 [MUSA][17/N] ci: Add MUSA diffusion, sgl-kernel tests, and CI workflow support (#20672)
Co-authored-by: ximin.chen <ximin.chen@mthreads.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2026-05-07 20:45:21 -07:00
Brayden Zhong
5fa3bb2eaf Enable flashinfer::trtllm_allreduce_fusion with PDL (#23765)
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-08 10:41:10 +08:00
shuwenn
d9dddd4d7d [SPEC V2][2/N] feat: adaptive spec support spec v2 (#23336)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2026-05-07 18:33:47 -07:00
Liangsheng Yin
35870d55ac Deepseek V4 (#23882)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: yueming-yuan <yym022502@gmail.com>
Co-authored-by: DarkSharpness <2040703891@qq.com>
Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Qiaolin Yu <90088090+qiaolin-yu@users.noreply.github.com>
Co-authored-by: Ethan (Yusheng) Su <11704492+yushengsu-thu@users.noreply.github.com>
Co-authored-by: Mingyi <27337995+wisclmy0611@users.noreply.github.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Yihao Wang <42559837+againstentropy@users.noreply.github.com>
2026-05-07 18:32:21 -07:00
Mandepudi Rani Chowdary
55224fff08 Add Arm64 CPU Phase 1A CI bootstrap (#22123)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-05-08 09:28:23 +08:00
Lianmin Zheng
3c3f0bd55e Cache empty MatchResult in RadixCache (#24470) 2026-05-07 17:13:20 -07:00
Baizhou Zhang
c4bb3ce273 Fix stuck when enabling MTP on DSA models (#24635) 2026-05-07 17:06:28 -07:00
Liangsheng Yin
95fb722dd2 Add registry for custom speculative algorithms (#23991) 2026-05-07 16:11:45 -07:00
Revanth Reddy Airre
c2c57068da fix(http): apply SGLANG_TIMEOUT_KEEP_ALIVE in common.py (#24323)
Signed-off-by: Revanth Reddy Airre <revanthreddy@hippocraticai.com>
2026-05-07 16:01:41 -07:00
Xinyuan Tong
5b589ed2e7 feat(constrained): two-phase reasoning grammar + --enable-strict-thinking (#23953) 2026-05-07 14:21:51 -07:00
Xinyuan Tong
af2a2ac618 fix(function_call): handle Kimi-K2.5 bare numeric tool call IDs (#23950) 2026-05-07 14:20:02 -07:00
Xinyuan Tong
d8f9d32a05 feat(reasoning): auto-detect reasoning/tool-call parser from chat template (#23952) 2026-05-07 14:19:16 -07:00
Khoa Pham
d2c1034163 [Gemma 4] Adding MTP support (#24436)
Co-authored-by: Pengyu Chen <pychen96@gmail.com>
2026-05-07 14:08:41 -07:00
Xinyuan Tong
f1395af543 fix(openai): map reasoning.enabled to thinking AND enable_thinking (#23951) 2026-05-07 14:01:35 -07:00
R0CKSTAR
9cffa5ed6f [MUSA] Bump torchada to 0.1.54 (#24592)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-05-07 11:45:49 -07:00
GXIN
90a618e37b [NPU][diffusion] add selectable parallel VAE decode strategies (#23248)
Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-07 21:37:03 +03:00
Junlin Wu
80a6014243 [diffusion][npu][quant] Add MXFP8 quantization support for Wan2.2 Diffusion on Ascend NPU (#20922)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-05-07 21:30:56 +03:00
McZyWu
7d397ad23d [NPU]Support model Trinity-mini for Npu, accuracy 90% (#18172)
Co-authored-by: sglang-npu-bot <sglangnpu@163.com>
2026-05-07 20:58:18 +03:00
Mick
b0225a69dc [diffusion] optimize: precompute LTX2 guidance perturbation states (#24494) 2026-05-08 01:18:42 +08:00