sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-02 13:27:09 +00:00

Author	SHA1	Message	Date
Yuxuan Zhang	d49fc092cb	[Bug Fix] GLM-5.1: drop constexpr on page_indice_batch_offset, skip offloader post_init on draft worker, support N=32 in copy_to_gpu_no_ce (#23550 )	2026-05-09 15:43:45 +08:00
Liangsheng Yin	78da0d3106	[Spec] Move `accept_tokens` off `EagleDraftInput`; pass via method arg (#24735 )	2026-05-08 23:24:18 -07:00
Chi McIsaac	8e534e8f15	[diffusion] fix: fix diffusers executor crash when component residency manager is absent (#24573 )	2026-05-09 11:45:06 +08:00
storyicon	590b13b513	[diffusion] fix: fix NCCL deadlock in ulysses sp when sequence length has remainder (#24694 ) Signed-off-by: storyicon <storyicon@foxmail.com>	2026-05-09 11:05:37 +08:00
Polisetty V R K Jyothendra Varma	50ed01674e	fix is_arch_support_pdl function usage (#24600 ) Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>	2026-05-09 09:39:34 +08:00
Liangsheng Yin	1613bae412	[Spec] Disambiguate `verified_id` into `bonus_token(s)` / `accept_tokens` (#24724 )	2026-05-08 18:24:33 -07:00
Yuan Luo	a61a14f416	[KDA] Optimize prefill kernels with diagonal and recompute fuse (#24271 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2026-05-09 08:52:51 +08:00
Brayden Zhong	9ee830346f	Disable Custom AR V2 when in multi-node (#24729 ) Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>	2026-05-08 17:50:05 -07:00
Cheng Wan	d1c5937428	env: add SGLANG_RADIX_FORCE_MISS to force radix prefix-cache miss (#24726 ) Co-authored-by: sihan-zzz <228612289+sihan-zzz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:46:38 -07:00
YAMY	560829a171	feat(scheduler): add adaptive queue-based prefill delayer trigger (#23189 )	2026-05-08 16:54:30 -07:00
YAMY	6971a03fe6	fix(fa3): skip scheduler_metadata precompute under DP attention (#24632 )	2026-05-08 16:19:20 -07:00
Niko Ma	62c2e091f6	[PD] MORI-IO: Add state transfer, inline transfer model, and high-concurrency fixes (#22665 )	2026-05-08 16:07:22 -07:00
Jimmy Shong	fa8985486e	[test/fix]: isolate VLM MMMU eval output dirs to fix nightly-4-gpu cross-test pollution (#24623 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-08 15:01:53 -07:00
Jimmy Shong	096ad02b06	[Model] Laguna-XS.2 Model Support (#24204 )	2026-05-09 05:43:13 +08:00
Cheng Wan	7b707c9222	disable the combination of --enable-two-batch-overlap and --enforce-s… (#24720 )	2026-05-08 14:27:35 -07:00
Yuhao Yang	09912fd89d	Remove unnecessary bf16 assert in rotate_activation (#24686 )	2026-05-09 05:00:52 +08:00
Yilong Zhao	f30d1d0b0a	logits: remove blocking H2D copy (#24627 )	2026-05-08 13:22:13 -07:00
Ethan Feng	672f778512	[NemotronH] Fix expert scale weight loading (#24434 )	2026-05-08 12:37:06 -07:00
zhongdaor-nv	2cf1a4ab38	feat: Add KV events for Mamba radix cache (#23678 ) Signed-off-by: zhongdaor-nv <220807034+zhongdaor-nv@users.noreply.github.com> Co-authored-by: zhongdaor-nv <220807034+zhongdaor-nv@users.noreply.github.com>	2026-05-08 11:53:36 -07:00
Lianmin Zheng	e40e339c72	Filter non-int token ids in benchmark and observe decode-side bootstrap/alloc metrics (#24684 )	2026-05-08 11:45:37 -07:00
Mick	73b8eda103	[diffusion] fix: fix FA3 varlen out argument handling (#24688 )	2026-05-08 19:01:49 +08:00
fanxingran	7f8e7a9130	fix(aiter): drop FP8 KV upcast; use native FP8 path in paged_attentio… (#24129 ) Co-authored-by: fanxingran <fanxingran@amd.com>	2026-05-08 02:47:48 -07:00
jacky.cheng	f21d4868dc	[AMD] Replace naive triton RMSNorm with aiter RMSNorm for diffusion model (#24360 )	2026-05-08 02:44:13 -07:00
YC Yen-Ching Tseng	e1150f66db	[AMD][diffusion] Temporal-unfolded batched Conv2D for ROCm VAE decode (#22971 )	2026-05-08 02:32:14 -07:00
Brayden Zhong	80d0226b68	Turn on JIT custom AR implementation by default (#24363 ) Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>	2026-05-08 02:05:31 -07:00
HAI	73792629d4	[AMD] Intro SGLANG_DIFFUSION_AITER_FP8_ATTN (#24677 )	2026-05-08 01:31:00 -07:00
jacky.cheng	b22d3cd606	[AMD] Support fp8 MLA for diffusion model (#20319 )	2026-05-08 00:56:24 -07:00
Yibo Cai	55d8223c2b	[sgl-kernel/cpu] support w8a8 int8 model for arm cpu (#16045 ) skip gpu test as this one is not related to gpu backend.	2026-05-08 14:47:06 +08:00
JoyFuture	e1bc001872	fix(mimo_v2): auto-disable multimodal when vision/audio configs are absent (#24652 )	2026-05-08 13:40:08 +08:00
maocheng23	7deed98e1b	[fix] /pause_generation and /continue_generation wrong for --tokenizer-worker-num > 1 (#24462 ) Co-authored-by: lawrence-harmonic <185285563+lawrence-harmonic@users.noreply.github.com>	2026-05-07 21:32:21 -07:00
Mick	2afb450501	[diffusion] optimize: optimize frame returns path (#24616 )	2026-05-08 12:10:09 +08:00
johnnycxm	cdf5771f91	[MUSA][17/N] ci: Add MUSA diffusion, sgl-kernel tests, and CI workflow support (#20672 ) Co-authored-by: ximin.chen <ximin.chen@mthreads.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2026-05-07 20:45:21 -07:00
Brayden Zhong	5fa3bb2eaf	Enable `flashinfer::trtllm_allreduce_fusion` with PDL (#23765 ) Co-authored-by: b8zhong <b8zhong@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-08 10:41:10 +08:00
shuwenn	d9dddd4d7d	[SPEC V2][2/N] feat: adaptive spec support spec v2 (#23336 ) Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>	2026-05-07 18:33:47 -07:00
Liangsheng Yin	35870d55ac	Deepseek V4 (#23882 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: yueming-yuan <yym022502@gmail.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Qiaolin Yu <90088090+qiaolin-yu@users.noreply.github.com> Co-authored-by: Ethan (Yusheng) Su <11704492+yushengsu-thu@users.noreply.github.com> Co-authored-by: Mingyi <27337995+wisclmy0611@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Yihao Wang <42559837+againstentropy@users.noreply.github.com>	2026-05-07 18:32:21 -07:00
Mandepudi Rani Chowdary	55224fff08	Add Arm64 CPU Phase 1A CI bootstrap (#22123 ) Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>	2026-05-08 09:28:23 +08:00
Lianmin Zheng	3c3f0bd55e	Cache empty MatchResult in RadixCache (#24470 )	2026-05-07 17:13:20 -07:00
Baizhou Zhang	c4bb3ce273	Fix stuck when enabling MTP on DSA models (#24635 )	2026-05-07 17:06:28 -07:00
Liangsheng Yin	95fb722dd2	Add registry for custom speculative algorithms (#23991 )	2026-05-07 16:11:45 -07:00
Revanth Reddy Airre	c2c57068da	fix(http): apply SGLANG_TIMEOUT_KEEP_ALIVE in common.py (#24323 ) Signed-off-by: Revanth Reddy Airre <revanthreddy@hippocraticai.com>	2026-05-07 16:01:41 -07:00
Xinyuan Tong	5b589ed2e7	feat(constrained): two-phase reasoning grammar + --enable-strict-thinking (#23953 )	2026-05-07 14:21:51 -07:00
Xinyuan Tong	af2a2ac618	fix(function_call): handle Kimi-K2.5 bare numeric tool call IDs (#23950 )	2026-05-07 14:20:02 -07:00
Xinyuan Tong	d8f9d32a05	feat(reasoning): auto-detect reasoning/tool-call parser from chat template (#23952 )	2026-05-07 14:19:16 -07:00
Khoa Pham	d2c1034163	[Gemma 4] Adding MTP support (#24436 ) Co-authored-by: Pengyu Chen <pychen96@gmail.com>	2026-05-07 14:08:41 -07:00
Xinyuan Tong	f1395af543	fix(openai): map reasoning.enabled to thinking AND enable_thinking (#23951 )	2026-05-07 14:01:35 -07:00
R0CKSTAR	9cffa5ed6f	[MUSA] Bump torchada to 0.1.54 (#24592 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2026-05-07 11:45:49 -07:00
GXIN	90a618e37b	[NPU][diffusion] add selectable parallel VAE decode strategies (#23248 ) Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-07 21:37:03 +03:00
Junlin Wu	80a6014243	✨ [diffusion][npu][quant] Add MXFP8 quantization support for Wan2.2 Diffusion on Ascend NPU (#20922 ) Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-05-07 21:30:56 +03:00
McZyWu	7d397ad23d	[NPU]Support model Trinity-mini for Npu, accuracy 90% (#18172 ) Co-authored-by: sglang-npu-bot <sglangnpu@163.com>	2026-05-07 20:58:18 +03:00
Mick	b0225a69dc	[diffusion] optimize: precompute LTX2 guidance perturbation states (#24494 )	2026-05-08 01:18:42 +08:00

1 2 3 4 5 ...

8252 Commits