sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 12:17:09 +00:00

Author	SHA1	Message	Date
Bingxu Chen	d84470079d	[AMD] Fix Grok-2 nightly: avoid multimodal misdetection from auto-populated vision_config (#23383 )	2026-04-26 21:54:36 -07:00
Jia Guo	bead2e3470	perf: optimize PCG inductor path for FP8 models (redo of #21734 ) (#23227 ) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:34:27 -07:00
Byron Hsu	85376a6119	refactor(moe): centralize post-experts all-reduce skip predicate (#23748 ) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:29:59 -07:00
iridiumine	32c3513816	[NPU] Support MTP for Qwen3.5 (#20918 )	2026-04-27 10:44:17 +08:00
Kangyan-Zhou	35591c7d51	fix(lora): don't assert on non-LoRA lm_head adapter weights (#23433 ) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:10:07 -07:00
Mick	a392ae8879	[diffusion] feat: accelerate multiple-outputs generation (#23759 )	2026-04-27 01:47:33 +08:00
jianan-gu	10fd0faccd	[CPU] Add Qwen3.5 model optimization for CPU (#19484 ) Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>	2026-04-26 10:12:36 -07:00
Liwansi	7d49564431	[NPU]Fix support_triton bug (#23604 )	2026-04-26 21:34:56 +08:00
Cheng Wan	c7878dbb6d	[MoE] Deprecate act_and_mul_triton; fold filter_expert into JIT silu/gelu_and_mul (#23707 ) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:41:35 -07:00
Mick	d49a0377de	[diffusion] refactor: make timestep scheduler request-local (#23716 )	2026-04-26 15:59:53 +08:00
sglang-bot	9003f24e2b	chore: bump sglang-kernel version to 0.4.1.post1 (#23733 ) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Kangyan Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:23:49 -07:00
Byron Hsu	ba4e9d2ac2	Apply should_use_dp_reduce_scatterv guard to remaining MoE models (follow-up to #23731 ) (#23732 ) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>	2026-04-25 20:36:16 -07:00
Byron Hsu	71029abd64	Fix Qwen3 MoE: also guard EP all-reduce with not use_reduce_scatter (follow-up to #23731 ) (#23734 ) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 20:35:52 -07:00
Byron Hsu	99b59b279c	Fix Qwen3 MoE double-reduce when DP attention + EP + reduce_scatterv (#23729 ) (#23731 ) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com>	2026-04-25 15:28:28 -07:00
AlbeeSo	e0a4522370	[typo] fix typo in parallel_state (#23710 )	2026-04-25 09:33:33 -07:00
Mick	03849496ad	jit_kernel: tolerate FA3 kernels without out arg (#23717 )	2026-04-25 23:42:33 +08:00
1874.	046c14a3ed	[NPU] Support GGUF quantization for Ascend NPU (dense + MoE) (#17883 ) Co-authored-by: ronnie_zheng <zl19940307@163.com>	2026-04-25 17:16:47 +03:00
gjsheu	e708ea6d94	[diffusion] fix: restore cache-dit support for LTX2 (#23235 ) Co-authored-by: gengjinsong <gengjinsong@huawei.com>	2026-04-25 18:10:43 +08:00
Aleksi Vesanto	50ce2708ca	[diffusion] fix: Fix FLUX.1/2 graph breaks (#23648 )	2026-04-25 17:54:52 +08:00
kk	393252f514	[AMD] fused qk gemma norm kernels to reduce four kernels (#23575 ) Co-authored-by: root <root@smci355-ccs-aus-g12-26.cs-aus.dcgpu>	2026-04-25 00:30:01 -07:00
Артем Савкин	bd523dd60d	[NPU] [Bugfix] [Diffusion] Fixed gray images at the generation output (#23266 ) Co-authored-by: ronnie_zheng <zl19940307@163.com>	2026-04-25 10:20:38 +03:00
Yujing	6175946db7	[Feature]Add MSProbe dump support in SGLang (#18349 )	2026-04-25 10:12:50 +03:00
Yujun Dong	21835fb0af	[HiCache] Prevent move_hybrid_indices from polluting radix-tree node host state (#23427 ) Co-authored-by: hzh0425 <hzh0425@apache.org>	2026-04-25 14:27:42 +08:00
DarkSharpness	82254bd9c5	[JIT Kernel] Reland JIT activation (#22094 ) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Cheng Wan <chwan@rice.edu> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:00:28 -07:00
YC Yen-Ching Tseng	adc59325bc	[AMD] Optimize MiniMax-M2.5 - enable fused Triton kernel for FP8 KV cache write in aiter decode path (#23620 )	2026-04-24 22:23:49 -07:00
YC Yen-Ching Tseng	fb272d27db	[AMD] Optimize MiniMax-M2.5 - use aiter biased_grouped_topk for sigmoid scoring in MoE routing (#23611 )	2026-04-24 22:18:08 -07:00
Shenxiu Liu	8471c9ebe6	Skip torch.cuda.empty_cache() in weight update flush path (#22998 )	2026-04-25 12:41:39 +08:00
Yuhao Yang	4a3fe2a091	model: support parakeet nemotron encoder (#23568 ) Co-authored-by: trangdough <trangtdo22@gmail.com>	2026-04-25 11:00:23 +08:00
Jackey Hua	465abadd3c	Add fused moe triton config for Qwen3.5-397B-A17B-FP8 (#23682 )	2026-04-24 18:35:32 -07:00
Xinyi Song	76da28f6d6	[AMD][bugfix] add gate rocm >= 7.2 for bpreshuffle (#23671 )	2026-04-24 13:26:16 -07:00
Jia Guo	587fd15bd2	perf: eliminate attention DtoD copy by passing pre-allocated output to FA (#21985 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-24 12:05:16 -07:00
Xinyuan Tong	6d03861476	support Hy3 preview (#23533 ) Co-authored-by: pengmeng <pengmeng@tencent.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Co-authored-by: chengvjiang <chengvjiang@tencent.com> Co-authored-by: russellfeng <russellfeng@tencent.com>	2026-04-24 12:03:24 -07:00
Lianmin Zheng	6344b546c8	Deprecate --collect-tokens-histogram, auto-collect with --enable-metrics (#23595 )	2026-04-24 12:00:16 -07:00
Mick	05696527ea	[diffusion] feat: support LoRA for LTX2.3 (#23649 )	2026-04-25 01:52:41 +08:00
Kang Yifei	baa0aa670f	[HiCache & HybridModel] 3FS backend support DSA & mamba model (#23241 ) Co-authored-by: 墨已 <kangyifei.kyf@alibaba-inc.com> Co-authored-by: hzh0425 <hzh0425@apache.org>	2026-04-25 00:48:01 +08:00
Kangrui Du	92d262f710	[diffusion] RL: add per-step rollout options for SDE and trajectory capture (#23151 )	2026-04-24 23:26:16 +08:00
Siju Samuel	bca3dd958a	[Intel GPU] Enable pipeline parallelism on XPU (#23645 )	2026-04-24 19:52:44 +08:00
Yuwei An	60bbb800db	[Experimental] Breakable Piecewise Cuda Graph (#22218 ) Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-24 04:33:05 -07:00
Mick	b3b03369a5	[diffusion] fix: unify LTX-2.3 HQ codepath gates for all LTX-2.3 variants (#23624 )	2026-04-24 17:44:08 +08:00
Shangming Cai	b8d883398d	Revert "[Intel GPU] Enable pipeline parallelism on XPU" (#23641 )	2026-04-24 17:36:35 +08:00
Hubert Lu	4cb0c4e1f3	[AMD] Fix memory access fault when `--page-size > 1` with speculative decoding on AMD GPUs (#23596 )	2026-04-23 23:56:36 -07:00
Mick	cd1fa7506a	[diffusion] model: support LTX2.3 high quality pipeline (#23366 )	2026-04-24 14:18:20 +08:00
Shaojun Zhou	59724e90a9	model: support Moss-VL (#23454 )	2026-04-24 11:14:29 +08:00
Siju Samuel	bf98eb3ab7	[Intel GPU] Enable pipeline parallelism on XPU (#23472 ) Co-authored-by: Shangming Cai <csmthu@gmail.com>	2026-04-24 10:41:51 +08:00
popsiclexu	b35213be11	[MUSA][16/N] Add MUSA backend support for layers and DeepSeek models (V2/V3/R1) (#22774 ) Co-authored-by: popsiclexu <zhenxue.xu@mthreads.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-23 18:59:51 -07:00
R0CKSTAR	87e50f20f6	[Apple Silicon][MLX] Cache seq_lens-derived tensors in BatchedDecodeContext (#23470 ) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>	2026-04-23 18:12:26 -07:00
Mick	c0166355ae	[diffusion] CI: minor refactor CI (#23576 )	2026-04-24 08:48:31 +08:00
Cheng Wan	d9c72bdd2b	Skip unselected experts in flashinfer_trtllm (#23493 )	2026-04-23 17:30:19 -07:00
Cheng Wan	000a2525e1	Move expert_mask_gpu from FusedMoE layer to StandardDispatcher (#23585 ) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:17:27 -07:00
Lianmin Zheng	95d021b523	Pre-set SWA cache location in CudaGraphRunner (#23552 )	2026-04-23 16:51:29 -07:00

1 2 3 4 5 ...

7965 Commits