sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-06-30 19:57:52 +00:00

Author	SHA1	Message	Date
sushil Dubey	e26c73c4e9	[diffusion] platform: support Intel XPU (#17920 ) Signed-off-by: sushil.dubey <sushil.dubey@intel.com> Signed-off-by: Sushil Dubey <sushil.dubey@intel.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>	2026-04-11 15:09:02 +08:00
YC Yen-Ching Tseng	cf1436d6ae	[AMD] Diffusion - Enabel rocm miopen tuning on vae (#22428 )	2026-04-10 22:47:25 -07:00
Jacob0226	7e4e1dcd7a	[AMD] Fuse RMSNorm + FP8 per-token quant for GLM-4.7-FP8 (#21403 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-10 22:45:31 -07:00
Khoa Pham	aeeff58cd4	[Spec][Ngram] Clean up unused stateless `batchMatch` (#22487 )	2026-04-10 21:52:56 -07:00
Khoa Pham	04bd8e1218	[Spec][Ngram] Return token counts in list_external_corpora API (#22471 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:50:02 -07:00
Zhangheng	f2af00d05a	[HiSparse-pd] Add device-buffer budget and fix logical pool admission in decode side (#22453 )	2026-04-11 12:30:38 +08:00
Alex Nails	8eac618a8d	[tokenizer] lazy text accumulation + use deltas directly for streaming (#22548 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 21:26:04 -07:00
Liangsheng Yin	c7f93a2ce7	[metrics] Add `PoolStats.update_scheduler_stats` to deduplicate metrics assignment (#22559 )	2026-04-10 21:04:18 -07:00
Bi Xue	d30b3efa84	[sgl] _ATTN_TP and _ATTN_CP use message queue for broadcast on CPU (#22205 )	2026-04-10 20:52:49 -07:00
Xinyuan Tong	7c6db40540	Fix tool call constrained decoding and parsing for models with native formats (#21593 )	2026-04-10 20:37:23 -07:00
Liangsheng Yin	c2821dfbe9	[mem] Introduce PoolStats dataclass; unify pool metrics and token_usage (#22554 )	2026-04-10 20:35:50 -07:00
Yuhao Yang	16f306fd85	[VLM] GPU Image Preprocessing for Kimi-K2.5 (#22368 )	2026-04-11 11:13:30 +08:00
Yilong Zhao	58f863956c	cuda graph: adjust capture time num-non-padded-tokens to align capture with replay (#22404 )	2026-04-11 10:27:50 +08:00
Mick	0b4f5c9fcb	[diffusion] CI: improve readability and fix bug of early-return (#22507 )	2026-04-11 10:08:44 +08:00
Alison Shao	75223c5404	[Diffusion][CI] Fix nunchaku unit test broken by #22365 (#22560 ) Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>	2026-04-10 17:49:56 -07:00
Liangsheng Yin	b4a1d8fd71	[mem] Fix idle token_usage missing mamba_usage; add FIXME for naming (#22555 )	2026-04-10 16:20:33 -07:00
Alex Nails	0af9166474	[tokenizer] improve non streaming request processing + some small fixes. (#20310 )	2026-04-10 15:46:12 -07:00
ori	f7a1740101	[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) (#22051 ) Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>	2026-04-10 14:18:39 -07:00
Minglei Zhu	6af34b95b6	perf: precompute FA3 scheduler_metadata to eliminate per-layer prepare_varlen_num_blocks (#21104 ) Co-authored-by: zminglei <zminglei@linkedin.com>	2026-04-10 13:57:54 -07:00
Zhongdongming Dai	4ace144fae	feat: update ModelExpress metadata API to SourceIdentity-based schema (#21222 )	2026-04-10 13:45:05 -07:00
Cheng Wan	6d95602ea3	Reduce GPU memory for MoE parallel groups (#22515 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-10 13:23:23 -07:00
satyamk7054	059b287e25	Add offline auto-tuning for LoRA CSGMV kernel (#20391 ) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>	2026-04-10 13:10:43 -07:00
Qiaolin Yu	d8831355a3	Fix multi_layer_eagle_worker_v2 draft extend selection, add chain style multi layer mtp test (#22340 ) Co-authored-by: 0xNullPath <luyan@nvidia.com>	2026-04-10 12:44:52 -07:00
Trevor Morris	7dbd0dd9f0	MiniMax-M2.5 - Support dp attention, dp reduce scatter, FP4 all gather, AR fusion in prepare_attn (#20067 )	2026-04-10 12:41:27 -07:00
KrishnanPrash	a937ec31be	fix: server crash when stop_token_ids contains null (#22175 ) Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>	2026-04-10 11:42:23 -07:00
Jia Guo	5cb4ea1d4d	perf: enable inductor combo_kernels for horizontal fusion (#21977 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-11 01:01:14 +08:00
Tarushii Goel	2ba94136ce	[sgl] improve mamba_track_indices perf in specdec (#22380 )	2026-04-11 00:39:53 +08:00
Bi Xue	f652135d52	[sgl] fix using symmetric memory issues for attention_tp (#22286 )	2026-04-11 00:26:18 +08:00
Ratish P	8227187d47	[SKILL]: add component accuracy guidance to the diffusion add-model skill (#22460 )	2026-04-10 23:08:31 +08:00
Ratish P	cf5ad12612	[diffusion][CI]: route multimodal component accuracy through run_suite (#21960 )	2026-04-10 23:06:03 +08:00
kingkingleeljj	84194c25c1	[BugFix] fix the bug of minimax_m2.5 model that causes repeated outputs when using tp16 (#20967 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-10 22:21:19 +08:00
Xiaoyu Zhang	1ff51555f2	[Diffusion] modelopt diffusion fp8 support for flux1/flux2 and wan2.2 (#22365 )	2026-04-10 20:56:57 +08:00
Yujun Dong	8ba9646044	Make GDN support non-continuous B/A Tensor input to fix the accuracy regression of Qwen3.5-27B (#22312 ) Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>	2026-04-10 18:58:13 +08:00
Jincong Chen	0668a7f51a	[Perf] Remove two operations in gdn_backend extend verify path (#22444 )	2026-04-10 17:53:57 +08:00
Shangming Cai	1c76f322df	[HiCache] Add CP support for HiCache (#20977 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2026-04-10 17:52:51 +08:00
Cheng Wan	37107bee6f	[Observability] Add pending token count to prefill log and get_load (#22480 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-10 02:05:21 -07:00
Lee Nau	c554dc5c64	Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE (#21339 )	2026-04-10 01:35:56 -07:00
Mick	7c6b5c095c	[diffusion] fix: fix flux2 i2i accuracy (#22423 )	2026-04-10 16:16:51 +08:00
Liangsheng Yin	6cf7f210bf	Add page_size to admission token budget check (#22495 )	2026-04-10 01:16:04 -07:00
Jacob0226	dd41764487	[AMD][HIP] NSA: bf16 passthrough from RMSNorm to eliminate FP8 dequantization (#22258 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-10 01:08:32 -07:00
jianan-gu	2ab141547d	[CPU] Add apply_routed_scaling_factor_on_output support for biased_grouped_topk fusion (#22413 ) Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>	2026-04-10 15:16:05 +08:00
Polisetty V R K Jyothendra Varma	599cce4d82	[Intel GPU] import flash_attn functions from sgl_kernel only (#22438 )	2026-04-10 15:10:00 +08:00
xieminghe1	18f41ac427	[Reland] DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (#22316 ) Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: xq25478 <xq25478@qq.com>	2026-04-10 14:56:05 +08:00
Tarushii Goel	0334d4b7e8	[sgl] Fix mamba tracking calculation in spec dec (#22239 )	2026-04-10 14:46:16 +08:00
Ethan (Yusheng) Su	6d79c60995	[Lora] Lora kimi support (#22381 )	2026-04-09 22:31:53 -07:00
Liangsheng Yin	722e25a621	Fix SWA eviction boundary and page-align chunked prefill (#22470 )	2026-04-09 22:09:43 -07:00
Ke Bao	e77bfba24d	Fix NCCL AllGather hanging issue for Qwen3 Next MTP (#22458 )	2026-04-10 11:40:54 +08:00
Kangyan-Zhou	89553ff82b	[Observability] Add Prometheus metrics endpoint for gRPC mode (#20801 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 20:04:54 -07:00
LHXuuu	42ffb168b3	[EPD][VLM] Support Kimi K25 EPD (#22269 ) Signed-off-by: LHXuuu <xulianhao.xlh@antgroup.com>	2026-04-10 10:58:42 +08:00
jacky.cheng	d283808457	[AMD] Replace triton rotary_emb with aiter rotary_emb for Wan2.2 denoise (#22422 )	2026-04-09 18:21:02 -07:00

... 3 4 5 6 7 ...

7855 Commits