sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-06-30 19:57:52 +00:00

Author	SHA1	Message	Date
Liangsheng Yin	f9792166c3	trim_overshoot: cap swa_evicted_seqlen + unit test (#22900 )	2026-04-15 15:05:35 -07:00
Xinyu Zhang	13a2cd748d	[Ray] Add data parallel (DP) and DP attention support to RayEngine (#21887 ) Co-authored-by: xyuzh <xyuzh@users.noreply.github.com>	2026-04-15 15:00:48 -07:00
Sundara Raman Ramachandran	4927975427	[Score API] Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel (#22427 )	2026-04-15 14:58:56 -07:00
Lee Nau	4e480d5785	Harden FlashInfer FP4 imports in standard dispatcher (#21776 )	2026-04-15 14:54:49 -07:00
Liangsheng Yin	efc267ca29	streaming session: trim spec v2 overshoot in cache_finished_req (#22897 )	2026-04-15 14:15:46 -07:00
Lianmin Zheng	43925d179d	[Speculative] Fix Eagle3/DFLASH aux hidden state capture during CUDA graph init (#22836 )	2026-04-15 14:04:54 -07:00
Kurt Shuster	32d9fe5a32	[lora] Speedup triton backend `sgemm` calls with better grid (#22386 )	2026-04-15 13:47:07 -07:00
Jimmy Shong	28e915b474	[Bugfix] Preserve auto-detected quant_config for GLM NextN draft model (#22823 )	2026-04-15 13:25:36 -07:00
Yuhao Yang	8686f42acb	[VLM] Enable per-image ViT cache and avoid TP CUDA context creation for Kimi-K2.5 (#22858 )	2026-04-16 01:14:24 +08:00
huangtingwei	7d7fdc1309	[HiCache]Fix CP support for hybrid model (#22782 ) Co-authored-by: hzh0425 <hzh0425@apache.org>	2026-04-15 23:50:29 +08:00
Xiaoyu Zhang	695ab705cb	[diffusion] quant: update modelopt quantization docs and CI coverage (#22772 )	2026-04-15 21:30:28 +08:00
Mick	80718492dd	[diffusion] CI: reset thresholds (#22854 )	2026-04-15 21:11:00 +08:00
Zhangheng	0a5c9728a1	[HiSparse][BugFix]: Fix the memory leak issue during health checks. (#22882 )	2026-04-15 19:49:54 +08:00
Liangsheng Yin	ce31934ca8	Streaming session: fix retract tail leak via _free_tail (#22862 )	2026-04-15 01:44:27 -07:00
huangtingwei	3511c2deb4	[HiCache] Fix memory host free logic when share_indices_with_anchor enabled (#22767 ) Co-authored-by: hzh0425 <hzh0425@apache.org>	2026-04-15 16:31:18 +08:00
Liangsheng Yin	aa78564e1a	Refactor streaming session abort handling (#22790 )	2026-04-15 00:13:05 -07:00
Hubert Lu	b2af34be54	[AMD] Optimize _append_shared_to_topk_output by a single fused Triton kernel for Qwen3.5 (#22844 ) Co-authored-by: HaiShaw <hixiao@gmail.com>	2026-04-14 23:50:32 -07:00
Mick	e95c2e73bd	[diffusion] CI: refactor diffusion ci and reduce redundancy (#22810 )	2026-04-15 10:12:29 +08:00
Kangrui Du	47ac830c07	[diffusion] rl: support standalone rollout api, denoising environment backpass and sp-aligned log-prob for T2I post-training (#22604 ) Co-authored-by: MikukuOvO <mikukuovo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 10:10:38 +08:00
Lianmin Zheng	adb310b976	Cleanup server_args.py and minor code tidying (#22820 )	2026-04-14 18:52:41 -07:00
Chen, Zhentao	ea05ea5abe	[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8 (#20736 ) Co-authored-by: Chen, Todd <zhenchen@amd.com> Co-authored-by: jacky.cheng <yichiche@amd.com>	2026-04-14 18:52:36 -07:00
Piotr Mazurek	46c8a597ef	[VLM] fix LFM2-VL offline inference and GPU JPEG decode (#22448 )	2026-04-15 09:13:25 +08:00
Alex Nails	8092431316	[serving] replace O(n²) stream_buffer string concat with integer offset (#22606 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-14 14:48:44 -07:00
Liangsheng Yin	36891ab514	Rename _alive_streaming_session_count; use _is_streaming helper (#22755 )	2026-04-14 13:26:03 -07:00
Liangsheng Yin	0cb7295698	Fix streaming session busy-check double-counting via active_pool_idxs (#22753 )	2026-04-14 13:11:06 -07:00
mingyue300	b4616dcbf5	[BugFix] Fix EAGLE speculative decoding missing grammar-based finish … (#21723 )	2026-04-14 12:43:50 -07:00
Mick	d2f479e544	[diffusion] chore: auto-enable best parallel setting if unspecified (#22763 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:02:05 +08:00
Bi Xue	070c6a2489	[sgl] perf optimization for eplb (#21232 )	2026-04-14 22:52:17 +08:00
Mick	c5e95080d2	[diffusion] model: support Ltx 2.3 two stage ti2v (#22667 )	2026-04-14 22:10:08 +08:00
lawtherWu	454228e071	hicache storage backend mooncake support ascend hixl (#20016 )	2026-04-14 20:51:06 +08:00
Jia Guo	6da3aba6a5	perf: optimize PCG inductor path for FP8 models (#21734 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:51:27 +08:00
xutizhou	3cb3f7c018	fix: EPLB dispatch OOB when shared experts fusion enabled under DeepEP (#22525 )	2026-04-14 02:33:27 -07:00
Jincong Chen	6760c790bd	[bugfix] avoid attention padding tokens computation in pcg (#17706 )	2026-04-14 16:08:23 +08:00
Michael	eab045b2b7	[AMD] Add MiniMax-M2.7 accuracy and performance nightly tests (#22722 ) Co-authored-by: HaiShaw <hixiao@gmail.com>	2026-04-14 00:30:11 -07:00
xiaobochen-amd	d7ecab5113	[ROCm]fix(aiter): cast fp8 prefill output back to model dtype (#22626 ) Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com>	2026-04-14 00:25:09 -07:00
Xiaoyu Zhang	f97c608caa	[diffusion] quant: add FLUX.1-dev modelopt nvfp4 support (#22672 )	2026-04-14 15:00:59 +08:00
Colin Z	b10f852118	GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix (#22543 ) Co-authored-by: HAI <hixiao@gmail.com>	2026-04-13 23:56:48 -07:00
YAMY	657945c338	Replace all-reduce + dp_scatter with reduce_scatterv for DP attention (#22642 )	2026-04-13 21:51:10 -07:00
ishandhanani	520ce526b9	Restore Qwen3 rope config fallback (#22739 )	2026-04-13 21:47:37 -07:00
Xuwei	a9a2ae4a68	[Anthropic] Fix clock mismatch in received_time causing negative Prometheus metrics (#22247 ) Signed-off-by: Xuwei Li <lixuwei.xy@gmail.com>	2026-04-13 21:22:00 -07:00
huangtingwei	e9d6b9eb2d	[HiCache & HybridModel] mooncake backend support DSA & mamba model (#21259 ) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: hzh0425 <hzh0425@apache.org> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>	2026-04-13 18:47:36 -07:00
ishandhanani	cc449ac4e5	feat(metrics): expose raw KV cache pool token counts as prometheus gauges (#22726 )	2026-04-13 18:30:36 -07:00
huangtingwei	945d73824f	[HiSparse] Clarify decode token usage logs (#22331 )	2026-04-13 18:03:25 -07:00
yuki-brook	1ec018f27a	[Feature] Add SiMM as sglang HiCache Storage backend (#18016 )	2026-04-13 17:12:37 -07:00
Liangsheng Yin	33a3ba256f	Delete dead rematch path in SessionAwareCache.release_session (#22735 )	2026-04-13 17:02:40 -07:00
Lianmin Zheng	9fb00ede15	Clean up TokenizerManager and req_time_stats: reduce overhead and simplify (#21646 )	2026-04-13 16:47:32 -07:00
Jia Guo	a2b5111962	perf: skip KV cache in FA backend for embedding mode (#21971 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-13 16:27:52 -07:00
Lianmin Zheng	8f9553bccb	[Misc] Migrate SGLANG_SET_CPU_AFFINITY to envs and refactor model config building (#22730 )	2026-04-13 16:10:31 -07:00
mqhc2020	f4f9e68189	[AMD] Add MoE weights and scales padding (#21097 ) Co-authored-by: HAI <hixiao@gmail.com>	2026-04-13 15:50:15 -07:00
Yilong Zhao	b1efce342c	env: add knob to control SWA eviction interval (#22645 )	2026-04-13 15:37:59 -07:00

1 2 3 4 5 ...

7855 Commits