sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 12:17:09 +00:00

Author	SHA1	Message	Date
Junhao Liu	051427c0a3	[diffusion] benchmark: add SLO metric forinbench_serving (#18907 ) Co-authored-by: ronnie_zheng <zl19940307@163.com>	2026-03-08 22:35:57 +08:00
liubiyongge	cc73355a1f	[Feature] Add SLRU eviction policy & fix RadixCache hit_count bug (#18843 ) Co-authored-by: zhangheng <hzh0425@apache.org>	2026-03-08 21:30:55 +08:00
Mick	2c183350be	[diffusion] fix: fix wrong dit config for qwen-image-edit-plus-2511 (#20123 )	2026-03-08 20:08:36 +08:00
Ratish P	ab9de886c5	[diffusion] reduce LayerwiseOffloadManager reserved GPU memory (#20042 )	2026-03-08 19:26:17 +08:00
Liangsheng Yin	29f3a5396e	[Minor] Add `SessionSlot.is_holding_kv` property for readability (#20120 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-08 03:25:13 -07:00
Liangsheng Yin	36b557d2c9	Fix streaming session with paged KV cache (SWA/MLA) (#20070 ) Co-authored-by: Yilong Zhao <74357408+happierpig@users.noreply.github.com> Co-authored-by: Aurick Qiao <6137920+aurickq@users.noreply.github.com>	2026-03-08 03:00:32 -07:00
yuyu5333	230fb55899	[Performance] Decode Offload improves the long texts performance 100% through dynamic block offload. (#17216 ) Co-authored-by: zhangheng <hzh0425@apache.org>	2026-03-08 17:16:53 +08:00
Yuan Luo	97a2a9be0f	[VLM] Replace conv3d proj with linear for GLM4V (#20033 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2026-03-07 22:50:47 -08:00
Fan Lin	7fb282a96f	[diffusion] fix: fix bug of copy_if (#20094 ) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>	2026-03-08 14:27:58 +08:00
xingsy97	7f9f85d4c8	[diffusion] feat: make QwenImageLayered resolution configurable (#20044 )	2026-03-08 14:26:05 +08:00
Lancer	a73369c39f	[diffusion] chore: ensure CFG Zero Star numerical stability for Helios model (#20091 ) Signed-off-by: Lancer <maruixiang6688@gmail.com>	2026-03-08 14:25:14 +08:00
shuwenn	72f6dfcc31	fix: add ModelScope cache lookup and speculative path support (#20098 )	2026-03-07 22:23:16 -08:00
Liangsheng Yin	d02c515ee8	Decouple scheduler log printing from metrics collection (#20107 )	2026-03-07 22:09:10 -08:00
Baizhou Zhang	d28f35240a	[V32/GLM5] Change default setting of V32 nvfp4 on TP4 (#20086 )	2026-03-07 15:13:25 -08:00
Alison Shao	0f62da6953	[CI] Show test partition assignments after checkout (#20085 ) Co-authored-by: Alison Shao <alisonshao@mac.lan>	2026-03-07 13:50:49 -08:00
VDV1985	45bd30e29d	[NPU] make torch_native lora backend a little bit faster (#17228 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Egor Filimonov <44640852+ssshinigami@users.noreply.github.com> Co-authored-by: ronnie_zheng <zl19940307@163.com>	2026-03-07 20:14:46 +03:00
Ke Bao	5867c3fa80	Support HiCache for MambaRadixCache (#19663 ) Co-authored-by: hzh0425 <hzh0425@apache.org>	2026-03-08 00:36:25 +08:00
Bingxu Chen	17721b00fd	[AMD] Fix Tensor Memory Aliasing (#19928 )	2026-03-07 08:06:10 -08:00
Yuan Luo	7da590d4d0	[Qwen3.5] Support Qwen3.5 Pipeline Parallelism (#19670 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2026-03-07 23:34:08 +08:00
YeChang Guo	13bdc7bf4a	[Feature][NPU]: add runtime support for AutoRound quantized models (#16699 ) Co-authored-by: root <root@localhost.localdomain> Co-authored-by: ronnie_zheng <zl19940307@163.com>	2026-03-07 18:03:55 +03:00
Артем Савкин	5297b02c88	[Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization support (#17996 ) Co-authored-by: ronnie_zheng <zl19940307@163.com>	2026-03-07 17:26:44 +03:00
xingsy97	f8d4eb7022	[Docs] Add docstrings to JIT kernel include headers (#19770 )	2026-03-07 20:48:00 +08:00
Ratish P	ef6540b439	[diffusion]: add width/height passthrough for OpenAI image API (#19970 )	2026-03-07 20:43:46 +08:00
David Wang	19c51fe2fa	fix(rope): restore K writeback in fused rope + kv store kernel (#19636 )	2026-03-07 20:41:35 +08:00
Fan Yin	43d6a32045	[sgl-kernel] rebase FlashMLA 0217 (#18902 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2026-03-07 00:30:52 -08:00
danielafrimi	f8bbf56de7	Refactor NemotronHConfig to canonical layers_block_type and add MTP block-type support (#19950 ) Signed-off-by: dafrimi <dafrimi@nvidia.com>	2026-03-06 23:22:03 -08:00
Lancer	b91fb8393e	[diffusion] fix: fix multi-prompt generation and support multiple prompts in cli (#19960 ) Signed-off-by: Lancer <maruixiang6688@gmail.com>	2026-03-07 13:01:59 +08:00
Eitan Turok	31e93e4486	[diffusion] fix: fix TeaCache silently fails with --enable-teacache (#19964 )	2026-03-07 13:00:11 +08:00
Qiaolin Yu	925185f9ec	Fix flashinfer backend with pcg (#20061 )	2026-03-06 20:01:43 -08:00
Feng Su	8a411a9a2a	[Tracing] Remove the deprecated tracing code from mini_lb (#19409 )	2026-03-07 11:19:23 +08:00
Mohammad Miadh Angkad	f88acf8780	[JIT Kernel] Reland NVFP4 kernels to JIT (#20012 )	2026-03-07 10:31:08 +08:00
Yilong Zhao	6ffc74efd7	[Metrics] Add overlap bubble timing, full KV usage gauge, and prefill cuda graph tracking (#19982 )	2026-03-06 17:41:27 -08:00
shubham singhal	a0d085c16d	Adding correct path for module not found error while collecting test (#19778 ) Co-authored-by: sys-lpot-val <sys_lpot_val@intel.com>	2026-03-06 16:26:16 -08:00
R0CKSTAR	e818f8219a	Fix none-comparison (E711) warnings (#19745 ) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>	2026-03-06 16:15:21 -08:00
R0CKSTAR	0c4f98ed4e	[diffusion] hardware: add set_musa_arch on MUSA (misc, 15/N) (#19381 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2026-03-06 16:14:41 -08:00
MARATRIX	069d4c577b	Fix Kimi K2.5 PP layer range exposure for PD disaggregation (#19959 ) Signed-off-by: yafeng.li <yafeng.li@mthreads.com>	2026-03-06 16:14:02 -08:00
Liangsheng Yin	ddcecdea49	[Core] Unify `max_num_reqs` `dp_size` division for pool sizing (#20063 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-06 16:12:59 -08:00
Kangyan-Zhou	7a12255b6e	fix: set first_token_time before computing decode_throughput for single-batch completions (#19984 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 16:11:41 -08:00
Aurick Qiao	5c8e28698c	Add cleanup for _ATTN_TP in parallel_state.py (#19978 )	2026-03-06 15:43:31 -08:00
Shu Wang	61de303f0a	Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe (#19189 )	2026-03-06 15:15:04 -08:00
Kangyan-Zhou	e89069ee64	Fallback to torch.cuda.mem_get_info() when nvidia-smi is unavailable (#18957 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 15:00:08 -08:00
Liangsheng Yin	604db4471d	[Core] Clarify memory variable naming in model runner (#20060 )	2026-03-06 14:00:46 -08:00
Liangsheng Yin	7a6cf0e9ba	[Core] Extract `_calculate_mamba_ratio` and `_init_pools` from `init_memory_pool` (#20058 )	2026-03-06 13:37:22 -08:00
Mohammad Miadh Angkad	759700c808	Fix SM120 `triton_kernels` MXFP4 `block_k` for GPT-OSS (#20040 )	2026-03-06 10:53:08 -08:00
R0CKSTAR	de1a0afcbc	[MUSA][10/N] Add GGUF support (#18357 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2026-03-06 10:50:35 -08:00
JohnHerry	e8f2b80340	[diffusion] improve: improve code readability of DenoisingStage (#20003 )	2026-03-06 23:23:44 +08:00
xingsy97	54634b9a40	[Kernel] Dispatch exp/sin/cos through dtype_trait (#19798 )	2026-03-06 22:57:52 +08:00
Johnsonms	2d266c73ea	Migrate renorm kernels from sgl-kernel to FlashInfer JIT (#18854 )	2026-03-06 22:53:28 +08:00
Xiaoyu Zhang	6d22c9f369	[Diffusion] Move hf kernels diffusion cuda kernels skills to SGLD (#20001 ) Co-authored-by: Mick <mickjagger19@icloud.com>	2026-03-06 22:16:06 +08:00
Yuan Luo	f7de9375ac	[GDN][Qwen3-Next][Qwen3.5] Fuse fused_gdn_gating and fused_recurrent_gated_delta_rule_update in verify_target (#19775 )	2026-03-06 21:42:44 +08:00

... 19 20 21 22 23 ...

7855 Commits