sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 04:08:10 +00:00

Author	SHA1	Message	Date
Xuwei	a9a2ae4a68	[Anthropic] Fix clock mismatch in received_time causing negative Prometheus metrics (#22247 ) Signed-off-by: Xuwei Li <lixuwei.xy@gmail.com>	2026-04-13 21:22:00 -07:00
huangtingwei	e9d6b9eb2d	[HiCache & HybridModel] mooncake backend support DSA & mamba model (#21259 ) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: hzh0425 <hzh0425@apache.org> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>	2026-04-13 18:47:36 -07:00
ishandhanani	cc449ac4e5	feat(metrics): expose raw KV cache pool token counts as prometheus gauges (#22726 )	2026-04-13 18:30:36 -07:00
huangtingwei	945d73824f	[HiSparse] Clarify decode token usage logs (#22331 )	2026-04-13 18:03:25 -07:00
yuki-brook	1ec018f27a	[Feature] Add SiMM as sglang HiCache Storage backend (#18016 )	2026-04-13 17:12:37 -07:00
Liangsheng Yin	33a3ba256f	Delete dead rematch path in SessionAwareCache.release_session (#22735 )	2026-04-13 17:02:40 -07:00
Lianmin Zheng	9fb00ede15	Clean up TokenizerManager and req_time_stats: reduce overhead and simplify (#21646 )	2026-04-13 16:47:32 -07:00
Jia Guo	a2b5111962	perf: skip KV cache in FA backend for embedding mode (#21971 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-13 16:27:52 -07:00
Lianmin Zheng	8f9553bccb	[Misc] Migrate SGLANG_SET_CPU_AFFINITY to envs and refactor model config building (#22730 )	2026-04-13 16:10:31 -07:00
mqhc2020	f4f9e68189	[AMD] Add MoE weights and scales padding (#21097 ) Co-authored-by: HAI <hixiao@gmail.com>	2026-04-13 15:50:15 -07:00
Yilong Zhao	b1efce342c	env: add knob to control SWA eviction interval (#22645 )	2026-04-13 15:37:59 -07:00
Lianmin Zheng	f81b6e8f51	[Misc] Add @cache_once to is_arch_support_pdl in jit_kernel (#22724 )	2026-04-13 14:42:49 -07:00
Baizhou Zhang	b441317aa4	Revert "Upgrade CI default CUDA version from 12.9 to 13.0" (#22727 )	2026-04-13 14:39:24 -07:00
Lianmin Zheng	ba7bcca6b3	Use reshape instead of contiguous().view() in TRTLLMHAAttnBackend (#22517 )	2026-04-13 14:29:12 -07:00
Kurt Shuster	ff13dfee45	[lora][moe] Virtual experts for LoRA MoE (#22122 ) Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>	2026-04-13 21:19:30 +00:00
ishandhanani	6b2bf66cd9	fix[glm4.7 flash]: properly detect `gfx95_quant_format` (#22720 )	2026-04-13 13:10:07 -07:00
Asish Kumar	39810762d2	fix: use describe mode for SGLang version detection (#22600 ) Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>	2026-04-13 09:45:45 -07:00
DarkSharpness	314d6ecf08	[Feature][JIT Kernel] Fused TP QK norm For Minimax (#20673 ) Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2026-04-13 20:29:47 +08:00
Xiaole Guo	4df60434d7	[diffusion] model: support stable-diffusion-3-medium-diffusers (#19225 ) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Kangrui Du <kangruidu@gmail.com> Co-authored-by: Xiaole Guo <gxlvera@gmail.com>	2026-04-13 16:07:06 +08:00
Chandrakant Khandelwal	1e9eecfa36	[Intel GPU] Enable sgl-kernel-xpu fused_experts MoE kernel path for GPT-OSS bf16 models. (#22417 )	2026-04-13 13:45:48 +08:00
Mick	d524f110ac	[diffusion] refactor: streamline denoising stages (#22633 )	2026-04-13 13:34:37 +08:00
Polisetty V R K Jyothendra Varma	7d2c11970c	[Intel GPU] Upgrade pytorch xpu version to 2.11 (#21908 ) Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>	2026-04-13 13:16:24 +08:00
Zhangheng	5549d910c6	[hisparse]: Adding ci for hisparse kvcache-swap-in jit-kernel (#22155 )	2026-04-13 12:50:29 +08:00
Zhangheng	305b42935a	[HiSparse]: Add benchmark for hisparse kernel (#22187 )	2026-04-13 12:49:18 +08:00
Alison Shao	3f4fbc165d	Upgrade CI default CUDA version from 12.9 to 13.0 (#21441 )	2026-04-12 21:48:40 -07:00
Mohammad Miadh Angkad	4dbd59850b	Add bfloat16 KV cache validation for HiSparse (#22505 )	2026-04-13 12:41:42 +08:00
Xiaoyu Zhang	fae0a2fc3c	[codex] Add LTX-2.3 benchmark skill recipes (#22631 )	2026-04-13 12:23:32 +08:00
Mick	bf022e177c	Revert "[Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support (#22574 )" (#22649 )	2026-04-13 11:17:32 +08:00
Zhangheng	bc59cc0f96	[RaidxTree Refactor]: Support Unified HybridRadixTree V2 (#21206 ) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: xiezhq-hermann <xiezhq@stanford.edu>	2026-04-13 10:28:22 +08:00
Ziang Li	5593539942	[RL] Refactor NVFP4 shuffling/swizzling to in-place replacement (#22204 )	2026-04-12 19:08:45 -07:00
blzheng	934e19a610	[CPU] Fix argument issues in qkv_proj_with_rope_fused_weight and bmm… (#21367 ) Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>	2026-04-13 09:59:13 +08:00
Liangsheng Yin	da6b8e1448	Extract pause_resume_in_place kit; rename test_abort to test_scheduler_control (#22647 )	2026-04-12 18:49:37 -07:00
Lawrence Wu	28e40d873c	fix(PD): respect pause_generation in disagg event loops (#20908 )	2026-04-12 18:07:51 -07:00
ishandhanani	c1ab68b45e	fix: streaming session race condition + some metrics (#21875 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>	2026-04-12 18:05:23 -07:00
Xiaoyu Zhang	37fc47c645	diffusion: fix layerwise offload for ModelOpt quantized DiTs (#22594 )	2026-04-13 08:01:54 +08:00
Xiaoyu Zhang	03a1a7b81c	[Diffusion] Add FLUX.1-dev ModelOpt NVFP4 support (#22574 )	2026-04-13 07:57:41 +08:00
Kurt Shuster	f81b6df3a3	[lora] Fix partial MoE rank loading, VL lm_head, strict loading, deepseek on-demand (#21864 ) Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>	2026-04-12 16:25:02 -07:00
Khoa Pham	1f8df97054	Fix broken streaming response with --incremental-streaming-output (#22549 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 15:05:58 -07:00
Zhiyu	d4ad30b94c	[diffusion] quant: enable modelopt quantized FLUX deployment (#20082 ) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:35:33 +08:00
Mick	495ef8ec64	[diffusion] model: support LTX2.3 two stage (#22182 )	2026-04-12 22:15:57 +08:00
Ziang Li	31453bb76a	[RL] Fix weight update for mxfp8 flashinfer_cutlass gemm backend (#22484 )	2026-04-12 13:02:17 +00:00
Mohammad Miadh Angkad	bcc0c65aa8	[DSA] Hopper FP8 FlashMLA KV padding (#22372 )	2026-04-12 02:19:17 -07:00
Kurt Shuster	0e0091c6c8	[server] Add --quantization unquant to explicitly opt out of quantization (#21863 )	2026-04-12 02:17:22 -07:00
Wenyao Gao	4dfc8e1c3f	VLM: support passing --mm-process-config for all models (#18467 )	2026-04-12 17:08:05 +08:00
Liangsheng Yin	f1eb4ca90c	Fix streaming session busy check double-counting; add compat CI tests (#22213 )	2026-04-12 01:48:16 -07:00
Ke Bao	bc1bfbf607	Fix swa input length limitation (#22597 )	2026-04-12 16:03:35 +08:00
Liangsheng Yin	f2377a00cb	Add SWA support for runtime busy memory check (#21499 )	2026-04-12 00:39:51 -07:00
wufann	19cb918653	[Not-Merge][AMD] GLM-5 performance optimization (#21166 )	2026-04-11 23:58:11 -07:00
Hubert Lu	edaa5973d4	[AMD][No-Merge] Simplify fused allreduce + RMSNorm and remove hidden_dim allowlist (#21986 ) Co-authored-by: HAI <hixiao@gmail.com>	2026-04-11 23:47:08 -07:00
Xinyuan Tong	9a4e8089ff	[Whisper] Batch encoder forward for concurrent prefill requests (#22361 )	2026-04-12 14:15:14 +08:00

1 2 3 4 5 ...

7716 Commits