sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 04:08:10 +00:00

Author	SHA1	Message	Date
sglang-bot	46bf19cdab	chore: bump flashinfer version to 0.6.7.post2 (#22097 ) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>	2026-04-04 02:16:25 -07:00
narutolhy	24763256b9	[Speculative Decoding] Add FA4-based Spec Support (#21080 ) Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>	2026-04-04 02:09:45 -07:00
Yuhao Yang	34d5765e2f	[VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer (#22038 )	2026-04-04 16:55:17 +08:00
Piotr Mazurek	b5e8c4b9e3	model: support LFM2-VL (Liquid Foundation Model 2 Vision-Language) (#21230 ) Co-authored-by: Piotr Mazurek <piotr.mazurek@liquid.ai>	2026-04-04 16:36:04 +08:00
R0CKSTAR	1fb4bf3558	[diffusion] fix: validate attention backend for Ring Attention in USPAttention (#21828 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2026-04-04 16:24:02 +08:00
harrisonlimh	9fa12d605a	Add dsv3 router gemm benchmark on blackwell (#17707 )	2026-04-04 01:18:01 -07:00
Xiaoyu Zhang	82ea4906cf	[diffusion] Default NVFP4 to CUTLASS and add all-model shape benchmarks (#22091 )	2026-04-04 16:14:38 +08:00
Ethan (Yusheng) Su	ff8e47edf9	[5/n] Lora support cuda graph (#21647 )	2026-04-04 00:31:46 -07:00
Douglas Yang	a94c3804c2	fix: mistral embedding regression fix (#21913 )	2026-04-04 00:11:51 -07:00
Chi McIsaac	005e582d06	[diffusion] improve: norm fusion for z-image (#18762 ) Signed-off-by: Chi McIsaac <chixie.mcisaac@gmail.com> Co-authored-by: yihanc <yingluosanqian@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>	2026-04-04 14:01:01 +08:00
Qiaolin Yu	ef13031243	Tiny fix step3.5-flash launch crash (#22076 )	2026-04-03 22:25:25 -07:00
Ziang Li	990c7590b8	[RL] Support mxfp8 DeepSeek V3 (#21280 )	2026-04-03 21:57:45 -07:00
faceless void	de9859073f	Add `--stream-response-default-include-usage` server flag (#16711 )	2026-04-03 21:36:00 -07:00
CHEN Xi	31c9d8e885	[Diffusion] Fix weight scale swizzle and add large-M kernel config for FLUX.2-dev-NVFP4 (#22064 )	2026-04-04 11:50:30 +08:00
Yilong Zhao	fe92f3563c	dp: add profile req hook (#22083 )	2026-04-03 20:47:09 -07:00
Yuxuan Zhang	b7ae3b5a9a	GLM-4.7 and GLM-4.7-Flash Loading and import format (#21851 )	2026-04-03 20:44:08 -07:00
Prozac614	db3d4f4b76	[diffusion] model: support two stage pipeline of LTX-2 (#20707 ) Co-authored-by: daiweitao <dwti614707404@163.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai>	2026-04-04 09:37:28 +08:00
Liangsheng Yin	95cdbce34f	[Test] Extract common PD server setup into base fixture (#22080 )	2026-04-03 16:37:12 -07:00
Lawrence Wu	9593d434c4	fix: pause_generation should not populate running_batch on prefill nodes (#20273 )	2026-04-03 16:16:06 -07:00
Sundara Raman Ramachandran	90e86800f4	[Score API] Implement EngineScoreMixin for scoring functionality and refactor Tok… (#21342 )	2026-04-03 15:17:42 -07:00
Baizhou Zhang	ac1e437f6a	Revert "[Feature] JIT activation and update skills (by codex)" (#22078 )	2026-04-03 15:04:15 -07:00
Mohammad Miadh Angkad	8cb337c8ea	[Bugfix] Temporarily skip TRTLLM attention on (G)B300 (SM103) to avoid high-concurrency hang (#21906 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2026-04-03 14:19:13 -07:00
Yz Xiao	1d7a53dd03	[Fix] XGrammarGrammarBackend reset to clear inherited cache (#22054 )	2026-04-03 14:17:59 -07:00
sglang-bot	84118acf50	chore: bump sglang-kernel version to 0.4.1 (#22009 ) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>	2026-04-03 13:58:35 -07:00
Lianmin Zheng	eb407b80f3	[Kernel] Make FA3/FA4 imports lazy in FlashAttentionBackend (#22028 )	2026-04-03 13:49:00 -07:00
Brayden Zhong	6aafe756b9	Revert "[Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+… (#22047 )	2026-04-03 13:12:30 -07:00
Shiyan Deng	0c9dc098e7	Fix DP attention worker port binding for IPv6 support (#21917 ) Signed-off-by: Shiyan Deng <dsy842974287@meta.com>	2026-04-03 12:39:39 -07:00
Zhangheng	ed3435e37f	[HiSparse]: Optimize server args checking-HiSparse is temporarily only available for DSA models. (#22065 )	2026-04-04 02:23:56 +08:00
Mick	151f727163	[diffusion] fix: fix gated repo failing the generate cmd (#22040 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-04 00:43:11 +08:00
DarkSharpness	44e5d35703	[Feature][JIT Kernel] JIT activation and update skills (by codex) (#21766 ) Co-authored-by: weiminc <tnwilly@gmail.com>	2026-04-03 23:28:54 +08:00
Mick	030fb1c4b1	refactor: replace mm_inputs dict with MultimodalProcessorOutput (#21738 )	2026-04-03 23:26:37 +08:00
Ke Bao	9f409d0749	[CI] Adjust CI server launch timeout (#22045 )	2026-04-03 22:38:07 +08:00
Xiaoyu Zhang	ee9d922f5a	Revert "[Kernel] Fuse temperature + softmax in sampling for decode speedup" (#22046 )	2026-04-03 21:32:08 +08:00
Kangyan-Zhou	56ac9c9932	[Fix] Add _MOE_TP to graph_capture for MoE models with ep>1 (#21907 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2026-04-03 02:33:16 -07:00
Khoa Pham	cd75d54fc5	[Bugfix] Fix CUDA graph replay issues in trtllm_mla draft_extend (#21987 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 01:45:13 -07:00
shuwenn	4f84ce5807	[CI] ci: add test_http_server_auth.py to CI (#21866 )	2026-04-03 16:32:18 +08:00
Thomas Wang	7431db7392	[AMD] Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang backend (#21511 )	2026-04-03 00:58:23 -07:00
Kelon	ad0516d9c1	[NPU] optimize glm4.7 (#19246 )	2026-04-03 15:44:07 +08:00
Shangming Cai	d82097a0df	[PD] Tiny register info field cleanup for mooncake backend (#22016 )	2026-04-03 15:13:44 +08:00
Ricardo-M-L	24f52e66d3	fix: remove duplicate words in comments (#22007 )	2026-04-03 00:05:39 -07:00
Yuzhen Zhou	6b876a7710	[ROCM][RL] Shuffle Weight In-Place to Preserve Parameter Attributes (#21825 )	2026-04-02 23:43:55 -07:00
Zhangheng	4d097047f2	[PD]: Add support for HiSparse to directly transfer the cache from Prefill to Decode DRAM. (#21591 ) Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>	2026-04-02 23:06:12 -07:00
kk	5bcbc9757c	[AMD] Resolve the performance degression when launch server with "--enable-aiter-allreduce-fusion" (#21947 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2026-04-02 22:10:24 -07:00
DarkSharpness	d1b7c3907d	[Parallel State Refactor 2/n] Unify code path of AMD deterministic all reduce (#20871 )	2026-04-03 12:33:17 +08:00
Baizhou Zhang	efa7b2d5d3	Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" (#22002 )	2026-04-02 20:42:13 -07:00
lviy	5f0df1e2ad	[Bugfix] Fix incorrect dp-attention parallel info in bench_one_batch (#21519 )	2026-04-02 20:13:53 -07:00
Yuhao Yang	69e89a1fcc	[VLM] Enable per-image MM splitting by default and remove MULTI_IMAGES modality (#21899 )	2026-04-03 11:04:41 +08:00
narutolhy	8897ac58f0	[PP] qwen3 vl skip layer id for pp (#19135 )	2026-04-03 10:51:53 +08:00
Mook	991f3aa5b3	[Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+) (#19652 )	2026-04-03 10:48:15 +08:00
Khoa Pham	2b5aed94f5	Remove maxItems=1 restriction when tool_choice is specified (#20208 )	2026-04-03 02:35:24 +00:00

1 2 3 4 5 ...

7489 Commits