sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 12:17:09 +00:00

Author	SHA1	Message	Date
LiYomi	1d6424d5ad	fix: Mistral Small 4 fails to start due to config/weight format mismatch (#21620 ) Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 01:57:35 -07:00
strgrb	b246269444	fix mamba cache leak when adder fails to add a matched req. (#21404 )	2026-03-30 16:45:49 +08:00
Baizhou Zhang	62a63eeff7	[Fix] Fix weight_loader property assignment for qwen3-next FP8 models (#21662 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 01:35:59 -07:00
Hubert Lu	e6071e60c0	[AMD] Support AMD MXFP4 Qwen3.5-397B-A17B model (#21234 )	2026-03-30 01:14:18 -07:00
kk	b9a68c304e	[AMD] Fused rope kv store (#21315 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2026-03-30 00:05:41 -07:00
blzheng	ed01e1d5d6	[CPU] add kernel apply_rotary_pos_emb_cpu for Qwen3-VL and Qwen3-Omni (#13121 ) Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>	2026-03-29 23:43:46 -07:00
Aishwarya Ramasethu	c32ee48886	MFU metrics in Prometheus (#19395 )	2026-03-29 23:40:06 -07:00
Polisetty V R K Jyothendra Varma	f0303fd07e	[Intel GPU] Enable DeepSeek R1 inference on XPU (#18461 ) Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>	2026-03-29 22:35:59 -07:00
Feng Su	9b4dd27478	[Fix] Fix Qwen3.5 MoE model loading and Mamba cache sharding in PP mode (#21448 ) Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com>	2026-03-30 11:57:26 +08:00
Liangsheng Yin	c06ca1526c	Fix circular reference in CustomTestCase.__init_subclass__ (#21650 ) Co-authored-by: wan4ch <wan4ch@gmail.com>	2026-03-29 20:38:12 -07:00
Lianmin Zheng	9f7792415a	Clean up TokenizerManager: remove dead code and improve rid validation (#21639 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 15:12:49 -07:00
Lianmin Zheng	f3970b17ef	[Cleanup] Remove unused BatchMultimodalOutput and BatchMultimodalDecodeReq (#21640 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 14:54:25 -07:00
Lianmin Zheng	1d9c8e8c9e	Simplify routed experts test and move base64 encoding to tokenizer manager (#21634 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 12:44:01 -07:00
Mohammad Miadh Angkad	2acdda1d85	[Fix] Remove redundant allreduce fusion block and skip TP=1 (#20621 )	2026-03-29 12:30:40 -07:00
wili	bda94fc779	[Fix] SGLANG_USE_CUDA_IPC_TRANSPORT=1 and SGLANG_ENABLE_MM_SPLITTING=1 do not work at the same time. (#19915 )	2026-03-30 01:15:26 +08:00
saatwiknagpal	d2440dcf58	[VLM] perf: optimize CUDA IPC for multimodal transfer by caching IPC pool handles (#21418 )	2026-03-30 00:20:38 +08:00
wili	5bb9ca0e63	[Feature] Optimizations for JPEG input on NVIDIA GPU (#19749 )	2026-03-30 00:06:14 +08:00
Bi Xue	42c46e6334	[sgl] disable piecewise cuda graph when a model doesn't have layers (#21565 )	2026-03-29 23:04:20 +08:00
Hanlin Bi	aa9177152e	fix cuda graph capturing error in sm120 mxfp8 triton path (#19835 )	2026-03-29 01:59:24 -07:00
Liangsheng Yin	fec9961a1f	Clean up _wait_for_scheduler_ready implementation (#21626 )	2026-03-29 01:02:33 -07:00
psaab	d2fa8d67ba	Wrap IPv6 addresses in gRPC, bench_serving, and log messages (#21236 ) Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>	2026-03-29 00:36:31 -07:00
shuwenn	18074e25dc	fix: scheduler launch hang when non-current rank dies (#20287 )	2026-03-29 00:28:45 -07:00
Simon (Jiyou) Li	22e4733ab9	Add subprocess liveness monitor to detect scheduler crashes (#18582 ) Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com>	2026-03-29 00:09:13 -07:00
Kangyan-Zhou	9d64a82173	feat(ci): add GB300 nightly benchmark test suites (#21487 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 21:54:03 -07:00
Lianmin Zheng	ba6b501f3a	Clean up detokenizer and remove dead multimodal_gen code (#21588 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 21:44:40 -07:00
Xiaoyu Zhang	516cff97a3	[Diffusion] Align diffusion benchmark skill presets with nightly comparison cases (#21616 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-29 12:12:17 +08:00
Yuan Luo	343a7ac652	[GDN] Fuse GDN kkt + solve_tril into one kernel (#21411 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2026-03-29 12:02:07 +08:00
jacky.cheng	c86f6c2831	[AMD] Add peft>=0.18.0 to diffusion_hip deps for transformers 5.x compat for AMD diffusion model (#21442 ) Co-authored-by: HaiShaw <hixiao@gmail.com>	2026-03-28 20:28:05 -07:00
Yuhao Yang	4e69f14b95	fix bench_serving sglang backend to support image dataset (#21294 )	2026-03-29 10:02:11 +08:00
eigen	3ab9afd653	fix: piecewise_cuda_graph get correct qo_indptr (#21452 ) Co-authored-by: Avery Huang <averyh@nvidia.com>	2026-03-28 15:57:29 -07:00
Shu Wang	efebcab43e	Support skip-softmax attention (#19089 )	2026-03-28 15:55:48 -07:00
Xinyuan Tong	ced69c9f84	feat: enable CUDA graph and timestamp for the whisper model(#21190 )	2026-03-29 01:46:03 +08:00
Yuhao Yang	57cf4790ca	[VLM] Optimize ShmPointerMMData for multi-pickle safety and deferred unwrap (#21465 )	2026-03-28 23:11:12 +08:00
Mick	fc9de157f9	[diffusion] feat: support overlay model materialization (#21600 )	2026-03-28 23:02:38 +08:00
Aditya Sharma	627e162335	[diffusion] fix: fix Flux2-Klein prompt tokenization length to 512 and add regression coverage (#21407 )	2026-03-28 17:28:02 +08:00
Baizhou Zhang	edd4d54023	[Clean] Remove deprecated environs (#21536 )	2026-03-28 00:35:44 -07:00
Liangsheng Yin	402628e560	Patch transformers is_base_mistral in CI to avoid HF 429 rate limiting (#21586 )	2026-03-27 22:19:36 -07:00
Jianying	daf02bde33	Fix Piecewise CUDA Graph crash with `-enable-mixed-chunk` (#20441 ) Co-authored-by: jianyingzhu <joeyzhu@nvidia.com>	2026-03-27 21:56:21 -07:00
Liangsheng Yin	19b1f75186	Fix HFRunner hang when subprocess dies during init (#21582 )	2026-03-27 21:22:42 -07:00
Yuhao Yang	5ef56682b8	reduce CPU peak memory in multimodal tensor hashing (#21123 )	2026-03-28 11:09:16 +08:00
Fengyuan Yu	9fa7b974fd	[diffusion] chore: remove redundant identity preprocess_text functions(#20633 ) Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com>	2026-03-28 10:07:30 +08:00
Eitan Turok	e570ca96f6	[diffusion] refactor: Unify `TeaCacheParams` and `WanTeaCacheParams` (#20706 ) Co-authored-by: Mick <mickjagger19@icloud.com>	2026-03-28 09:51:44 +08:00
Mick	f0c68fbefd	[diffusion] UX: aggregate expected dtype-cast logs during weight loading (#21552 )	2026-03-28 09:50:40 +08:00
Trevor Morris	7160b6cb76	[NVIDIA] Enable automatic NUMA configuration (#19452 )	2026-03-27 18:44:13 -07:00
Vladislav Nosivskoy	c37200f5e4	Scope streaming backlog coalescing to incremental_streaming_output mode (#21037 ) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2026-03-27 17:29:54 -07:00
Qiaolin Yu	a27651d5e0	Remove sync when enabling return_logprob (#20972 )	2026-03-27 16:36:28 -07:00
Ethan (Yusheng) Su	6d48719e31	[1/n] lora support - Auto detect lora target modules (#21439 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2026-03-27 16:08:36 -07:00
narutolhy	9b29131961	fix tp capture in vit cuda graph (#17255 )	2026-03-27 22:38:18 +00:00
Muqi Li	38ad251738	feat: add gc_threshold arg (#21481 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-27 13:42:46 -07:00
huangtingwei	d864622a68	[Hicache & JIT_kernel] Support page first layout & mla jit kernel (#18311 )	2026-03-27 08:54:36 -07:00

1 2 3 4 5 ...

7326 Commits