sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 12:17:09 +00:00

Author	SHA1	Message	Date
Kangyan-Zhou	f5a4a5429f	Revert early HTTP port reservation (#17754 , #19805 ) (#20468 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:17:33 -07:00
Ethan (Yusheng) Su	af2807e146	[LoRA][I] Add MOE LoRA JIT alignment kernel and tests (#19710 ) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Jonah Bernard <96398205+Jonahcb@users.noreply.github.com>	2026-03-12 12:23:46 -07:00
Yuhao Yang	a57a44739f	[diffusion] deps: upgrade diffusers from 0.36.0 to 0.37.0 (#20318 )	2026-03-12 19:17:28 +08:00
kk	318a40fdfb	[Bug-fix] Fix gpu fault when run the test with dp-attention-enabled and max-concurrency is over 256 (#20399 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2026-03-12 02:32:03 -07:00
Ratish P	4e5ca92249	[diffusion]: clear file-path-only outputs on all ranks to prevent TP GPU memory skew (#20353 )	2026-03-12 17:29:09 +08:00
jacky.cheng	1e2983c98e	[AMD] Fix FP8 assertion failure in aiter MLA decode by falling back to self.k_scale (#19935 )	2026-03-12 01:48:51 -07:00
roikoren755	067353f67b	[Test] Refactor KL divergence and prefix cache branching to kits (#19715 )	2026-03-12 16:11:59 +08:00
0xNullPath	46b558445d	Fix default_max_tokens compute error in responses api when mtp is opened (#18932 )	2026-03-12 16:00:48 +08:00
Hexq0210	dd82678b2d	[NPU] Support mamba cache transfer for NPU (#20364 )	2026-03-12 12:49:21 +08:00
Mook	abc672e717	[Benchmark] use flashinfer bench_gpu_time instead of triton do_bench (#20305 )	2026-03-12 04:04:30 +00:00
Ke Bao	ae7c2397b9	Fix FA3 swa spec pg_size > 1 (#20369 )	2026-03-12 11:42:01 +08:00
Yuan Luo	649d6f2bc8	[GDN] Change Attention State Layout from [N, HV, K, V] to [N, HV, V, K] (#20283 )	2026-03-12 10:53:12 +08:00
huangtingwei	8787cf4566	Fix the scope of io_backend in NSATokenToKVPoolHost (#20327 )	2026-03-12 10:33:11 +08:00
Vedant V Jhaveri	9b55a98a67	perf(qwen3_5): replace einops rearrange with torch.flatten in GatedDe… (#20386 )	2026-03-12 09:51:27 +08:00
Vedant V Jhaveri	25bd83033d	Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Models (#19903 )	2026-03-12 09:16:38 +08:00
fy	677e446e51	[NPU] Convert cu_window_seqlens to CPU for npu_flush_attention_unpad operator (#20328 )	2026-03-12 09:08:43 +08:00
Hubert Lu	67f02681c9	[AMD] Support speculative decoding v2 for aiter backend on ROCm/HIP (#17450 ) Co-authored-by: kkHuang-amd <wunhuang@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com>	2026-03-11 17:01:01 -07:00
shuwenn	acab24a76a	fix: gracefully abort last request in retract_decode on OOM (#19881 )	2026-03-11 15:13:03 -07:00
doujiang24	88d2fc19b1	feature: support X-Data-Parallel-Rank header to specific dp-rank. (#19832 ) Signed-off-by: doujiang24 <doujiang24@gmail.com>	2026-03-11 14:53:33 -07:00
Shangming Cai	af4c28904d	[PD] Fix the infinite loop in deocde resolve_pending_reqs (#20371 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2026-03-11 14:11:19 -07:00
haNa-meister	252ef90fc2	[Generative Score API] Fix on prefill-only scheduler running batch loss track problem (#14320 ) Co-authored-by: Wenyan Yao <wenyao@linkedin.com> Co-authored-by: Sundara Raman Ramachandran <sundar24295@gmail.com>	2026-03-11 13:15:50 -07:00
satyamk7054	a54d71e967	[Benchmark] Add sglang-embedding backend to bench_serving (#20017 ) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>	2026-03-11 13:13:16 -07:00
Rain Jiang	61b228239e	bump sgl-fa4 version to 4.0.5 to loose torch deps (#20378 )	2026-03-11 13:08:09 -07:00
BingjiaWang	006bd44cf9	[deepseekv3.2] fix get_k_and_s_triton kenel for 128K seqlen case bug (#19319 ) Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>	2026-03-11 12:56:33 -07:00
Kazami Michiru	e6a6cd1f0c	[Fix] Reset `output_ids` for requests with `input_embeds` during retraction (#14110 )	2026-03-11 12:42:21 -07:00
R0CKSTAR	dae5c6cadf	[diffusion] doc: add Moore Threads as a supported vendor (#20146 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2026-03-11 10:15:15 -07:00
Артем Савкин	ed42af99a9	[NPU] [Quantization] w4a4 MoE layer support (#18924 )	2026-03-11 16:52:35 +03:00
Yoray Zack	9991debde3	[Feature] Integrate Elastic NIXL-EP into SGLang (#19248 ) Signed-off-by: Barak Biber <bbiber@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Barak Biber <bbiber@nvidia.com>	2026-03-11 17:37:43 +08:00
Xiaoyu Zhang	680d9d98e4	Fix cutedsl ci error (#20309 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2026-03-11 16:17:35 +08:00
qy-seu	456934fed5	feat: fix update last_receive_tstamp logic for health-check in multi-token-worker mode (#20256 )	2026-03-11 00:23:22 -07:00
Liangsheng Yin	61cad15d28	[Utils] Add `NetworkAddress` abstraction for IPv6-safe address handling (#20306 )	2026-03-11 00:07:37 -07:00
Kurkur	55e6acf834	[NPU][QwenVL] Support qwen image preprocess on npu (#20189 )	2026-03-11 15:03:08 +08:00
Xuhao Zhang	57b093dc34	[NPU]MindSpore backend support eagle3 (#17098 ) Co-authored-by: wangtiance <tiancew@qq.com> Co-authored-by: Tiance Wang <wangtiance@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ronnie_zheng <zl19940307@163.com>	2026-03-11 09:11:19 +03:00
zhaoshang	18cfeabd33	Add SGLANG_SORT_WEIGHT_FILES env var for sequential I/O optimization (#20194 ) Signed-off-by: zhaoshang <zhaoshangsjtu@linux.alibaba.com>	2026-03-11 14:10:53 +08:00
Mick	8c8a487468	[diffusion] doc: add diffusion-optimal-perf (#20311 )	2026-03-11 12:20:09 +08:00
Aleksi Vesanto	c8bbe5010a	[diffusion] feat: add AITER Sage attention backend (#20178 )	2026-03-11 12:17:45 +08:00
xieminghe1	21a0015aa3	[PCG]add piecewise cuda graph support for marlin linear (#20119 ) Co-authored-by: undefined <zhouchen.arrebol@jd.com>	2026-03-11 10:57:08 +08:00
Polisetty V R K Jyothendra Varma	b2dd104ade	[Intel GPU] Upgrade pytorch xpu version to 2.10 (#20254 ) Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>	2026-03-10 18:47:25 -07:00
Kurkur	16ec4f3a4a	Integrate the AddRmsNorm operator (#19939 )	2026-03-11 09:05:04 +08:00
Liangsheng Yin	50953aea8d	[Scheduler] Unify idle checks into `is_fully_idle()` and fix weight update test (#20296 )	2026-03-10 17:50:23 -07:00
Michael	dc4380e33a	[AMD] [DeepSeek-OCR-2 Day 0] Enable DeepSeek-OCR-2 on AMD GPUs and add nightly test (#19732 )	2026-03-10 17:04:35 -07:00
Qiaolin Yu	09a118fafe	Support return_logprob for spec v2 (overlap safe) (#19801 ) Co-authored-by: Ratish1 <ratish1501@gmail.com> Co-authored-by: Ratish1 <formula733@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>	2026-03-10 15:38:27 -07:00
Ziang Li	76ee4bb98c	[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE (#19537 )	2026-03-10 15:37:57 -07:00
Qiaolin Yu	bd460e9565	add logprob related params in bench_serving (#20218 )	2026-03-10 15:04:57 -07:00
R0CKSTAR	db97f193b7	[diffusion][llm] macOS support (#19549 ) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>	2026-03-10 13:11:07 -07:00
Qiaolin Yu	a3d88a247b	Enable piecewise-cuda-graph when logprob_start_len = -1 (#19453 )	2026-03-10 12:50:57 -07:00
fxmarty-amd	031d0a2aad	[Qwen-MOE] Fix memory duplication issues in case layers weights are re-assigned during weight loading (#18255 )	2026-03-10 17:34:56 +00:00
Xinyuan Tong	11d9c36c2f	Replace soundfile+torchaudio with torchcodec AudioDecoder in load_audio (#20190 )	2026-03-10 17:26:29 +00:00
Mick	e1f0b3181a	[diffusion] fix: adjust convert_hf_to_fp8 to be compatible with more dits (#20281 )	2026-03-11 01:21:54 +08:00
Xiaoyu Zhang	60cc06297e	[4/n jit_kernel restruct] speed up CI tests and add benchmark workflow (#20268 )	2026-03-10 21:37:41 +08:00

... 17 18 19 20 21 ...

7855 Commits