sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 20:27:57 +00:00

Author	SHA1	Message	Date
Baizhou Zhang	39008955ff	Revert "[AMD][MORI] Fix MTP crash with FP4/FP8 dispatch and add NEXTN dispatch env vars." (#20602 )	2026-03-14 12:12:42 -07:00
Xiaoyu Zhang	5ab2cfe9a8	[Diffusion] Clean upstream fa3 in hopper (#20576 )	2026-03-14 23:41:23 +08:00
Yuan Luo	22e67876d6	[Omni] Optimize AudioEncoder for Qwen3_Omni_Thinker (#18185 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2026-03-14 23:00:17 +08:00
Ratish P	574aa2d723	[diffusion]: remove stale offload-manager cleanup in denoising stage (#20587 )	2026-03-14 22:56:57 +08:00
Xiaoyu Zhang	25e38216b6	[kernel slimming] Clean many useless sgl-kernel deprecated kernels (#20277 )	2026-03-14 16:45:54 +08:00
Mohammad Miadh Angkad	75a7879fd4	[Model] Support Nemotron 3 Super NVFP4 (#20407 )	2026-03-14 00:56:26 -07:00
SoluMilken	c95dc88f86	[CI] migrate ascend-gptq from `test/srt` to `test/registered` (#19628 )	2026-03-14 00:28:57 -07:00
Xiaoyu Zhang	f9e4221b71	[Diffusion] add mova and hunyuanvideo to perf skills (#20563 )	2026-03-14 13:49:50 +08:00
Shangming Cai	99a3b25c9b	[PP] Fix recv tensor dict potential race condition (#20341 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2026-03-14 13:35:01 +08:00
Xinyuan Tong	c330b687a1	[Bugfix] Fix GLM-4.6V vision regression in glm4v_moe and glm_ocr (#20463 )	2026-03-13 21:48:28 -07:00
ziruiliu	dfd0a77a9a	[bugfix] Add prev_prefix_len parameter to HiMambaRadixCache's _insert_helper() (#20539 )	2026-03-14 09:54:14 +08:00
Duyi-Wang	0eea80bc00	[AMD][MORI] Fix MTP crash with FP4/FP8 dispatch and add NEXTN dispatch env vars. (#20453 )	2026-03-13 14:03:17 -07:00
YC Tseng	c37ef7f18b	[AMD] diffusion refactor: move ROCM VAE optimization to Platform abstraction (#20496 )	2026-03-13 13:10:05 -07:00
Simo Lin	654fc02cf1	[gRPC] Extract gRPC servicer into standalone package (#20478 ) Signed-off-by: Simo Lin <linsimo.mark@gmail.com>	2026-03-13 09:13:29 -07:00
Xiaoyu Zhang	be7a0311a0	[Diffusion] Fix and validate diffusion skills benchmarking/profiling workflow (#20528 )	2026-03-13 21:11:37 +08:00
Leon Gao	b1246c50f8	Fix chunked prefill and KV cache leaks for streaming sessions (#20476 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>	2026-03-13 02:36:55 -07:00
Ke Bao	287dc12b05	Fix hicache log metrics (#20504 )	2026-03-13 16:29:58 +08:00
Baizhou Zhang	f8668d9e78	[Fix] Add fallback for flashinfer allreduce fusion (#20384 )	2026-03-13 01:24:55 -07:00
Mick	b638b25b22	[diffusion] UX: suppress excessive logging from httpx and httpcore (#20452 )	2026-03-13 14:43:09 +08:00
seungrokj	9c8777c80f	[AMD][Qwen3.5] aiter a8w8 gemm configuration (#19826 ) Signed-off-by: seungrokj <seungrok.jung@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com>	2026-03-12 23:23:58 -07:00
Antonin Vidon	63ecdcbb18	Expose async LoRA interface to Offline Engine (#18636 )	2026-03-12 23:09:47 -07:00
StonyPort	d4e68ead1d	[quant] Ignore FP8 quantization layers (#20340 ) Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-13 13:59:39 +08:00
Xiaoyu Zhang	e00328d1e5	[Diffusion] Opt qwen-image-edit with fuse_residual_layernorm_scale_shift_gate_select01_kernel (#20395 ) Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>	2026-03-13 13:15:22 +08:00
hzh0425	197f807134	[RadixTree][7/N Refactor]: Refactor mamba radix tree, release dup kvcache in insert func (#19429 )	2026-03-13 12:28:32 +08:00
Liangsheng Yin	f605612b87	[HTTP] Fix `/GET` HTTP route when ollama endpoint is not set. (#20494 )	2026-03-12 20:54:32 -07:00
Xiaoyu Zhang	7ecf07b8f4	[jit_kernel] Temporarily Skip Flaky JIT Kernel GDN Test and Add PR Label (#20436 )	2026-03-13 09:34:22 +08:00
Pai Liu	65dd08153d	Fix Test* mixin classes being collected as standalone pytest tests (#20417 )	2026-03-12 18:18:45 -07:00
LinyuanLi	9865f11421	[bugfix] fix bug when enable prefill delay and DP (#20134 )	2026-03-13 08:51:09 +08:00
zzhpro	c21ddbc785	[Minor] fix type annotations and invalid method calls in constrained … (#20132 )	2026-03-12 16:42:46 -07:00
YC Tseng	78a467c74a	[AMD] [diffusion] feat: enable AITer GroupNorm for VAE decode on ROCm (#20170 ) Co-authored-by: HaiShaw <hixiao@gmail.com>	2026-03-12 16:38:19 -07:00
Kangyan-Zhou	f5a4a5429f	Revert early HTTP port reservation (#17754 , #19805 ) (#20468 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 16:17:33 -07:00
Ethan (Yusheng) Su	af2807e146	[LoRA][I] Add MOE LoRA JIT alignment kernel and tests (#19710 ) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Jonah Bernard <96398205+Jonahcb@users.noreply.github.com>	2026-03-12 12:23:46 -07:00
Yuhao Yang	a57a44739f	[diffusion] deps: upgrade diffusers from 0.36.0 to 0.37.0 (#20318 )	2026-03-12 19:17:28 +08:00
kk	318a40fdfb	[Bug-fix] Fix gpu fault when run the test with dp-attention-enabled and max-concurrency is over 256 (#20399 ) Co-authored-by: wunhuang <wunhuang@amd.com>	2026-03-12 02:32:03 -07:00
Ratish P	4e5ca92249	[diffusion]: clear file-path-only outputs on all ranks to prevent TP GPU memory skew (#20353 )	2026-03-12 17:29:09 +08:00
jacky.cheng	1e2983c98e	[AMD] Fix FP8 assertion failure in aiter MLA decode by falling back to self.k_scale (#19935 )	2026-03-12 01:48:51 -07:00
roikoren755	067353f67b	[Test] Refactor KL divergence and prefix cache branching to kits (#19715 )	2026-03-12 16:11:59 +08:00
0xNullPath	46b558445d	Fix default_max_tokens compute error in responses api when mtp is opened (#18932 )	2026-03-12 16:00:48 +08:00
Hexq0210	dd82678b2d	[NPU] Support mamba cache transfer for NPU (#20364 )	2026-03-12 12:49:21 +08:00
Mook	abc672e717	[Benchmark] use flashinfer bench_gpu_time instead of triton do_bench (#20305 )	2026-03-12 04:04:30 +00:00
Ke Bao	ae7c2397b9	Fix FA3 swa spec pg_size > 1 (#20369 )	2026-03-12 11:42:01 +08:00
Yuan Luo	649d6f2bc8	[GDN] Change Attention State Layout from [N, HV, K, V] to [N, HV, V, K] (#20283 )	2026-03-12 10:53:12 +08:00
huangtingwei	8787cf4566	Fix the scope of io_backend in NSATokenToKVPoolHost (#20327 )	2026-03-12 10:33:11 +08:00
Vedant V Jhaveri	9b55a98a67	perf(qwen3_5): replace einops rearrange with torch.flatten in GatedDe… (#20386 )	2026-03-12 09:51:27 +08:00
Vedant V Jhaveri	25bd83033d	Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Models (#19903 )	2026-03-12 09:16:38 +08:00
fy	677e446e51	[NPU] Convert cu_window_seqlens to CPU for npu_flush_attention_unpad operator (#20328 )	2026-03-12 09:08:43 +08:00
Hubert Lu	67f02681c9	[AMD] Support speculative decoding v2 for aiter backend on ROCm/HIP (#17450 ) Co-authored-by: kkHuang-amd <wunhuang@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com>	2026-03-11 17:01:01 -07:00
shuwenn	acab24a76a	fix: gracefully abort last request in retract_decode on OOM (#19881 )	2026-03-11 15:13:03 -07:00
doujiang24	88d2fc19b1	feature: support X-Data-Parallel-Rank header to specific dp-rank. (#19832 ) Signed-off-by: doujiang24 <doujiang24@gmail.com>	2026-03-11 14:53:33 -07:00
Shangming Cai	af4c28904d	[PD] Fix the infinite loop in deocde resolve_pending_reqs (#20371 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2026-03-11 14:11:19 -07:00

1 2 3 4 5 ...

6985 Commits