Commit Graph

7855 Commits

Author SHA1 Message Date
Kangyan-Zhou
f5a4a5429f Revert early HTTP port reservation (#17754, #19805) (#20468)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:17:33 -07:00
Ethan (Yusheng) Su
af2807e146 [LoRA][I] Add MOE LoRA JIT alignment kernel and tests (#19710)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jonah Bernard <96398205+Jonahcb@users.noreply.github.com>
2026-03-12 12:23:46 -07:00
Yuhao Yang
a57a44739f [diffusion] deps: upgrade diffusers from 0.36.0 to 0.37.0 (#20318) 2026-03-12 19:17:28 +08:00
kk
318a40fdfb [Bug-fix] Fix gpu fault when run the test with dp-attention-enabled and max-concurrency is over 256 (#20399)
Co-authored-by: wunhuang <wunhuang@amd.com>
2026-03-12 02:32:03 -07:00
Ratish P
4e5ca92249 [diffusion]: clear file-path-only outputs on all ranks to prevent TP GPU memory skew (#20353) 2026-03-12 17:29:09 +08:00
jacky.cheng
1e2983c98e [AMD] Fix FP8 assertion failure in aiter MLA decode by falling back to self.k_scale (#19935) 2026-03-12 01:48:51 -07:00
roikoren755
067353f67b [Test] Refactor KL divergence and prefix cache branching to kits (#19715) 2026-03-12 16:11:59 +08:00
0xNullPath
46b558445d Fix default_max_tokens compute error in responses api when mtp is opened (#18932) 2026-03-12 16:00:48 +08:00
Hexq0210
dd82678b2d [NPU] Support mamba cache transfer for NPU (#20364) 2026-03-12 12:49:21 +08:00
Mook
abc672e717 [Benchmark] use flashinfer bench_gpu_time instead of triton do_bench (#20305) 2026-03-12 04:04:30 +00:00
Ke Bao
ae7c2397b9 Fix FA3 swa spec pg_size > 1 (#20369) 2026-03-12 11:42:01 +08:00
Yuan Luo
649d6f2bc8 [GDN] Change Attention State Layout from [N, HV, K, V] to [N, HV, V, K] (#20283) 2026-03-12 10:53:12 +08:00
huangtingwei
8787cf4566 Fix the scope of io_backend in NSATokenToKVPoolHost (#20327) 2026-03-12 10:33:11 +08:00
Vedant V Jhaveri
9b55a98a67 perf(qwen3_5): replace einops rearrange with torch.flatten in GatedDe… (#20386) 2026-03-12 09:51:27 +08:00
Vedant V Jhaveri
25bd83033d Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Models (#19903) 2026-03-12 09:16:38 +08:00
fy
677e446e51 [NPU] Convert cu_window_seqlens to CPU for npu_flush_attention_unpad operator (#20328) 2026-03-12 09:08:43 +08:00
Hubert Lu
67f02681c9 [AMD] Support speculative decoding v2 for aiter backend on ROCm/HIP (#17450)
Co-authored-by: kkHuang-amd <wunhuang@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
2026-03-11 17:01:01 -07:00
shuwenn
acab24a76a fix: gracefully abort last request in retract_decode on OOM (#19881) 2026-03-11 15:13:03 -07:00
doujiang24
88d2fc19b1 feature: support X-Data-Parallel-Rank header to specific dp-rank. (#19832)
Signed-off-by: doujiang24 <doujiang24@gmail.com>
2026-03-11 14:53:33 -07:00
Shangming Cai
af4c28904d [PD] Fix the infinite loop in deocde resolve_pending_reqs (#20371)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-03-11 14:11:19 -07:00
haNa-meister
252ef90fc2 [Generative Score API] Fix on prefill-only scheduler running batch loss track problem (#14320)
Co-authored-by: Wenyan Yao <wenyao@linkedin.com>
Co-authored-by: Sundara Raman Ramachandran <sundar24295@gmail.com>
2026-03-11 13:15:50 -07:00
satyamk7054
a54d71e967 [Benchmark] Add sglang-embedding backend to bench_serving (#20017)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
2026-03-11 13:13:16 -07:00
Rain Jiang
61b228239e bump sgl-fa4 version to 4.0.5 to loose torch deps (#20378) 2026-03-11 13:08:09 -07:00
BingjiaWang
006bd44cf9 [deepseekv3.2] fix get_k_and_s_triton kenel for 128K seqlen case bug (#19319)
Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>
2026-03-11 12:56:33 -07:00
Kazami Michiru
e6a6cd1f0c [Fix] Reset output_ids for requests with input_embeds during retraction (#14110) 2026-03-11 12:42:21 -07:00
R0CKSTAR
dae5c6cadf [diffusion] doc: add Moore Threads as a supported vendor (#20146)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-03-11 10:15:15 -07:00
Артем Савкин
ed42af99a9 [NPU] [Quantization] w4a4 MoE layer support (#18924) 2026-03-11 16:52:35 +03:00
Yoray Zack
9991debde3 [Feature] Integrate Elastic NIXL-EP into SGLang (#19248)
Signed-off-by: Barak Biber <bbiber@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Barak Biber <bbiber@nvidia.com>
2026-03-11 17:37:43 +08:00
Xiaoyu Zhang
680d9d98e4 Fix cutedsl ci error (#20309)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-03-11 16:17:35 +08:00
qy-seu
456934fed5 feat: fix update last_receive_tstamp logic for health-check in multi-token-worker mode (#20256) 2026-03-11 00:23:22 -07:00
Liangsheng Yin
61cad15d28 [Utils] Add NetworkAddress abstraction for IPv6-safe address handling (#20306) 2026-03-11 00:07:37 -07:00
Kurkur
55e6acf834 [NPU][QwenVL] Support qwen image preprocess on npu (#20189) 2026-03-11 15:03:08 +08:00
Xuhao Zhang
57b093dc34 [NPU]MindSpore backend support eagle3 (#17098)
Co-authored-by: wangtiance <tiancew@qq.com>
Co-authored-by: Tiance Wang <wangtiance@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-03-11 09:11:19 +03:00
zhaoshang
18cfeabd33 Add SGLANG_SORT_WEIGHT_FILES env var for sequential I/O optimization (#20194)
Signed-off-by: zhaoshang <zhaoshangsjtu@linux.alibaba.com>
2026-03-11 14:10:53 +08:00
Mick
8c8a487468 [diffusion] doc: add diffusion-optimal-perf (#20311) 2026-03-11 12:20:09 +08:00
Aleksi Vesanto
c8bbe5010a [diffusion] feat: add AITER Sage attention backend (#20178) 2026-03-11 12:17:45 +08:00
xieminghe1
21a0015aa3 [PCG]add piecewise cuda graph support for marlin linear (#20119)
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
2026-03-11 10:57:08 +08:00
Polisetty V R K Jyothendra Varma
b2dd104ade [Intel GPU] Upgrade pytorch xpu version to 2.10 (#20254)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
2026-03-10 18:47:25 -07:00
Kurkur
16ec4f3a4a Integrate the AddRmsNorm operator (#19939) 2026-03-11 09:05:04 +08:00
Liangsheng Yin
50953aea8d [Scheduler] Unify idle checks into is_fully_idle() and fix weight update test (#20296) 2026-03-10 17:50:23 -07:00
Michael
dc4380e33a [AMD] [DeepSeek-OCR-2 Day 0] Enable DeepSeek-OCR-2 on AMD GPUs and add nightly test (#19732) 2026-03-10 17:04:35 -07:00
Qiaolin Yu
09a118fafe Support return_logprob for spec v2 (overlap safe) (#19801)
Co-authored-by: Ratish1 <ratish1501@gmail.com>
Co-authored-by: Ratish1 <formula733@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-10 15:38:27 -07:00
Ziang Li
76ee4bb98c [FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE (#19537) 2026-03-10 15:37:57 -07:00
Qiaolin Yu
bd460e9565 add logprob related params in bench_serving (#20218) 2026-03-10 15:04:57 -07:00
R0CKSTAR
db97f193b7 [diffusion][llm] macOS support (#19549)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-10 13:11:07 -07:00
Qiaolin Yu
a3d88a247b Enable piecewise-cuda-graph when logprob_start_len = -1 (#19453) 2026-03-10 12:50:57 -07:00
fxmarty-amd
031d0a2aad [Qwen-MOE] Fix memory duplication issues in case layers weights are re-assigned during weight loading (#18255) 2026-03-10 17:34:56 +00:00
Xinyuan Tong
11d9c36c2f Replace soundfile+torchaudio with torchcodec AudioDecoder in load_audio (#20190) 2026-03-10 17:26:29 +00:00
Mick
e1f0b3181a [diffusion] fix: adjust convert_hf_to_fp8 to be compatible with more dits (#20281) 2026-03-11 01:21:54 +08:00
Xiaoyu Zhang
60cc06297e [4/n jit_kernel restruct] speed up CI tests and add benchmark workflow (#20268) 2026-03-10 21:37:41 +08:00