Kangyan-Zhou
|
f5a4a5429f
|
Revert early HTTP port reservation (#17754, #19805) (#20468)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-03-12 16:17:33 -07:00 |
|
Ethan (Yusheng) Su
|
af2807e146
|
[LoRA][I] Add MOE LoRA JIT alignment kernel and tests (#19710)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jonah Bernard <96398205+Jonahcb@users.noreply.github.com>
|
2026-03-12 12:23:46 -07:00 |
|
Yuhao Yang
|
a57a44739f
|
[diffusion] deps: upgrade diffusers from 0.36.0 to 0.37.0 (#20318)
|
2026-03-12 19:17:28 +08:00 |
|
kk
|
318a40fdfb
|
[Bug-fix] Fix gpu fault when run the test with dp-attention-enabled and max-concurrency is over 256 (#20399)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2026-03-12 02:32:03 -07:00 |
|
Ratish P
|
4e5ca92249
|
[diffusion]: clear file-path-only outputs on all ranks to prevent TP GPU memory skew (#20353)
|
2026-03-12 17:29:09 +08:00 |
|
jacky.cheng
|
1e2983c98e
|
[AMD] Fix FP8 assertion failure in aiter MLA decode by falling back to self.k_scale (#19935)
|
2026-03-12 01:48:51 -07:00 |
|
roikoren755
|
067353f67b
|
[Test] Refactor KL divergence and prefix cache branching to kits (#19715)
|
2026-03-12 16:11:59 +08:00 |
|
0xNullPath
|
46b558445d
|
Fix default_max_tokens compute error in responses api when mtp is opened (#18932)
|
2026-03-12 16:00:48 +08:00 |
|
Hexq0210
|
dd82678b2d
|
[NPU] Support mamba cache transfer for NPU (#20364)
|
2026-03-12 12:49:21 +08:00 |
|
Mook
|
abc672e717
|
[Benchmark] use flashinfer bench_gpu_time instead of triton do_bench (#20305)
|
2026-03-12 04:04:30 +00:00 |
|
Ke Bao
|
ae7c2397b9
|
Fix FA3 swa spec pg_size > 1 (#20369)
|
2026-03-12 11:42:01 +08:00 |
|
Yuan Luo
|
649d6f2bc8
|
[GDN] Change Attention State Layout from [N, HV, K, V] to [N, HV, V, K] (#20283)
|
2026-03-12 10:53:12 +08:00 |
|
huangtingwei
|
8787cf4566
|
Fix the scope of io_backend in NSATokenToKVPoolHost (#20327)
|
2026-03-12 10:33:11 +08:00 |
|
Vedant V Jhaveri
|
9b55a98a67
|
perf(qwen3_5): replace einops rearrange with torch.flatten in GatedDe… (#20386)
|
2026-03-12 09:51:27 +08:00 |
|
Vedant V Jhaveri
|
25bd83033d
|
Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Models (#19903)
|
2026-03-12 09:16:38 +08:00 |
|
fy
|
677e446e51
|
[NPU] Convert cu_window_seqlens to CPU for npu_flush_attention_unpad operator (#20328)
|
2026-03-12 09:08:43 +08:00 |
|
Hubert Lu
|
67f02681c9
|
[AMD] Support speculative decoding v2 for aiter backend on ROCm/HIP (#17450)
Co-authored-by: kkHuang-amd <wunhuang@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
|
2026-03-11 17:01:01 -07:00 |
|
shuwenn
|
acab24a76a
|
fix: gracefully abort last request in retract_decode on OOM (#19881)
|
2026-03-11 15:13:03 -07:00 |
|
doujiang24
|
88d2fc19b1
|
feature: support X-Data-Parallel-Rank header to specific dp-rank. (#19832)
Signed-off-by: doujiang24 <doujiang24@gmail.com>
|
2026-03-11 14:53:33 -07:00 |
|
Shangming Cai
|
af4c28904d
|
[PD] Fix the infinite loop in deocde resolve_pending_reqs (#20371)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2026-03-11 14:11:19 -07:00 |
|
haNa-meister
|
252ef90fc2
|
[Generative Score API] Fix on prefill-only scheduler running batch loss track problem (#14320)
Co-authored-by: Wenyan Yao <wenyao@linkedin.com>
Co-authored-by: Sundara Raman Ramachandran <sundar24295@gmail.com>
|
2026-03-11 13:15:50 -07:00 |
|
satyamk7054
|
a54d71e967
|
[Benchmark] Add sglang-embedding backend to bench_serving (#20017)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
|
2026-03-11 13:13:16 -07:00 |
|
Rain Jiang
|
61b228239e
|
bump sgl-fa4 version to 4.0.5 to loose torch deps (#20378)
|
2026-03-11 13:08:09 -07:00 |
|
BingjiaWang
|
006bd44cf9
|
[deepseekv3.2] fix get_k_and_s_triton kenel for 128K seqlen case bug (#19319)
Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>
|
2026-03-11 12:56:33 -07:00 |
|
Kazami Michiru
|
e6a6cd1f0c
|
[Fix] Reset output_ids for requests with input_embeds during retraction (#14110)
|
2026-03-11 12:42:21 -07:00 |
|
R0CKSTAR
|
dae5c6cadf
|
[diffusion] doc: add Moore Threads as a supported vendor (#20146)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
|
2026-03-11 10:15:15 -07:00 |
|
Артем Савкин
|
ed42af99a9
|
[NPU] [Quantization] w4a4 MoE layer support (#18924)
|
2026-03-11 16:52:35 +03:00 |
|
Yoray Zack
|
9991debde3
|
[Feature] Integrate Elastic NIXL-EP into SGLang (#19248)
Signed-off-by: Barak Biber <bbiber@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Barak Biber <bbiber@nvidia.com>
|
2026-03-11 17:37:43 +08:00 |
|
Xiaoyu Zhang
|
680d9d98e4
|
Fix cutedsl ci error (#20309)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2026-03-11 16:17:35 +08:00 |
|
qy-seu
|
456934fed5
|
feat: fix update last_receive_tstamp logic for health-check in multi-token-worker mode (#20256)
|
2026-03-11 00:23:22 -07:00 |
|
Liangsheng Yin
|
61cad15d28
|
[Utils] Add NetworkAddress abstraction for IPv6-safe address handling (#20306)
|
2026-03-11 00:07:37 -07:00 |
|
Kurkur
|
55e6acf834
|
[NPU][QwenVL] Support qwen image preprocess on npu (#20189)
|
2026-03-11 15:03:08 +08:00 |
|
Xuhao Zhang
|
57b093dc34
|
[NPU]MindSpore backend support eagle3 (#17098)
Co-authored-by: wangtiance <tiancew@qq.com>
Co-authored-by: Tiance Wang <wangtiance@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
|
2026-03-11 09:11:19 +03:00 |
|
zhaoshang
|
18cfeabd33
|
Add SGLANG_SORT_WEIGHT_FILES env var for sequential I/O optimization (#20194)
Signed-off-by: zhaoshang <zhaoshangsjtu@linux.alibaba.com>
|
2026-03-11 14:10:53 +08:00 |
|
Mick
|
8c8a487468
|
[diffusion] doc: add diffusion-optimal-perf (#20311)
|
2026-03-11 12:20:09 +08:00 |
|
Aleksi Vesanto
|
c8bbe5010a
|
[diffusion] feat: add AITER Sage attention backend (#20178)
|
2026-03-11 12:17:45 +08:00 |
|
xieminghe1
|
21a0015aa3
|
[PCG]add piecewise cuda graph support for marlin linear (#20119)
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
|
2026-03-11 10:57:08 +08:00 |
|
Polisetty V R K Jyothendra Varma
|
b2dd104ade
|
[Intel GPU] Upgrade pytorch xpu version to 2.10 (#20254)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
|
2026-03-10 18:47:25 -07:00 |
|
Kurkur
|
16ec4f3a4a
|
Integrate the AddRmsNorm operator (#19939)
|
2026-03-11 09:05:04 +08:00 |
|
Liangsheng Yin
|
50953aea8d
|
[Scheduler] Unify idle checks into is_fully_idle() and fix weight update test (#20296)
|
2026-03-10 17:50:23 -07:00 |
|
Michael
|
dc4380e33a
|
[AMD] [DeepSeek-OCR-2 Day 0] Enable DeepSeek-OCR-2 on AMD GPUs and add nightly test (#19732)
|
2026-03-10 17:04:35 -07:00 |
|
Qiaolin Yu
|
09a118fafe
|
Support return_logprob for spec v2 (overlap safe) (#19801)
Co-authored-by: Ratish1 <ratish1501@gmail.com>
Co-authored-by: Ratish1 <formula733@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
|
2026-03-10 15:38:27 -07:00 |
|
Ziang Li
|
76ee4bb98c
|
[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE (#19537)
|
2026-03-10 15:37:57 -07:00 |
|
Qiaolin Yu
|
bd460e9565
|
add logprob related params in bench_serving (#20218)
|
2026-03-10 15:04:57 -07:00 |
|
R0CKSTAR
|
db97f193b7
|
[diffusion][llm] macOS support (#19549)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
|
2026-03-10 13:11:07 -07:00 |
|
Qiaolin Yu
|
a3d88a247b
|
Enable piecewise-cuda-graph when logprob_start_len = -1 (#19453)
|
2026-03-10 12:50:57 -07:00 |
|
fxmarty-amd
|
031d0a2aad
|
[Qwen-MOE] Fix memory duplication issues in case layers weights are re-assigned during weight loading (#18255)
|
2026-03-10 17:34:56 +00:00 |
|
Xinyuan Tong
|
11d9c36c2f
|
Replace soundfile+torchaudio with torchcodec AudioDecoder in load_audio (#20190)
|
2026-03-10 17:26:29 +00:00 |
|
Mick
|
e1f0b3181a
|
[diffusion] fix: adjust convert_hf_to_fp8 to be compatible with more dits (#20281)
|
2026-03-11 01:21:54 +08:00 |
|
Xiaoyu Zhang
|
60cc06297e
|
[4/n jit_kernel restruct] speed up CI tests and add benchmark workflow (#20268)
|
2026-03-10 21:37:41 +08:00 |
|