Commit Graph

8284 Commits

Author SHA1 Message Date
Teng Ma
00126648e2 [PD] Add EFA disaggregation transport 2026-05-10 23:51:29 +08:00
Teng Ma
c49ee54e73 Merge remote-tracking branch 'origin/main' into pr-21859-support-multi-protocol
# Conflicts:
#	python/sglang/srt/server_args.py
#	test/registered/unit/server_args/test_server_args.py
2026-05-10 22:35:49 +08:00
RunningLeon
335dbd60b4 Support Intern-S2-Preview (#24875) 2026-05-10 22:17:30 +08:00
Ke Bao
59faf986b2 [PD] Unify dsv4 dispatch with swa (#24888) 2026-05-10 22:01:13 +08:00
Yuhao Yang
2f06867128 Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (#24775)
Co-authored-by: Cheng Wan <chwan@rice.edu>
Co-authored-by: Chunan Zeng <zcnrex@gmail.com>
2026-05-10 19:03:37 +08:00
Yuhao Yang
bd0aa22309 Fix PD bootstrap failure handling (#24772)
Co-authored-by: Cheng Wan <chwan@rice.edu>
2026-05-10 19:02:47 +08:00
Liangsheng Yin
8cc16c9974 [Spec] Cleanup idle stub and shape-check patterns (#24881) 2026-05-10 02:39:53 -07:00
Cheng Wan
c7f674e427 [Bug] Add dsv4 state_type branch to mooncake disaggregation (#24878)
Co-authored-by: Cheng Wan <cheng.wan@radixark.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:13:46 -07:00
Liangsheng Yin
d08744238a [Spec V1] Split draft-extend phase from EagleDraftInput into new EagleDraftExtendInput (#24859) 2026-05-10 01:07:45 -07:00
Yuan Luo
d3fd91ed97 [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (#24696)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-05-10 00:24:12 -07:00
Qiaolin Yu
a87fb399de [spec decoding] support kimi-k2.5-eagle3-mla (#24826) 2026-05-09 23:57:39 -07:00
shuwenn
b4d347e86e [SPEC V2] fix: skip stale state updates in spec-v2 overlap (#23456) 2026-05-09 23:56:24 -07:00
Byron Hsu
cfd3fd00d0 [RL] Call torch.cuda.empty_cache() for in-place pause mode to avoid OOM (#24854)
Co-authored-by: Byron Hsu <byron@periodiclabs.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-09 23:36:52 -07:00
Chi McIsaac
44efc23a9a [diffusion] CI: add cache-dit CI tests (#19213) 2026-05-10 13:38:41 +08:00
Byron Hsu
1e6c6d1f07 [Utils] Make request dump robust to unpicklable server_args and large meta_info (#24767)
Co-authored-by: Byron Hsu <byron@periodiclabs.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-09 21:41:41 -07:00
Stefan He
9578ba1b57 [Utils] Refactor device cache emptying (#24861)
Co-authored-by: Biao He <biao@Biaos-MacBook-Air.local>
2026-05-09 21:28:00 -07:00
Byron Hsu
47483001b6 [PrefillDelayer] support NCCL all-gather for cross-DP info sync (#24768)
Co-authored-by: Byron Hsu <byron@periodiclabs.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-09 21:20:03 -07:00
Byron Hsu
7edb4c3cea [NUMA+Ray] Fix NUMA NVML handle resolution under shuffled CUDA_VISIBLE_DEVICES (#24766)
Co-authored-by: Byron Hsu <byron@periodiclabs.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-09 21:18:39 -07:00
Liangsheng Yin
c95454b341 speculative: drop dead params/returns/no-ops (#24865) 2026-05-09 15:53:31 -07:00
Charles Chen
12f42f2e7e Support Gemma3/4 + Eagle3 (#23976) 2026-05-09 13:34:56 -07:00
luchangli
8087e07d52 [UnifiedRadixTree]: Align cache_empty_result with RadixTree (#24779)
Co-authored-by: Zhangheng <hzh0425@apache.org>
2026-05-09 23:52:22 +08:00
Baizhou Zhang
ef5e9f8aba [DSV4] Cherry pick missing commits from deepseek_v4 branch and enhance tests (#24793)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: yueming-yuan <yym022502@gmail.com>
2026-05-09 04:15:37 -07:00
Brayden Zhong
4b23f6bdc5 Fix performance regression on Deepseek V3 on moe-runner-backend=triton on SM90 (#24562)
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
2026-05-09 03:49:12 -07:00
Brayden Zhong
05d1ab51e8 Enable PDL for various kernels in DSV32/GLM5 (#23965)
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
2026-05-09 03:42:56 -07:00
shuwenn
d5564c2a96 fix(fa3): translate page table to SWA loc in EAGLE3 topk>1 spec metadata (#24617) 2026-05-09 18:22:45 +08:00
JoyFuture
a309f1f8f4 fix(cuda_graph): zero out_cache_loc_swa on pad and use int32 (hybrid-SWA accuracy fix) (#24743) 2026-05-09 18:22:12 +08:00
Brayden Zhong
f4b7e73699 Enable trtllm-gen BF16 MoE for MTP (#24260)
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
2026-05-09 03:14:17 -07:00
sglang-npu-bot
f1a9a455e0 Revert "[NPU] fix profiler on npu" (#24815) 2026-05-09 17:53:02 +08:00
zhaozx-cn
e2527df8b6 [NPU] fix profiler on npu (#24685)
Signed-off-by: zhaozx-cn <zhaozx2116@163.com>
2026-05-09 17:48:24 +08:00
Jia Guo
fd636410a2 Restrict fa_skip_kv_cache to non-MLA backends (#24097) 2026-05-09 09:25:02 +00:00
Brayden Zhong
8f33bee31b Reland Cute-DSL FP4 dense GEMM (#23590)
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
2026-05-09 02:20:58 -07:00
Yuxuan Zhang
d49fc092cb [Bug Fix] GLM-5.1: drop constexpr on page_indice_batch_offset, skip offloader post_init on draft worker, support N=32 in copy_to_gpu_no_ce (#23550) 2026-05-09 15:43:45 +08:00
Liangsheng Yin
78da0d3106 [Spec] Move accept_tokens off EagleDraftInput; pass via method arg (#24735) 2026-05-08 23:24:18 -07:00
Chi McIsaac
8e534e8f15 [diffusion] fix: fix diffusers executor crash when component residency manager is absent (#24573) 2026-05-09 11:45:06 +08:00
storyicon
590b13b513 [diffusion] fix: fix NCCL deadlock in ulysses sp when sequence length has remainder (#24694)
Signed-off-by: storyicon <storyicon@foxmail.com>
2026-05-09 11:05:37 +08:00
Polisetty V R K Jyothendra Varma
50ed01674e fix is_arch_support_pdl function usage (#24600)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-05-09 09:39:34 +08:00
Liangsheng Yin
1613bae412 [Spec] Disambiguate verified_id into bonus_token(s) / accept_tokens (#24724) 2026-05-08 18:24:33 -07:00
Yuan Luo
a61a14f416 [KDA] Optimize prefill kernels with diagonal and recompute fuse (#24271)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-05-09 08:52:51 +08:00
Brayden Zhong
9ee830346f Disable Custom AR V2 when in multi-node (#24729)
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
2026-05-08 17:50:05 -07:00
Cheng Wan
d1c5937428 env: add SGLANG_RADIX_FORCE_MISS to force radix prefix-cache miss (#24726)
Co-authored-by: sihan-zzz <228612289+sihan-zzz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:46:38 -07:00
YAMY
560829a171 feat(scheduler): add adaptive queue-based prefill delayer trigger (#23189) 2026-05-08 16:54:30 -07:00
YAMY
6971a03fe6 fix(fa3): skip scheduler_metadata precompute under DP attention (#24632) 2026-05-08 16:19:20 -07:00
Niko Ma
62c2e091f6 [PD] MORI-IO: Add state transfer, inline transfer model, and high-concurrency fixes (#22665) 2026-05-08 16:07:22 -07:00
Jimmy Shong
fa8985486e [test/fix]: isolate VLM MMMU eval output dirs to fix nightly-4-gpu cross-test pollution (#24623)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-08 15:01:53 -07:00
Jimmy Shong
096ad02b06 [Model] Laguna-XS.2 Model Support (#24204) 2026-05-09 05:43:13 +08:00
Cheng Wan
7b707c9222 disable the combination of --enable-two-batch-overlap and --enforce-s… (#24720) 2026-05-08 14:27:35 -07:00
Yuhao Yang
09912fd89d Remove unnecessary bf16 assert in rotate_activation (#24686) 2026-05-09 05:00:52 +08:00
Yilong Zhao
f30d1d0b0a logits: remove blocking H2D copy (#24627) 2026-05-08 13:22:13 -07:00
Ethan Feng
672f778512 [NemotronH] Fix expert scale weight loading (#24434) 2026-05-08 12:37:06 -07:00
zhongdaor-nv
2cf1a4ab38 feat: Add KV events for Mamba radix cache (#23678)
Signed-off-by: zhongdaor-nv <220807034+zhongdaor-nv@users.noreply.github.com>
Co-authored-by: zhongdaor-nv <220807034+zhongdaor-nv@users.noreply.github.com>
2026-05-08 11:53:36 -07:00