Commit Graph

4372 Commits

Author SHA1 Message Date
gongwei-130
97be66c358 fix sgl-kernel version (#12723) 2025-11-05 19:01:03 -08:00
Lianmin Zheng
c7d57d5bb3 Fix CI and style (#12658) 2025-11-05 15:08:15 -08:00
Kaixi Hou
141278048e [NVIDIA] Fix unit test of MoE and add it to nightly ci (#12709) 2025-11-05 14:33:18 -08:00
Shu Wang
82f39dc11d Add mm_fp4 trtllm backend (#12406) 2025-11-05 14:31:46 -08:00
Atream
627bac649c Support Expert Deferral Mechanism in KTransformers (#12586)
Co-authored-by: Chen Hongtao <56470055+chenht2022@users.noreply.github.com>
Co-authored-by: chenht2022 <cht22@mails.tsinghua.edu.cn>
2025-11-05 13:41:52 -08:00
Morpheus Guo
c8547ecddd Enable Aiter Attention for VL model (#12699)
Co-authored-by: yuechguo <yuechguo@amd.com>
2025-11-05 13:01:23 -08:00
Mick
7bc1dae095 WIP: initial multimodal-gen support (#12484)
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: JiLi <leege233@gmail.com>
Co-authored-by: CHEN Xi <78632976+RubiaCx@users.noreply.github.com>
Co-authored-by: laixin <xielx@shanghaitech.edu.cn>
Co-authored-by: SolitaryThinker <wlsaidhi@gmail.com>
Co-authored-by: jzhang38 <a1286225768@gmail.com>
Co-authored-by: BrianChen1129 <yongqichcd@gmail.com>
Co-authored-by: Kevin Lin <42618777+kevin314@users.noreply.github.com>
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: rlsu9 <r3su@ucsd.edu>
Co-authored-by: Jinzhe Pan <48981407+eigensystem@users.noreply.github.com>
Co-authored-by: foreverpiano <pianoqwz@qq.com>
Co-authored-by: RandNMR73 <notomatthew31@gmail.com>
Co-authored-by: PorridgeSwim <yz3883@columbia.edu>
Co-authored-by: Jiali Chen <90408393+gary-chenjl@users.noreply.github.com>
2025-11-05 12:28:52 -08:00
Lianmin Zheng
fb2e816e83 Fix server args for gpt oss so users can override the moe runner backend (#12696) 2025-11-05 11:36:59 -08:00
bigmoyan
508d2f7aa2 add Kimi k2 reasoning parser (#12702)
Signed-off-by: wangzhengtao <wangzhengtao@msh.team>
2025-11-06 00:37:54 +08:00
Yuxuan Zhang
a889c85459 [Grammar Fix] GLM-4-MOE self.first_k_dense_replace is undefined. (#12455) 2025-11-06 00:03:45 +08:00
Yuhong Guo
4d84f886e7 Refactor --debug-tensor-dump-layers to list (#12691) 2025-11-05 03:30:01 -08:00
yinghui
dc4f541823 fix trtllm_mla attention backend when disabling cuda graph. (#12687) 2025-11-05 01:35:02 -08:00
zejunchen-zejun
0648eb482d [Profiler] Add SGLANG_PROFILE_RECORD_SHAPES for recording shapes when profiling (#11641)
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-11-04 23:41:46 -08:00
yinghui
b88fab3111 fix: add seed bench_serving to cache key, remove redundant function definition. (#12680) 2025-11-04 23:39:11 -08:00
Glen Liu
cbf23dbbfa [Feature] add --lora-request-distribution arg to bench_serving.py and support skewed and distinct workloads (#12175) 2025-11-04 21:41:40 -08:00
ai-easy-cpu
48641435d6 fix typo of args description in sglang.profiler (#12486)
Co-authored-by: AI-bot-easy <litchys0123@outlook.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-04 20:15:13 -08:00
Liangsheng Yin
44b1b394a4 [PD-Disagg] Check finish after pop tranferred (#12638) 2025-11-05 11:18:09 +08:00
Kaixi Hou
0711d1509b [NVIDIA] Fix cutedsl backend of MoE (#12353) 2025-11-04 18:54:55 -08:00
sglang-bot
09938e1f82 chore: bump SGLang version to 0.5.4.post3 (#12639) 2025-11-04 18:32:11 -08:00
Nicolas Castet
2340798353 Register allgather/reducescatter buffers with symm memory (#12572) 2025-11-04 17:11:36 -08:00
soaringk
44da737770 [fix] Handle escaped characters in GLM tool call parser to prevent double serialization (#12456) 2025-11-04 16:48:14 -08:00
Baizhou Zhang
d22d044734 Revert "Enable memory saver for hybrid model" (#12648) 2025-11-04 16:22:06 -08:00
Kaixi Hou
34f7564df0 [NVIDIA] Fix wrong symmetric sizes for fp4 cases (#12640) 2025-11-04 14:19:37 -08:00
Johnsonms
1cfbbc42d8 [Bug] Fix NSA Backend KV-Buffer Shape Mismatch in DeepSeek-V3.2 (#12645) 2025-11-04 13:57:32 -08:00
Lianmin Zheng
55dfb539cf [Auto Sync] Update scheduler_metrics_mixin.py, collector.py (20251104) (#12647)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-11-04 13:56:14 -08:00
Baizhou Zhang
42889acbd0 [hotfix] Fix deepep w4a8 bug (#12642) 2025-11-04 13:55:59 -08:00
Trevor Morris
211f4070e5 fix: Lazy import mooncake-ep to fix extra gpu contexts being created (#12641) 2025-11-04 12:28:36 -08:00
Liangsheng Yin
befa41a152 Fix output_ids inconsistency (#12628) 2025-11-05 01:43:08 +08:00
Liangsheng Yin
30b26ee9d0 Add io struct naming check back (#12634) 2025-11-05 01:15:01 +08:00
Liangsheng Yin
aa797d013d [Test] Merge all constrained decoding tests. (#12633) 2025-11-05 00:43:06 +08:00
Ke Bao
7cee07a067 Fix skip layer in get_quant_method (#12632) 2025-11-04 23:27:46 +08:00
Yuan Luo
bb517fe393 [HotFix] Disable torch dynamo for mrope_triton kernel (#12593)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-11-04 23:26:56 +08:00
fzyzcjy
ff0b64e1e6 Ensure GPU work is finished when release memory occupation call is finished (#12592) 2025-11-04 18:01:27 +08:00
Liangsheng Yin
0678beaaee [sepc-v2] Fix imcompatibility with constrained decoding (#12615) 2025-11-04 17:27:31 +08:00
Minglei Zhu
c14cc47e39 [Deterministic] Optimize bmm_batch_invariant op (#12522) 2025-11-04 00:33:31 -08:00
Trevor Morris
dbcf85b7f0 Add --speculative-moe-runner-backend server arg (#10183) 2025-11-04 00:20:56 -08:00
Zhao Chen
d5fa019c36 feat: limit peak memory usage when computing logprobs (#6318)
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
2025-11-03 23:53:20 -08:00
Junrong Lin
173e0f704f Enable memory saver for hybrid model (#11974) 2025-11-04 14:55:26 +08:00
Lianmin Zheng
f600866a44 Improve the metrics for PD (#12580)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
Co-authored-by: cctry <shiyang@x.ai>
2025-11-03 22:10:57 -08:00
ishandhanani
93be7e863e fix: respect --ignore-eos in PD case for benchmarking (#12597) 2025-11-03 21:44:14 -08:00
fzyzcjy
60b0754cc9 Tiny fix ExpertDistributionReq error (#11760) 2025-11-04 13:39:25 +08:00
Zhao Chen
0b24af4d79 test: support return logprobs in bench_offline_throughput test (#12462)
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-03 21:38:48 -08:00
Jonah Bernard
a209fb05c1 [Qwen3 VL] Add LoRA support for Qwen 3 VL (#12165) 2025-11-03 20:32:54 -08:00
Hanming Lu
48d6bea1ea [GDN/SWA] mamba and swa radix cache edge case fix (#12111)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-11-04 11:03:37 +08:00
akhilg-nv
e607850fcf Enable mixed type LayerNorm kernel for NSA indexer (#12044) 2025-11-03 16:50:41 -08:00
Lianmin Zheng
243c064df2 Remove the dependency of nccl.h in symmetric memory (#12571) 2025-11-03 16:11:00 -08:00
b8zhong
d31d48b341 update usage of trtllm_fp8_per_tensor_scale_moe (#12569) 2025-11-03 14:25:32 -08:00
fzyzcjy
8834260739 Super tiny dump server info such as args in bench for post analysis (#12550) 2025-11-03 14:24:08 -08:00
fzyzcjy
fd7a72d62d Super tiny allow profile activities in bench_serving (#12549) 2025-11-03 14:23:18 -08:00
Yi Zhang
21a8fa16ea tiny optimize for bench serving (#12553) 2025-11-03 14:13:18 -08:00