Commit Graph

4365 Commits

Author SHA1 Message Date
Lianmin Zheng
fb2e816e83 Fix server args for gpt oss so users can override the moe runner backend (#12696) 2025-11-05 11:36:59 -08:00
bigmoyan
508d2f7aa2 add Kimi k2 reasoning parser (#12702)
Signed-off-by: wangzhengtao <wangzhengtao@msh.team>
2025-11-06 00:37:54 +08:00
Yuxuan Zhang
a889c85459 [Grammar Fix] GLM-4-MOE self.first_k_dense_replace is undefined. (#12455) 2025-11-06 00:03:45 +08:00
Yuhong Guo
4d84f886e7 Refactor --debug-tensor-dump-layers to list (#12691) 2025-11-05 03:30:01 -08:00
yinghui
dc4f541823 fix trtllm_mla attention backend when disabling cuda graph. (#12687) 2025-11-05 01:35:02 -08:00
zejunchen-zejun
0648eb482d [Profiler] Add SGLANG_PROFILE_RECORD_SHAPES for recording shapes when profiling (#11641)
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-11-04 23:41:46 -08:00
yinghui
b88fab3111 fix: add seed bench_serving to cache key, remove redundant function definition. (#12680) 2025-11-04 23:39:11 -08:00
Glen Liu
cbf23dbbfa [Feature] add --lora-request-distribution arg to bench_serving.py and support skewed and distinct workloads (#12175) 2025-11-04 21:41:40 -08:00
ai-easy-cpu
48641435d6 fix typo of args description in sglang.profiler (#12486)
Co-authored-by: AI-bot-easy <litchys0123@outlook.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-04 20:15:13 -08:00
Liangsheng Yin
44b1b394a4 [PD-Disagg] Check finish after pop tranferred (#12638) 2025-11-05 11:18:09 +08:00
Kaixi Hou
0711d1509b [NVIDIA] Fix cutedsl backend of MoE (#12353) 2025-11-04 18:54:55 -08:00
sglang-bot
09938e1f82 chore: bump SGLang version to 0.5.4.post3 (#12639) 2025-11-04 18:32:11 -08:00
Nicolas Castet
2340798353 Register allgather/reducescatter buffers with symm memory (#12572) 2025-11-04 17:11:36 -08:00
soaringk
44da737770 [fix] Handle escaped characters in GLM tool call parser to prevent double serialization (#12456) 2025-11-04 16:48:14 -08:00
Baizhou Zhang
d22d044734 Revert "Enable memory saver for hybrid model" (#12648) 2025-11-04 16:22:06 -08:00
Kaixi Hou
34f7564df0 [NVIDIA] Fix wrong symmetric sizes for fp4 cases (#12640) 2025-11-04 14:19:37 -08:00
Johnsonms
1cfbbc42d8 [Bug] Fix NSA Backend KV-Buffer Shape Mismatch in DeepSeek-V3.2 (#12645) 2025-11-04 13:57:32 -08:00
Lianmin Zheng
55dfb539cf [Auto Sync] Update scheduler_metrics_mixin.py, collector.py (20251104) (#12647)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-11-04 13:56:14 -08:00
Baizhou Zhang
42889acbd0 [hotfix] Fix deepep w4a8 bug (#12642) 2025-11-04 13:55:59 -08:00
Trevor Morris
211f4070e5 fix: Lazy import mooncake-ep to fix extra gpu contexts being created (#12641) 2025-11-04 12:28:36 -08:00
Liangsheng Yin
befa41a152 Fix output_ids inconsistency (#12628) 2025-11-05 01:43:08 +08:00
Liangsheng Yin
30b26ee9d0 Add io struct naming check back (#12634) 2025-11-05 01:15:01 +08:00
Liangsheng Yin
aa797d013d [Test] Merge all constrained decoding tests. (#12633) 2025-11-05 00:43:06 +08:00
Ke Bao
7cee07a067 Fix skip layer in get_quant_method (#12632) 2025-11-04 23:27:46 +08:00
Yuan Luo
bb517fe393 [HotFix] Disable torch dynamo for mrope_triton kernel (#12593)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-11-04 23:26:56 +08:00
fzyzcjy
ff0b64e1e6 Ensure GPU work is finished when release memory occupation call is finished (#12592) 2025-11-04 18:01:27 +08:00
Liangsheng Yin
0678beaaee [sepc-v2] Fix imcompatibility with constrained decoding (#12615) 2025-11-04 17:27:31 +08:00
Minglei Zhu
c14cc47e39 [Deterministic] Optimize bmm_batch_invariant op (#12522) 2025-11-04 00:33:31 -08:00
Trevor Morris
dbcf85b7f0 Add --speculative-moe-runner-backend server arg (#10183) 2025-11-04 00:20:56 -08:00
Zhao Chen
d5fa019c36 feat: limit peak memory usage when computing logprobs (#6318)
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
2025-11-03 23:53:20 -08:00
Junrong Lin
173e0f704f Enable memory saver for hybrid model (#11974) 2025-11-04 14:55:26 +08:00
Lianmin Zheng
f600866a44 Improve the metrics for PD (#12580)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
Co-authored-by: cctry <shiyang@x.ai>
2025-11-03 22:10:57 -08:00
ishandhanani
93be7e863e fix: respect --ignore-eos in PD case for benchmarking (#12597) 2025-11-03 21:44:14 -08:00
fzyzcjy
60b0754cc9 Tiny fix ExpertDistributionReq error (#11760) 2025-11-04 13:39:25 +08:00
Zhao Chen
0b24af4d79 test: support return logprobs in bench_offline_throughput test (#12462)
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-03 21:38:48 -08:00
Jonah Bernard
a209fb05c1 [Qwen3 VL] Add LoRA support for Qwen 3 VL (#12165) 2025-11-03 20:32:54 -08:00
Hanming Lu
48d6bea1ea [GDN/SWA] mamba and swa radix cache edge case fix (#12111)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-11-04 11:03:37 +08:00
akhilg-nv
e607850fcf Enable mixed type LayerNorm kernel for NSA indexer (#12044) 2025-11-03 16:50:41 -08:00
Lianmin Zheng
243c064df2 Remove the dependency of nccl.h in symmetric memory (#12571) 2025-11-03 16:11:00 -08:00
b8zhong
d31d48b341 update usage of trtllm_fp8_per_tensor_scale_moe (#12569) 2025-11-03 14:25:32 -08:00
fzyzcjy
8834260739 Super tiny dump server info such as args in bench for post analysis (#12550) 2025-11-03 14:24:08 -08:00
fzyzcjy
fd7a72d62d Super tiny allow profile activities in bench_serving (#12549) 2025-11-03 14:23:18 -08:00
Yi Zhang
21a8fa16ea tiny optimize for bench serving (#12553) 2025-11-03 14:13:18 -08:00
Lianmin Zheng
7a21d8b276 Reduce the overhead of nccl symmetric memory (#12524)
Co-authored-by: Nicolas Castet <ncastet@nvidia.com>
2025-11-03 11:56:27 -08:00
Jonah Bernard
6ef23b9833 [Test] Add parameters to SRTRunner (#12227) 2025-11-03 11:20:56 -08:00
fzyzcjy
385599cb04 Fix error when calling quantization (#12548) 2025-11-03 10:17:43 -08:00
Yueyang Pan
952fbe47cb fix: fix the bug which leads qwen2_5_vl to crash with mixed_chunk (#11330)
Signed-off-by: PanJason <pyyjason@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
2025-11-03 09:26:03 -08:00
Liangsheng Yin
edb2569356 [hot-fix] Fix broken CI (#12564) 2025-11-04 00:03:25 +08:00
Liangsheng Yin
3529c061bb [spec v2] Fix output repetition by speculative sampling error (#12561) 2025-11-03 23:00:17 +08:00
harrisonlimh
ffb32a8548 Conditionally recapture cuda graph after model weight update from disk (#12060) 2025-11-03 05:51:27 -08:00