Lianmin Zheng
|
fb2e816e83
|
Fix server args for gpt oss so users can override the moe runner backend (#12696)
|
2025-11-05 11:36:59 -08:00 |
|
bigmoyan
|
508d2f7aa2
|
add Kimi k2 reasoning parser (#12702)
Signed-off-by: wangzhengtao <wangzhengtao@msh.team>
|
2025-11-06 00:37:54 +08:00 |
|
Yuxuan Zhang
|
a889c85459
|
[Grammar Fix] GLM-4-MOE self.first_k_dense_replace is undefined. (#12455)
|
2025-11-06 00:03:45 +08:00 |
|
Yuhong Guo
|
4d84f886e7
|
Refactor --debug-tensor-dump-layers to list (#12691)
|
2025-11-05 03:30:01 -08:00 |
|
yinghui
|
dc4f541823
|
fix trtllm_mla attention backend when disabling cuda graph. (#12687)
|
2025-11-05 01:35:02 -08:00 |
|
zejunchen-zejun
|
0648eb482d
|
[Profiler] Add SGLANG_PROFILE_RECORD_SHAPES for recording shapes when profiling (#11641)
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-11-04 23:41:46 -08:00 |
|
yinghui
|
b88fab3111
|
fix: add seed bench_serving to cache key, remove redundant function definition. (#12680)
|
2025-11-04 23:39:11 -08:00 |
|
Glen Liu
|
cbf23dbbfa
|
[Feature] add --lora-request-distribution arg to bench_serving.py and support skewed and distinct workloads (#12175)
|
2025-11-04 21:41:40 -08:00 |
|
ai-easy-cpu
|
48641435d6
|
fix typo of args description in sglang.profiler (#12486)
Co-authored-by: AI-bot-easy <litchys0123@outlook.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-11-04 20:15:13 -08:00 |
|
Liangsheng Yin
|
44b1b394a4
|
[PD-Disagg] Check finish after pop tranferred (#12638)
|
2025-11-05 11:18:09 +08:00 |
|
Kaixi Hou
|
0711d1509b
|
[NVIDIA] Fix cutedsl backend of MoE (#12353)
|
2025-11-04 18:54:55 -08:00 |
|
sglang-bot
|
09938e1f82
|
chore: bump SGLang version to 0.5.4.post3 (#12639)
|
2025-11-04 18:32:11 -08:00 |
|
Nicolas Castet
|
2340798353
|
Register allgather/reducescatter buffers with symm memory (#12572)
|
2025-11-04 17:11:36 -08:00 |
|
soaringk
|
44da737770
|
[fix] Handle escaped characters in GLM tool call parser to prevent double serialization (#12456)
|
2025-11-04 16:48:14 -08:00 |
|
Baizhou Zhang
|
d22d044734
|
Revert "Enable memory saver for hybrid model" (#12648)
|
2025-11-04 16:22:06 -08:00 |
|
Kaixi Hou
|
34f7564df0
|
[NVIDIA] Fix wrong symmetric sizes for fp4 cases (#12640)
|
2025-11-04 14:19:37 -08:00 |
|
Johnsonms
|
1cfbbc42d8
|
[Bug] Fix NSA Backend KV-Buffer Shape Mismatch in DeepSeek-V3.2 (#12645)
|
2025-11-04 13:57:32 -08:00 |
|
Lianmin Zheng
|
55dfb539cf
|
[Auto Sync] Update scheduler_metrics_mixin.py, collector.py (20251104) (#12647)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-11-04 13:56:14 -08:00 |
|
Baizhou Zhang
|
42889acbd0
|
[hotfix] Fix deepep w4a8 bug (#12642)
|
2025-11-04 13:55:59 -08:00 |
|
Trevor Morris
|
211f4070e5
|
fix: Lazy import mooncake-ep to fix extra gpu contexts being created (#12641)
|
2025-11-04 12:28:36 -08:00 |
|
Liangsheng Yin
|
befa41a152
|
Fix output_ids inconsistency (#12628)
|
2025-11-05 01:43:08 +08:00 |
|
Liangsheng Yin
|
30b26ee9d0
|
Add io struct naming check back (#12634)
|
2025-11-05 01:15:01 +08:00 |
|
Liangsheng Yin
|
aa797d013d
|
[Test] Merge all constrained decoding tests. (#12633)
|
2025-11-05 00:43:06 +08:00 |
|
Ke Bao
|
7cee07a067
|
Fix skip layer in get_quant_method (#12632)
|
2025-11-04 23:27:46 +08:00 |
|
Yuan Luo
|
bb517fe393
|
[HotFix] Disable torch dynamo for mrope_triton kernel (#12593)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-11-04 23:26:56 +08:00 |
|
fzyzcjy
|
ff0b64e1e6
|
Ensure GPU work is finished when release memory occupation call is finished (#12592)
|
2025-11-04 18:01:27 +08:00 |
|
Liangsheng Yin
|
0678beaaee
|
[sepc-v2] Fix imcompatibility with constrained decoding (#12615)
|
2025-11-04 17:27:31 +08:00 |
|
Minglei Zhu
|
c14cc47e39
|
[Deterministic] Optimize bmm_batch_invariant op (#12522)
|
2025-11-04 00:33:31 -08:00 |
|
Trevor Morris
|
dbcf85b7f0
|
Add --speculative-moe-runner-backend server arg (#10183)
|
2025-11-04 00:20:56 -08:00 |
|
Zhao Chen
|
d5fa019c36
|
feat: limit peak memory usage when computing logprobs (#6318)
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
|
2025-11-03 23:53:20 -08:00 |
|
Junrong Lin
|
173e0f704f
|
Enable memory saver for hybrid model (#11974)
|
2025-11-04 14:55:26 +08:00 |
|
Lianmin Zheng
|
f600866a44
|
Improve the metrics for PD (#12580)
Co-authored-by: Kan Wu <wukanustc@gmail.com>
Co-authored-by: cctry <shiyang@x.ai>
|
2025-11-03 22:10:57 -08:00 |
|
ishandhanani
|
93be7e863e
|
fix: respect --ignore-eos in PD case for benchmarking (#12597)
|
2025-11-03 21:44:14 -08:00 |
|
fzyzcjy
|
60b0754cc9
|
Tiny fix ExpertDistributionReq error (#11760)
|
2025-11-04 13:39:25 +08:00 |
|
Zhao Chen
|
0b24af4d79
|
test: support return logprobs in bench_offline_throughput test (#12462)
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-11-03 21:38:48 -08:00 |
|
Jonah Bernard
|
a209fb05c1
|
[Qwen3 VL] Add LoRA support for Qwen 3 VL (#12165)
|
2025-11-03 20:32:54 -08:00 |
|
Hanming Lu
|
48d6bea1ea
|
[GDN/SWA] mamba and swa radix cache edge case fix (#12111)
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-11-04 11:03:37 +08:00 |
|
akhilg-nv
|
e607850fcf
|
Enable mixed type LayerNorm kernel for NSA indexer (#12044)
|
2025-11-03 16:50:41 -08:00 |
|
Lianmin Zheng
|
243c064df2
|
Remove the dependency of nccl.h in symmetric memory (#12571)
|
2025-11-03 16:11:00 -08:00 |
|
b8zhong
|
d31d48b341
|
update usage of trtllm_fp8_per_tensor_scale_moe (#12569)
|
2025-11-03 14:25:32 -08:00 |
|
fzyzcjy
|
8834260739
|
Super tiny dump server info such as args in bench for post analysis (#12550)
|
2025-11-03 14:24:08 -08:00 |
|
fzyzcjy
|
fd7a72d62d
|
Super tiny allow profile activities in bench_serving (#12549)
|
2025-11-03 14:23:18 -08:00 |
|
Yi Zhang
|
21a8fa16ea
|
tiny optimize for bench serving (#12553)
|
2025-11-03 14:13:18 -08:00 |
|
Lianmin Zheng
|
7a21d8b276
|
Reduce the overhead of nccl symmetric memory (#12524)
Co-authored-by: Nicolas Castet <ncastet@nvidia.com>
|
2025-11-03 11:56:27 -08:00 |
|
Jonah Bernard
|
6ef23b9833
|
[Test] Add parameters to SRTRunner (#12227)
|
2025-11-03 11:20:56 -08:00 |
|
fzyzcjy
|
385599cb04
|
Fix error when calling quantization (#12548)
|
2025-11-03 10:17:43 -08:00 |
|
Yueyang Pan
|
952fbe47cb
|
fix: fix the bug which leads qwen2_5_vl to crash with mixed_chunk (#11330)
Signed-off-by: PanJason <pyyjason@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
|
2025-11-03 09:26:03 -08:00 |
|
Liangsheng Yin
|
edb2569356
|
[hot-fix] Fix broken CI (#12564)
|
2025-11-04 00:03:25 +08:00 |
|
Liangsheng Yin
|
3529c061bb
|
[spec v2] Fix output repetition by speculative sampling error (#12561)
|
2025-11-03 23:00:17 +08:00 |
|
harrisonlimh
|
ffb32a8548
|
Conditionally recapture cuda graph after model weight update from disk (#12060)
|
2025-11-03 05:51:27 -08:00 |
|