Commit Graph

4323 Commits

Author SHA1 Message Date
Yi Zhang
21a8fa16ea tiny optimize for bench serving (#12553) 2025-11-03 14:13:18 -08:00
Lianmin Zheng
7a21d8b276 Reduce the overhead of nccl symmetric memory (#12524)
Co-authored-by: Nicolas Castet <ncastet@nvidia.com>
2025-11-03 11:56:27 -08:00
Jonah Bernard
6ef23b9833 [Test] Add parameters to SRTRunner (#12227) 2025-11-03 11:20:56 -08:00
fzyzcjy
385599cb04 Fix error when calling quantization (#12548) 2025-11-03 10:17:43 -08:00
Yueyang Pan
952fbe47cb fix: fix the bug which leads qwen2_5_vl to crash with mixed_chunk (#11330)
Signed-off-by: PanJason <pyyjason@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
2025-11-03 09:26:03 -08:00
Liangsheng Yin
edb2569356 [hot-fix] Fix broken CI (#12564) 2025-11-04 00:03:25 +08:00
Liangsheng Yin
3529c061bb [spec v2] Fix output repetition by speculative sampling error (#12561) 2025-11-03 23:00:17 +08:00
harrisonlimh
ffb32a8548 Conditionally recapture cuda graph after model weight update from disk (#12060) 2025-11-03 05:51:27 -08:00
Atream
14d8064803 fix: Fix KTransformers hybrid inference with int8 quantization and format (#12536) 2025-11-03 04:59:39 -08:00
yinghui
de0b10cf5c fix: move dummy format loader check before quantization checks (#12532) 2025-11-02 23:41:30 -08:00
Baizhou Zhang
6e29446e45 [hotfix] Remove flashinfer-jit-cache from pyproject (#12530) 2025-11-02 22:11:05 -08:00
Yineng Zhang
0c3543d7d5 chore: upgrade flashinfer 0.5.0 (#12523)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-11-02 20:54:12 -08:00
Haian Huang(深度眸)
65f1d065c5 [Bug] Fix Intern-S1 model accuracy and support /generate interface with input_ids (#12367) 2025-11-02 20:22:33 -08:00
Johnsonms
9434a0e50f [Refact] Remove hardcoded KV cache dimension in MLATokenToKVPool (#12502) 2025-11-02 19:49:53 -08:00
Lianmin Zheng
20315697f4 move all get_stream in sgl_kernel to c++ to reduce the launch overhead (#12521) 2025-11-02 13:15:05 -08:00
fzyzcjy
c9db79117f Super tiny fix naming in bench serving scripts (#12515) 2025-11-02 12:43:10 -08:00
Hanming Lu
66fb9b1307 [ServerArgs] allow --mamba-ssm-dtype extend (#12481) 2025-11-02 11:50:04 -08:00
Yuan Luo
819fc59123 Add prefix for torch symm mem (#12506)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-11-02 11:23:05 -08:00
kousakawang
7efd8b3d1f [FEAT] Shared mem pool based cuda ipc for multi-modal data transport (#11917)
Co-authored-by: kousakawang <wanghanpei@bytedance.com>
Co-authored-by: Yuan Luo <4908075+yuan-luo@users.noreply.github.com>
2025-11-02 16:46:37 +08:00
Ho-Ren (Jack) Chuang
76196b3cbf feat: Add FP4 (E2M1) KV Cache Support with Quantization Utilities for MLA (#10078)
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Co-authored-by: Yichen Wang <yichen.wang@bytedance.com>
2025-11-01 22:24:58 -07:00
Binyao Jiang
3451fc3280 [Feature] Qwen3-Next & FLA: Support MTP topk>1; Up to 6% faster (#11133)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-11-01 19:47:56 -07:00
Zhihao Lyu
c550ab9125 [Ascend] Add Ascend NPU support for sglang.check_env & rework proposal (#11052)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2025-11-01 19:26:45 -07:00
Xun Sun
0afd68321b Update Mooncake EP's a2a interface (#12391) 2025-11-01 18:48:47 -07:00
Johnsonms
6f858930c8 [Bug] test_flashattn_mla_backend errors in Hopper #12487 (#12488) 2025-11-01 18:28:06 -07:00
hzh0425
6b634493c3 [HICache / PD]: Support offloading incremental KV cache in decode side. (#11966) 2025-11-01 14:59:37 -07:00
Xinyuan Tong
d2a8f71c2f [feat] Add SGLANG_TOOL_STRICT_LEVEL for tool-call behavior control (#12423)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-11-01 13:15:02 -07:00
Ke Bao
69193f7122 Filter tokenizer warning for kimi models (#12485) 2025-11-01 16:27:31 +08:00
yinghui
d5b6e50fe8 perf: trtllm mla performance minor improvements (#12435) 2025-10-31 22:48:02 -07:00
Liangsheng Yin
9632e48f5d [hot fix] Remove from python.sglang.xxx (#12483) 2025-11-01 11:00:05 +08:00
Qiaolin Yu
59cce5941a Use sgl fp4 quant kernel by default (#12482) 2025-10-31 19:51:28 -07:00
Surya-Gunukula
795e98f8a6 Forward unknown tool calls instead of dropping (#12226) 2025-11-01 02:10:35 +00:00
Shangming Cai
358ae3563d Tiny fix eos handling for PD disaggregation (#12334)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-10-31 17:57:10 -07:00
sglang-bot
41c10e67fc chore: bump SGLang version to 0.5.4.post2 (#12439) 2025-10-31 17:38:50 -07:00
Xinyuan Tong
0bfe1d145c fa3 & trtllm_mha spec overlap (#11874) 2025-10-31 17:38:13 -07:00
Ke Bao
a4bf5c6ad2 Support Kimi Linear (#12469)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-10-31 14:03:35 -07:00
fzyzcjy
30ad107028 Try to allow NCCL cumem for multi node nvlink case (#11987) 2025-10-31 12:48:25 -07:00
Ke Bao
f7f9e41b36 Fix run benchmark (#12473) 2025-11-01 02:39:48 +08:00
ishandhanani
263eab9f5d fix: dummy health check server not accessible on non-zero rank nodes (#12297) 2025-10-31 11:34:57 -07:00
fzyzcjy
25257d8e00 Tiny assert no running requests when releasing memory to avoid IMA (#12341) 2025-11-01 01:28:53 +08:00
daniel, chen
cf0c24150a add served model name in bench serving (#12428) 2025-11-01 01:28:11 +08:00
huangtingwei
5538e05cb1 fix default env var for mooncake store (#12429) 2025-11-01 01:25:33 +08:00
Yuan Luo
c30ebb9300 [VLM] Optimize async mm data process mechanism (#12066)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-11-01 01:24:53 +08:00
ykcombat
41efcaeb45 [Feature] PD-Multiplexing Context and Scheduler, lazy import spatial. (#12275) 2025-11-01 00:40:01 +08:00
0xNullPath
70562969b9 [Bug] OOM (Out-of-Memory) errors for extreme testing scenarios (min_tokens=2) (#11757)
Signed-off-by: Yan Lu <luyan@nvidia.com>
2025-11-01 00:28:41 +08:00
Ke Bao
0095e01874 Fix lint in deepseek-ocr (#12470) 2025-11-01 00:08:19 +08:00
Xinyuan Tong
684864814b Feat: deepseek-ocr logits processor (#12415)
Co-authored-by: xinyuant <xinyuant@usc.edu>
2025-10-31 23:35:22 +08:00
sjtu_shenhai
410225b719 [Bug fix] Fix severe memory waste issue with torch.empty pin_memory (#12266) 2025-10-31 21:30:37 +08:00
Liangsheng Yin
2c9aebea70 Simplify watchdog (#12463) 2025-10-31 21:17:38 +08:00
Kindyaa
bc741073a3 fix:watchdog thread exception (#12328) 2025-10-31 20:54:50 +08:00
Yuhong Guo
2f6af1a3de Enable bailing_moe to support TP=16 (#12369) 2025-10-31 19:32:49 +08:00