Commit Graph

4180 Commits

Author SHA1 Message Date
Elfie Guo
a1d5bc4cce Avoid using flashinfer_allreduce_fusion when dp attention is enabled. (#11632) 2025-10-26 12:31:14 -07:00
Zijian Zhang
a8023891f6 model: support NVILA and NVILA Lite (#10399) 2025-10-26 09:58:09 -07:00
fzyzcjy
0103f374ba Support DeepGEMM for deterministic inference (#12142) 2025-10-26 22:36:17 +08:00
zyksir
96a5a949f6 [Fix] fix allreduce bug in Piecewise Graph (#12106) 2025-10-26 21:15:48 +08:00
Liangsheng Yin
ea385ae85a Fix ITL metrics when using openai endpoint with spec (#12156) 2025-10-26 18:06:25 +08:00
Kai-Hsun Chen
6371f7af27 [quantization] AWQ Marlin doesn't work when dtype is bfloat16 (#11494)
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-10-26 15:49:45 +08:00
Liangsheng Yin
8491c794ad [misc] depdencies & enviroment flag (#12113) 2025-10-26 14:52:35 +08:00
Liangsheng Yin
bda3758fac [log] Make forward iter count optional (#12116) 2025-10-26 14:51:07 +08:00
Lianmin Zheng
7b36c47b3b Clean up attention backend selection code & Other minor rename (#12136) 2025-10-25 23:50:12 -07:00
Kaixi Hou
ff60406429 [NVIDIA] Change default quant method for model_opt (#11991) 2025-10-25 22:04:57 -07:00
fzyzcjy
c001deba37 Make bmm batch invariant injection optional (#12118) 2025-10-26 10:18:35 +08:00
Lianmin Zheng
8e70064c37 Clean up server launch code and multi tokenizer (#12132) 2025-10-25 16:40:27 -07:00
Baizhou Zhang
4b0ac1d52a Update sgl-kernel version to 0.3.16.post4 (#12125) 2025-10-25 14:33:33 -07:00
YAMY
c8492978a1 Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation (#12115) 2025-10-25 12:28:26 -07:00
Lianmin Zheng
4caca1ba04 Clean up server args & Add CI scripts (#12124) 2025-10-25 11:53:57 -07:00
Lianmin Zheng
ea13cb1452 [Auto Sync] Update test_deterministic.py, test_deterministi... (20251024) (#12083)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-10-24 22:28:01 -07:00
vipwangerxiao
8982418957 Fix 'KeyError' for per_token expert distribution recorder (#9501)
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Co-authored-by: Peng Wang <rocking@linux.alibaba.com>
2025-10-25 03:28:50 +00:00
fzyzcjy
20bd2271e2 Support true on-policy (#12058) 2025-10-25 10:23:42 +08:00
Cheng Wan
649949807f [10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE (#12054) 2025-10-24 19:16:17 -07:00
fzyzcjy
d7056c5236 Enhance tests in deterministic kernels (#12070) 2025-10-25 08:53:22 +08:00
Jinwu
13bf565d60 [2/N]Support DeepSeek-R1 w4a8 low latency deepep (#8464)
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: Shangchuan Huang <2510421000@qq.com>
2025-10-24 17:41:16 -07:00
yinghui
e51046beaa perf: trtllm_mla attention backend spec decoding speedup w/ cuda graph (#12093) 2025-10-24 16:05:44 -07:00
Minglei Zhu
f4b78d137c [1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU (#12000) 2025-10-24 15:17:28 -07:00
Jonah Bernard
4b046a72d3 docs(server-arguments): add allowed options for each argument (#11560) 2025-10-24 11:49:20 -07:00
ishandhanani
14203432b4 fix(compile_utils, ep_moe): update environment variable and dtype check (#12034) 2025-10-24 11:00:12 -07:00
Yuanhang Sun
0bfa394aff [Fix]: HiCache hasher failed when EAGLE mode enabled (#12025) 2025-10-24 23:53:13 +08:00
fzyzcjy
e04340bf48 Fix multi processing serializer bug (#11958) 2025-10-24 22:53:45 +08:00
Xiaoyu Zhang
8470133852 [b200] fix piecewise cuda graph launch bug (#12067) 2025-10-24 22:36:39 +08:00
Muqi Li
93ef9a094d [Profiler] expand '~' for torch_profiler_output_dir (#11999) 2025-10-24 17:20:46 +08:00
Muqi Li
b04cd3d487 Add 'gguf' to project dependencies (#12046) 2025-10-24 17:16:19 +08:00
Yuan Luo
7ef5d8afd4 Revise POINTSV15Chat model (#12049)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-10-24 17:09:45 +08:00
Qiaolin Yu
71d41212e4 Fix dpsk-r1-fp4 launching crash (#12063) 2025-10-24 17:04:50 +08:00
Xinyuan Tong
b9fb74f3bc fix: bench_serving ITL calculation when using spec-decoding (#12064) 2025-10-24 17:02:44 +08:00
ybyang
e15b63a182 [Fix] fix missing ipc_name of __getitem__ in some IO structs (#12053)
Signed-off-by: ybyang <ybyang7@iflytek.com>
2025-10-24 16:59:14 +08:00
Yuxuan Zhang
4060ed37cb Refactoring GLM-4.5 and GLM-4.5V related implementations (#11800) 2025-10-24 08:22:36 +00:00
fzyzcjy
2342605ef0 Tiny cleanup send_single (#12056) 2025-10-23 23:53:42 -07:00
Rain Jiang
8e797a47f0 fix: the hardcode hf repo name comparison for deepseek-ocr (#12031)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-23 21:37:56 -07:00
Zaili Wang
aa3003f116 Add gguf dependency for cpu/xpu (#12041) 2025-10-23 21:13:17 -07:00
Yongfei Xu
4793ec7d1a Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once (#10953) 2025-10-23 20:58:10 -07:00
Zaili Wang
92009bd28e fix: fix MMMU loading issue (#11759)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-23 20:21:38 -07:00
Baizhou Zhang
4ef981e2b6 Revert "[Fix] Fix lint to pass CI" (#12042) 2025-10-23 19:44:58 -07:00
Baizhou Zhang
69ed8b67a8 [Fix] Fix lint to pass CI (#12037) 2025-10-23 19:39:38 -07:00
narutolhy
1801cd199f support more model in piecewise cuda graph (#11745) 2025-10-24 10:31:39 +08:00
Lianmin Zheng
ffc722a690 Revert "lang: support direct video inference" (#12038) 2025-10-23 19:21:31 -07:00
thelongestusernameofall
49afb3d9d9 Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 (#11909)
Co-authored-by: Chengxing Xie <xiechengxing34@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-23 19:12:40 -07:00
b8zhong
f80371ff8c Use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk (#11816) 2025-10-23 19:12:15 -07:00
Jonah Bernard
62eff37ba1 Refactor Triton-kernel MoE runner integration (#11795) 2025-10-23 18:47:28 -07:00
b8zhong
47e12e082e Enable Llama 4 + TRTLLM MHA (#12003) 2025-10-23 18:22:58 -07:00
Mick
823b442945 lang: support direct video inference (#9936)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2025-10-23 18:12:39 -07:00
Fan Yin
14a4d80e57 [8/n] decouple quantization impl from vllm dependency - gguf srt (#11964)
Co-authored-by: Peng Zhang <zhuangsen.zp@antgroup.com>
2025-10-23 18:12:00 -07:00