Elfie Guo
|
a1d5bc4cce
|
Avoid using flashinfer_allreduce_fusion when dp attention is enabled. (#11632)
|
2025-10-26 12:31:14 -07:00 |
|
Zijian Zhang
|
a8023891f6
|
model: support NVILA and NVILA Lite (#10399)
|
2025-10-26 09:58:09 -07:00 |
|
fzyzcjy
|
0103f374ba
|
Support DeepGEMM for deterministic inference (#12142)
|
2025-10-26 22:36:17 +08:00 |
|
zyksir
|
96a5a949f6
|
[Fix] fix allreduce bug in Piecewise Graph (#12106)
|
2025-10-26 21:15:48 +08:00 |
|
Liangsheng Yin
|
ea385ae85a
|
Fix ITL metrics when using openai endpoint with spec (#12156)
|
2025-10-26 18:06:25 +08:00 |
|
Kai-Hsun Chen
|
6371f7af27
|
[quantization] AWQ Marlin doesn't work when dtype is bfloat16 (#11494)
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
|
2025-10-26 15:49:45 +08:00 |
|
Liangsheng Yin
|
8491c794ad
|
[misc] depdencies & enviroment flag (#12113)
|
2025-10-26 14:52:35 +08:00 |
|
Liangsheng Yin
|
bda3758fac
|
[log] Make forward iter count optional (#12116)
|
2025-10-26 14:51:07 +08:00 |
|
Lianmin Zheng
|
7b36c47b3b
|
Clean up attention backend selection code & Other minor rename (#12136)
|
2025-10-25 23:50:12 -07:00 |
|
Kaixi Hou
|
ff60406429
|
[NVIDIA] Change default quant method for model_opt (#11991)
|
2025-10-25 22:04:57 -07:00 |
|
fzyzcjy
|
c001deba37
|
Make bmm batch invariant injection optional (#12118)
|
2025-10-26 10:18:35 +08:00 |
|
Lianmin Zheng
|
8e70064c37
|
Clean up server launch code and multi tokenizer (#12132)
|
2025-10-25 16:40:27 -07:00 |
|
Baizhou Zhang
|
4b0ac1d52a
|
Update sgl-kernel version to 0.3.16.post4 (#12125)
|
2025-10-25 14:33:33 -07:00 |
|
YAMY
|
c8492978a1
|
Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation (#12115)
|
2025-10-25 12:28:26 -07:00 |
|
Lianmin Zheng
|
4caca1ba04
|
Clean up server args & Add CI scripts (#12124)
|
2025-10-25 11:53:57 -07:00 |
|
Lianmin Zheng
|
ea13cb1452
|
[Auto Sync] Update test_deterministic.py, test_deterministi... (20251024) (#12083)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
|
2025-10-24 22:28:01 -07:00 |
|
vipwangerxiao
|
8982418957
|
Fix 'KeyError' for per_token expert distribution recorder (#9501)
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Co-authored-by: Peng Wang <rocking@linux.alibaba.com>
|
2025-10-25 03:28:50 +00:00 |
|
fzyzcjy
|
20bd2271e2
|
Support true on-policy (#12058)
|
2025-10-25 10:23:42 +08:00 |
|
Cheng Wan
|
649949807f
|
[10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE (#12054)
|
2025-10-24 19:16:17 -07:00 |
|
fzyzcjy
|
d7056c5236
|
Enhance tests in deterministic kernels (#12070)
|
2025-10-25 08:53:22 +08:00 |
|
Jinwu
|
13bf565d60
|
[2/N]Support DeepSeek-R1 w4a8 low latency deepep (#8464)
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: Shangchuan Huang <2510421000@qq.com>
|
2025-10-24 17:41:16 -07:00 |
|
yinghui
|
e51046beaa
|
perf: trtllm_mla attention backend spec decoding speedup w/ cuda graph (#12093)
|
2025-10-24 16:05:44 -07:00 |
|
Minglei Zhu
|
f4b78d137c
|
[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU (#12000)
|
2025-10-24 15:17:28 -07:00 |
|
Jonah Bernard
|
4b046a72d3
|
docs(server-arguments): add allowed options for each argument (#11560)
|
2025-10-24 11:49:20 -07:00 |
|
ishandhanani
|
14203432b4
|
fix(compile_utils, ep_moe): update environment variable and dtype check (#12034)
|
2025-10-24 11:00:12 -07:00 |
|
Yuanhang Sun
|
0bfa394aff
|
[Fix]: HiCache hasher failed when EAGLE mode enabled (#12025)
|
2025-10-24 23:53:13 +08:00 |
|
fzyzcjy
|
e04340bf48
|
Fix multi processing serializer bug (#11958)
|
2025-10-24 22:53:45 +08:00 |
|
Xiaoyu Zhang
|
8470133852
|
[b200] fix piecewise cuda graph launch bug (#12067)
|
2025-10-24 22:36:39 +08:00 |
|
Muqi Li
|
93ef9a094d
|
[Profiler] expand '~' for torch_profiler_output_dir (#11999)
|
2025-10-24 17:20:46 +08:00 |
|
Muqi Li
|
b04cd3d487
|
Add 'gguf' to project dependencies (#12046)
|
2025-10-24 17:16:19 +08:00 |
|
Yuan Luo
|
7ef5d8afd4
|
Revise POINTSV15Chat model (#12049)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-10-24 17:09:45 +08:00 |
|
Qiaolin Yu
|
71d41212e4
|
Fix dpsk-r1-fp4 launching crash (#12063)
|
2025-10-24 17:04:50 +08:00 |
|
Xinyuan Tong
|
b9fb74f3bc
|
fix: bench_serving ITL calculation when using spec-decoding (#12064)
|
2025-10-24 17:02:44 +08:00 |
|
ybyang
|
e15b63a182
|
[Fix] fix missing ipc_name of __getitem__ in some IO structs (#12053)
Signed-off-by: ybyang <ybyang7@iflytek.com>
|
2025-10-24 16:59:14 +08:00 |
|
Yuxuan Zhang
|
4060ed37cb
|
Refactoring GLM-4.5 and GLM-4.5V related implementations (#11800)
|
2025-10-24 08:22:36 +00:00 |
|
fzyzcjy
|
2342605ef0
|
Tiny cleanup send_single (#12056)
|
2025-10-23 23:53:42 -07:00 |
|
Rain Jiang
|
8e797a47f0
|
fix: the hardcode hf repo name comparison for deepseek-ocr (#12031)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-23 21:37:56 -07:00 |
|
Zaili Wang
|
aa3003f116
|
Add gguf dependency for cpu/xpu (#12041)
|
2025-10-23 21:13:17 -07:00 |
|
Yongfei Xu
|
4793ec7d1a
|
Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once (#10953)
|
2025-10-23 20:58:10 -07:00 |
|
Zaili Wang
|
92009bd28e
|
fix: fix MMMU loading issue (#11759)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-23 20:21:38 -07:00 |
|
Baizhou Zhang
|
4ef981e2b6
|
Revert "[Fix] Fix lint to pass CI" (#12042)
|
2025-10-23 19:44:58 -07:00 |
|
Baizhou Zhang
|
69ed8b67a8
|
[Fix] Fix lint to pass CI (#12037)
|
2025-10-23 19:39:38 -07:00 |
|
narutolhy
|
1801cd199f
|
support more model in piecewise cuda graph (#11745)
|
2025-10-24 10:31:39 +08:00 |
|
Lianmin Zheng
|
ffc722a690
|
Revert "lang: support direct video inference" (#12038)
|
2025-10-23 19:21:31 -07:00 |
|
thelongestusernameofall
|
49afb3d9d9
|
Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 (#11909)
Co-authored-by: Chengxing Xie <xiechengxing34@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-23 19:12:40 -07:00 |
|
b8zhong
|
f80371ff8c
|
Use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk (#11816)
|
2025-10-23 19:12:15 -07:00 |
|
Jonah Bernard
|
62eff37ba1
|
Refactor Triton-kernel MoE runner integration (#11795)
|
2025-10-23 18:47:28 -07:00 |
|
b8zhong
|
47e12e082e
|
Enable Llama 4 + TRTLLM MHA (#12003)
|
2025-10-23 18:22:58 -07:00 |
|
Mick
|
823b442945
|
lang: support direct video inference (#9936)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2025-10-23 18:12:39 -07:00 |
|
Fan Yin
|
14a4d80e57
|
[8/n] decouple quantization impl from vllm dependency - gguf srt (#11964)
Co-authored-by: Peng Zhang <zhuangsen.zp@antgroup.com>
|
2025-10-23 18:12:00 -07:00 |
|