Yuxuan Zhang
|
4060ed37cb
|
Refactoring GLM-4.5 and GLM-4.5V related implementations (#11800)
|
2025-10-24 08:22:36 +00:00 |
|
fzyzcjy
|
2342605ef0
|
Tiny cleanup send_single (#12056)
|
2025-10-23 23:53:42 -07:00 |
|
Rain Jiang
|
8e797a47f0
|
fix: the hardcode hf repo name comparison for deepseek-ocr (#12031)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-23 21:37:56 -07:00 |
|
Zaili Wang
|
aa3003f116
|
Add gguf dependency for cpu/xpu (#12041)
|
2025-10-23 21:13:17 -07:00 |
|
Yongfei Xu
|
4793ec7d1a
|
Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once (#10953)
|
2025-10-23 20:58:10 -07:00 |
|
Zaili Wang
|
92009bd28e
|
fix: fix MMMU loading issue (#11759)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-23 20:21:38 -07:00 |
|
Baizhou Zhang
|
4ef981e2b6
|
Revert "[Fix] Fix lint to pass CI" (#12042)
|
2025-10-23 19:44:58 -07:00 |
|
Baizhou Zhang
|
69ed8b67a8
|
[Fix] Fix lint to pass CI (#12037)
|
2025-10-23 19:39:38 -07:00 |
|
narutolhy
|
1801cd199f
|
support more model in piecewise cuda graph (#11745)
|
2025-10-24 10:31:39 +08:00 |
|
Lianmin Zheng
|
ffc722a690
|
Revert "lang: support direct video inference" (#12038)
|
2025-10-23 19:21:31 -07:00 |
|
thelongestusernameofall
|
49afb3d9d9
|
Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 (#11909)
Co-authored-by: Chengxing Xie <xiechengxing34@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-23 19:12:40 -07:00 |
|
b8zhong
|
f80371ff8c
|
Use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk (#11816)
|
2025-10-23 19:12:15 -07:00 |
|
Jonah Bernard
|
62eff37ba1
|
Refactor Triton-kernel MoE runner integration (#11795)
|
2025-10-23 18:47:28 -07:00 |
|
b8zhong
|
47e12e082e
|
Enable Llama 4 + TRTLLM MHA (#12003)
|
2025-10-23 18:22:58 -07:00 |
|
Mick
|
823b442945
|
lang: support direct video inference (#9936)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2025-10-23 18:12:39 -07:00 |
|
Fan Yin
|
14a4d80e57
|
[8/n] decouple quantization impl from vllm dependency - gguf srt (#11964)
Co-authored-by: Peng Zhang <zhuangsen.zp@antgroup.com>
|
2025-10-23 18:12:00 -07:00 |
|
sglang-bot
|
1053e1be17
|
chore: bump SGLang version to 0.5.4 (#12027)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
|
2025-10-23 18:01:40 -07:00 |
|
Roger Young
|
dbd9435dc1
|
Fix mamba radix cache eviction logic in alloc_req_slots (#11616)
Signed-off-by: rogeryoungh <rogeryoungh@foxmail.com>
|
2025-10-23 13:07:43 -07:00 |
|
b8zhong
|
8ae9d4bb41
|
Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" (#12028)
|
2025-10-23 12:42:59 -07:00 |
|
Nicolas Castet
|
1c304aa9bc
|
Log iteration # for prefill and decode (#9366)
|
2025-10-23 12:28:03 -07:00 |
|
Mick
|
770529a731
|
model: support deepseek-ocr (#11891)
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2025-10-24 03:15:17 +08:00 |
|
ErvinXie
|
39c237f02c
|
Add AWQ quantization support for NPU. (#10158)
Co-authored-by: Alisehen <814073252@qq.com>
Co-authored-by: Yaochen Han <48639761+Alisehen@users.noreply.github.com>
Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>
|
2025-10-23 12:08:05 -07:00 |
|
Lianmin Zheng
|
ab07cd3e5a
|
[Auto Sync] Update test_deterministic_utils.py (20251023) (#12022)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
|
2025-10-23 11:20:45 -07:00 |
|
Netanel Haber
|
a98496834b
|
Feature/nano v2 offline modelopt fp8 and nvfp4 (#12018)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
|
2025-10-23 11:16:46 -07:00 |
|
Teng Ma
|
96a5e4dd79
|
[Feature] Support loading weights from ckpt engine worker (#11755)
Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Co-authored-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
Co-authored-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-23 09:23:30 -07:00 |
|
cctry
|
b0b4f71679
|
[Fix] memory leak by overlap + retract (#11981)
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
|
2025-10-23 22:59:23 +08:00 |
|
Liangsheng Yin
|
6c18addb6f
|
Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" (#12015)
|
2025-10-23 21:27:58 +08:00 |
|
Liangsheng Yin
|
32852fe9e9
|
Move memory runtime checker to mixin class (#12014)
|
2025-10-23 20:53:26 +08:00 |
|
Netanel Haber
|
d6fee73d1f
|
Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 (#11866)
|
2025-10-23 17:29:02 +08:00 |
|
Qiaolin Yu
|
36a4cad7b0
|
Support overlap-spec-v2 with trtllm_mla attention backend (#11821)
|
2025-10-23 16:55:35 +08:00 |
|
yinghui
|
c23eda8589
|
Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend (#11985)
|
2025-10-22 22:44:45 -07:00 |
|
Jue WANG
|
138ff23187
|
Allow to disable batch decoding. (#11944)
|
2025-10-22 21:57:12 -07:00 |
|
blzheng
|
13fb8b5489
|
[CPU] Optimize FP16 decode_attention_cpu (#10652)
|
2025-10-22 21:39:51 -07:00 |
|
Zaili Wang
|
007b849b0e
|
[CPU] misc updates (#11906)
|
2025-10-22 21:10:05 -07:00 |
|
Johnny
|
e7aa4664b3
|
[NVIDIA] Build CUDA 13 (#11299)
Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-10-22 20:03:12 -07:00 |
|
b8zhong
|
4d4feccbb2
|
[ROCm] Remove vLLM rope dependency & use AITER impl (#11322)
|
2025-10-22 19:17:34 -07:00 |
|
jacky.cheng
|
99c92ff24b
|
[AMD] Support a new flag to disable quant on parallelLinear layer if required (#11811)
|
2025-10-22 19:16:15 -07:00 |
|
Chang Su
|
6ade6a02d4
|
[grpc] Support gRPC standard health check (#11955)
|
2025-10-22 16:59:09 -07:00 |
|
Christian Bahls
|
164302c7df
|
Implement BGE-M3 Sparse Embeddings in SGLang (#10869)
Co-authored-by: Christian Bahls <christian.bahls@planet-ai.de>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-10-22 13:46:16 -07:00 |
|
jiahanc
|
eec9e471ca
|
[NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (#11563)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
|
2025-10-22 13:11:16 -07:00 |
|
Lianmin Zheng
|
6d535b719f
|
Revert "Recapture cuda graph after model weight update to resolve IMA error " (#11980)
|
2025-10-22 11:50:26 -07:00 |
|
yuho
|
fdcb1d13c5
|
[BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… (#11977)
|
2025-10-22 11:29:55 -07:00 |
|
Hongbo Xu
|
d7e834d6ba
|
[6/n]decouple quantization implementation from vLLM dependency (#10750)
|
2025-10-23 02:07:55 +08:00 |
|
Fan Yin
|
1d097aac87
|
[Fix] Remove unused import from triton_kernels_moe.py (#11967)
Co-authored-by: Shangming Cai <171321666+shangmingcai@users.noreply.github.com>
|
2025-10-22 21:02:57 +08:00 |
|
996_icu
|
88568c01eb
|
[model] Support POINTSV15Chat (#9651)
Co-authored-by: josephyou <josephyou@tencent.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: root <root@TENCENT64.site>
|
2025-10-22 16:58:17 +08:00 |
|
Hank Han
|
904655c5fd
|
[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank (#10606)
Co-authored-by: Xun Sun <UNIDY2002@outlook.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-22 01:13:31 -07:00 |
|
Xun Sun
|
e028af6998
|
Fix mooncake dispatcher (#11908)
|
2025-10-22 01:11:49 -07:00 |
|
Zhiyu
|
80b2b3207a
|
Enable native ModelOpt quantization support (3/3) (#10154)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-10-21 21:44:29 -07:00 |
|
Liangsheng Yin
|
9d61205dac
|
[lint] improve ruff check (#11922)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2025-10-22 11:32:50 +08:00 |
|
Chang Su
|
70f6309cd4
|
[router][grpc] Support v1/responses API (#11926)
|
2025-10-21 17:41:48 -07:00 |
|