Commit Graph

4840 Commits

Author SHA1 Message Date
PiteXChen
dc7bdc7329 bugfix[schedule]: Excessive preemption occurs when preempting running requests to schedule new prefill requests. (#12494)
Signed-off-by: CLFutureX <chenyongqyl@163.com>
2025-11-30 22:29:26 +08:00
Liangsheng Yin
0a9d64530d Support grammar + spec + reasoning (#14163) 2025-11-30 21:19:57 +08:00
fzyzcjy
340c613ab5 Support numactl bind for CPU and memory before process starts (#14156) 2025-11-30 17:00:33 +08:00
fzyzcjy
36b729c2b8 Implement profiler v2 and fix stage mixture bug (#14148) 2025-11-30 16:59:52 +08:00
Tianhao Zhou
67e6ef4b2d feat: longcat flash add aux layers capture for eagle3 (#14161) 2025-11-30 00:50:55 -08:00
strgrb
65ba5ab8b1 add cpp files for cpp_radix_tree to pyproject.toml. (#14052) 2025-11-30 13:05:04 +08:00
WenhaoZhang
990023e59b [diffusion] lora: Fix LoRA weight merging for torch.nn.Linear layers from diffusers modules (#14150)
Co-authored-by: niehen6174 <niehen.6174@gmail.com>
2025-11-30 12:44:12 +08:00
fzyzcjy
0ae4b1ad81 Show errors when misusing env variables (#14154) 2025-11-30 10:57:35 +08:00
fzyzcjy
94cd64a7b0 Support checking fp8 params in weight_checker (#14147) 2025-11-30 09:08:59 +08:00
fzyzcjy
b870271a50 Fix spec v2 does not support RL update weights from tensor (#14146) 2025-11-30 09:08:05 +08:00
fzyzcjy
22ee9b0111 Super tiny add more info in dumper (#14145) 2025-11-30 09:07:39 +08:00
fzyzcjy
9d0e5f1f74 Tiny fix DeepGEMM precompile rank check (#14136) 2025-11-30 09:07:17 +08:00
Kangyan-Zhou
1d3d8b3418 Fix Minimax M2 loading issue (#13956) 2025-11-29 17:07:19 -05:00
Lianmin Zheng
155a9e7237 Fix condition for streaming output_ids in tokenizer manager (#13759)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Chang Su <chang.s.su\n@oracle.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-11-29 13:56:15 -08:00
gongwei-130
3339c81072 fix RuntimeError: RMSNorm failed with error code an illegal memory access was encountered (#14135) 2025-11-29 12:17:41 -08:00
Yuhao Yang
f03ea34a3d add runtime check for PyTorch 2.9.1 + CuDNN < 9.15 to prevent Conv3d performance issues (#14119) 2025-11-29 10:05:54 -05:00
fzyzcjy
4cafc835d3 Super tiny fix typo (#14131) 2025-11-29 21:08:31 +08:00
Mick
c6a52f4411 [diffusion] chore: add resolution shortcuts for sampling params (#14129) 2025-11-29 18:00:21 +08:00
elvischenv
848ee57067 feat: support flashinfer kernel autotune (#12306)
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
2025-11-29 00:05:37 -08:00
Cheng Wan
0fe74af563 Remove incorrect deep_gemm assertions from server_args.py (#14113)
Co-authored-by: Kangyan Zhou <zky314343421@gmail.com>
2025-11-28 20:25:39 -08:00
Mick
0a362d653f [diffusion] log: unify generation performance logging (#14117) 2025-11-29 12:21:59 +08:00
fjybiocs
143b57b805 enable piecewise cuda graph for prefill server (#13377)
Co-authored-by: serverance.fu <serverance.fu@temu.com>
2025-11-29 12:09:26 +08:00
Yan Ru Pei
f446b51c41 fix: malformed KV events for NVIDIA Dynamo (#13488)
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-11-28 14:55:20 -08:00
Tomer Shmilovich
11b6217aee Fix NIXL OBJ desciptors (#10712)
Co-authored-by: Tomer Shmilovich <tshmilovich@login-eos01.eos.clusters.nvidia.com>
2025-11-28 11:32:07 -08:00
Yuhao Yang
841eb29d3d [diffusion] model: support z-image (#14067) 2025-11-28 21:48:31 +08:00
fzyzcjy
45cf575852 Fix overlap scheduler not take effect when outputing logprobs (#14096) 2025-11-28 18:15:56 +08:00
Mick
0e8ce1e832 [diffusion] refactor: clean useless files (#14094) 2025-11-28 18:14:00 +08:00
Lzhang-hub
ea1e9f6b3c feat: support qwen3_vl vision model dp (#13724) 2025-11-28 17:29:07 +08:00
Lzhang-hub
f6e37d3edb [Bugfix] qwen2.5-vl spec decode accept_len low (#13904)
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
2025-11-28 17:26:32 +08:00
vipwangerxiao
ab9a46d462 Support configuring the request limit per receiving poll (#14076)
Co-authored-by: Peng Wang <peng_wang@linux.alibaba.com>
Co-authored-by: Feng Su <225349073+sufeng-buaa@users.noreply.github.com>
2025-11-28 16:14:21 +08:00
shuwenn
621061f017 [Bugfix] input prompt was not logged (#13936) 2025-11-28 16:00:51 +08:00
Aleksandr Krotov
7daddcdb58 Fix structural_tag tool call with null schema (#14006) 2025-11-27 23:04:16 -08:00
Mick
951028968c [diffusion] refactor: refactor ComponentLoader and support loading native models from diffusers and transformers (#13205) 2025-11-28 14:17:32 +08:00
Mick
3543a04a48 [diffusion] refactor: refactor condition image resize logic (#14079) 2025-11-28 14:06:34 +08:00
fzyzcjy
21af8e73ad Super tiny add comments to SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK (#14048) 2025-11-27 22:16:43 +08:00
Baizhou Zhang
7ab548ef64 [2/2] Refactor DeepGeem requant for FP8 FusedMoE on Blackwell (#13960) 2025-11-27 09:00:26 -05:00
Yixin Dong
6350042696 feat: Naive support Spec V2 + Constrained Decoding (#13425)
Signed-off-by: Ubospica <ubospica@gmail.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2025-11-27 20:31:46 +08:00
fzyzcjy
25758647b1 Support sanity checking weight consistency especially for RL (#13854) 2025-11-27 20:25:12 +08:00
fzyzcjy
2bc8ee8b74 Tiny support 3D tensors in inverse_transform_scale_ue8m0 (#14002) 2025-11-27 20:20:45 +08:00
Jimmy
ab843ced31 [Feat]Add scheduler recv skipper weights to environment configuration (#13855) 2025-11-27 18:16:11 +08:00
Mick
6edffc6391 [diffusion] perf: improve black-forest-labs/FLUX.2-dev (#14040) 2025-11-27 14:49:52 +08:00
gaopengff
077ca70ee4 [Intel XPU]Add xpu support for get_device_memory_capacity (#13895) 2025-11-26 20:55:52 -08:00
Qiaolin Yu
7cb04dc0e5 Use trtllm mha decode kernel for target_verify in speculative decoding (#13976)
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
2025-11-26 20:40:34 -08:00
sunxxuns
5443db8759 fix: Fix AMD CI failures with HIP layernorm and PyPI connectivity (#13814)
Co-authored-by: root <root@mi300x8-005.atl1.do.cpe.ice.amd.com>
2025-11-27 11:30:37 +08:00
Stefan He
9f340ab1fb [Piecewise] support disable decode cuda graph when enable piecewise cuda graph (#13965) 2025-11-26 18:35:59 -08:00
alisonshao
6330d6641b Fix flashinfer cutlass MoE output shape for non-FP4-packed inputs (#14028) 2025-11-26 18:09:02 -07:00
Sam
91e8dc371a [Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13761 (#13761)
Co-authored-by: Kaixi Hou <kaixih@nvidia.com>
2025-11-26 17:53:45 -07:00
Lianmin Zheng
231df4b0d4 Cleanup server args (#14027) 2025-11-26 16:32:41 -08:00
ShawnY112358
5155016b56 [feat] update bucketed weights from distributed (#13824)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
2025-11-26 15:30:45 -08:00
Netanel Haber
082b54c689 Support nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 (and nvidia/C-RADIOv2-H) (#12277) 2025-11-26 16:28:52 -07:00