Commit Graph

4880 Commits

Author SHA1 Message Date
Mick
1f930cd23d [diffusion] CI: add testcase-wise retry mechanism (#14261) 2025-12-02 11:06:12 +08:00
Kartik Ramesh
11ce05163d Fix NIXL exception message (#14172) 2025-12-02 10:39:45 +08:00
Stefan He
8fe8b63576 Revert "Try to remove wrong logic about max total token in spec decoding" (#14259) 2025-12-01 18:18:03 -08:00
Yuan Luo
26aebf83d3 [VLM] Support Piecewise CUDA Graph for Qwen3-Omni-MOE (#14222)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-12-02 10:12:10 +08:00
Mick
3ab8ae6847 [diffusion] fix: fix Flux.2 condition image resize (#14232)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-02 10:05:44 +08:00
Lianmin Zheng
796d82b107 [Auto Sync] Add max_total_num_tokens metric: Update scheduler_metrics_mixin.py, collector.py (20251202) (#14256)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dan Zheng <dzheng@x.ai>
2025-12-01 16:34:34 -08:00
Lianmin Zheng
1da59e8304 [Auto Sync] optionally disable fake register in Update fp8_kernel.py (20251202) (#14255)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: gauravjain14 <41287729+gauravjain14@users.noreply.github.com>
2025-12-01 16:11:12 -08:00
TomerBN-Nvidia
02af51e4fc Support fp4 fp8 non gated moe (#13794)
Co-authored-by: Roi Koren <roik@nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
2025-12-01 15:26:28 -08:00
Zhiyu
079b173853 Fix a distributed initialization error (#13843)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-12-01 15:10:05 -08:00
YAMY
1f2b84d28d Fix NSA Bug in Centralize NSA Dispatch Logic (#14245) 2025-12-01 13:18:18 -08:00
ishandhanani
07821352fb Revert "Skip weight loading in deepgemm compilation" (#14241) 2025-12-01 12:59:09 -08:00
Byron Hsu
edbeaf3b88 [MM][style] rename inputs_embeds to input_embeds for consistency (#14240) 2025-12-01 11:36:51 -08:00
liupeng374
2e8f54e61e [spec-overlap] bugfix for pd disaggregation and npu (#14088)
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
2025-12-01 22:58:20 +08:00
fzyzcjy
45264554f3 Super tiny fix typo (#14219) 2025-12-01 20:19:17 +08:00
Liangsheng Yin
a2423052f6 Add cuda event based on waiting value (#14214) 2025-12-01 18:51:44 +08:00
Lianmin Zheng
bc3d2a85af [Minor] update docs (#14212) 2025-12-01 02:33:58 -08:00
fzyzcjy
d815d00248 Tiny call cudaProfilerStart only on first rank in node (#14211) 2025-12-01 18:18:45 +08:00
Xiaoyu Zhang
fa9021b21f fix: Increase FlashInfer workspace size for Qwen3VL models (#14173) 2025-12-01 17:54:23 +08:00
Xiaoyu Zhang
9c80072845 Add peak output tokens per second in bench_serving (#14165)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-12-01 17:47:54 +08:00
Yuan Luo
630a693081 [VLM] Boost Memory Pool based CUDA IPC (#14123)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-12-01 17:17:46 +08:00
Mick
7ce8faae28 [diffusion] refactor: remove hard-code of instanceof on PipelineConfig (#14186)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-01 16:35:34 +08:00
fzyzcjy
de153cf76a Fix speculative decoding error when retracting (#14180) 2025-12-01 15:30:13 +08:00
fzyzcjy
f4a0c5c76b Try to remove wrong logic about max total token in spec decoding (#14167) 2025-12-01 15:29:58 +08:00
Binyao Jiang
0f8e53947d [Piecewise] Use same global graph memory pool as the main cuda graph … (#14044)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: BBuf <1182563586@qq.com>
2025-11-30 23:04:10 -08:00
fzyzcjy
e8ba5a668c Support profiling only prefill or decode without the other (#14182) 2025-12-01 14:46:30 +08:00
fzyzcjy
a2960bdd6b Super tiny allow millisecond precision in logging (#14183) 2025-12-01 14:46:09 +08:00
fzyzcjy
487c8d4df3 Tiny add several args to bench serving (#14181) 2025-12-01 14:45:47 +08:00
fzyzcjy
f87b8eab23 Tiny fix transform_scale_ue8m0 wrong output in some scenarios (#14003) 2025-12-01 14:45:27 +08:00
Minglei Zhu
e8542db558 [piecewise] move piecewise_cuda_graph_runner init to model_runner initialize (#14034)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
2025-11-30 22:16:04 -08:00
Lianmin Zheng
6df1e8d628 [Auto Sync] Update backend.py (20251130) (#14153)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
2025-11-30 22:15:02 -08:00
qichu-yun
bd0e690857 [Feature] Enable PTPC FP8 for compressed tensors moe (aiter kernel) (#12181) 2025-11-30 21:54:28 -08:00
Byron Hsu
0825d7f4c6 [piecewise] Refactor VLM to support input embed buffer and remove external embedder hack (#14155) 2025-11-30 21:43:09 -08:00
Yuhao Yang
0b9dbea593 [diffusion] chore: improve z-image (#14104) 2025-12-01 12:26:17 +08:00
Uranus
982db4ebac Feat: GLM-4.6 supports shared experts fusion (#13873)
Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com>
Co-authored-by: Kevin-XiongC <kevin_xiong1997@outlook.com>
Co-authored-by: Mingyi Jin <jinmingyi1998@sina.cn>
2025-12-01 11:33:18 +08:00
Teng Ma
f5f3a5d98c [PD] Support json file configuration for Transfer Engine (#14059)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-12-01 10:47:33 +08:00
YAMY
decb48965d [DeepSeekV3.2] Enable pure TP & Partial DP Attention (#13646) 2025-11-30 15:59:23 -08:00
Fan Yin
c72f0756d2 Fix: fix flashmla fp8 kv cache acc error (#13841)
Co-authored-by: ybyang <ybyang7@iflytek.com>
2025-11-30 13:38:19 -08:00
Baizhou Zhang
f1115cf58d Revert "[Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs" (#14171) 2025-11-30 12:49:46 -08:00
Baizhou Zhang
7b03cc6482 [Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs (#14065) 2025-11-30 10:11:42 -08:00
Dongjoo Seo
c15c864b6f Fix LMCache unit test and init bug (#14005)
Signed-off-by: DongDongJu <commisori28@gmail.com>
2025-11-30 23:57:32 +08:00
PiteXChen
dc7bdc7329 bugfix[schedule]: Excessive preemption occurs when preempting running requests to schedule new prefill requests. (#12494)
Signed-off-by: CLFutureX <chenyongqyl@163.com>
2025-11-30 22:29:26 +08:00
Liangsheng Yin
0a9d64530d Support grammar + spec + reasoning (#14163) 2025-11-30 21:19:57 +08:00
fzyzcjy
340c613ab5 Support numactl bind for CPU and memory before process starts (#14156) 2025-11-30 17:00:33 +08:00
fzyzcjy
36b729c2b8 Implement profiler v2 and fix stage mixture bug (#14148) 2025-11-30 16:59:52 +08:00
Tianhao Zhou
67e6ef4b2d feat: longcat flash add aux layers capture for eagle3 (#14161) 2025-11-30 00:50:55 -08:00
strgrb
65ba5ab8b1 add cpp files for cpp_radix_tree to pyproject.toml. (#14052) 2025-11-30 13:05:04 +08:00
WenhaoZhang
990023e59b [diffusion] lora: Fix LoRA weight merging for torch.nn.Linear layers from diffusers modules (#14150)
Co-authored-by: niehen6174 <niehen.6174@gmail.com>
2025-11-30 12:44:12 +08:00
fzyzcjy
0ae4b1ad81 Show errors when misusing env variables (#14154) 2025-11-30 10:57:35 +08:00
fzyzcjy
94cd64a7b0 Support checking fp8 params in weight_checker (#14147) 2025-11-30 09:08:59 +08:00
fzyzcjy
b870271a50 Fix spec v2 does not support RL update weights from tensor (#14146) 2025-11-30 09:08:05 +08:00