Eva20150932-atlascloud
|
7c38eca1e4
|
feat: DeepSeek new v3.2 encoding (#14249)
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
2025-12-02 11:41:05 -08:00 |
|
Quanfeng Li
|
427b08e24d
|
Init TBO with dp_padded batch (#11423)
Co-authored-by: Cheng Wan <wan4ch@gmail.com>
Co-authored-by: Yuhao Yao <37280700+yuhyao@users.noreply.github.com>
|
2025-12-02 10:34:26 -08:00 |
|
alisonshao
|
0141ca370f
|
Revert PR #14044: Restore separate memory pool for piecewise CUDA graph (#14278)
|
2025-12-02 09:53:16 -08:00 |
|
alisonshao
|
25a6be4930
|
Fix duplicate download log messages in multi-process environment (#14299)
|
2025-12-02 09:33:18 -08:00 |
|
Mick
|
9530b76630
|
[diffusion] refactor: simplify DmdDenoisingStage (#14269)
|
2025-12-02 18:59:40 +08:00 |
|
Jinyan Chen
|
3067b3f050
|
[diffusion] chore: improve model info registration and searching strategy (#14281)
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
|
2025-12-02 18:28:59 +08:00 |
|
Lianmin Zheng
|
64092c8b55
|
[Auto Sync] Rename is_hybrid to is_hybrid_swa (#14252)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
Co-authored-by: Hanming Lu <hanming@x.ai>
|
2025-12-01 23:24:24 -08:00 |
|
sglang-bot
|
63b9300f00
|
chore: bump sgl-kernel version to 0.3.18.post2 (#14244)
|
2025-12-01 23:14:12 -08:00 |
|
b8zhong
|
236a7c2370
|
fix trtllm mla spec (#13738)
Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>
|
2025-12-01 22:16:25 -08:00 |
|
Roger Young
|
3dabd609fb
|
Optimize topk sigmoid in minimax_m2 (#14047)
Co-authored-by: xuebi <xuebi@minimaxi.com>
|
2025-12-02 14:07:12 +08:00 |
|
Kevin Li
|
c9e2090101
|
fix: Support PP for Mistral Small 3.1 (#14254)
|
2025-12-02 13:04:14 +08:00 |
|
kun-llfl
|
106df4eac5
|
Fix mrope_positions size when req is retracted (#13700)
Signed-off-by: Kun(llfl) <i@imux.top>
Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com>
|
2025-12-02 11:38:20 +08:00 |
|
Mick
|
1f930cd23d
|
[diffusion] CI: add testcase-wise retry mechanism (#14261)
|
2025-12-02 11:06:12 +08:00 |
|
Kartik Ramesh
|
11ce05163d
|
Fix NIXL exception message (#14172)
|
2025-12-02 10:39:45 +08:00 |
|
Stefan He
|
8fe8b63576
|
Revert "Try to remove wrong logic about max total token in spec decoding" (#14259)
|
2025-12-01 18:18:03 -08:00 |
|
Yuan Luo
|
26aebf83d3
|
[VLM] Support Piecewise CUDA Graph for Qwen3-Omni-MOE (#14222)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-12-02 10:12:10 +08:00 |
|
Mick
|
3ab8ae6847
|
[diffusion] fix: fix Flux.2 condition image resize (#14232)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-12-02 10:05:44 +08:00 |
|
Lianmin Zheng
|
796d82b107
|
[Auto Sync] Add max_total_num_tokens metric: Update scheduler_metrics_mixin.py, collector.py (20251202) (#14256)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dan Zheng <dzheng@x.ai>
|
2025-12-01 16:34:34 -08:00 |
|
Lianmin Zheng
|
1da59e8304
|
[Auto Sync] optionally disable fake register in Update fp8_kernel.py (20251202) (#14255)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: gauravjain14 <41287729+gauravjain14@users.noreply.github.com>
|
2025-12-01 16:11:12 -08:00 |
|
TomerBN-Nvidia
|
02af51e4fc
|
Support fp4 fp8 non gated moe (#13794)
Co-authored-by: Roi Koren <roik@nvidia.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
|
2025-12-01 15:26:28 -08:00 |
|
Zhiyu
|
079b173853
|
Fix a distributed initialization error (#13843)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-12-01 15:10:05 -08:00 |
|
YAMY
|
1f2b84d28d
|
Fix NSA Bug in Centralize NSA Dispatch Logic (#14245)
|
2025-12-01 13:18:18 -08:00 |
|
ishandhanani
|
07821352fb
|
Revert "Skip weight loading in deepgemm compilation" (#14241)
|
2025-12-01 12:59:09 -08:00 |
|
Byron Hsu
|
edbeaf3b88
|
[MM][style] rename inputs_embeds to input_embeds for consistency (#14240)
|
2025-12-01 11:36:51 -08:00 |
|
liupeng374
|
2e8f54e61e
|
[spec-overlap] bugfix for pd disaggregation and npu (#14088)
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
|
2025-12-01 22:58:20 +08:00 |
|
fzyzcjy
|
45264554f3
|
Super tiny fix typo (#14219)
|
2025-12-01 20:19:17 +08:00 |
|
Liangsheng Yin
|
a2423052f6
|
Add cuda event based on waiting value (#14214)
|
2025-12-01 18:51:44 +08:00 |
|
Lianmin Zheng
|
bc3d2a85af
|
[Minor] update docs (#14212)
|
2025-12-01 02:33:58 -08:00 |
|
fzyzcjy
|
d815d00248
|
Tiny call cudaProfilerStart only on first rank in node (#14211)
|
2025-12-01 18:18:45 +08:00 |
|
Xiaoyu Zhang
|
fa9021b21f
|
fix: Increase FlashInfer workspace size for Qwen3VL models (#14173)
|
2025-12-01 17:54:23 +08:00 |
|
Xiaoyu Zhang
|
9c80072845
|
Add peak output tokens per second in bench_serving (#14165)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
2025-12-01 17:47:54 +08:00 |
|
Yuan Luo
|
630a693081
|
[VLM] Boost Memory Pool based CUDA IPC (#14123)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-12-01 17:17:46 +08:00 |
|
Mick
|
7ce8faae28
|
[diffusion] refactor: remove hard-code of instanceof on PipelineConfig (#14186)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-12-01 16:35:34 +08:00 |
|
fzyzcjy
|
de153cf76a
|
Fix speculative decoding error when retracting (#14180)
|
2025-12-01 15:30:13 +08:00 |
|
fzyzcjy
|
f4a0c5c76b
|
Try to remove wrong logic about max total token in spec decoding (#14167)
|
2025-12-01 15:29:58 +08:00 |
|
Binyao Jiang
|
0f8e53947d
|
[Piecewise] Use same global graph memory pool as the main cuda graph … (#14044)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: BBuf <1182563586@qq.com>
|
2025-11-30 23:04:10 -08:00 |
|
fzyzcjy
|
e8ba5a668c
|
Support profiling only prefill or decode without the other (#14182)
|
2025-12-01 14:46:30 +08:00 |
|
fzyzcjy
|
a2960bdd6b
|
Super tiny allow millisecond precision in logging (#14183)
|
2025-12-01 14:46:09 +08:00 |
|
fzyzcjy
|
487c8d4df3
|
Tiny add several args to bench serving (#14181)
|
2025-12-01 14:45:47 +08:00 |
|
fzyzcjy
|
f87b8eab23
|
Tiny fix transform_scale_ue8m0 wrong output in some scenarios (#14003)
|
2025-12-01 14:45:27 +08:00 |
|
Minglei Zhu
|
e8542db558
|
[piecewise] move piecewise_cuda_graph_runner init to model_runner initialize (#14034)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
|
2025-11-30 22:16:04 -08:00 |
|
Lianmin Zheng
|
6df1e8d628
|
[Auto Sync] Update backend.py (20251130) (#14153)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
|
2025-11-30 22:15:02 -08:00 |
|
qichu-yun
|
bd0e690857
|
[Feature] Enable PTPC FP8 for compressed tensors moe (aiter kernel) (#12181)
|
2025-11-30 21:54:28 -08:00 |
|
Byron Hsu
|
0825d7f4c6
|
[piecewise] Refactor VLM to support input embed buffer and remove external embedder hack (#14155)
|
2025-11-30 21:43:09 -08:00 |
|
Yuhao Yang
|
0b9dbea593
|
[diffusion] chore: improve z-image (#14104)
|
2025-12-01 12:26:17 +08:00 |
|
Uranus
|
982db4ebac
|
Feat: GLM-4.6 supports shared experts fusion (#13873)
Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com>
Co-authored-by: Kevin-XiongC <kevin_xiong1997@outlook.com>
Co-authored-by: Mingyi Jin <jinmingyi1998@sina.cn>
|
2025-12-01 11:33:18 +08:00 |
|
Teng Ma
|
f5f3a5d98c
|
[PD] Support json file configuration for Transfer Engine (#14059)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2025-12-01 10:47:33 +08:00 |
|
YAMY
|
decb48965d
|
[DeepSeekV3.2] Enable pure TP & Partial DP Attention (#13646)
|
2025-11-30 15:59:23 -08:00 |
|
Fan Yin
|
c72f0756d2
|
Fix: fix flashmla fp8 kv cache acc error (#13841)
Co-authored-by: ybyang <ybyang7@iflytek.com>
|
2025-11-30 13:38:19 -08:00 |
|
Baizhou Zhang
|
f1115cf58d
|
Revert "[Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs" (#14171)
|
2025-11-30 12:49:46 -08:00 |
|