Commit Graph

6437 Commits

Author SHA1 Message Date
Makcum888e
5f81ec1ad5 [Diffusion] Fix get model name when model local path end with "/" (#18918) 2026-02-17 13:19:54 +03:00
Ratish P
f6cc02489f [diffusion]: fix sparse video gen 2 backend being applied to cross-attention (#18900)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-02-17 13:17:46 +03:00
HAI
b158f5d4a2 Revert "[AMD] Fix RotaryEmbedding crash on AMD/ROCm (regression from #17934)" (#18922) 2026-02-17 01:07:50 -08:00
billishyahao
899e2be7d0 [TBO] fix cuda graph intermittently becomes disabled bug (#18320) 2026-02-16 22:18:57 -08:00
Michael
5e3103a787 [AMD] Fix RotaryEmbedding crash on AMD/ROCm (regression from #17934) (#18903)
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
2026-02-17 12:59:40 +08:00
Mohammad Miadh Angkad
90a0d66e1e [Tiny] Fix assert syntax warning in compressed_tensors_w4a4_mxint4_moe.py (#18899) 2026-02-17 12:54:30 +08:00
Yilong Zhao
d5307ce022 [misc] adding metadata field in UpdateWeightFromDiskReqInput (#18821) 2026-02-17 12:14:15 +08:00
triple-mu
26b2c63d03 [diffusion] operator: unify rotary embedding impl (#18164)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-17 12:02:48 +08:00
pansicheng
b21390f8f3 Adapt the Qwen2Model._update_causal_mask for transformers==4.57.1 (#18774) 2026-02-17 10:20:41 +08:00
Ratish P
50ca24aebb [diffusion]: fix scheduler crash on ZMQ messages with unexpected frame counts (#17890)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-02-17 09:45:05 +08:00
Frank Minors
1b659bcb08 Fix GLM-5 fused shared expert (#18804)
Co-authored-by: FrankMinions <liuchen@shinemo.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
2026-02-16 19:50:39 +00:00
danielafrimi
0ff24159a5 Fix modelopt FP8 create weights (#18447)
Signed-off-by: root <dafrimi@nvidia.com>
2026-02-17 00:59:50 +08:00
Tamir Baydasov
eba6af385d [2/N] Quantization Refactor: Compressed tensors MoE schemes (#17503)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: Peng Zhang <aniz1905@gmail.com>
2026-02-16 18:03:51 +03:00
Estrella-xx
1b3513a7e4 refactor FAKE transfer backend and remove --disaggregation-decode-enable-fake-auto parameter (#18345) 2026-02-16 17:27:02 +03:00
Ratish P
c1d1337afc [diffusion][Wan]: fix sparse attention backends being applied to cross-attention (#17596) 2026-02-16 21:57:58 +08:00
Mohammad Miadh Angkad
b86c6491fa [Perf] ~9.5x faster Blackwell MXFP4 MoE weight loading (#18858) 2026-02-16 19:47:09 +08:00
Shivam jindal
4f0409f8aa [Model] Add Qwen3ForRewardModel and fix Qwen3ForSequenceClassification (#17992)
Co-authored-by: yes-its-shivam <yes-its-shivam@users.noreply.github.com>
2026-02-16 19:44:41 +08:00
Mick
de833f9e8e Revert "[diffusion]: Improve layerwise offload buffer reuse and shared-storage handling" (#18866) 2026-02-16 18:00:58 +08:00
Mick
d0c94e136a [diffusion] logging: improve peak vram logging (#18865) 2026-02-16 16:44:37 +08:00
Yi Zhong
ed22720c07 [JIT kernel] hd=512,1024 in JIT QK norm (cta based) (#17515)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
2026-02-16 16:07:24 +08:00
Alison Shao
206accd15d Fix GLM-4V processor registration when glm_ocr is unavailable (#18885) 2026-02-16 16:02:31 +08:00
Changyi Yang
61da34ad0b [diffusion] fix: fix LoRA weight snapshot aliasing in unmerge logic (#18883) 2026-02-16 15:39:45 +08:00
Alison Shao
86c181e335 Fix test_lora_qwen3 nightly failure: replace adapter with added_tokens (#18884) 2026-02-16 14:35:06 +08:00
Douglas Yang
f1efb46bdd fix: adding performance logging for nightly diffusion (#18023) 2026-02-16 14:09:00 +08:00
fzyzcjy
f554b3c27b Support dumping gradients, parameters, lazy values (#18881)
Co-authored-by: Yueming Yuan <112649537+yueming-yuan@users.noreply.github.com>
2026-02-16 13:34:06 +08:00
fzyzcjy
9a7d8d5eb0 Collect upper level metadata to dump output (#18880) 2026-02-16 13:31:19 +08:00
fzyzcjy
949792d0c6 Change dump output format to dict with value and metadata (#18879) 2026-02-16 13:30:47 +08:00
fzyzcjy
02816abc0d Flip dumper to disable by default and refactor environment handling (#18878) 2026-02-16 13:29:32 +08:00
Duyi-Wang
5ddc84e33e [AMD] MORI-EP inter kernel type switch (#18437)
Co-authored-by: HAI <hixiao@gmail.com>
2026-02-15 20:59:39 -08:00
Johnsonms
bc79a64d3a [Diff]: support SGLANG_TORCH_PROFILER_DIR environment variable for profiler log directory (#18454)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-16 12:47:29 +08:00
Mick
0af9dcc407 [diffusion] refactor: refactor server_args adjust and validate logics (#18863) 2026-02-16 11:49:06 +08:00
Mick
78b4c9e248 [diffusion] fix: avoid saving output for warmup requests (#18867) 2026-02-16 11:48:28 +08:00
Yuan Luo
8a82c70297 [VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel (#18856)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-02-16 11:19:44 +08:00
Rain Jiang
0ffd0a3995 Nsa trtllm mla sparse fp8 support with Deepseek v3.2 NVFP4 (#18389) 2026-02-16 09:29:54 +08:00
Mike Qiu
b79808bee2 Fix libnuma.so does not exsit (#15355)
Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com>
Co-authored-by: Mike_Qiu <qiudayu.qdy@antgroup.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2026-02-16 00:37:50 +08:00
akhilg-nv
48eac1b62d Improve profiler options for bench_serving (#16991) 2026-02-16 00:36:01 +08:00
Chanh Nguyen
597d17dd18 Use ephemeral nccl port via get_free_port() (#18009)
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com>
2026-02-16 00:32:47 +08:00
tjp_zju
7a607c4900 fix_get_quant_method_in_fused_moe_condition (#18459)
Signed-off-by: tom-zju <tanjianpingzju1990@gmail.com>
Co-authored-by: Peng Zhang <aniz1905@gmail.com>
2026-02-16 00:31:42 +08:00
WiwilZ
b2f74d660a fix: add SM110 (Jetson AGX Thor) to Blackwell capability check (#18787) 2026-02-16 00:26:58 +08:00
blake-snc
57f7e06cb9 fix: update Blackwell log/error messages to include SM12x (#18751)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 00:23:51 +08:00
SoluMilken
07a24f1a38 update pre-commit config (#18860) 2026-02-16 00:18:31 +08:00
Ratish P
ddfe147377 [diffusion]: Improve layerwise offload buffer reuse and shared-storage handling (#18611) 2026-02-15 22:17:51 +08:00
Mick
3feb48139e [diffusion] quant: add support for svdquant and nunchaku (#18549)
Co-authored-by: AichenF <aichenf@nvidia.com>
Co-authored-by: jianyingzhu <53300651@qq.com>
2026-02-15 20:43:00 +08:00
Michael
88010e9601 [AMD] Fix nightly 1-GPU test failures and bench_serving regression (#18761)
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
2026-02-15 20:36:47 +08:00
fzyzcjy
4c7f986c6b Extract dumper and prefill delayer tests common utils (#18857) 2026-02-15 18:33:23 +08:00
haowen-han
b992828ad2 fix: fix bug on kimi2.5 with dp2 and tp4 (#18604)
Co-authored-by: hanhaowen <hanhaowen@baidu.com>
2026-02-15 16:32:13 +08:00
Ratish P
274bf6607a [diffusion] fix: enable torch.compile for UlyssesAttention (#18840) 2026-02-15 15:54:27 +08:00
zhangxiaolei123456
ad1bdb93df perf: add minimax-2.5 fused_moe tuning config for h20 (#18833) 2026-02-15 15:46:56 +08:00
jackey hua
922fbc21e2 [Perf] Tune MiniMax M2 fused moe kernel on H100 GPU (#18851) 2026-02-15 15:30:52 +08:00
andyluo7
944a9f6fcf Fix/qwen3 5 amd rope cutedsl fallback (#18753)
Co-authored-by: seungrokj <seungrok.jung@amd.com>
2026-02-14 22:09:44 -08:00