Commit Graph

7296 Commits

Author SHA1 Message Date
Shu Wang
efebcab43e Support skip-softmax attention (#19089) 2026-03-28 15:55:48 -07:00
Xinyuan Tong
ced69c9f84 feat: enable CUDA graph and timestamp for the whisper model(#21190) 2026-03-29 01:46:03 +08:00
Yuhao Yang
57cf4790ca [VLM] Optimize ShmPointerMMData for multi-pickle safety and deferred unwrap (#21465) 2026-03-28 23:11:12 +08:00
Mick
fc9de157f9 [diffusion] feat: support overlay model materialization (#21600) 2026-03-28 23:02:38 +08:00
Aditya Sharma
627e162335 [diffusion] fix: fix Flux2-Klein prompt tokenization length to 512 and add regression coverage (#21407) 2026-03-28 17:28:02 +08:00
Baizhou Zhang
edd4d54023 [Clean] Remove deprecated environs (#21536) 2026-03-28 00:35:44 -07:00
Liangsheng Yin
402628e560 Patch transformers is_base_mistral in CI to avoid HF 429 rate limiting (#21586) 2026-03-27 22:19:36 -07:00
Jianying
daf02bde33 Fix Piecewise CUDA Graph crash with -enable-mixed-chunk (#20441)
Co-authored-by: jianyingzhu <joeyzhu@nvidia.com>
2026-03-27 21:56:21 -07:00
Liangsheng Yin
19b1f75186 Fix HFRunner hang when subprocess dies during init (#21582) 2026-03-27 21:22:42 -07:00
Yuhao Yang
5ef56682b8 reduce CPU peak memory in multimodal tensor hashing (#21123) 2026-03-28 11:09:16 +08:00
Fengyuan Yu
9fa7b974fd [diffusion] chore: remove redundant identity preprocess_text functions(#20633)
Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com>
2026-03-28 10:07:30 +08:00
Eitan Turok
e570ca96f6 [diffusion] refactor: Unify TeaCacheParams and WanTeaCacheParams (#20706)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-28 09:51:44 +08:00
Mick
f0c68fbefd [diffusion] UX: aggregate expected dtype-cast logs during weight loading (#21552) 2026-03-28 09:50:40 +08:00
Trevor Morris
7160b6cb76 [NVIDIA] Enable automatic NUMA configuration (#19452) 2026-03-27 18:44:13 -07:00
Vladislav Nosivskoy
c37200f5e4 Scope streaming backlog coalescing to incremental_streaming_output mode (#21037)
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2026-03-27 17:29:54 -07:00
Qiaolin Yu
a27651d5e0 Remove sync when enabling return_logprob (#20972) 2026-03-27 16:36:28 -07:00
Ethan (Yusheng) Su
6d48719e31 [1/n] lora support - Auto detect lora target modules (#21439)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-03-27 16:08:36 -07:00
narutolhy
9b29131961 fix tp capture in vit cuda graph (#17255) 2026-03-27 22:38:18 +00:00
Muqi Li
38ad251738 feat: add gc_threshold arg (#21481)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 13:42:46 -07:00
huangtingwei
d864622a68 [Hicache & JIT_kernel] Support page first layout & mla jit kernel (#18311) 2026-03-27 08:54:36 -07:00
Bi Xue
30397e0a1e [rl][sgl] fix tensor mismatch after pause (#21514) 2026-03-27 23:02:30 +08:00
yang1002378395-cmyk
279e7738c5 [diffusion] fix: return None instead of raising RuntimeError when no model info found (#21319)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-27 22:42:39 +08:00
Xiaoyu Zhang
9238bd08a2 [CI] Register missing jit_kernel test files (#21547) 2026-03-27 19:39:08 +08:00
yang1002378395-cmyk
f83b1b73a8 [diffusion] feat: add --strict-ports option for predictable port assignment (#21320)
Co-authored-by: 阳虎 <yanghu@yanghudeMacBook-Pro.local>
2026-03-27 16:40:50 +08:00
zwang86
5fc5c18bed fix(security): replace unsafe pickle.loads with SafeUnpickler for CVE-2026-3989 (#20904) 2026-03-27 00:43:41 -07:00
Khoa Pham
8d4fca5908 [Security] 1/N: Bind ZMQ sockets to localhost to prevent unauthenticated remote access (#21435) 2026-03-26 23:33:49 -07:00
Xiaoyu Zhang
d633ab7349 [Diffusion] Add qknorm rope fuse kernel (#21440) 2026-03-27 14:27:08 +08:00
Xiaoyu Zhang
e8d46f145c Opt jit qknorm_across_heads cuda kernel (#21503) 2026-03-27 13:30:46 +08:00
Johnsonms
8a56a7b04d [jit_kernel] Migrate cast (downcast_fp8) from sgl-kernel AOT to JIT (#19103) 2026-03-27 13:21:44 +08:00
Johnsonms
c531be455e [jit_kernel] Add fused_qknorm_rope JIT kernel (#19059)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-03-27 13:21:28 +08:00
Mick
d7c4c57ace [diffusion] refactor: move format-specific weight loading hooks (quant-related) to a dedicated file (#21366) 2026-03-27 09:58:49 +08:00
Liangsheng Yin
e1ee68d0fc Release mm features on session close and support multiple /rerun-ut specs (#21501) 2026-03-26 18:31:29 -07:00
Aurick Qiao
c2b3e42ad6 Fix sessions with mm inputs (#21269) 2026-03-26 17:38:23 -07:00
Liangsheng Yin
8a4cdcd538 Simplify flush_cache: reject concurrent requests, remove client-side retry (#21490) 2026-03-26 16:31:04 -07:00
Liangsheng Yin
c580ddd19d Fix benchmark generating empty prompts when random_input_len is small (#21492) 2026-03-26 16:24:35 -07:00
Baizhou Zhang
a93065679b Revert "bugfix for weight loading for qwen3-next" (#21496) 2026-03-26 16:17:18 -07:00
SevenJ
2e65c27b29 Api add flush cache timeout (#21413)
Signed-off-by: root <wenjun7j@gmail.com>
2026-03-26 14:44:37 -07:00
Qiaolin Yu
8c3ccef2d9 Fix Kimi K2.5 dp attention+ spec decoding launch crash (#21391) 2026-03-26 14:40:26 -07:00
satyamk7054
be0cca5596 Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native (#20562)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
2026-03-26 14:35:20 -07:00
satyamk7054
e59ea4f6e9 fix: torch-native LoRA for multi-adapter case (#20564)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
2026-03-26 14:34:16 -07:00
Liangsheng Yin
fb90c9d298 [Test] Consolidate eval accuracy test mixins into eval_accuracy_kit (#21047) 2026-03-26 14:26:46 -07:00
Liangsheng Yin
e5b7650353 Fix UnboundLocalError when DetokenizerManager constructor fails (#21471) 2026-03-26 13:00:16 -07:00
Ho-Ren (Jack) Chuang
4b5f63e1b8 FIX: (NSA) Compute topk_indices_offset when NSA prefill flashmla_sparse is used with FP8 KV cache (#20606)
Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
2026-03-26 12:50:50 -07:00
jianzhao-xu
3867c6431a Fix bug in dbrx model (#21445)
Co-authored-by: Jianzhao Xu <xujianchao@huawei.com>
2026-03-26 11:23:30 -07:00
shuwenn
646573e4e8 fix: use get_rope_config() to support models without rope_parameters (#21135) 2026-03-26 11:22:12 -07:00
McZyWu
0906e45cec bugfix for weight loading for qwen3-next (#21313) 2026-03-26 21:21:00 +08:00
Mick
35720d9969 [diffusion] fix: fix qwen-image with nunchaku (#21415) 2026-03-26 16:31:44 +08:00
Anant Sharma
f289d173aa [Deps] Bump xgrammar to 0.1.32 (#21032) 2026-03-26 01:22:37 -07:00
Chen, Zhentao
fd535942ac [AMD]Integrate aiter's fused_topk for softmax scoring in topk function (#21421)
Co-authored-by: Chen, Todd <zhenchen@amd.com>
2026-03-26 00:57:56 -07:00
R0CKSTAR
a305964159 [MLX] Add native MLX execution backend for Apple Silicon Mac (#20342)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
2026-03-26 00:09:17 -07:00