Commit Graph

7855 Commits

Author SHA1 Message Date
Junhao Liu
051427c0a3 [diffusion] benchmark: add SLO metric forinbench_serving (#18907)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-03-08 22:35:57 +08:00
liubiyongge
cc73355a1f [Feature] Add SLRU eviction policy & fix RadixCache hit_count bug (#18843)
Co-authored-by: zhangheng <hzh0425@apache.org>
2026-03-08 21:30:55 +08:00
Mick
2c183350be [diffusion] fix: fix wrong dit config for qwen-image-edit-plus-2511 (#20123) 2026-03-08 20:08:36 +08:00
Ratish P
ab9de886c5 [diffusion] reduce LayerwiseOffloadManager reserved GPU memory (#20042) 2026-03-08 19:26:17 +08:00
Liangsheng Yin
29f3a5396e [Minor] Add SessionSlot.is_holding_kv property for readability (#20120)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-08 03:25:13 -07:00
Liangsheng Yin
36b557d2c9 Fix streaming session with paged KV cache (SWA/MLA) (#20070)
Co-authored-by: Yilong Zhao <74357408+happierpig@users.noreply.github.com>
Co-authored-by: Aurick Qiao <6137920+aurickq@users.noreply.github.com>
2026-03-08 03:00:32 -07:00
yuyu5333
230fb55899 [Performance] Decode Offload improves the long texts performance 100% through dynamic block offload. (#17216)
Co-authored-by: zhangheng <hzh0425@apache.org>
2026-03-08 17:16:53 +08:00
Yuan Luo
97a2a9be0f [VLM] Replace conv3d proj with linear for GLM4V (#20033)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-07 22:50:47 -08:00
Fan Lin
7fb282a96f [diffusion] fix: fix bug of copy_if (#20094)
Co-authored-by: Yihan Chen <yingluosanqian@gmail.com>
2026-03-08 14:27:58 +08:00
xingsy97
7f9f85d4c8 [diffusion] feat: make QwenImageLayered resolution configurable (#20044) 2026-03-08 14:26:05 +08:00
Lancer
a73369c39f [diffusion] chore: ensure CFG Zero Star numerical stability for Helios model (#20091)
Signed-off-by: Lancer <maruixiang6688@gmail.com>
2026-03-08 14:25:14 +08:00
shuwenn
72f6dfcc31 fix: add ModelScope cache lookup and speculative path support (#20098) 2026-03-07 22:23:16 -08:00
Liangsheng Yin
d02c515ee8 Decouple scheduler log printing from metrics collection (#20107) 2026-03-07 22:09:10 -08:00
Baizhou Zhang
d28f35240a [V32/GLM5] Change default setting of V32 nvfp4 on TP4 (#20086) 2026-03-07 15:13:25 -08:00
Alison Shao
0f62da6953 [CI] Show test partition assignments after checkout (#20085)
Co-authored-by: Alison Shao <alisonshao@mac.lan>
2026-03-07 13:50:49 -08:00
VDV1985
45bd30e29d [NPU] make torch_native lora backend a little bit faster (#17228)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Egor Filimonov <44640852+ssshinigami@users.noreply.github.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-03-07 20:14:46 +03:00
Ke Bao
5867c3fa80 Support HiCache for MambaRadixCache (#19663)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2026-03-08 00:36:25 +08:00
Bingxu Chen
17721b00fd [AMD] Fix Tensor Memory Aliasing (#19928) 2026-03-07 08:06:10 -08:00
Yuan Luo
7da590d4d0 [Qwen3.5] Support Qwen3.5 Pipeline Parallelism (#19670)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-07 23:34:08 +08:00
YeChang Guo
13bdc7bf4a [Feature][NPU]: add runtime support for AutoRound quantized models (#16699)
Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-03-07 18:03:55 +03:00
Артем Савкин
5297b02c88 [Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization support (#17996)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-03-07 17:26:44 +03:00
xingsy97
f8d4eb7022 [Docs] Add docstrings to JIT kernel include headers (#19770) 2026-03-07 20:48:00 +08:00
Ratish P
ef6540b439 [diffusion]: add width/height passthrough for OpenAI image API (#19970) 2026-03-07 20:43:46 +08:00
David Wang
19c51fe2fa fix(rope): restore K writeback in fused rope + kv store kernel (#19636) 2026-03-07 20:41:35 +08:00
Fan Yin
43d6a32045 [sgl-kernel] rebase FlashMLA 0217 (#18902)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-03-07 00:30:52 -08:00
danielafrimi
f8bbf56de7 Refactor NemotronHConfig to canonical layers_block_type and add MTP block-type support (#19950)
Signed-off-by: dafrimi <dafrimi@nvidia.com>
2026-03-06 23:22:03 -08:00
Lancer
b91fb8393e [diffusion] fix: fix multi-prompt generation and support multiple prompts in cli (#19960)
Signed-off-by: Lancer <maruixiang6688@gmail.com>
2026-03-07 13:01:59 +08:00
Eitan Turok
31e93e4486 [diffusion] fix: fix TeaCache silently fails with --enable-teacache (#19964) 2026-03-07 13:00:11 +08:00
Qiaolin Yu
925185f9ec Fix flashinfer backend with pcg (#20061) 2026-03-06 20:01:43 -08:00
Feng Su
8a411a9a2a [Tracing] Remove the deprecated tracing code from mini_lb (#19409) 2026-03-07 11:19:23 +08:00
Mohammad Miadh Angkad
f88acf8780 [JIT Kernel] Reland NVFP4 kernels to JIT (#20012) 2026-03-07 10:31:08 +08:00
Yilong Zhao
6ffc74efd7 [Metrics] Add overlap bubble timing, full KV usage gauge, and prefill cuda graph tracking (#19982) 2026-03-06 17:41:27 -08:00
shubham singhal
a0d085c16d Adding correct path for module not found error while collecting test (#19778)
Co-authored-by: sys-lpot-val <sys_lpot_val@intel.com>
2026-03-06 16:26:16 -08:00
R0CKSTAR
e818f8219a Fix none-comparison (E711) warnings (#19745)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
2026-03-06 16:15:21 -08:00
R0CKSTAR
0c4f98ed4e [diffusion] hardware: add set_musa_arch on MUSA (misc, 15/N) (#19381)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-03-06 16:14:41 -08:00
MARATRIX
069d4c577b Fix Kimi K2.5 PP layer range exposure for PD disaggregation (#19959)
Signed-off-by: yafeng.li <yafeng.li@mthreads.com>
2026-03-06 16:14:02 -08:00
Liangsheng Yin
ddcecdea49 [Core] Unify max_num_reqs dp_size division for pool sizing (#20063)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-06 16:12:59 -08:00
Kangyan-Zhou
7a12255b6e fix: set first_token_time before computing decode_throughput for single-batch completions (#19984)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 16:11:41 -08:00
Aurick Qiao
5c8e28698c Add cleanup for _ATTN_TP in parallel_state.py (#19978) 2026-03-06 15:43:31 -08:00
Shu Wang
61de303f0a Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe (#19189) 2026-03-06 15:15:04 -08:00
Kangyan-Zhou
e89069ee64 Fallback to torch.cuda.mem_get_info() when nvidia-smi is unavailable (#18957)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 15:00:08 -08:00
Liangsheng Yin
604db4471d [Core] Clarify memory variable naming in model runner (#20060) 2026-03-06 14:00:46 -08:00
Liangsheng Yin
7a6cf0e9ba [Core] Extract _calculate_mamba_ratio and _init_pools from init_memory_pool (#20058) 2026-03-06 13:37:22 -08:00
Mohammad Miadh Angkad
759700c808 Fix SM120 triton_kernels MXFP4 block_k for GPT-OSS (#20040) 2026-03-06 10:53:08 -08:00
R0CKSTAR
de1a0afcbc [MUSA][10/N] Add GGUF support (#18357)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-03-06 10:50:35 -08:00
JohnHerry
e8f2b80340 [diffusion] improve: improve code readability of DenoisingStage (#20003) 2026-03-06 23:23:44 +08:00
xingsy97
54634b9a40 [Kernel] Dispatch exp/sin/cos through dtype_trait (#19798) 2026-03-06 22:57:52 +08:00
Johnsonms
2d266c73ea Migrate renorm kernels from sgl-kernel to FlashInfer JIT (#18854) 2026-03-06 22:53:28 +08:00
Xiaoyu Zhang
6d22c9f369 [Diffusion] Move hf kernels diffusion cuda kernels skills to SGLD (#20001)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-06 22:16:06 +08:00
Yuan Luo
f7de9375ac [GDN][Qwen3-Next][Qwen3.5] Fuse fused_gdn_gating and fused_recurrent_gated_delta_rule_update in verify_target (#19775) 2026-03-06 21:42:44 +08:00