Commit Graph

7326 Commits

Author SHA1 Message Date
LiYomi
1d6424d5ad fix: Mistral Small 4 fails to start due to config/weight format mismatch (#21620)
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 01:57:35 -07:00
strgrb
b246269444 fix mamba cache leak when adder fails to add a matched req. (#21404) 2026-03-30 16:45:49 +08:00
Baizhou Zhang
62a63eeff7 [Fix] Fix weight_loader property assignment for qwen3-next FP8 models (#21662)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 01:35:59 -07:00
Hubert Lu
e6071e60c0 [AMD] Support AMD MXFP4 Qwen3.5-397B-A17B model (#21234) 2026-03-30 01:14:18 -07:00
kk
b9a68c304e [AMD] Fused rope kv store (#21315)
Co-authored-by: wunhuang <wunhuang@amd.com>
2026-03-30 00:05:41 -07:00
blzheng
ed01e1d5d6 [CPU] add kernel apply_rotary_pos_emb_cpu for Qwen3-VL and Qwen3-Omni (#13121)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-03-29 23:43:46 -07:00
Aishwarya Ramasethu
c32ee48886 MFU metrics in Prometheus (#19395) 2026-03-29 23:40:06 -07:00
Polisetty V R K Jyothendra Varma
f0303fd07e [Intel GPU] Enable DeepSeek R1 inference on XPU (#18461)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
2026-03-29 22:35:59 -07:00
Feng Su
9b4dd27478 [Fix] Fix Qwen3.5 MoE model loading and Mamba cache sharding in PP mode (#21448)
Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com>
2026-03-30 11:57:26 +08:00
Liangsheng Yin
c06ca1526c Fix circular reference in CustomTestCase.__init_subclass__ (#21650)
Co-authored-by: wan4ch <wan4ch@gmail.com>
2026-03-29 20:38:12 -07:00
Lianmin Zheng
9f7792415a Clean up TokenizerManager: remove dead code and improve rid validation (#21639)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:12:49 -07:00
Lianmin Zheng
f3970b17ef [Cleanup] Remove unused BatchMultimodalOutput and BatchMultimodalDecodeReq (#21640)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 14:54:25 -07:00
Lianmin Zheng
1d9c8e8c9e Simplify routed experts test and move base64 encoding to tokenizer manager (#21634)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 12:44:01 -07:00
Mohammad Miadh Angkad
2acdda1d85 [Fix] Remove redundant allreduce fusion block and skip TP=1 (#20621) 2026-03-29 12:30:40 -07:00
wili
bda94fc779 [Fix] SGLANG_USE_CUDA_IPC_TRANSPORT=1 and SGLANG_ENABLE_MM_SPLITTING=1 do not work at the same time. (#19915) 2026-03-30 01:15:26 +08:00
saatwiknagpal
d2440dcf58 [VLM] perf: optimize CUDA IPC for multimodal transfer by caching IPC pool handles (#21418) 2026-03-30 00:20:38 +08:00
wili
5bb9ca0e63 [Feature] Optimizations for JPEG input on NVIDIA GPU (#19749) 2026-03-30 00:06:14 +08:00
Bi Xue
42c46e6334 [sgl] disable piecewise cuda graph when a model doesn't have layers (#21565) 2026-03-29 23:04:20 +08:00
Hanlin Bi
aa9177152e fix cuda graph capturing error in sm120 mxfp8 triton path (#19835) 2026-03-29 01:59:24 -07:00
Liangsheng Yin
fec9961a1f Clean up _wait_for_scheduler_ready implementation (#21626) 2026-03-29 01:02:33 -07:00
psaab
d2fa8d67ba Wrap IPv6 addresses in gRPC, bench_serving, and log messages (#21236)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2026-03-29 00:36:31 -07:00
shuwenn
18074e25dc fix: scheduler launch hang when non-current rank dies (#20287) 2026-03-29 00:28:45 -07:00
Simon (Jiyou) Li
22e4733ab9 Add subprocess liveness monitor to detect scheduler crashes (#18582)
Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com>
Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com>
2026-03-29 00:09:13 -07:00
Kangyan-Zhou
9d64a82173 feat(ci): add GB300 nightly benchmark test suites (#21487)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:54:03 -07:00
Lianmin Zheng
ba6b501f3a Clean up detokenizer and remove dead multimodal_gen code (#21588)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:44:40 -07:00
Xiaoyu Zhang
516cff97a3 [Diffusion] Align diffusion benchmark skill presets with nightly comparison cases (#21616)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-29 12:12:17 +08:00
Yuan Luo
343a7ac652 [GDN] Fuse GDN kkt + solve_tril into one kernel (#21411)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-29 12:02:07 +08:00
jacky.cheng
c86f6c2831 [AMD] Add peft>=0.18.0 to diffusion_hip deps for transformers 5.x compat for AMD diffusion model (#21442)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2026-03-28 20:28:05 -07:00
Yuhao Yang
4e69f14b95 fix bench_serving sglang backend to support image dataset (#21294) 2026-03-29 10:02:11 +08:00
eigen
3ab9afd653 fix: piecewise_cuda_graph get correct qo_indptr (#21452)
Co-authored-by: Avery Huang <averyh@nvidia.com>
2026-03-28 15:57:29 -07:00
Shu Wang
efebcab43e Support skip-softmax attention (#19089) 2026-03-28 15:55:48 -07:00
Xinyuan Tong
ced69c9f84 feat: enable CUDA graph and timestamp for the whisper model(#21190) 2026-03-29 01:46:03 +08:00
Yuhao Yang
57cf4790ca [VLM] Optimize ShmPointerMMData for multi-pickle safety and deferred unwrap (#21465) 2026-03-28 23:11:12 +08:00
Mick
fc9de157f9 [diffusion] feat: support overlay model materialization (#21600) 2026-03-28 23:02:38 +08:00
Aditya Sharma
627e162335 [diffusion] fix: fix Flux2-Klein prompt tokenization length to 512 and add regression coverage (#21407) 2026-03-28 17:28:02 +08:00
Baizhou Zhang
edd4d54023 [Clean] Remove deprecated environs (#21536) 2026-03-28 00:35:44 -07:00
Liangsheng Yin
402628e560 Patch transformers is_base_mistral in CI to avoid HF 429 rate limiting (#21586) 2026-03-27 22:19:36 -07:00
Jianying
daf02bde33 Fix Piecewise CUDA Graph crash with -enable-mixed-chunk (#20441)
Co-authored-by: jianyingzhu <joeyzhu@nvidia.com>
2026-03-27 21:56:21 -07:00
Liangsheng Yin
19b1f75186 Fix HFRunner hang when subprocess dies during init (#21582) 2026-03-27 21:22:42 -07:00
Yuhao Yang
5ef56682b8 reduce CPU peak memory in multimodal tensor hashing (#21123) 2026-03-28 11:09:16 +08:00
Fengyuan Yu
9fa7b974fd [diffusion] chore: remove redundant identity preprocess_text functions(#20633)
Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com>
2026-03-28 10:07:30 +08:00
Eitan Turok
e570ca96f6 [diffusion] refactor: Unify TeaCacheParams and WanTeaCacheParams (#20706)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-28 09:51:44 +08:00
Mick
f0c68fbefd [diffusion] UX: aggregate expected dtype-cast logs during weight loading (#21552) 2026-03-28 09:50:40 +08:00
Trevor Morris
7160b6cb76 [NVIDIA] Enable automatic NUMA configuration (#19452) 2026-03-27 18:44:13 -07:00
Vladislav Nosivskoy
c37200f5e4 Scope streaming backlog coalescing to incremental_streaming_output mode (#21037)
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2026-03-27 17:29:54 -07:00
Qiaolin Yu
a27651d5e0 Remove sync when enabling return_logprob (#20972) 2026-03-27 16:36:28 -07:00
Ethan (Yusheng) Su
6d48719e31 [1/n] lora support - Auto detect lora target modules (#21439)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-03-27 16:08:36 -07:00
narutolhy
9b29131961 fix tp capture in vit cuda graph (#17255) 2026-03-27 22:38:18 +00:00
Muqi Li
38ad251738 feat: add gc_threshold arg (#21481)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 13:42:46 -07:00
huangtingwei
d864622a68 [Hicache & JIT_kernel] Support page first layout & mla jit kernel (#18311) 2026-03-27 08:54:36 -07:00