Commit Graph

7373 Commits

Author SHA1 Message Date
Qiaolin Yu
d8db3077ca Fix draft extend cuda graph when spec_step=1 (#21709) 2026-03-31 18:29:56 -07:00
Liangsheng Yin
e4c565f2f2 [Misc] Tiny: Add test network timeouts and dynamic max-parallel for 5090/2-gpu runners (#21800) 2026-03-31 18:27:39 -07:00
Chang Su
1389962f06 [gRPC] Preserve original ImportError in grpc_server.py (#21801)
Signed-off-by: Chang Su <chang.s.su@oracle.com>
2026-03-31 18:22:29 -07:00
Brayden Zhong
6a9b09847c CUTLASS NVFP4 GEMM improvement of SM120 (#21314) 2026-04-01 09:04:34 +08:00
Johnsonms
5bbf347bb3 [jit_kernel] Optimize fused_qknorm_rope: deduplicate sincosf for interleave RoPE (#21654) 2026-04-01 09:04:13 +08:00
Xiaoyu Zhang
cdd7d6a227 Remove obsolete sgl-kernel legacy paths (#21528) 2026-04-01 09:00:20 +08:00
Liangsheng Yin
a8759dd9af Fix killall.py crash when sglang is not yet installed (#21797) 2026-03-31 17:40:58 -07:00
Liangsheng Yin
7581d814ae Add CompletionSampler for non-chat eval in run_eval (#21785) 2026-03-31 16:33:07 -07:00
Yilong Zhao
1f7cee81da [moe] add customized option to moe-a2a-backend (#21786) 2026-03-31 16:32:47 -07:00
Baizhou Zhang
f60f2ccc10 [Fix] Fall back to triton MOE for GPT-OSS on Blackwell with driver >= 595 (#21780)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 15:52:10 -07:00
weireweire
9191b02eda Fix cuda graph max bs capture upper bound (#21005) 2026-03-31 15:20:56 -07:00
Ethan (Yusheng) Su
3c91ebdf55 [2/n] lora - Shared outer experts and support qwen3_30b_a3b_instruct (#21466)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-03-31 14:06:23 -07:00
Liangsheng Yin
f4505e2ee3 Fix ineffective is_base_mistral CI patch for HF API rate limiting (#21729) 2026-03-31 12:54:34 -07:00
Trevor Morris
b91f78d255 [bugfix] Fix rope theta config for MiniMax after transformers v5 update (#21241) 2026-03-31 11:37:03 -07:00
Michael
8d919bbd44 [AMD] Fix Handle missing rope_theta in get_rope_config for Grok-1 (#21518) 2026-03-31 10:58:12 -07:00
Zhangheng
91048b2a8e [HiMambaTree]: Optimize mamba host lock mechanism (#21750) 2026-03-31 21:52:24 +08:00
R0CKSTAR
e67dbf257a [diffusion] fix: fix Wan2.2-I2V-A14B video max size issue(#21390)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-31 21:49:53 +08:00
Mick
7790645b82 [diffusion] UX: replace deprecated ORJSONResponse with orjson_response (#21755)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 21:41:33 +08:00
JD
20d07c4384 Fix remote weight info nnode>1 and dp>1 (#17389) 2026-03-31 21:17:18 +08:00
Shangming Cai
ca2b2130ba [PD] Tiny cleanup after KVReceiver refactor (#21760)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-03-31 21:07:57 +08:00
Yuan Luo
c7adca9992 Fix kimi-linear launch server error (#21752)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-31 21:07:08 +08:00
Ke Bao
dbc97456ad Enable evict swa with piecewise cuda graph (#21754) 2026-03-31 20:07:16 +08:00
weireweire
4455d17619 [PD] Refactor Disagg Conn and Fix Hang with total_request/total_tokens Balancing (#21299)
Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal>
2026-03-31 18:01:50 +08:00
R0CKSTAR
6c03ae6fe2 [diffusion] fix: fix typo (#21746)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-03-31 17:51:46 +08:00
xiaoqi
a6a8b9b376 bugfix(model):fix deepstack index out of range error (#21727)
Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com>
2026-03-31 02:41:47 -07:00
Thomas Wang
5628e908ae [AMD] Use tgemm.mm for MoEGate router gemm in deepseek_v2.py (#21657) 2026-03-31 00:55:40 -07:00
xiazhahe
b4cb31f698 [NPU] fix conflict between empty_cache and use_mem_pool (#21507) 2026-03-31 15:37:33 +08:00
Mohammad Miadh Angkad
dd9c9c1b8e Add explicit disable flag for FlashInfer allreduce fusion (#21446) 2026-03-31 00:15:44 -07:00
Yuhao Yang
68a4573627 [diffusion] fix: fix Flux.2 with tp(#21664) 2026-03-31 14:14:59 +08:00
jacky.cheng
8ba992411d [AMD] Fix CI multimodal-gen-test-1-gpu-amd for gen model (#21621) 2026-03-30 23:02:20 -07:00
Jincong Chen
03e4f2858d [Perf]Remove H2D for Qwen3.5 SpecV2 (#20864) 2026-03-31 11:54:58 +08:00
Lewis
33e725b052 [Fix] Update supported custom_mem_pool types for mooncake (#21728)
Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>
2026-03-31 11:18:30 +08:00
Xiaoyu Zhang
505eb312ec Revert "DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication" (#21719) 2026-03-31 10:22:01 +08:00
DarkSharpness
4e480982fa [misc] multiprocess compilation to speed up test (#21483) 2026-03-31 08:56:37 +08:00
kk
67c295b5f5 [AMD] fix performance regression issue when run gpt-oss with "--context-length 13824" (#21691) 2026-03-30 16:30:16 -07:00
Zhai Feiyue
daf697afda [AMD] Add SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS env var for configurable KV transfer overlap (#20410)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2026-03-30 14:37:16 -07:00
Aditya Sharma
d6029de6ad [Bugfix][NPU] Skip FRACTAL_NZ format for MoE weights with unaligned dimensions (#21209)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-03-30 23:22:17 +03:00
Vedant V Jhaveri
4a9ffc3ab6 fix nemotron capture for non attention layers (#21436) 2026-03-30 12:50:49 -07:00
Yuxuan Zhang
ad064c2f4e [GLM-V and GLM-4.7] Cast to FP32 before gate projection for GLM model. (#21660) 2026-03-30 12:25:27 -07:00
Makcum888e
f4b0e9c64a [diffusion] [NPU] support ring attention on NPU with FA (#21383) 2026-03-30 20:10:55 +03:00
GXIN
752d260c77 [NPU][diffusion]: support parallel decoding of qwen-image (#20757)
Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local>
2026-03-30 20:03:24 +03:00
cen121212
ba6d54d0f0 [NPU] GLM-5 optimize with fused kernels (#18617) 2026-03-30 22:48:15 +08:00
xieminghe1
7119d59747 DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (#14162)
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
2026-03-30 22:27:28 +08:00
heziiop
673ffb3116 [NPU] fix eagle3 accept rate (#21255) 2026-03-30 21:58:25 +08:00
GXIN
c5c58c3349 [NPU][Diffusion] fix sp modulate for qwen-image-edit (#20974)
Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local>
2026-03-30 16:18:48 +03:00
Mick
0a1fb42869 [diffusion] CI: relax pr-test threshold (#21682) 2026-03-30 20:23:46 +08:00
Mick
b76730701b [diffusion] feat: enhance overlay mechanism (#21648) 2026-03-30 19:45:34 +08:00
LiYomi
1d6424d5ad fix: Mistral Small 4 fails to start due to config/weight format mismatch (#21620)
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 01:57:35 -07:00
strgrb
b246269444 fix mamba cache leak when adder fails to add a matched req. (#21404) 2026-03-30 16:45:49 +08:00
Baizhou Zhang
62a63eeff7 [Fix] Fix weight_loader property assignment for qwen3-next FP8 models (#21662)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 01:35:59 -07:00