JD
|
20d07c4384
|
Fix remote weight info nnode>1 and dp>1 (#17389)
|
2026-03-31 21:17:18 +08:00 |
|
Shangming Cai
|
ca2b2130ba
|
[PD] Tiny cleanup after KVReceiver refactor (#21760)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2026-03-31 21:07:57 +08:00 |
|
Yuan Luo
|
c7adca9992
|
Fix kimi-linear launch server error (#21752)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2026-03-31 21:07:08 +08:00 |
|
Ke Bao
|
dbc97456ad
|
Enable evict swa with piecewise cuda graph (#21754)
|
2026-03-31 20:07:16 +08:00 |
|
weireweire
|
4455d17619
|
[PD] Refactor Disagg Conn and Fix Hang with total_request/total_tokens Balancing (#21299)
Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal>
|
2026-03-31 18:01:50 +08:00 |
|
R0CKSTAR
|
6c03ae6fe2
|
[diffusion] fix: fix typo (#21746)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
|
2026-03-31 17:51:46 +08:00 |
|
xiaoqi
|
a6a8b9b376
|
bugfix(model):fix deepstack index out of range error (#21727)
Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com>
|
2026-03-31 02:41:47 -07:00 |
|
Thomas Wang
|
5628e908ae
|
[AMD] Use tgemm.mm for MoEGate router gemm in deepseek_v2.py (#21657)
|
2026-03-31 00:55:40 -07:00 |
|
xiazhahe
|
b4cb31f698
|
[NPU] fix conflict between empty_cache and use_mem_pool (#21507)
|
2026-03-31 15:37:33 +08:00 |
|
Mohammad Miadh Angkad
|
dd9c9c1b8e
|
Add explicit disable flag for FlashInfer allreduce fusion (#21446)
|
2026-03-31 00:15:44 -07:00 |
|
Yuhao Yang
|
68a4573627
|
[diffusion] fix: fix Flux.2 with tp(#21664)
|
2026-03-31 14:14:59 +08:00 |
|
jacky.cheng
|
8ba992411d
|
[AMD] Fix CI multimodal-gen-test-1-gpu-amd for gen model (#21621)
|
2026-03-30 23:02:20 -07:00 |
|
Jincong Chen
|
03e4f2858d
|
[Perf]Remove H2D for Qwen3.5 SpecV2 (#20864)
|
2026-03-31 11:54:58 +08:00 |
|
Lewis
|
33e725b052
|
[Fix] Update supported custom_mem_pool types for mooncake (#21728)
Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>
|
2026-03-31 11:18:30 +08:00 |
|
Xiaoyu Zhang
|
505eb312ec
|
Revert "DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication" (#21719)
|
2026-03-31 10:22:01 +08:00 |
|
DarkSharpness
|
4e480982fa
|
[misc] multiprocess compilation to speed up test (#21483)
|
2026-03-31 08:56:37 +08:00 |
|
kk
|
67c295b5f5
|
[AMD] fix performance regression issue when run gpt-oss with "--context-length 13824" (#21691)
|
2026-03-30 16:30:16 -07:00 |
|
Zhai Feiyue
|
daf697afda
|
[AMD] Add SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS env var for configurable KV transfer overlap (#20410)
Co-authored-by: HaiShaw <hixiao@gmail.com>
|
2026-03-30 14:37:16 -07:00 |
|
Aditya Sharma
|
d6029de6ad
|
[Bugfix][NPU] Skip FRACTAL_NZ format for MoE weights with unaligned dimensions (#21209)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
|
2026-03-30 23:22:17 +03:00 |
|
Vedant V Jhaveri
|
4a9ffc3ab6
|
fix nemotron capture for non attention layers (#21436)
|
2026-03-30 12:50:49 -07:00 |
|
Yuxuan Zhang
|
ad064c2f4e
|
[GLM-V and GLM-4.7] Cast to FP32 before gate projection for GLM model. (#21660)
|
2026-03-30 12:25:27 -07:00 |
|
Makcum888e
|
f4b0e9c64a
|
[diffusion] [NPU] support ring attention on NPU with FA (#21383)
|
2026-03-30 20:10:55 +03:00 |
|
GXIN
|
752d260c77
|
[NPU][diffusion]: support parallel decoding of qwen-image (#20757)
Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local>
|
2026-03-30 20:03:24 +03:00 |
|
cen121212
|
ba6d54d0f0
|
[NPU] GLM-5 optimize with fused kernels (#18617)
|
2026-03-30 22:48:15 +08:00 |
|
xieminghe1
|
7119d59747
|
DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (#14162)
Co-authored-by: undefined <zhouchen.arrebol@jd.com>
|
2026-03-30 22:27:28 +08:00 |
|
heziiop
|
673ffb3116
|
[NPU] fix eagle3 accept rate (#21255)
|
2026-03-30 21:58:25 +08:00 |
|
GXIN
|
c5c58c3349
|
[NPU][Diffusion] fix sp modulate for qwen-image-edit (#20974)
Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local>
|
2026-03-30 16:18:48 +03:00 |
|
Mick
|
0a1fb42869
|
[diffusion] CI: relax pr-test threshold (#21682)
|
2026-03-30 20:23:46 +08:00 |
|
Mick
|
b76730701b
|
[diffusion] feat: enhance overlay mechanism (#21648)
|
2026-03-30 19:45:34 +08:00 |
|
LiYomi
|
1d6424d5ad
|
fix: Mistral Small 4 fails to start due to config/weight format mismatch (#21620)
Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-30 01:57:35 -07:00 |
|
strgrb
|
b246269444
|
fix mamba cache leak when adder fails to add a matched req. (#21404)
|
2026-03-30 16:45:49 +08:00 |
|
Baizhou Zhang
|
62a63eeff7
|
[Fix] Fix weight_loader property assignment for qwen3-next FP8 models (#21662)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-30 01:35:59 -07:00 |
|
Hubert Lu
|
e6071e60c0
|
[AMD] Support AMD MXFP4 Qwen3.5-397B-A17B model (#21234)
|
2026-03-30 01:14:18 -07:00 |
|
kk
|
b9a68c304e
|
[AMD] Fused rope kv store (#21315)
Co-authored-by: wunhuang <wunhuang@amd.com>
|
2026-03-30 00:05:41 -07:00 |
|
blzheng
|
ed01e1d5d6
|
[CPU] add kernel apply_rotary_pos_emb_cpu for Qwen3-VL and Qwen3-Omni (#13121)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
|
2026-03-29 23:43:46 -07:00 |
|
Aishwarya Ramasethu
|
c32ee48886
|
MFU metrics in Prometheus (#19395)
|
2026-03-29 23:40:06 -07:00 |
|
Polisetty V R K Jyothendra Varma
|
f0303fd07e
|
[Intel GPU] Enable DeepSeek R1 inference on XPU (#18461)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
|
2026-03-29 22:35:59 -07:00 |
|
Feng Su
|
9b4dd27478
|
[Fix] Fix Qwen3.5 MoE model loading and Mamba cache sharding in PP mode (#21448)
Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com>
|
2026-03-30 11:57:26 +08:00 |
|
Liangsheng Yin
|
c06ca1526c
|
Fix circular reference in CustomTestCase.__init_subclass__ (#21650)
Co-authored-by: wan4ch <wan4ch@gmail.com>
|
2026-03-29 20:38:12 -07:00 |
|
Lianmin Zheng
|
9f7792415a
|
Clean up TokenizerManager: remove dead code and improve rid validation (#21639)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-29 15:12:49 -07:00 |
|
Lianmin Zheng
|
f3970b17ef
|
[Cleanup] Remove unused BatchMultimodalOutput and BatchMultimodalDecodeReq (#21640)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-29 14:54:25 -07:00 |
|
Lianmin Zheng
|
1d9c8e8c9e
|
Simplify routed experts test and move base64 encoding to tokenizer manager (#21634)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-29 12:44:01 -07:00 |
|
Mohammad Miadh Angkad
|
2acdda1d85
|
[Fix] Remove redundant allreduce fusion block and skip TP=1 (#20621)
|
2026-03-29 12:30:40 -07:00 |
|
wili
|
bda94fc779
|
[Fix] SGLANG_USE_CUDA_IPC_TRANSPORT=1 and SGLANG_ENABLE_MM_SPLITTING=1 do not work at the same time. (#19915)
|
2026-03-30 01:15:26 +08:00 |
|
saatwiknagpal
|
d2440dcf58
|
[VLM] perf: optimize CUDA IPC for multimodal transfer by caching IPC pool handles (#21418)
|
2026-03-30 00:20:38 +08:00 |
|
wili
|
5bb9ca0e63
|
[Feature] Optimizations for JPEG input on NVIDIA GPU (#19749)
|
2026-03-30 00:06:14 +08:00 |
|
Bi Xue
|
42c46e6334
|
[sgl] disable piecewise cuda graph when a model doesn't have layers (#21565)
|
2026-03-29 23:04:20 +08:00 |
|
Hanlin Bi
|
aa9177152e
|
fix cuda graph capturing error in sm120 mxfp8 triton path (#19835)
|
2026-03-29 01:59:24 -07:00 |
|
Liangsheng Yin
|
fec9961a1f
|
Clean up _wait_for_scheduler_ready implementation (#21626)
|
2026-03-29 01:02:33 -07:00 |
|
psaab
|
d2fa8d67ba
|
Wrap IPv6 addresses in gRPC, bench_serving, and log messages (#21236)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
|
2026-03-29 00:36:31 -07:00 |
|