Commit Graph

6508 Commits

Author SHA1 Message Date
Michael
ca09d71cf0 Fix nightly grok failure on rotary embedding import (#19192)
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
2026-02-25 13:25:16 +08:00
jacky.cheng
e138f7960a [AMD] Fix accuracy while using --enable-dp-attention (#19247)
Co-authored-by: yichiche@amd.com <jacky.cheng>
2026-02-24 20:50:28 -08:00
Liangsheng Yin
ab0f608788 [PD-Disagg] Fix bootstrap server race condition when prefill workers not yet registered (#19288)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-24 20:22:16 -08:00
Liangsheng Yin
539f772f54 [PD-Disagg] Fully support external DP dispatch w/ PD-disaggregation mode. (#19268)
Co-authored-by: Ratish P <114130421+ratish1@users.noreply.github.com>
2026-02-24 19:58:01 -08:00
Mick
241ee90164 [diffusion] chore: tiny fix pyproject.toml (#19256) 2026-02-25 11:57:53 +08:00
Shangming Cai
0fac2796b6 [PD-Disagg] Improve KVManager init across all backends (#19240)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-02-25 10:37:09 +08:00
siyu
c0fdfd4b92 Delete mm.feature after decode phase (#17324) 2026-02-24 18:13:03 -08:00
Xiaoyu Zhang
9dff933164 [Kernel Slimming] Remove sgl-kernel AOT marlin kernels (#19241) 2026-02-25 10:08:22 +08:00
Feng Su
3b89302277 Refactor: observability code cleanup (#17862)
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
2026-02-24 18:07:29 -08:00
siyu
245430eaac Encoder Global Cache Manager (#16137)
Co-authored-by: Zheng Wengang <zwg0606@gmail.com>
Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
2026-02-24 18:05:43 -08:00
fzyzcjy
b7af58b9af Support replication axis in dump comparator (#19282) 2026-02-25 09:48:43 +08:00
fzyzcjy
2e2b18e870 Support context parallel zigzag reordering in dump comparator (#19281) 2026-02-25 09:46:17 +08:00
fzyzcjy
0de1f4b07b Support multi axis unsharding in dump comparator (#19280) 2026-02-25 09:44:07 +08:00
fzyzcjy
4bb678f28a Full test coverage in dump comparator (#19279) 2026-02-25 09:43:29 +08:00
fzyzcjy
94ca2ac5d7 Support TP unification and enhance tests in dump comparator (#19278) 2026-02-25 09:42:56 +08:00
fzyzcjy
39ba9b5ab5 Support simple unsharding in dumper comparator (#19277) 2026-02-25 09:42:21 +08:00
fzyzcjy
02ca107b2c Support dims annotation and enhance dump loader in dumper (#19276) 2026-02-25 09:41:48 +08:00
fzyzcjy
8b1ab4aaf9 Support agent-friendly output formats in dump comparator (#19275) 2026-02-25 09:40:33 +08:00
fzyzcjy
d7578ce279 Implement simplest dump comparator v2 (#19274) 2026-02-25 09:37:21 +08:00
wxy
9cec98b445 [diffusion] fix: shard timestep_proj in sequence-sharded ti2v (#19237) 2026-02-25 09:33:45 +08:00
Mick
0ede5c54a8 [diffusion] logging: improve request and component load logs (#19253) 2026-02-25 09:32:36 +08:00
silencejade
15f2e36fb9 [NPU] forward_npu uses native impl by default in MultiPlatformOp (#18545) 2026-02-25 09:16:09 +08:00
Chen, Zhentao
b9cf1563de [ROCm] Optimize Deepseek R1 on MI300X (#18242)
Co-authored-by: Chen, Todd <zhenchen@amd.com>
2026-02-24 17:01:14 -08:00
Vladimir221
f1088beb6a [NPU]Optimization of forward_npu for UnquantizedFusedMoEMethod (#13158) 2026-02-25 08:55:20 +08:00
Jonah Bernard
d7a03c7ebf [MoE Refactor] Refactor FlashInferFusedMoE into FusedMoE and flashinfer_trtllm.py (#19266) 2026-02-24 16:47:53 -08:00
Chen, Zhentao
c193a52fa2 [AMD] DSR1/V3 use fp8 bmm in MLA for MI300X (#18624)
Co-authored-by: Chen, Todd <zhenchen@amd.com>
2026-02-24 15:33:33 -08:00
Ratish P
ae6f6e1495 [Refactor] Benchmark: Add typed DatasetArgs/Loader registry and CPU dataset unit tests (#19147)
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2026-02-24 12:22:01 -08:00
Glen Liu
a06425184f [Fix] remove redundant +1 when getting tail_str for Req (#18584) 2026-02-24 10:35:26 -08:00
Makcum888e
9bce3b040c [diffusion] [NPU] Update perf baselines (#19227) 2026-02-24 21:15:16 +03:00
Zihao Wang
e6ad58e5da [diffusion] fix: webui not return data (#19244) 2026-02-24 23:39:06 +08:00
Adarsh Shirawalmath
1efc33c640 [diffusion] refactor: remove enums and unify attention backends (#19149) 2026-02-24 21:21:23 +08:00
triple-mu
ea1bc1c578 [diffusion] feat: optimize all_to_all (#19157) 2026-02-24 21:20:10 +08:00
Yuan Luo
31c7dc9d99 [VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend (#19003)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-02-24 19:49:22 +08:00
xdtbynd
1a83b2c15d fix: fix the bug blocking completion template application (#17010)
Co-authored-by: xdtbynd <supercluster@vip.qq.com>
Co-authored-by: cy <chenyang08056032@163.com>
Co-authored-by: sglang-npu-bot <sglangnpu@163.com>
2026-02-24 18:50:08 +08:00
Cheng Wan
6e54361608 Refactor CUDA graph input buffers with shared buffer pool (#19180) 2026-02-24 02:24:40 -08:00
jianzhao-xu
94946764a4 bugfix: The default value of the environment virable is None, so that… (#18309)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: sglang-npu-bot <sglangnpu@163.com>
2026-02-24 18:10:21 +08:00
GMI Xiao Jin
fcfd964d7d [diffusion] model: LTX-2 Support PR3 (#19151) 2026-02-24 16:55:28 +08:00
Simo Lin
88a7e48108 fix(grpc): handle embedding requests in _handle_batch_output (#19221)
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
2026-02-23 22:32:16 -08:00
Mick
2e053d6eb6 [diffusion] quant: support quant for all dits (#19156)
Co-authored-by: zyzshishui <zyzshishui@gmail.com>
2026-02-24 14:20:54 +08:00
fy
4f25a48d7a support xverse_moe on npu
Co-authored-by: sglang-npu-bot <sglangnpu@163.com>
2026-02-24 14:11:20 +08:00
Liangsheng Yin
5e80027ac7 [PD-Disagg] Support FAKE transfer backend for more cases (#19215) 2026-02-23 20:50:43 -08:00
jianzhao-xu
d8ef33a551 bugfix: prioritize init_npu_backend to fix various initialization bugs (#17652) 2026-02-24 12:25:20 +08:00
Liangsheng Yin
feb041f4e5 [PD-Disagg] Improve type hints across all conn.py (#19208) 2026-02-23 19:44:23 -08:00
Liangsheng Yin
ea7ef63e6d [PD-Disagg] Deduplicate common KVManager methods into CommonKVManager (#19205)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-02-23 18:34:21 -08:00
Kangyan-Zhou
8aeb16f3fc fix: add missing blank line after docstring in serving_transcription.py (#19206)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 18:29:38 -08:00
Xinyuan Tong
581bf53e03 Whisper model support & /v1/audio/transcriptions endpoint & benchmark (#16983)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-23 17:28:37 -08:00
Liangsheng Yin
d160c5b9bb [PD-Disagg] Unify prefill info data transition flow, all with PrefillServerInfo (#19195) 2026-02-23 16:15:20 -08:00
Liangsheng Yin
9fac90d85d [CI] Tiny enhance the dp attention load blance benchmark (#19194) 2026-02-23 14:33:32 -08:00
Liangsheng Yin
2aa3fe394e [CI] fix the teardown output of disaggregation test (#19193)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-23 12:41:03 -08:00
Liangsheng Yin
2274bfebb1 [PD-Disagg] Support query dp rank from bootstrap server. (#19168)
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Co-authored-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
2026-02-23 10:59:30 -08:00