Commit Graph

6437 Commits

Author SHA1 Message Date
danielafrimi
3f1df322f9 [FIX] Always support TP > 4 for FP4 Gemm (#17300) 2026-02-05 15:10:26 +08:00
Meng, Hengyu
368936a62b [XPU] Integrate MoE and minor improvements in XPU attention backend (#13561) 2026-02-04 23:09:59 -08:00
Xiaoyu Zhang
dff3ba202a [Diffusion] Support layerwise offload for mova (#18272) 2026-02-05 13:16:07 +08:00
Ch3ngY1
f730c18679 [PD] improve kv offset calculation for MHA model with different tp size (#18163)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-02-05 10:43:23 +08:00
Mick
f218234e4f [diffusion] chore: prohibit Chinese characters usage (#18249) 2026-02-05 09:22:26 +08:00
yinghui
599c5f4922 fix kimi k2.5's moe gemm config init (#18064) 2026-02-04 16:59:01 -08:00
linhaifeng
c1d5cc3b24 [Bugfix] fix a obvious logic error (#18254) 2026-02-04 13:59:58 -08:00
Mohammad Miadh Angkad
efbf39583e Add MoE fused config for Qwen3-Coder-Next-FP8 on H100 TP=2 (#18195) 2026-02-04 13:36:35 -08:00
Zack Yu
2e87c2bd5e fix: fix MockModelRunner in attention tests (#18240) 2026-02-04 13:18:02 -08:00
Michael
6fd878b41d [AMD] Add kimi mi35x nightly test, folder organization and several stability fixes (#17895) 2026-02-04 12:03:57 -08:00
Mick
36a3e78af9 [diffusion] refactor: move model_stages into stages folder (#18248) 2026-02-05 00:23:31 +08:00
RunningLeon
3e7ecb78a6 model: support interns1-pro (#18145)
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
2026-02-05 00:22:44 +08:00
RunningLeon
a6f53cc5e3 entrypoint: support passing spaces_between_special_tokens per request (#17939) 2026-02-04 22:18:36 +08:00
wxy
4c403045ec [diffusion] fix: fix the bug of redundant memory usage on GPU-0 (#18221) 2026-02-04 21:25:23 +08:00
Zhang Yiyang (SII)
0c9a0adc53 [diffusion] chore: clean MOVA codes (#18107) 2026-02-04 21:23:41 +08:00
BingjiaWang
760ae933bb optimize get_topk_ragged by fusing get k and k_scale triton kernel (#16043)
Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>
2026-02-04 19:59:41 +08:00
Nicolas Castet
315306d8a9 Make sure we always disable symm memory without dp padding (#18129) 2026-02-04 19:58:28 +08:00
Jincong Chen
a72f4f839c Tiny fix for fp8 moe backend flashinfer_trtllm naming (#18243) 2026-02-04 19:58:04 +08:00
Evrard-Nil
ce02df8592 [diffusion] logging: downgrade default prompt log from info to debug (#17813)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-02-04 19:19:02 +08:00
Cheng Wan
84c09913eb Moving _alloc_extend_naive out of npu allocator (#18200) 2026-02-04 02:09:55 -08:00
zhangheng
be557cbc5f [RadixTree][5/N Refactor]: Introduce pre and post-processing methods for key matching (#18147) 2026-02-04 17:10:46 +08:00
Baizhou Zhang
d279520ba5 [DeepGemm] Add a flag for fast warmup (#18111) 2026-02-04 14:12:13 +08:00
Jianying
4739f2e8d5 [diffusion] kernel: gated residual layernorm scale shift and layernorm scale shift kernel fusion for Qwen-Image, WAN and HunyuanVideo (#14717)
Co-authored-by: AichenF <aichenf@nvidia.com>
Co-authored-by: jianyingzhu <joeyzhu@nvidia.com>
Co-authored-by: root <root@a4u8g-0120.ipp2a2.colossus.nvidia.com>
Co-authored-by: Yihan Chen <yingluosanqian@example.com>
Co-authored-by: 陈一涵 <yingluosanqian@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-02-04 13:46:20 +08:00
strgrb
37c33cc0aa fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. (#17801) 2026-02-04 11:41:26 +08:00
Aurick Qiao
c1d529c196 Fix Session for multimodal and expose it through Engine (#18152) 2026-02-04 10:33:27 +08:00
wxy
da758ed601 [diffusion] fix: fix server cache-dit bug under continuous dynamic requests (#17140) 2026-02-04 09:03:37 +08:00
satyamk7054
793bf9fc06 Update weight rename check for Qwen3 Embeddings (#17535) 2026-02-03 13:55:11 -08:00
Hudson Xing
e867040fc6 add streaming parallel tool call test case (#18097) 2026-02-03 12:46:01 -08:00
R0CKSTAR
7de650c83c [diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) (#17346)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-02-03 12:44:57 -08:00
R0CKSTAR
ec2461bc16 [diffusion] hardware: support diffusion models on MTGPU (multi-GPU, 5/N) (#17318)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-02-03 12:44:22 -08:00
R0CKSTAR
acf724b036 [Diffusion] Only import sgl_kernel in custom op cuda path (SiluAndMul and RMSNorm) (#15592)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-02-03 12:42:58 -08:00
Vladislav Nosivskoy
e166ca8758 [HiCache] feat: Add detailed cache hit breakdown for HiCache in sglext and Prometheus metrics (#17648)
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
2026-02-03 11:45:35 -08:00
Even Zhou
d48bbe3bed [CI][NPU] Bugfix import sgl-kernel error (#18173) 2026-02-03 11:39:38 -08:00
DiweiSun
495290aefd enable ut test for xpu devices (#11712)
Co-authored-by: jundu <jun.du@intel.com>
Co-authored-by: Gao, Pengfei <pengfei.gao@intel.com>
2026-02-03 11:15:14 -08:00
elvischenv
99fab2ce67 [Bugfix] Fix Mistral Large 3 NVFP4 TRTLLM MoE (#18065) 2026-02-03 20:32:49 +08:00
Lewis
a45647bce1 [PD] feat: support mooncake intra-node nvlink kv transfer (#17866)
Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>
Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>
2026-02-03 17:47:52 +08:00
Xiaowei Wang
cc69ac9e7a Warmup before profiling prefill latency for dynamic chunk sizing (#17198)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-02-03 17:45:23 +08:00
Mohammad Miadh Angkad
6f6b9c6e42 [Perf] Use safetensors load_file in multithread loader (#18124) 2026-02-02 23:21:13 -08:00
fatSheep
7a9d9c79d1 [HiCache] fix: apply extra_backend_tag in Mooncake batch_exists (#17265) 2026-02-02 22:54:56 -08:00
Viacheslav
74f716dbd7 Gigachat 3 tool parser and tests (#14765) 2026-02-02 22:28:34 -08:00
Kaixi Hou
4181290efd [NVIDIA] Add --top-k argument to run_eval.py (#18025)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 22:17:53 -08:00
b8zhong
78bf13db44 MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py (#16685)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2026-02-02 20:45:14 -08:00
Xiaoyu Zhang
eedd472025 [Diffusion] fix serving image_edit get input image bug (#18109) 2026-02-03 12:17:16 +08:00
Hank Han
e484c90cc7 Add triton_fused_moe config for GLM-4.7-FP8 tp8 H20 H20-3e (#18091) 2026-02-03 12:08:23 +08:00
Linyu Wu
9b1619c148 [Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels (#17889) 2026-02-03 10:49:17 +08:00
Mick
62004fd2be [diffusion] UX: improve logging (#18122) 2026-02-03 10:35:05 +08:00
zhangheng
180594358b [HiCache]: Support DeepSeek v32 cpu offloading (#17415)
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
2026-02-02 18:07:37 -08:00
Xiaoyu Zhang
a1bbc892af [Diffsuion & JIT_kernel] QKNorm cross heads kernel (#18073) 2026-02-03 10:03:17 +08:00
EkiRui
fd983b09b6 [Performance] Optimize radix cache eviction performance (#14339)
Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com>
2026-02-03 09:44:20 +08:00
Alison Shao
28e2340725 Fix HF hub race condition in CI by coordinating model downloads across TP ranks (#17787)
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
2026-02-02 14:57:45 -08:00