danielafrimi
|
3f1df322f9
|
[FIX] Always support TP > 4 for FP4 Gemm (#17300)
|
2026-02-05 15:10:26 +08:00 |
|
Meng, Hengyu
|
368936a62b
|
[XPU] Integrate MoE and minor improvements in XPU attention backend (#13561)
|
2026-02-04 23:09:59 -08:00 |
|
Xiaoyu Zhang
|
dff3ba202a
|
[Diffusion] Support layerwise offload for mova (#18272)
|
2026-02-05 13:16:07 +08:00 |
|
Ch3ngY1
|
f730c18679
|
[PD] improve kv offset calculation for MHA model with different tp size (#18163)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2026-02-05 10:43:23 +08:00 |
|
Mick
|
f218234e4f
|
[diffusion] chore: prohibit Chinese characters usage (#18249)
|
2026-02-05 09:22:26 +08:00 |
|
yinghui
|
599c5f4922
|
fix kimi k2.5's moe gemm config init (#18064)
|
2026-02-04 16:59:01 -08:00 |
|
linhaifeng
|
c1d5cc3b24
|
[Bugfix] fix a obvious logic error (#18254)
|
2026-02-04 13:59:58 -08:00 |
|
Mohammad Miadh Angkad
|
efbf39583e
|
Add MoE fused config for Qwen3-Coder-Next-FP8 on H100 TP=2 (#18195)
|
2026-02-04 13:36:35 -08:00 |
|
Zack Yu
|
2e87c2bd5e
|
fix: fix MockModelRunner in attention tests (#18240)
|
2026-02-04 13:18:02 -08:00 |
|
Michael
|
6fd878b41d
|
[AMD] Add kimi mi35x nightly test, folder organization and several stability fixes (#17895)
|
2026-02-04 12:03:57 -08:00 |
|
Mick
|
36a3e78af9
|
[diffusion] refactor: move model_stages into stages folder (#18248)
|
2026-02-05 00:23:31 +08:00 |
|
RunningLeon
|
3e7ecb78a6
|
model: support interns1-pro (#18145)
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
|
2026-02-05 00:22:44 +08:00 |
|
RunningLeon
|
a6f53cc5e3
|
entrypoint: support passing spaces_between_special_tokens per request (#17939)
|
2026-02-04 22:18:36 +08:00 |
|
wxy
|
4c403045ec
|
[diffusion] fix: fix the bug of redundant memory usage on GPU-0 (#18221)
|
2026-02-04 21:25:23 +08:00 |
|
Zhang Yiyang (SII)
|
0c9a0adc53
|
[diffusion] chore: clean MOVA codes (#18107)
|
2026-02-04 21:23:41 +08:00 |
|
BingjiaWang
|
760ae933bb
|
optimize get_topk_ragged by fusing get k and k_scale triton kernel (#16043)
Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>
|
2026-02-04 19:59:41 +08:00 |
|
Nicolas Castet
|
315306d8a9
|
Make sure we always disable symm memory without dp padding (#18129)
|
2026-02-04 19:58:28 +08:00 |
|
Jincong Chen
|
a72f4f839c
|
Tiny fix for fp8 moe backend flashinfer_trtllm naming (#18243)
|
2026-02-04 19:58:04 +08:00 |
|
Evrard-Nil
|
ce02df8592
|
[diffusion] logging: downgrade default prompt log from info to debug (#17813)
Co-authored-by: Mick <mickjagger19@icloud.com>
|
2026-02-04 19:19:02 +08:00 |
|
Cheng Wan
|
84c09913eb
|
Moving _alloc_extend_naive out of npu allocator (#18200)
|
2026-02-04 02:09:55 -08:00 |
|
zhangheng
|
be557cbc5f
|
[RadixTree][5/N Refactor]: Introduce pre and post-processing methods for key matching (#18147)
|
2026-02-04 17:10:46 +08:00 |
|
Baizhou Zhang
|
d279520ba5
|
[DeepGemm] Add a flag for fast warmup (#18111)
|
2026-02-04 14:12:13 +08:00 |
|
Jianying
|
4739f2e8d5
|
[diffusion] kernel: gated residual layernorm scale shift and layernorm scale shift kernel fusion for Qwen-Image, WAN and HunyuanVideo (#14717)
Co-authored-by: AichenF <aichenf@nvidia.com>
Co-authored-by: jianyingzhu <joeyzhu@nvidia.com>
Co-authored-by: root <root@a4u8g-0120.ipp2a2.colossus.nvidia.com>
Co-authored-by: Yihan Chen <yingluosanqian@example.com>
Co-authored-by: 陈一涵 <yingluosanqian@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2026-02-04 13:46:20 +08:00 |
|
strgrb
|
37c33cc0aa
|
fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. (#17801)
|
2026-02-04 11:41:26 +08:00 |
|
Aurick Qiao
|
c1d529c196
|
Fix Session for multimodal and expose it through Engine (#18152)
|
2026-02-04 10:33:27 +08:00 |
|
wxy
|
da758ed601
|
[diffusion] fix: fix server cache-dit bug under continuous dynamic requests (#17140)
|
2026-02-04 09:03:37 +08:00 |
|
satyamk7054
|
793bf9fc06
|
Update weight rename check for Qwen3 Embeddings (#17535)
|
2026-02-03 13:55:11 -08:00 |
|
Hudson Xing
|
e867040fc6
|
add streaming parallel tool call test case (#18097)
|
2026-02-03 12:46:01 -08:00 |
|
R0CKSTAR
|
7de650c83c
|
[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) (#17346)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
|
2026-02-03 12:44:57 -08:00 |
|
R0CKSTAR
|
ec2461bc16
|
[diffusion] hardware: support diffusion models on MTGPU (multi-GPU, 5/N) (#17318)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
|
2026-02-03 12:44:22 -08:00 |
|
R0CKSTAR
|
acf724b036
|
[Diffusion] Only import sgl_kernel in custom op cuda path (SiluAndMul and RMSNorm) (#15592)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
|
2026-02-03 12:42:58 -08:00 |
|
Vladislav Nosivskoy
|
e166ca8758
|
[HiCache] feat: Add detailed cache hit breakdown for HiCache in sglext and Prometheus metrics (#17648)
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
|
2026-02-03 11:45:35 -08:00 |
|
Even Zhou
|
d48bbe3bed
|
[CI][NPU] Bugfix import sgl-kernel error (#18173)
|
2026-02-03 11:39:38 -08:00 |
|
DiweiSun
|
495290aefd
|
enable ut test for xpu devices (#11712)
Co-authored-by: jundu <jun.du@intel.com>
Co-authored-by: Gao, Pengfei <pengfei.gao@intel.com>
|
2026-02-03 11:15:14 -08:00 |
|
elvischenv
|
99fab2ce67
|
[Bugfix] Fix Mistral Large 3 NVFP4 TRTLLM MoE (#18065)
|
2026-02-03 20:32:49 +08:00 |
|
Lewis
|
a45647bce1
|
[PD] feat: support mooncake intra-node nvlink kv transfer (#17866)
Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com>
Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>
|
2026-02-03 17:47:52 +08:00 |
|
Xiaowei Wang
|
cc69ac9e7a
|
Warmup before profiling prefill latency for dynamic chunk sizing (#17198)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2026-02-03 17:45:23 +08:00 |
|
Mohammad Miadh Angkad
|
6f6b9c6e42
|
[Perf] Use safetensors load_file in multithread loader (#18124)
|
2026-02-02 23:21:13 -08:00 |
|
fatSheep
|
7a9d9c79d1
|
[HiCache] fix: apply extra_backend_tag in Mooncake batch_exists (#17265)
|
2026-02-02 22:54:56 -08:00 |
|
Viacheslav
|
74f716dbd7
|
Gigachat 3 tool parser and tests (#14765)
|
2026-02-02 22:28:34 -08:00 |
|
Kaixi Hou
|
4181290efd
|
[NVIDIA] Add --top-k argument to run_eval.py (#18025)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-02-02 22:17:53 -08:00 |
|
b8zhong
|
78bf13db44
|
MoE Refactor: Refactor modelopt_quant.py -> flashinfer_trllm.py (#16685)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2026-02-02 20:45:14 -08:00 |
|
Xiaoyu Zhang
|
eedd472025
|
[Diffusion] fix serving image_edit get input image bug (#18109)
|
2026-02-03 12:17:16 +08:00 |
|
Hank Han
|
e484c90cc7
|
Add triton_fused_moe config for GLM-4.7-FP8 tp8 H20 H20-3e (#18091)
|
2026-02-03 12:08:23 +08:00 |
|
Linyu Wu
|
9b1619c148
|
[Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels (#17889)
|
2026-02-03 10:49:17 +08:00 |
|
Mick
|
62004fd2be
|
[diffusion] UX: improve logging (#18122)
|
2026-02-03 10:35:05 +08:00 |
|
zhangheng
|
180594358b
|
[HiCache]: Support DeepSeek v32 cpu offloading (#17415)
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
|
2026-02-02 18:07:37 -08:00 |
|
Xiaoyu Zhang
|
a1bbc892af
|
[Diffsuion & JIT_kernel] QKNorm cross heads kernel (#18073)
|
2026-02-03 10:03:17 +08:00 |
|
EkiRui
|
fd983b09b6
|
[Performance] Optimize radix cache eviction performance (#14339)
Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com>
|
2026-02-03 09:44:20 +08:00 |
|
Alison Shao
|
28e2340725
|
Fix HF hub race condition in CI by coordinating model downloads across TP ranks (#17787)
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
|
2026-02-02 14:57:45 -08:00 |
|