DarkSharpness
|
d1b7c3907d
|
[Parallel State Refactor 2/n] Unify code path of AMD deterministic all reduce (#20871)
|
2026-04-03 12:33:17 +08:00 |
|
Brayden Zhong
|
6a9b09847c
|
CUTLASS NVFP4 GEMM improvement of SM120 (#21314)
|
2026-04-01 09:04:34 +08:00 |
|
Baizhou Zhang
|
67cad3e69e
|
Revert "Support CuteDSL mm_fp4 backend" (#21077)
|
2026-03-20 22:47:47 -07:00 |
|
Lianmin Zheng
|
104b10f70a
|
refactor: consolidate is_in_ci (jit_kernel, sgl-kernel benchmarks, tests) (#21009)
|
2026-03-20 05:55:36 -07:00 |
|
Brayden Zhong
|
b42b9f6e1a
|
Support CuteDSL mm_fp4 backend (#18801)
|
2026-03-19 14:20:01 -07:00 |
|
Qi Yuhang
|
cb8105fe28
|
[sgl-kernel][6/7]Support Expert Specialization Grouped GEMM (#15471)
|
2026-03-19 15:39:52 +08:00 |
|
Xiaoyu Zhang
|
25e38216b6
|
[kernel slimming] Clean many useless sgl-kernel deprecated kernels (#20277)
|
2026-03-14 16:45:54 +08:00 |
|
pansicheng
|
2ad475b4ed
|
use flashinfer.sampling (#18696)
|
2026-02-26 10:02:38 +08:00 |
|
SoluMilken
|
07a24f1a38
|
update pre-commit config (#18860)
|
2026-02-16 00:18:31 +08:00 |
|
Xiaoyu Zhang
|
de2f2880b5
|
[JIT sgl-kernel] Jit support per tensor quant (#15709)
|
2025-12-25 16:24:37 +08:00 |
|
sunxxuns
|
f2d64e6782
|
[amd] Add deterministic all-reduce kernel for AMD (ROCm) (#15340)
Co-authored-by: Thomas Wang <1am9trash@gmail.com>
|
2025-12-18 23:36:03 -08:00 |
|
b8zhong
|
4b8901ac0f
|
Update FP4 GEMM Benchmark (#14449)
|
2025-12-16 23:04:56 -08:00 |
|
Xiaoyu Zhang
|
c5947ecd85
|
Opt moe align block size kernel (#14133)
|
2025-12-02 19:13:55 +08:00 |
|
Xiaoyu Zhang
|
ecefc7904f
|
[sgl-kernel Code Clean] Remove useless lightning_attention kernel (#13819)
|
2025-11-24 18:26:25 +08:00 |
|
Roger Young
|
e72cf13693
|
Support moe topk sigmoid kernel (#13049)
Co-authored-by: xuebi <xuebi@minimaxi.com>
|
2025-11-20 00:24:37 +08:00 |
|
Xiaoyu Zhang
|
1d3d42bda0
|
[opt kimi k2 1 / n] Add kimi k2 moe fused gate (#13287)
|
2025-11-15 17:14:19 +08:00 |
|
Yuan Luo
|
271d3d0d50
|
Support mrope triton kernel and add unit test (#11722)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
|
2025-10-20 11:51:07 +08:00 |
|
Qi Yuhang
|
6c01844f45
|
[sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674)
|
2025-10-15 13:39:31 -07:00 |
|
Qi Yuhang
|
9a30914e94
|
[sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: PGFLMG <1106310035@qq.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2025-10-12 20:19:21 -07:00 |
|
fzyzcjy
|
21337b22b9
|
Reland [1/2] Optimizations and refactors about quant kernel (#10312)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2025-10-11 15:59:03 +08:00 |
|
Lianmin Zheng
|
9b8ebb2798
|
move more files under srt/utils (#11285)
|
2025-10-09 16:46:15 -07:00 |
|
Yuan Luo
|
4f42c8cd3e
|
[sgl-kernel] Support float64 moe_sum_reduce cuda kernel (#11068)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-10-07 14:31:11 +00:00 |
|
Xiaoyu Zhang
|
11965b0daf
|
Fix sgl-kernel benchmark dead code (#11022)
|
2025-09-29 15:06:40 +08:00 |
|
Xiaoyu Zhang
|
c4e314f986
|
Restruct sgl-kernel benchmark (#10861)
|
2025-09-25 07:45:25 +08:00 |
|
Yineng Zhang
|
6d55f60e77
|
Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" (#10292)
|
2025-09-10 18:24:23 -07:00 |
|
hlu1
|
5f1eb20484
|
[chore] Remove unused ep_moe cuda kernels (#9956)
|
2025-09-06 01:35:50 -07:00 |
|
fzyzcjy
|
339f8eef09
|
[1/2] Optimizations and refactors about quant kernel (#9534)
|
2025-09-05 18:45:08 +08:00 |
|
fzyzcjy
|
e85cb1ce9d
|
Fix quant kernel test errors and benchmark wrong output speeds (#7604)
|
2025-08-21 03:48:41 -07:00 |
|
Yuan Luo
|
53dcc750b6
|
[sgl-kernel] Support FlashInfer top_k_top_p_sampling_from_logits (#9060)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-08-14 10:56:36 -07:00 |
|
henryg
|
841810f227
|
[Perf] Tunings for SM100 FP8 CUTLASS kernel (#8818)
|
2025-08-13 21:59:22 -07:00 |
|
fzyzcjy
|
9aea255522
|
Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) (#9077)
|
2025-08-12 01:46:40 -07:00 |
|
Yuan Luo
|
1bd5316873
|
fix benchmark fp8 blockwise group gemm (#8815)
|
2025-08-06 21:02:21 +08:00 |
|
Stefan He
|
db7343c992
|
fix per token cuda kernel hidden dim cannot divide by 16 (#8543)
|
2025-08-01 09:27:18 -07:00 |
|
Peter Pan
|
6bdd27861b
|
[Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (#8013)
|
2025-08-01 22:01:24 +08:00 |
|
Cheng Wan
|
a5f5ab4030
|
update sgl-kernel for EP: kernel part (#8514)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
|
2025-07-30 22:19:55 -07:00 |
|
Elfie Guo
|
5c9c275bc8
|
Use FlashInfer FP4 gemm. (#8241)
|
2025-07-27 01:05:22 -07:00 |
|
fzyzcjy
|
e34cf6ad75
|
Fix bench script making input data on L2 cache (#7739)
|
2025-07-27 00:30:24 -07:00 |
|
Qi Yuhang
|
426b74936a
|
Add nvfp4 scaled mm benchmark. (#8401)
|
2025-07-26 23:18:04 -07:00 |
|
Hubert Lu
|
af4b9bae95
|
[AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (#7135)
Co-authored-by: yiakwy-xpu-ml-framework-team <961186938@qq.com>
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-07-24 23:44:28 -07:00 |
|
Peter Pan
|
0f8b538614
|
[fix] benchmark : routed_scaling_factor is None (#8059)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
|
2025-07-22 08:55:35 -07:00 |
|
Baizhou Zhang
|
282eb59ff3
|
Add bf16 output option for dsv3_router_gemm kernel (#7999)
|
2025-07-20 09:49:37 +08:00 |
|
Yi Zhang
|
2998c4bdf4
|
[optimize] fuse renormalize into moe_topk_softmax (#7744)
Co-authored-by: ispobock <ispobaoke@gmail.com>
|
2025-07-03 12:42:44 -07:00 |
|
ayrnb
|
2c4feaf308
|
Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (#7278)
Co-authored-by: HydraQYH <QYH820@Outlook.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>
|
2025-07-02 23:27:03 -07:00 |
|
Baizhou Zhang
|
7248272ccc
|
Add dsv3 router gemm kernel (#7627)
|
2025-06-29 23:31:55 -07:00 |
|
Ke Bao
|
04b35190e2
|
Add dsv3 fused a gemm to sgl-kernel (#7630)
|
2025-06-29 02:52:24 -07:00 |
|
Ke Bao
|
57ab776910
|
Fuse sorted_token_ids padding to moe_align_block_size kernel (#7437)
|
2025-06-24 17:44:27 -07:00 |
|
xutizhou
|
506c4928f5
|
feat: integrate deepgemm into EPMoE (#6821)
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
|
2025-06-23 01:38:58 -07:00 |
|
JieXin Liang
|
ab1a4fa5cb
|
[fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla (#7184)
|
2025-06-14 12:45:41 -07:00 |
|
fzyzcjy
|
aa46ed34d2
|
Remove 200us slow concat kernel (part 1: kernel) (#7145)
|
2025-06-13 01:58:29 -07:00 |
|
Yuan Luo
|
84727a5139
|
[sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul (#6919)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-06-11 20:43:08 -07:00 |
|