Commit Graph

213 Commits

Author SHA1 Message Date
Baizhou Zhang
6ecd6f84db [CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13 (#23119)
Co-authored-by: Kangyan Zhou <zky314343421@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Alison Shao <a.shao@wustl.edu>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-04-19 05:32:36 -07:00
DarkSharpness
d1b7c3907d [Parallel State Refactor 2/n] Unify code path of AMD deterministic all reduce (#20871) 2026-04-03 12:33:17 +08:00
Xiaoyu Zhang
cdd7d6a227 Remove obsolete sgl-kernel legacy paths (#21528) 2026-04-01 09:00:20 +08:00
Lianmin Zheng
27ac831a84 docs: improve CI and testing documentation (#21202)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 10:48:50 -07:00
Xiaoyu Zhang
25e38216b6 [kernel slimming] Clean many useless sgl-kernel deprecated kernels (#20277) 2026-03-14 16:45:54 +08:00
Rain Jiang
472eef4071 fa4 cleanup (#19727) 2026-03-05 17:54:25 +08:00
huangtingwei
36dc973cbf [HiCache] refactor page_first_direct io kernel (#18113)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2026-02-27 11:43:14 -08:00
pansicheng
2ad475b4ed use flashinfer.sampling (#18696) 2026-02-26 10:02:38 +08:00
Xiaoyu Zhang
9dff933164 [Kernel Slimming] Remove sgl-kernel AOT marlin kernels (#19241) 2026-02-25 10:08:22 +08:00
SoluMilken
07a24f1a38 update pre-commit config (#18860) 2026-02-16 00:18:31 +08:00
Xiaoyu Zhang
c29394e3c8 [kernel slimming] Move fast_hadamard_transform to jit_kernel (#18475) 2026-02-14 23:06:21 +08:00
Xiaoyu Zhang
fb74e43707 [Diffusion] Delete sgl-kernel outdated time_embedding kernel (#17278) 2026-01-28 14:18:53 +08:00
Xiaoyu Zhang
67fb492c9a [CI] Fix test_moe_fused_gate error (#17844) 2026-01-28 12:03:17 +08:00
Bingxu Chen
50a2e4345a [AMD CI] Add 2-GPU sgl-kernel Tests (#17555)
Co-authored-by: YC Tseng <yctseng@amd.com>
2026-01-22 21:48:52 -08:00
Michael
53609e5e5b Revert "[Diffusion] Move diffusion time embedding to jit kernel" (#17257) 2026-01-17 21:29:22 +08:00
Xiaoyu Zhang
2cdd4370bc [Diffusion] Move diffusion time embedding to jit kernel (#16879) 2026-01-17 12:21:22 +08:00
Baizhou Zhang
f9fc50acd6 [Tiny] Rename test_sparse_flash_attn.py to fix CI (#16895) 2026-01-11 18:18:29 +08:00
Johnny
b5493f65be [NVIDIA] upstream FA4 (#15182)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-01-11 15:31:28 +08:00
66RING
46be74b4b4 [diffusion] kernel: timestep embedding kernel implementation (#12995)
Co-authored-by: 戚余航 <qiyuhang@bytedance.com>
Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com>
2025-12-19 20:59:50 +08:00
sunxxuns
f2d64e6782 [amd] Add deterministic all-reduce kernel for AMD (ROCm) (#15340)
Co-authored-by: Thomas Wang <1am9trash@gmail.com>
2025-12-18 23:36:03 -08:00
Kevin_Xiong
4792d1f452 [sgl-kernel][1/2] Fused qk_norm_rope for GLM4.6 (#15141) 2025-12-18 17:07:04 +08:00
zyl_keep_moving
a9ce1623cd [kernel][moe] add moe topk fast (#13969)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-12-14 22:26:40 +08:00
Qiaolin Yu
cb8df87fc1 [1/2] Add rope kernel in sgl-kernel (#14334) 2025-12-04 16:45:44 +08:00
Qi Yuhang
16ff892c18 [sgl-kernel][Feat][B200][1/N] Support MXFP8 Grouped GEMM in Blackwell (#13731)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-12-04 10:09:09 +08:00
Xiaoyu Zhang
c5947ecd85 Opt moe align block size kernel (#14133) 2025-12-02 19:13:55 +08:00
Fan Yin
412160f4c1 [sgl-kernel] fix b200 kernel ci (#13907)
Co-authored-by: HydraQYH <qyh820@outlook.com>
2025-11-30 10:15:37 -08:00
Yuan Luo
e12c78aab6 [sgl-kernel][1/2] Fused qk_norm_rope for Qwen3-MoE (#14036)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-11-28 12:25:15 +08:00
Xiaoyu Zhang
ecefc7904f [sgl-kernel Code Clean] Remove useless lightning_attention kernel (#13819) 2025-11-24 18:26:25 +08:00
alisonshao
64480ec712 Add sgl-kernel CI test for Blackwell (B200) (#13301) 2025-11-20 19:02:42 -08:00
Roger Young
e72cf13693 Support moe topk sigmoid kernel (#13049)
Co-authored-by: xuebi <xuebi@minimaxi.com>
2025-11-20 00:24:37 +08:00
iLeGend
20e59f9510 Add FP32 dtype support for RoPE - Part1 (#13181) 2025-11-15 11:37:18 -08:00
Xiaoyu Zhang
1d3d42bda0 [opt kimi k2 1 / n] Add kimi k2 moe fused gate (#13287) 2025-11-15 17:14:19 +08:00
Fan Yin
2966367a31 [sgl-kernel] support custom fp8 flashmla kernel (#13087) 2025-11-13 12:45:21 -08:00
Shu Wang
6664083522 Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent (#12376) 2025-11-13 00:26:00 -08:00
Xiaoyu Zhang
05559a4a90 Support hidden_dim % 4 == 0 in per_token_quant_fp8 (#12883) 2025-11-10 17:13:14 +08:00
hlu1
b8ddc296f4 [sgl-kernel][Deepseek V3.2] Add row_starts to topk kernel (#12582)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-11-07 18:33:27 -08:00
Ben Barsdell
fef3a6b63b Restore torch defaults between sgl-kernel tests (#11131) 2025-11-03 23:51:23 -08:00
fzyzcjy
193fbb0bce Super tiny add UT for copy_to_gpu_no_ce (#12270) 2025-11-04 09:40:51 +08:00
Baizhou Zhang
685c06451f [ci] Try fixing broken CIs (#12317) 2025-10-29 01:13:51 -07:00
weiliang
88596739a4 Support running FP4 Deepseek on SM120. (#11708) 2025-10-27 17:37:49 -07:00
huangtingwei
3e6281d0aa [HiCache]Page head layout IO kernel (#11615) 2025-10-26 15:53:50 +08:00
Fan Yin
23afdfd1c2 [sgl-kernel] support flashmla libtorch (#11717) 2025-10-21 21:17:50 -07:00
hlu1
3b80232d06 [DeepseekV32] Add fast_topk_transform_ragged_fused kernel (#11815)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-10-19 17:13:39 -07:00
Johnny
252dc4e112 [NVIDIA] FA3/FA4 Fix (#11606)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-19 17:10:10 -07:00
Fan Yin
3289da5b41 [sgl-kernel] support hadamard (#11663) 2025-10-15 19:00:44 -07:00
Fan Yin
5464457251 [sgl-kernel] Optimize gguf test (#11667) 2025-10-15 15:45:53 -07:00
Qi Yuhang
6c01844f45 [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674) 2025-10-15 13:39:31 -07:00
Qi Yuhang
9a30914e94 [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: PGFLMG <1106310035@qq.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-10-12 20:19:21 -07:00
PGFLMG
8fdcd98efe [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019) 2025-10-11 14:04:57 -07:00
fzyzcjy
21337b22b9 Reland [1/2] Optimizations and refactors about quant kernel (#10312)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-10-11 15:59:03 +08:00