Commit Graph

722 Commits

Author SHA1 Message Date
Qiaolin Yu
90d5e27f79 Enable fa3 PDL by compiling it with corresponding flags (#18756) 2026-02-18 17:12:05 +08:00
blake-snc
0d30896015 fix(sgl-kernel): use >= 120 for SM12x CUDA kernel dispatch (#18750)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 00:44:47 +08:00
blake-snc
5fc328465a fix(sgl-kernel): support CUDA 13 runtime preloading for DGX Spark (#18747)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 00:43:04 +08:00
SoluMilken
07a24f1a38 update pre-commit config (#18860) 2026-02-16 00:18:31 +08:00
Xiaoyu Zhang
c29394e3c8 [kernel slimming] Move fast_hadamard_transform to jit_kernel (#18475) 2026-02-14 23:06:21 +08:00
Xiaoyu Zhang
9e9e949261 speed up sgl-kernel build (#18586) 2026-02-12 23:43:22 +08:00
Baizhou Zhang
2d38b8aca0 Revert "[sgl-kernel] upgrade deepgemm" (#18562) 2026-02-11 01:17:40 +08:00
Xiaoyu Zhang
bec7fe9e65 [sgl-kernel] upgrade deepgemm (#18362) 2026-02-10 21:31:30 +08:00
Lianmin Zheng
75997ebe8d Update author information in pyproject.toml (#18453) 2026-02-08 12:22:55 -08:00
Baizhou Zhang
9fbec79906 Revert "[Build] Enable full kernel in aarch64 wheel" (#18385) 2026-02-07 09:19:07 +08:00
zhangxin81
e3021b65fe support smem in per_token_quant_fp8 kernel (#16725)
Co-authored-by: zhangxin81 <969206500@qq.com>
2026-02-02 17:18:50 +08:00
Yuan Luo
afebb7ab78 Optimize custom-all-reduce (#17674)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-02-01 18:59:31 +08:00
Zaili Wang
97593c9f41 [CPU] toml file update (#17861) 2026-01-31 13:16:06 -08:00
R0CKSTAR
46095f0551 [MUSA] Update 3rd party dir to build/_deps (#18035)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-01-31 12:02:39 -08:00
Yifan Cui
45fe51a28e Reduce topk kernel shared memory from 128KB to 32KB for better occupancy (#17747)
Co-authored-by: Claude <noreply@anthropic.com>
2026-01-30 21:42:21 -08:00
jianan-gu
c35aa0238c [CPU][INT4] Add INT4 kernels for CPU (#8226)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-29 22:30:13 -08:00
Ma Mingfei
88f7759402 [CPU] optimize flash_attn_varlen_func (#15708) 2026-01-29 22:07:05 -08:00
jianan-gu
336dc4579e [CPU] Optimize Qwen3-next model on CPU (#12525)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
Co-authored-by: Fan Yin <1106310035@qq.com>
2026-01-29 22:03:58 -08:00
Xiaoyu Zhang
fb74e43707 [Diffusion] Delete sgl-kernel outdated time_embedding kernel (#17278) 2026-01-28 14:18:53 +08:00
Xiaoyu Zhang
67fb492c9a [CI] Fix test_moe_fused_gate error (#17844) 2026-01-28 12:03:17 +08:00
Yi Zhong
8acd4d7d7e Make flashMLA work on: Cu13, B300 (#17600)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
2026-01-28 00:12:47 +08:00
R0CKSTAR
628ab5d57b [MUSA][2/N] sgl-kernel build (#17053)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-01-23 14:41:47 -08:00
Bingxu Chen
50a2e4345a [AMD CI] Add 2-GPU sgl-kernel Tests (#17555)
Co-authored-by: YC Tseng <yctseng@amd.com>
2026-01-22 21:48:52 -08:00
Zaili Wang
672eb37534 [CPU][Fix CI] Solidate torch version for sgl-kernel-cpu and fix device orientation error (#17460) 2026-01-22 14:04:50 +08:00
Serge Panev
e95668abc7 [NVIDIA] Fix CUDA arch requirement in nvfp4 cast (#12581)
Signed-off-by: Serge Panev <spanev@nvidia.com>
Co-authored-by: Fan Yin <1106310035@qq.com>
2026-01-21 20:21:11 -08:00
Binyao Jiang
38c233fd04 [Piecewise] Support PCG weak_ref_tensor cuda kernel on AMD (#17291) 2026-01-20 14:05:32 -08:00
Michael
53609e5e5b Revert "[Diffusion] Move diffusion time embedding to jit kernel" (#17257) 2026-01-17 21:29:22 +08:00
Xiaoyu Zhang
2cdd4370bc [Diffusion] Move diffusion time embedding to jit kernel (#16879) 2026-01-17 12:21:22 +08:00
sglang-bot
c86ca12875 chore: bump sgl-kernel version to 0.3.21 (#16888)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2026-01-14 13:27:49 +08:00
Lianmin Zheng
a4825ed588 Fix kernel type annotations for fp8 quant and logging (#16994) 2026-01-13 18:14:32 -08:00
Xiaoyu Zhang
2ab3ed3e9e Fix sgl-kernel per_token_quant fp8 kernel scale shared_memory bug (#16886) 2026-01-13 23:22:05 +08:00
Hubert Lu
8716589826 [AMD][Diffusion] support timestep embedding kernel for AMD GPUs (#16766) 2026-01-12 22:17:07 -08:00
Baizhou Zhang
f9fc50acd6 [Tiny] Rename test_sparse_flash_attn.py to fix CI (#16895) 2026-01-11 18:18:29 +08:00
Johnny
b5493f65be [NVIDIA] upstream FA4 (#15182)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-01-11 15:31:28 +08:00
MarcoDWei
1c09cbe3ed [Build] Enable full kernel in aarch64 wheel (#16155) 2026-01-07 19:40:03 -08:00
hlu1
12a0292bfd Revert "[sgl-kernel] Update flashmla to include fp8 sparse_mla optimizations" (#16678) 2026-01-08 10:23:06 +08:00
Yingchun Lai
828cd8936f Introduce sgl-kernel Dockerfile (#14066) 2026-01-04 11:19:08 -08:00
Yineng Zhang
5595ae142c docs: fix markdown preview (#16236) 2025-12-31 12:43:57 -08:00
shuwenn
c0fc7a89e7 [sgl-kernel] fix: make sgl-kernel build respect MAX_JOBS (#15575) 2025-12-31 10:44:45 +08:00
Xiaoyu Zhang
de2f2880b5 [JIT sgl-kernel] Jit support per tensor quant (#15709) 2025-12-25 16:24:37 +08:00
sglang-bot
a39126672a chore: bump sgl-kernel version to 0.3.20 (#15564) 2025-12-21 13:15:23 -08:00
Xiaoyu Zhang
7fa4906f4f [sgl-kernel] Streamline kernel size report (Top 20 only) and clean up (#15552) 2025-12-21 10:00:47 +08:00
Hubert Lu
51e2eaa458 [AMD] Support fast_topk kernels in sgl-kernel (#15172) 2025-12-19 22:19:09 -08:00
66RING
46be74b4b4 [diffusion] kernel: timestep embedding kernel implementation (#12995)
Co-authored-by: 戚余航 <qiyuhang@bytedance.com>
Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com>
2025-12-19 20:59:50 +08:00
Fan Yin
65c098592d [sgl-kernel] chore: update deepgemm version (#13402) 2025-12-19 00:20:24 -08:00
sunxxuns
f2d64e6782 [amd] Add deterministic all-reduce kernel for AMD (ROCm) (#15340)
Co-authored-by: Thomas Wang <1am9trash@gmail.com>
2025-12-18 23:36:03 -08:00
Bruce-x-1997
793c96c3d2 [perf]optimize w4afp8 kernel on deepseek-v3-0324 (#12921)
Signed-off-by: bruce.xu <bruce.x@gmicloud.ai>
2025-12-18 18:13:22 +08:00
Kevin_Xiong
4792d1f452 [sgl-kernel][1/2] Fused qk_norm_rope for GLM4.6 (#15141) 2025-12-18 17:07:04 +08:00
Xiaoyu Zhang
56d12b4aea Fix warp illegal instruction in kimi k2 thinking PCG (#15306) 2025-12-18 16:58:23 +08:00
MarcoDWei
ef7c29acd7 Fix issue: ENABLE_BELOW_SM90 cannot be enabled on aarch64 CPU (#12967) 2025-12-18 13:26:42 +08:00