Qiaolin Yu
|
90d5e27f79
|
Enable fa3 PDL by compiling it with corresponding flags (#18756)
|
2026-02-18 17:12:05 +08:00 |
|
blake-snc
|
0d30896015
|
fix(sgl-kernel): use >= 120 for SM12x CUDA kernel dispatch (#18750)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-02-16 00:44:47 +08:00 |
|
blake-snc
|
5fc328465a
|
fix(sgl-kernel): support CUDA 13 runtime preloading for DGX Spark (#18747)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-02-16 00:43:04 +08:00 |
|
SoluMilken
|
07a24f1a38
|
update pre-commit config (#18860)
|
2026-02-16 00:18:31 +08:00 |
|
Xiaoyu Zhang
|
c29394e3c8
|
[kernel slimming] Move fast_hadamard_transform to jit_kernel (#18475)
|
2026-02-14 23:06:21 +08:00 |
|
Xiaoyu Zhang
|
9e9e949261
|
speed up sgl-kernel build (#18586)
|
2026-02-12 23:43:22 +08:00 |
|
Baizhou Zhang
|
2d38b8aca0
|
Revert "[sgl-kernel] upgrade deepgemm" (#18562)
|
2026-02-11 01:17:40 +08:00 |
|
Xiaoyu Zhang
|
bec7fe9e65
|
[sgl-kernel] upgrade deepgemm (#18362)
|
2026-02-10 21:31:30 +08:00 |
|
Lianmin Zheng
|
75997ebe8d
|
Update author information in pyproject.toml (#18453)
|
2026-02-08 12:22:55 -08:00 |
|
Baizhou Zhang
|
9fbec79906
|
Revert "[Build] Enable full kernel in aarch64 wheel" (#18385)
|
2026-02-07 09:19:07 +08:00 |
|
zhangxin81
|
e3021b65fe
|
support smem in per_token_quant_fp8 kernel (#16725)
Co-authored-by: zhangxin81 <969206500@qq.com>
|
2026-02-02 17:18:50 +08:00 |
|
Yuan Luo
|
afebb7ab78
|
Optimize custom-all-reduce (#17674)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2026-02-01 18:59:31 +08:00 |
|
Zaili Wang
|
97593c9f41
|
[CPU] toml file update (#17861)
|
2026-01-31 13:16:06 -08:00 |
|
R0CKSTAR
|
46095f0551
|
[MUSA] Update 3rd party dir to build/_deps (#18035)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
|
2026-01-31 12:02:39 -08:00 |
|
Yifan Cui
|
45fe51a28e
|
Reduce topk kernel shared memory from 128KB to 32KB for better occupancy (#17747)
Co-authored-by: Claude <noreply@anthropic.com>
|
2026-01-30 21:42:21 -08:00 |
|
jianan-gu
|
c35aa0238c
|
[CPU][INT4] Add INT4 kernels for CPU (#8226)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-01-29 22:30:13 -08:00 |
|
Ma Mingfei
|
88f7759402
|
[CPU] optimize flash_attn_varlen_func (#15708)
|
2026-01-29 22:07:05 -08:00 |
|
jianan-gu
|
336dc4579e
|
[CPU] Optimize Qwen3-next model on CPU (#12525)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
Co-authored-by: Fan Yin <1106310035@qq.com>
|
2026-01-29 22:03:58 -08:00 |
|
Xiaoyu Zhang
|
fb74e43707
|
[Diffusion] Delete sgl-kernel outdated time_embedding kernel (#17278)
|
2026-01-28 14:18:53 +08:00 |
|
Xiaoyu Zhang
|
67fb492c9a
|
[CI] Fix test_moe_fused_gate error (#17844)
|
2026-01-28 12:03:17 +08:00 |
|
Yi Zhong
|
8acd4d7d7e
|
Make flashMLA work on: Cu13, B300 (#17600)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
|
2026-01-28 00:12:47 +08:00 |
|
R0CKSTAR
|
628ab5d57b
|
[MUSA][2/N] sgl-kernel build (#17053)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
|
2026-01-23 14:41:47 -08:00 |
|
Bingxu Chen
|
50a2e4345a
|
[AMD CI] Add 2-GPU sgl-kernel Tests (#17555)
Co-authored-by: YC Tseng <yctseng@amd.com>
|
2026-01-22 21:48:52 -08:00 |
|
Zaili Wang
|
672eb37534
|
[CPU][Fix CI] Solidate torch version for sgl-kernel-cpu and fix device orientation error (#17460)
|
2026-01-22 14:04:50 +08:00 |
|
Serge Panev
|
e95668abc7
|
[NVIDIA] Fix CUDA arch requirement in nvfp4 cast (#12581)
Signed-off-by: Serge Panev <spanev@nvidia.com>
Co-authored-by: Fan Yin <1106310035@qq.com>
|
2026-01-21 20:21:11 -08:00 |
|
Binyao Jiang
|
38c233fd04
|
[Piecewise] Support PCG weak_ref_tensor cuda kernel on AMD (#17291)
|
2026-01-20 14:05:32 -08:00 |
|
Michael
|
53609e5e5b
|
Revert "[Diffusion] Move diffusion time embedding to jit kernel" (#17257)
|
2026-01-17 21:29:22 +08:00 |
|
Xiaoyu Zhang
|
2cdd4370bc
|
[Diffusion] Move diffusion time embedding to jit kernel (#16879)
|
2026-01-17 12:21:22 +08:00 |
|
sglang-bot
|
c86ca12875
|
chore: bump sgl-kernel version to 0.3.21 (#16888)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
|
2026-01-14 13:27:49 +08:00 |
|
Lianmin Zheng
|
a4825ed588
|
Fix kernel type annotations for fp8 quant and logging (#16994)
|
2026-01-13 18:14:32 -08:00 |
|
Xiaoyu Zhang
|
2ab3ed3e9e
|
Fix sgl-kernel per_token_quant fp8 kernel scale shared_memory bug (#16886)
|
2026-01-13 23:22:05 +08:00 |
|
Hubert Lu
|
8716589826
|
[AMD][Diffusion] support timestep embedding kernel for AMD GPUs (#16766)
|
2026-01-12 22:17:07 -08:00 |
|
Baizhou Zhang
|
f9fc50acd6
|
[Tiny] Rename test_sparse_flash_attn.py to fix CI (#16895)
|
2026-01-11 18:18:29 +08:00 |
|
Johnny
|
b5493f65be
|
[NVIDIA] upstream FA4 (#15182)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2026-01-11 15:31:28 +08:00 |
|
MarcoDWei
|
1c09cbe3ed
|
[Build] Enable full kernel in aarch64 wheel (#16155)
|
2026-01-07 19:40:03 -08:00 |
|
hlu1
|
12a0292bfd
|
Revert "[sgl-kernel] Update flashmla to include fp8 sparse_mla optimizations" (#16678)
|
2026-01-08 10:23:06 +08:00 |
|
Yingchun Lai
|
828cd8936f
|
Introduce sgl-kernel Dockerfile (#14066)
|
2026-01-04 11:19:08 -08:00 |
|
Yineng Zhang
|
5595ae142c
|
docs: fix markdown preview (#16236)
|
2025-12-31 12:43:57 -08:00 |
|
shuwenn
|
c0fc7a89e7
|
[sgl-kernel] fix: make sgl-kernel build respect MAX_JOBS (#15575)
|
2025-12-31 10:44:45 +08:00 |
|
Xiaoyu Zhang
|
de2f2880b5
|
[JIT sgl-kernel] Jit support per tensor quant (#15709)
|
2025-12-25 16:24:37 +08:00 |
|
sglang-bot
|
a39126672a
|
chore: bump sgl-kernel version to 0.3.20 (#15564)
|
2025-12-21 13:15:23 -08:00 |
|
Xiaoyu Zhang
|
7fa4906f4f
|
[sgl-kernel] Streamline kernel size report (Top 20 only) and clean up (#15552)
|
2025-12-21 10:00:47 +08:00 |
|
Hubert Lu
|
51e2eaa458
|
[AMD] Support fast_topk kernels in sgl-kernel (#15172)
|
2025-12-19 22:19:09 -08:00 |
|
66RING
|
46be74b4b4
|
[diffusion] kernel: timestep embedding kernel implementation (#12995)
Co-authored-by: 戚余航 <qiyuhang@bytedance.com>
Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com>
|
2025-12-19 20:59:50 +08:00 |
|
Fan Yin
|
65c098592d
|
[sgl-kernel] chore: update deepgemm version (#13402)
|
2025-12-19 00:20:24 -08:00 |
|
sunxxuns
|
f2d64e6782
|
[amd] Add deterministic all-reduce kernel for AMD (ROCm) (#15340)
Co-authored-by: Thomas Wang <1am9trash@gmail.com>
|
2025-12-18 23:36:03 -08:00 |
|
Bruce-x-1997
|
793c96c3d2
|
[perf]optimize w4afp8 kernel on deepseek-v3-0324 (#12921)
Signed-off-by: bruce.xu <bruce.x@gmicloud.ai>
|
2025-12-18 18:13:22 +08:00 |
|
Kevin_Xiong
|
4792d1f452
|
[sgl-kernel][1/2] Fused qk_norm_rope for GLM4.6 (#15141)
|
2025-12-18 17:07:04 +08:00 |
|
Xiaoyu Zhang
|
56d12b4aea
|
Fix warp illegal instruction in kimi k2 thinking PCG (#15306)
|
2025-12-18 16:58:23 +08:00 |
|
MarcoDWei
|
ef7c29acd7
|
Fix issue: ENABLE_BELOW_SM90 cannot be enabled on aarch64 CPU (#12967)
|
2025-12-18 13:26:42 +08:00 |
|