Commit Graph

30 Commits

Author SHA1 Message Date
Baizhou Zhang
d14d368191 [Kernel] Set sgl_per_token_group_quant_8bit_v2 as default choice (#22467) 2026-04-11 01:59:57 -07:00
Xiaoyu Zhang
25e38216b6 [kernel slimming] Clean many useless sgl-kernel deprecated kernels (#20277) 2026-03-14 16:45:54 +08:00
Mohammad Miadh Angkad
f88acf8780 [JIT Kernel] Reland NVFP4 kernels to JIT (#20012) 2026-03-07 10:31:08 +08:00
Baizhou Zhang
51e5dc845a Revert "[Kernel Slimming] Migrate NVFP4 kernels to JIT" (#20005) 2026-03-05 19:40:00 -08:00
Mohammad Miadh Angkad
2bdd89a6cd [Kernel Slimming] Migrate NVFP4 kernels to JIT (#19437) 2026-03-05 15:22:28 +08:00
Xiaoyu Zhang
9dff933164 [Kernel Slimming] Remove sgl-kernel AOT marlin kernels (#19241) 2026-02-25 10:08:22 +08:00
SoluMilken
07a24f1a38 update pre-commit config (#18860) 2026-02-16 00:18:31 +08:00
Lianmin Zheng
20315697f4 move all get_stream in sgl_kernel to c++ to reduce the launch overhead (#12521) 2025-11-02 13:15:05 -08:00
zejunchen-zejun
8a6838212a [Fix] fix type issue of env flag value MODELOPT_MAX_TOKENS_PER_EXPERT (#11709)
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
2025-10-29 09:44:05 -07:00
fzyzcjy
a27825ae01 Support not officially supported high sgl-kernel version with low srt version (#11786) 2025-10-19 16:11:59 +08:00
fzyzcjy
21337b22b9 Reland [1/2] Optimizations and refactors about quant kernel (#10312)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-10-11 15:59:03 +08:00
Yineng Zhang
6d55f60e77 Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" (#10292) 2025-09-10 18:24:23 -07:00
fzyzcjy
339f8eef09 [1/2] Optimizations and refactors about quant kernel (#9534) 2025-09-05 18:45:08 +08:00
Kaixi Hou
5c34b4f1c7 [NVIDIA] [2/N] Optimize silu_and_mul_scaled_fp4_grouped_quant perf (#9556) 2025-08-29 17:17:03 -07:00
Kaixi Hou
e5638573c1 [NVIDA] [1/N] Nvfp4 Masked Gemm: Add quant op for the flashinfer grouped gemm (#9200) 2025-08-22 12:19:45 -07:00
Azure
70bb066ee4 Fix FP4 inference corruption issue in glm4.5-air model (#9346) 2025-08-20 22:13:47 -07:00
Peng Zhang
5aa1ebd242 [2/n]decouple quantization implementation from vLLM dependency (#8112)
Co-authored-by: walker-ai <yiyun.wyt@antgroup.com>
Co-authored-by: leoneo <1320612015@qq.com>
2025-08-14 03:19:03 -07:00
Baizhou Zhang
282eb59ff3 Add bf16 output option for dsv3_router_gemm kernel (#7999) 2025-07-20 09:49:37 +08:00
Baizhou Zhang
7248272ccc Add dsv3 router gemm kernel (#7627) 2025-06-29 23:31:55 -07:00
Ke Bao
04b35190e2 Add dsv3 fused a gemm to sgl-kernel (#7630) 2025-06-29 02:52:24 -07:00
fzyzcjy
5c66c4424f Support new DeepGEMM format in per token group quant (#7146) 2025-06-13 02:00:22 -07:00
Pavani Majety
eb38c7d1ca [1/2] Add Kernel support for Cutlass based Fused FP4 MoE (#6093)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-06-02 13:48:03 -07:00
HandH1998
4d643f6c7a [1/2] Support Qserve (#6457)
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
2025-05-21 19:48:59 -07:00
Yineng Zhang
136b8e6afb fix: remove cublas_grouped_gemm (#5307) 2025-04-11 16:22:37 -07:00
Yineng Zhang
31dfff7da7 use default for torch.ops (#4835) 2025-03-27 19:09:58 -07:00
Trevor Morris
e9f8e42318 Support FP4 gemm (1/2) (#3899) 2025-03-24 19:50:23 -07:00
Chunan Zeng
65c24c28f9 [Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster (#4396) 2025-03-23 23:44:17 -07:00
Yineng Zhang
0a3960f21f fix awq_dequantize (#4333) 2025-03-12 01:04:38 -07:00
Rex
07f944631e Add awq dequantize kernel to sgl with 1x to 3x speedup (#4104) 2025-03-12 00:10:02 -07:00
Lianmin Zheng
8abf74e3c9 Rename files in sgl kernel to avoid nested folder structure (#4213)
Co-authored-by: zhyncs <me@zhyncs.com>
2025-03-08 22:54:51 -08:00