composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-18 09:38:17 +00:00

Files

kabrahamAMD c4b2da9cbd implement device batched gemm b scale for wmma (#2825 )

* rebased on top of develop

* fixed missing shuffeling and wrong indexing

* added tests for batched_b_scale

* added missing files

* fixed wrong stride computation and removed k batching (for now) due to precision issues

* reinstated k-batching with PRNG constrained to -1..1

* added specialization of GeneratorTensor_3 for int4 and fixed internal overflow

* added k-batching to reference and increased tolerances for test

* changed gemm_b_scale and gemm_universal tests to use correct parameters

* adressed review commentsd

* ported fixes back to non-batched version of b_scale

* adressed review comments

* run clang-format on older commits

* add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior

* added newline at end of file

* reflected changes from muitl-abd branch in batched b_scale

* fixed gfx11 issue

* changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed

* run clang format

* set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested.

* reduced range for pk_i4 even further to 0..0

* removed failing xld instances. Failure now uncovered now that tests were fixed

* removed generation of int4 values entierly

* divide B buffer by BPackedSize

---------

Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>

2025-10-16 11:00:42 -07:00

codegen_device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_avgpool2d_bwd_nhwc_nhwc.hpp

Pool2d max/avg kernel in the BWD version (#1494 )

2024-09-12 11:47:52 +02:00

device_avgpool3d_bwd_ndhwc_ndhwc.hpp

Average pool backward deviceOP and example (#797 )

2023-08-10 12:04:35 +08:00

device_batched_contraction_multiple_d_wmma_cshuffle.hpp

Disable GridwiseOp prints if env var is off (#2843 )

2025-09-16 17:47:28 +02:00

device_batched_contraction_multiple_d_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_e_permute_xdl.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_gemm_wmma_cshuffle_v3.hpp

Implement batched gemm gemm for RDNA (3 and 4) (#2612 )

2025-09-04 14:10:24 -07:00

device_batched_gemm_gemm_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_multi_d_xdl.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_multiple_d_dl.hpp

Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 )

2025-07-28 13:01:07 -07:00

device_batched_gemm_multiple_d_gemm_multiple_d_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_reduce_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_softmax_gemm_permute_wmma_cshuffle.hpp

Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 )

2025-07-28 13:01:07 -07:00

device_batched_gemm_softmax_gemm_permute_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_softmax_gemm_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_wmma_cshuffle_v3_b_scale.hpp

implement device batched gemm b scale for wmma (#2825 )

2025-10-16 11:00:42 -07:00

device_batched_gemm_wmma_cshuffle_v3.hpp

Wmma support for multiple ABD GEMM (#2803 )

2025-09-22 18:49:06 -07:00

device_batched_gemm_xdl_fpAintB_b_scale.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batched_gemm_xdl.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_batchnorm_backward_impl.hpp

Fixed GroupedGemmFixedNK with hipGraph (#1065 )

2023-11-30 15:09:27 -06:00

device_batchnorm_forward_impl_obsolete.hpp

Fixed GroupedGemmFixedNK with hipGraph (#1065 )

2023-11-30 15:09:27 -06:00

device_batchnorm_forward_impl.hpp

Fixed GroupedGemmFixedNK with hipGraph (#1065 )

2023-11-30 15:09:27 -06:00

device_cgemm_4gemm_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_column_to_image_impl.hpp

upgrade from clang-format-12 to clang-format-18 (#2568 )

2025-07-28 11:34:07 -07:00

device_contraction_multiple_abd_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_contraction_multiple_d_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_contraction_utils.hpp

upgrade from clang-format-12 to clang-format-18 (#2568 )

2025-07-28 11:34:07 -07:00

device_conv2d_backward_weight_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_conv2d_bwd_data_xdl_nhwc_kyxc_nhwk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_conv2d_fwd_xdl_c_shuffle_bias_activation_add_nhwc_kyxc_nhwk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_conv2d_fwd_xdl_c_shuffle_bias_activation_nhwc_kyxc_nhwk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_conv2d_fwd_xdl_nhwc_kyxc_nhwk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_conv3d_fwd_naive_ndhwc_kzyxc_ndhwk.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_conv3d_fwd_xdl_ndhwc_kzyxc_ndhwk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_convnd_bwd_data_nwc_kxc_nwk_dl.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_convnd_bwd_data_nwc_kxc_nwk_xdl.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_elementwise_dynamic_vector_dims_impl.hpp

Refactor elementwise kernels (#1222 )

2024-04-19 13:31:17 +02:00

device_elementwise_normalization_impl.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_elementwise_scale_impl.hpp

Refactor elementwise kernels (#1222 )

2024-04-19 13:31:17 +02:00

device_fpAintB_gemm_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_bias_add_reduce_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_dl.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_gemm_dpp.hpp

Code clean-up (#1285 )

2024-05-10 09:41:39 -07:00

device_gemm_multiple_abd_wmma_cshuffle_v3.hpp

Wmma support for multiple ABD GEMM (#2803 )

2025-09-22 18:49:06 -07:00

device_gemm_multiple_abd_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_multiple_d_dl.hpp

Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 )

2025-07-28 13:01:07 -07:00

device_gemm_multiple_d_layernorm_xdl_cshuffle.hpp

[CK] Fix misc issues in CK examples (#2890 )

2025-09-24 11:28:20 -07:00

device_gemm_multiple_d_multiple_r_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_multiple_d_wmma_cshuffle_v3.hpp

Wmma support for multiple ABD GEMM (#2803 )

2025-09-22 18:49:06 -07:00

device_gemm_multiple_d_wmma_cshuffle.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_multiple_d_xdl_cshuffle_lds_direct_load.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_multiple_d_xdl_cshuffle_v3_ab_scale.hpp

Add KBatch support for gemm_ab_scale (#2740 )

2025-10-09 08:33:16 +02:00

device_gemm_multiple_d_xdl_cshuffle_v3_b_preshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_multiple_d_xdl_cshuffle_v3_blockscale_bpreshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_multiple_d_xdl_cshuffle_v3.hpp

add the check of granularity for atomic add (#2959 )

2025-10-02 11:15:24 -07:00

device_gemm_multiple_d_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_reduce_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_wmma_cshuffle_v3_b_scale.hpp

Wmma support for multiple ABD GEMM (#2803 )

2025-09-22 18:49:06 -07:00

device_gemm_wmma_cshuffle_v3_common.hpp

Wmma support for multiple ABD GEMM (#2803 )

2025-09-22 18:49:06 -07:00

device_gemm_wmma_cshuffle_v3.hpp

Wmma support for multiple ABD GEMM (#2803 )

2025-09-22 18:49:06 -07:00

device_gemm_wmma_cshuffle_v3r1.hpp

fix compilation errors on RHEL8 and SLES15 (#2967 )

2025-10-03 07:08:49 -07:00

device_gemm_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_xdl_cshuffle_lds_direct_load.hpp

TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852 )

2025-09-17 14:50:15 -07:00

device_gemm_xdl_cshuffle_streamk_v3.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_cshuffle_v2.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_cshuffle_v3_b_preshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_cshuffle_v3_b_scale.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_cshuffle_v3_mx.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_cshuffle_v3.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_cshuffle_v3r1.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_layernorm_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_skip_b_lds.hpp

[CK] Fix misc issues in CK examples (#2890 )

2025-09-24 11:28:20 -07:00

device_gemm_xdl_splitk_c_shuffle_lds_direct_load.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_splitk_c_shuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_streamk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl_waveletmodel_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_gemm_xdl.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_grouped_contraction_multiple_d_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_grouped_conv_bwd_data_multiple_d_wmma_cshuffle.hpp

Move SetZero functions inside the kernels for Grouped Conv (#2255 )

2025-06-11 23:41:03 +02:00

device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp

Conv:TF32: add more instances - 2 (#2879 )

2025-10-10 15:28:17 +08:00

device_grouped_conv_bwd_weight_dl.hpp

Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 )

2025-07-28 13:01:07 -07:00

device_grouped_conv_bwd_weight_explicit_xdl.hpp

Disable bwd weight split-k autodeduce for single stage kernels (#2856 )

2025-09-19 16:27:50 +02:00

device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle.hpp

Conv:TF32: add more instances - 2 (#2879 )

2025-10-10 15:28:17 +08:00

device_grouped_conv_bwd_weight_two_stage_xdl_cshuffle.hpp

Conv:TF32: add more instances - 2 (#2879 )

2025-10-10 15:28:17 +08:00

device_grouped_conv_bwd_weight_wmma_cshuffle.hpp

Fix grid size calc for bwd wei (#2226 )

2025-05-26 16:51:09 +02:00

device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp

Conv:TF32: add more instances - 2 (#2879 )

2025-10-10 15:28:17 +08:00

device_grouped_conv_bwd_weight_xdl_cshuffle.hpp

Conv:TF32: add more instances - 2 (#2879 )

2025-10-10 15:28:17 +08:00

device_grouped_conv_fwd_dl_multiple_d_nhwc_kyxc_nhwk.hpp

Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 )

2025-07-28 13:01:07 -07:00

device_grouped_conv_fwd_dl_nhwc_kyxc_nhwk.hpp

Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 )

2025-07-28 13:01:07 -07:00

device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3.hpp

Conv:TF32: add more instances - 2 (#2879 )

2025-10-10 15:28:17 +08:00

device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852 )

2025-09-17 14:50:15 -07:00

device_grouped_conv_fwd_multiple_d_multiple_r_xdl_cshuffle.hpp

[CK] Fix misc issues in CK examples (#2890 )

2025-09-24 11:28:20 -07:00

device_grouped_conv_fwd_multiple_d_multiple_r.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_grouped_conv_fwd_multiple_d_wmma_cshuffle.hpp

Add Grouped Conv Fwd Large Tensor kernel (#1432 )

2024-08-06 10:06:10 +02:00

device_grouped_conv_fwd_multiple_d_xdl_cshuffle.hpp

Add instances for conv_scale with fp8@bf8->fp8 (#1220 )

2024-04-03 09:08:08 -05:00

device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp

Conv:TF32: add more instances - 2 (#2879 )

2025-10-10 15:28:17 +08:00

device_grouped_conv_utils.hpp

Add support for GKCYX grouped conv fwd (#2015 )

2025-03-26 21:13:38 +01:00

device_grouped_gemm_multi_abd_xdl_fixed_nk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_grouped_gemm_multiple_d_dl.hpp

Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 )

2025-07-28 13:01:07 -07:00

device_grouped_gemm_multiple_d_splitk_xdl_cshuffle_two_stage.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_grouped_gemm_multiple_d_xdl_cshuffle_tile_loop.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_grouped_gemm_softmax_gemm_permute_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_grouped_gemm_xdl_fixed_nk.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_grouped_gemm_xdl_splitk_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_grouped_gemm_xdl.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_grouped_query_attention_forward_wmma.hpp

Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 )

2025-07-28 13:01:07 -07:00

device_image_to_column_impl.hpp

Codegen hipRTC compilation (#1579 )

2025-01-31 09:48:39 -08:00

device_max_pool_bwd_impl.hpp

Refactor elementwise kernels (#1222 )

2024-04-19 13:31:17 +02:00

device_moe_gemm_blockscale.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_moe_gemm.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_moe_mx_gemm_bns.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_moe_mx_gemm_bpreshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_moe_mx_gemm.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

device_multi_query_attention_forward_wmma.hpp

Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564 )

2025-07-28 13:01:07 -07:00

device_multiple_reduce_multiblock.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_multiple_reduce_threadwise.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_normalization_bwd_data_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_normalization_bwd_gamma_beta_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_normalization_fwd_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_normalization_fwd_splitk_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_permute_impl.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_pool2d_fwd_nhwc_nhwc.hpp

Rewrite pool2d fwd (#1462 )

2024-09-11 15:21:00 +02:00

device_pool3d_fwd_ndhwc_ndhwc.hpp

Refactor pool fwd (#815 )

2023-08-15 02:25:28 +08:00

device_put_element_impl.hpp

Maxpool bwd (#750 )

2023-06-19 09:44:22 -05:00

device_reduce_common.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_reduce_multiblock.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_reduce_threadwise_multi_d.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_reduce_threadwise.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_softmax_impl.hpp

Revert "Grouped Gemm with looping over the tiles. (#788 )" (#982 )

2023-10-11 14:27:29 -05:00

device_sparse_embeddings_forward_layernorm.hpp

update the switch condition for buffer built-ins (#2602 )

2025-08-01 14:30:07 -07:00

device_splitk_contraction_multiple_d_xdl_cshuffle.hpp

Extend XDL kernel to Support RDNA3/4 - Part 3 (#2723 )

2025-09-09 11:22:36 +08:00

split_k_arg.hpp

Automatic deduction of split-K value for grouped convolution (#2491 )

2025-07-31 12:08:45 +02:00

split_k_utils.hpp

Automatic deduction of split-K value for grouped convolution (#2491 )

2025-07-31 12:08:45 +02:00