composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 13:11:25 +00:00

Files

Kiefer van Teutem 9e74ae7c89 Implement batched gemm wmma (RDNA batched gemm) based on wmma cshuffle v3 (#2319 )

* Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty.

* Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good.

* Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet.

* Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3. In principle the whole thing works now, just need to add other data types and perhaps do some cleanup.

* Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16).

* Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts.

* Fixup comments and TODOs

* Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small.

* Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests.

* Take batching into account when using rotating memory or clearing the C tensor.

* Implement small refactors / comments etc. from review.

* Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists.

* Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems.

* Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups.

* Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function.

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>

2025-06-24 07:28:13 -07:00

codegen_device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

Switch to v2 pipeline for grouped conv bwd data (#2181 )

2025-05-13 10:14:30 +02:00

device_avgpool2d_bwd_nhwc_nhwc.hpp

Pool2d max/avg kernel in the BWD version (#1494 )

2024-09-12 11:47:52 +02:00

device_avgpool3d_bwd_ndhwc_ndhwc.hpp

Average pool backward deviceOP and example (#797 )

2023-08-10 12:04:35 +08:00

device_batched_contraction_multiple_d_wmma_cshuffle.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_batched_contraction_multiple_d_xdl_cshuffle.hpp

Switch to v2 pipeline for grouped conv bwd data (#2181 )

2025-05-13 10:14:30 +02:00

device_batched_gemm_e_permute_xdl.hpp

Switch to v2 pipeline for grouped conv bwd data (#2181 )

2025-05-13 10:14:30 +02:00

device_batched_gemm_gemm_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batched_gemm_multi_d_xdl.hpp

Switch to v2 pipeline for grouped conv bwd data (#2181 )

2025-05-13 10:14:30 +02:00

device_batched_gemm_multiple_d_dl.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_batched_gemm_multiple_d_gemm_multiple_d_xdl_cshuffle.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

Grouped conv bwd wei explicit GEMM for odd C/K (#2306 )

2025-06-10 11:17:12 +02:00

device_batched_gemm_reduce_xdl_cshuffle.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_batched_gemm_softmax_gemm_permute_wmma_cshuffle.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_batched_gemm_softmax_gemm_permute_xdl_cshuffle.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_batched_gemm_softmax_gemm_xdl_cshuffle.hpp

MIGraphX hipRTC fix (#1923 )

2025-03-03 07:55:05 -08:00

device_batched_gemm_wmma_cshuffle_v3.hpp

Implement batched gemm wmma (RDNA batched gemm) based on wmma cshuffle v3 (#2319 )

2025-06-24 07:28:13 -07:00

device_batched_gemm_xdl_fpAintB_b_scale.hpp

CK pk_i4_t test failures fix (SWDEV-518629) (#2075 )

2025-04-14 16:58:57 +08:00

device_batched_gemm_xdl.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batchnorm_backward_impl.hpp

Fixed GroupedGemmFixedNK with hipGraph (#1065 )

2023-11-30 15:09:27 -06:00

device_batchnorm_forward_impl_obsolete.hpp

Fixed GroupedGemmFixedNK with hipGraph (#1065 )

2023-11-30 15:09:27 -06:00

device_batchnorm_forward_impl.hpp

Fixed GroupedGemmFixedNK with hipGraph (#1065 )

2023-11-30 15:09:27 -06:00

device_cgemm_4gemm_xdl_cshuffle.hpp

Implement GetWorkSpaceSize from BaseOperator. (#1564 )

2024-10-12 14:05:11 +08:00

device_column_to_image_impl.hpp

Codegen hipRTC compilation (#1579 )

2025-01-31 09:48:39 -08:00

device_contraction_multiple_abd_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_contraction_multiple_d_xdl_cshuffle.hpp

Switch to v2 pipeline for grouped conv bwd data (#2181 )

2025-05-13 10:14:30 +02:00

device_contraction_utils.hpp

Fix continous dim selection in contraction (#1336 )

2024-06-18 10:26:49 +02:00

device_conv2d_backward_weight_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp

Use asynchronous version of hipMemset (#850 )

2023-08-18 11:14:59 +08:00

device_conv2d_bwd_data_xdl_nhwc_kyxc_nhwk.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_conv2d_fwd_xdl_c_shuffle_bias_activation_add_nhwc_kyxc_nhwk.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_conv2d_fwd_xdl_c_shuffle_bias_activation_nhwc_kyxc_nhwk.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_conv2d_fwd_xdl_nhwc_kyxc_nhwk.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_conv3d_fwd_naive_ndhwc_kzyxc_ndhwk.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_conv3d_fwd_xdl_ndhwc_kzyxc_ndhwk.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_convnd_bwd_data_nwc_kxc_nwk_dl.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_convnd_bwd_data_nwc_kxc_nwk_xdl.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_elementwise_dynamic_vector_dims_impl.hpp

Refactor elementwise kernels (#1222 )

2024-04-19 13:31:17 +02:00

device_elementwise_normalization_impl.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_elementwise_scale_impl.hpp

Refactor elementwise kernels (#1222 )

2024-04-19 13:31:17 +02:00

device_fpAintB_gemm_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_bias_add_reduce_xdl_cshuffle.hpp

Disable XDL kernels on unsupported HW Add ck::is_xdl_supported (#768 )

2023-07-26 07:19:55 -07:00

device_gemm_dl.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_gemm_dpp.hpp

Code clean-up (#1285 )

2024-05-10 09:41:39 -07:00

device_gemm_multiple_abd_xdl_cshuffle.hpp

bf16A_Int8B with fastgelu/bias (#1264 )

2024-04-26 07:26:30 -05:00

device_gemm_multiple_d_dl.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_gemm_multiple_d_layernorm_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_gemm_multiple_d_multiple_r_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_gemm_multiple_d_wmma_cshuffle.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_multiple_d_xdl_cshuffle_lds_direct_load.hpp

Add basic support for direct loads from global to LDS (#999 )

2023-11-25 13:35:22 +01:00

device_gemm_multiple_d_xdl_cshuffle_v3_ab_scale.hpp

[Block Scale GEMM] Optimized block scale gemm (#1950 )

2025-03-11 10:11:21 -07:00

device_gemm_multiple_d_xdl_cshuffle_v3_b_preshuffle.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

device_gemm_multiple_d_xdl_cshuffle_v3_blockscale_bpreshuffle.hpp

Add MoE & FP8 Blockscale WP Kernels for GFX950 (#2297 )

2025-06-12 09:25:59 +08:00

device_gemm_multiple_d_xdl_cshuffle_v3.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

device_gemm_multiple_d_xdl_cshuffle.hpp

Narrowing error fix for codegen compilation (#2194 )

2025-05-16 11:11:54 -07:00

device_gemm_reduce_xdl_cshuffle.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_gemm_wmma_cshuffle_v3.hpp

WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 )

2025-06-04 12:22:33 +06:00

device_gemm_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_xdl_cshuffle_lds_direct_load.hpp

Add support for double buffering in direct load GEMM kernel (#1052 )

2023-12-03 23:08:47 +01:00

device_gemm_xdl_cshuffle_streamk_v3.hpp

Stream-K Reduction option as Runtime parameter and Compilation Error Fix (SK- Reduction) (#2145 )

2025-06-11 10:59:44 -07:00

device_gemm_xdl_cshuffle_v2.hpp

[GEMM] Optimization for MI200/300. (#1135 )

2024-01-19 07:02:22 -06:00

device_gemm_xdl_cshuffle_v3_b_preshuffle.hpp

CK pk_i4_t test failures fix (SWDEV-518629) (#2075 )

2025-04-14 16:58:57 +08:00

device_gemm_xdl_cshuffle_v3_b_scale.hpp

CK pk_i4_t test failures fix (SWDEV-518629) (#2075 )

2025-04-14 16:58:57 +08:00

device_gemm_xdl_cshuffle_v3_mx.hpp

Optimized GEMMs for MX FP4/8 (#2294 )

2025-06-05 13:54:15 -06:00

device_gemm_xdl_cshuffle_v3.hpp

CK pk_i4_t test failures fix (SWDEV-518629) (#2075 )

2025-04-14 16:58:57 +08:00

device_gemm_xdl_cshuffle_v3r1.hpp

Universal gemm splitk using reduce (with multi-d) (#1341 )

2024-07-19 22:01:22 +08:00

device_gemm_xdl_cshuffle.hpp

Add Gemm instances for performance improvement (#1018 )

2023-11-07 09:09:58 -06:00

device_gemm_xdl_layernorm_cshuffle.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_gemm_xdl_skip_b_lds.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_gemm_xdl_splitk_c_shuffle_lds_direct_load.hpp

Optimized GEMMs for MX FP4/8 (#2294 )

2025-06-05 13:54:15 -06:00

device_gemm_xdl_splitk_c_shuffle.hpp

Codegen hipRTC compilation (#1579 )

2025-01-31 09:48:39 -08:00

device_gemm_xdl_streamk.hpp

Add support for more Navi2x and Navi3x models. (#1152 )

2024-02-02 11:35:26 -08:00

device_gemm_xdl_waveletmodel_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_gemm_xdl.hpp

Add support for more Navi2x and Navi3x models. (#1152 )

2024-02-02 11:35:26 -08:00

device_grouped_contraction_multiple_d_xdl_cshuffle.hpp

Switch to v2 pipeline for grouped conv bwd data (#2181 )

2025-05-13 10:14:30 +02:00

device_grouped_conv_bwd_data_multiple_d_wmma_cshuffle.hpp

Move SetZero functions inside the kernels for Grouped Conv (#2255 )

2025-06-11 23:41:03 +02:00

device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp

Grouped conv bwd weight with grouped gemm (#2304 )

2025-06-12 10:15:07 +02:00

device_grouped_conv_bwd_weight_dl.hpp

Add support for NGCHW in grouped conv bwd wei (#1491 )

2024-09-03 10:52:03 +02:00

device_grouped_conv_bwd_weight_explicit_xdl.hpp

Grouped conv bwd wei explicit GEMM for odd C/K (#2306 )

2025-06-10 11:17:12 +02:00

device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle.hpp

Move SetZero functions inside the kernels for Grouped Conv (#2255 )

2025-06-11 23:41:03 +02:00

device_grouped_conv_bwd_weight_two_stage_xdl_cshuffle.hpp

Move SetZero functions inside the kernels for Grouped Conv (#2255 )

2025-06-11 23:41:03 +02:00

device_grouped_conv_bwd_weight_wmma_cshuffle.hpp

Fix grid size calc for bwd wei (#2226 )

2025-05-26 16:51:09 +02:00

device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp

Move SetZero functions inside the kernels for Grouped Conv (#2255 )

2025-06-11 23:41:03 +02:00

device_grouped_conv_bwd_weight_xdl_cshuffle.hpp

Move SetZero functions inside the kernels for Grouped Conv (#2255 )

2025-06-11 23:41:03 +02:00

device_grouped_conv_fwd_dl_multiple_d_nhwc_kyxc_nhwk.hpp

Statically Cast Pointer Offset (#1631 )

2024-11-05 09:59:08 -08:00

device_grouped_conv_fwd_dl_nhwc_kyxc_nhwk.hpp

Add Grouped Conv Fwd Large Tensor kernel (#1432 )

2024-08-06 10:06:10 +02:00

device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3.hpp

Grouped convolution forward with clamp (#2334 )

2025-06-16 15:36:53 +02:00

device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

Grouped convolution forward with clamp (#2334 )

2025-06-16 15:36:53 +02:00

device_grouped_conv_fwd_multiple_d_multiple_r_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_conv_fwd_multiple_d_multiple_r.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_grouped_conv_fwd_multiple_d_wmma_cshuffle.hpp

Add Grouped Conv Fwd Large Tensor kernel (#1432 )

2024-08-06 10:06:10 +02:00

device_grouped_conv_fwd_multiple_d_xdl_cshuffle.hpp

Add instances for conv_scale with fp8@bf8->fp8 (#1220 )

2024-04-03 09:08:08 -05:00

device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp

Grouped convolution forward with clamp (#2334 )

2025-06-16 15:36:53 +02:00

device_grouped_conv_utils.hpp

Add support for GKCYX grouped conv fwd (#2015 )

2025-03-26 21:13:38 +01:00

device_grouped_gemm_multi_abd_xdl_fixed_nk.hpp

Fix grouped gemm check to avoid overflow (#1545 )

2024-10-04 17:32:43 +02:00

device_grouped_gemm_multiple_d_dl.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_grouped_gemm_multiple_d_splitk_xdl_cshuffle_two_stage.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_grouped_gemm_multiple_d_xdl_cshuffle_tile_loop.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_gemm_softmax_gemm_permute_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_gemm_xdl_fixed_nk.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_gemm_xdl_splitk_cshuffle.hpp

Split env.hpp header from the ck.hpp header. (#2049 )

2025-04-03 15:30:21 -07:00

device_grouped_gemm_xdl.hpp

Switch to v2 pipeline for grouped conv bwd data (#2181 )

2025-05-13 10:14:30 +02:00

device_grouped_query_attention_forward_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_image_to_column_impl.hpp

Codegen hipRTC compilation (#1579 )

2025-01-31 09:48:39 -08:00

device_max_pool_bwd_impl.hpp

Refactor elementwise kernels (#1222 )

2024-04-19 13:31:17 +02:00

device_moe_gemm_blockscale.hpp

Add MoE & FP8 Blockscale WP Kernels for GFX950 (#2297 )

2025-06-12 09:25:59 +08:00

device_moe_gemm.hpp

Moe gemm activation (#2026 )

2025-04-23 10:35:34 +08:00

device_moe_mx_gemm_bns.hpp

Add MoE & FP8 Blockscale WP Kernels for GFX950 (#2297 )

2025-06-12 09:25:59 +08:00

device_moe_mx_gemm.hpp

Add MoE & FP8 Blockscale WP Kernels for GFX950 (#2297 )

2025-06-12 09:25:59 +08:00

device_multi_query_attention_forward_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_multiple_reduce_multiblock.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_multiple_reduce_threadwise.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_normalization_bwd_data_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_normalization_bwd_gamma_beta_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_normalization_fwd_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_normalization_fwd_splitk_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_permute_impl.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_pool2d_fwd_nhwc_nhwc.hpp

Rewrite pool2d fwd (#1462 )

2024-09-11 15:21:00 +02:00

device_pool3d_fwd_ndhwc_ndhwc.hpp

Refactor pool fwd (#815 )

2023-08-15 02:25:28 +08:00

device_put_element_impl.hpp

Maxpool bwd (#750 )

2023-06-19 09:44:22 -05:00

device_reduce_common.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_reduce_multiblock.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_reduce_threadwise_multi_d.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_reduce_threadwise.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_softmax_impl.hpp

Revert "Grouped Gemm with looping over the tiles. (#788 )" (#982 )

2023-10-11 14:27:29 -05:00

device_sparse_embeddings_forward_layernorm.hpp

use old instrinsics with staging compiler (#1970 )

2025-03-12 07:29:09 -07:00

device_splitk_contraction_multiple_d_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00