composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 17:19:12 +00:00

Files

Haocong WANG cbd74c2d12 [Block Scale GEMM] Optimized block scale gemm (#1950 )

* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

2025-03-11 10:11:21 -07:00

codegen_device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_avgpool2d_bwd_nhwc_nhwc.hpp

Pool2d max/avg kernel in the BWD version (#1494 )

2024-09-12 11:47:52 +02:00

device_avgpool3d_bwd_ndhwc_ndhwc.hpp

Average pool backward deviceOP and example (#797 )

2023-08-10 12:04:35 +08:00

device_batched_contraction_multiple_d_wmma_cshuffle.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_batched_contraction_multiple_d_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batched_gemm_e_permute_xdl.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batched_gemm_gemm_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batched_gemm_multi_d_xdl.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batched_gemm_multiple_d_dl.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_batched_gemm_multiple_d_gemm_multiple_d_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

Add SplitK support into Batched GEMM V3 (#1729 )

2024-12-13 21:08:35 +01:00

device_batched_gemm_reduce_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batched_gemm_softmax_gemm_permute_wmma_cshuffle.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_batched_gemm_softmax_gemm_permute_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batched_gemm_softmax_gemm_xdl_cshuffle.hpp

MIGraphX hipRTC fix (#1923 )

2025-03-03 07:55:05 -08:00

device_batched_gemm_xdl_fpAintB_b_scale.hpp

Added Int4 mixed batch gemm support (#1839 )

2025-02-10 11:17:02 +08:00

device_batched_gemm_xdl.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_batchnorm_backward_impl.hpp

Fixed GroupedGemmFixedNK with hipGraph (#1065 )

2023-11-30 15:09:27 -06:00

device_batchnorm_forward_impl_obsolete.hpp

Fixed GroupedGemmFixedNK with hipGraph (#1065 )

2023-11-30 15:09:27 -06:00

device_batchnorm_forward_impl.hpp

Fixed GroupedGemmFixedNK with hipGraph (#1065 )

2023-11-30 15:09:27 -06:00

device_cgemm_4gemm_xdl_cshuffle.hpp

Implement GetWorkSpaceSize from BaseOperator. (#1564 )

2024-10-12 14:05:11 +08:00

device_column_to_image_impl.hpp

Codegen hipRTC compilation (#1579 )

2025-01-31 09:48:39 -08:00

device_contraction_multiple_abd_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_contraction_multiple_d_xdl_cshuffle.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_contraction_utils.hpp

Fix continous dim selection in contraction (#1336 )

2024-06-18 10:26:49 +02:00

device_conv2d_backward_weight_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp

Use asynchronous version of hipMemset (#850 )

2023-08-18 11:14:59 +08:00

device_conv2d_bwd_data_xdl_nhwc_kyxc_nhwk.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_conv2d_fwd_xdl_c_shuffle_bias_activation_add_nhwc_kyxc_nhwk.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_conv2d_fwd_xdl_c_shuffle_bias_activation_nhwc_kyxc_nhwk.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_conv2d_fwd_xdl_nhwc_kyxc_nhwk.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_conv3d_fwd_naive_ndhwc_kzyxc_ndhwk.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_conv3d_fwd_xdl_ndhwc_kzyxc_ndhwk.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_convnd_bwd_data_nwc_kxc_nwk_dl.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_convnd_bwd_data_nwc_kxc_nwk_xdl.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_elementwise_dynamic_vector_dims_impl.hpp

Refactor elementwise kernels (#1222 )

2024-04-19 13:31:17 +02:00

device_elementwise_normalization_impl.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_elementwise_scale_impl.hpp

Refactor elementwise kernels (#1222 )

2024-04-19 13:31:17 +02:00

device_fpAintB_gemm_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_bias_add_reduce_xdl_cshuffle.hpp

Disable XDL kernels on unsupported HW Add ck::is_xdl_supported (#768 )

2023-07-26 07:19:55 -07:00

device_gemm_dl.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_dpp.hpp

Code clean-up (#1285 )

2024-05-10 09:41:39 -07:00

device_gemm_multiple_abd_xdl_cshuffle.hpp

bf16A_Int8B with fastgelu/bias (#1264 )

2024-04-26 07:26:30 -05:00

device_gemm_multiple_d_dl.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_gemm_multiple_d_layernorm_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_gemm_multiple_d_multiple_r_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_gemm_multiple_d_wmma_cshuffle.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_multiple_d_xdl_cshuffle_lds_direct_load.hpp

Add basic support for direct loads from global to LDS (#999 )

2023-11-25 13:35:22 +01:00

device_gemm_multiple_d_xdl_cshuffle_v3_ab_scale.hpp

[Block Scale GEMM] Optimized block scale gemm (#1950 )

2025-03-11 10:11:21 -07:00

device_gemm_multiple_d_xdl_cshuffle_v3_b_preshuffle.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

device_gemm_multiple_d_xdl_cshuffle_v3.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

device_gemm_multiple_d_xdl_cshuffle.hpp

Rebase the PR #1520 to ROCm repo. (#1574 )

2025-02-20 18:58:14 -08:00

device_gemm_reduce_xdl_cshuffle.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_gemm_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_gemm_xdl_cshuffle_lds_direct_load.hpp

Add support for double buffering in direct load GEMM kernel (#1052 )

2023-12-03 23:08:47 +01:00

device_gemm_xdl_cshuffle_streamk_v3.hpp

BF16 GEMM Stream-K (#1541 )

2025-01-02 10:30:04 -08:00

device_gemm_xdl_cshuffle_v2.hpp

[GEMM] Optimization for MI200/300. (#1135 )

2024-01-19 07:02:22 -06:00

device_gemm_xdl_cshuffle_v3_b_preshuffle.hpp

Ck int4 moe develop (#1949 )

2025-03-10 11:16:44 +08:00

device_gemm_xdl_cshuffle_v3_b_scale.hpp

Implement the fp16xint4 scale weight only kernel for Ali (#1786 )

2025-01-03 18:35:21 +08:00

device_gemm_xdl_cshuffle_v3.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

device_gemm_xdl_cshuffle_v3r1.hpp

Universal gemm splitk using reduce (with multi-d) (#1341 )

2024-07-19 22:01:22 +08:00

device_gemm_xdl_cshuffle.hpp

Add Gemm instances for performance improvement (#1018 )

2023-11-07 09:09:58 -06:00

device_gemm_xdl_layernorm_cshuffle.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_gemm_xdl_skip_b_lds.hpp

replace the ENV macro with CK_ENV (#1296 )

2024-05-17 10:42:51 -07:00

device_gemm_xdl_splitk_c_shuffle_lds_direct_load.hpp

Codegen hipRTC compilation (#1579 )

2025-01-31 09:48:39 -08:00

device_gemm_xdl_splitk_c_shuffle.hpp

Codegen hipRTC compilation (#1579 )

2025-01-31 09:48:39 -08:00

device_gemm_xdl_streamk.hpp

Add support for more Navi2x and Navi3x models. (#1152 )

2024-02-02 11:35:26 -08:00

device_gemm_xdl_waveletmodel_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_gemm_xdl.hpp

Add support for more Navi2x and Navi3x models. (#1152 )

2024-02-02 11:35:26 -08:00

device_grouped_contraction_multiple_d_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_conv_bwd_data_multiple_d_wmma_cshuffle.hpp

Support large batch tensors in grouped conv bwd data (#1711 )

2024-12-06 10:55:23 +01:00

device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v1.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_conv_bwd_weight_dl.hpp

Add support for NGCHW in grouped conv bwd wei (#1491 )

2024-09-03 10:52:03 +02:00

device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_conv_bwd_weight_two_stage_xdl_cshuffle.hpp

Conditionally log a DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle warning (#1860 )

2025-02-11 17:25:00 -07:00

device_grouped_conv_bwd_weight_wmma_cshuffle.hpp

Add support for NGCHW in grouped conv bwd wei (#1491 )

2024-09-03 10:52:03 +02:00

device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp

Fix build for gfx950 (#1904 )

2025-02-19 13:47:39 -08:00

device_grouped_conv_bwd_weight_xdl_cshuffle.hpp

Add support for NGCHW in basic grouped conv bwd wei kernel (#1887 )

2025-02-20 10:02:08 +01:00

device_grouped_conv_fwd_dl_multiple_d_nhwc_kyxc_nhwk.hpp

Statically Cast Pointer Offset (#1631 )

2024-11-05 09:59:08 -08:00

device_grouped_conv_fwd_dl_nhwc_kyxc_nhwk.hpp

Add Grouped Conv Fwd Large Tensor kernel (#1432 )

2024-08-06 10:06:10 +02:00

device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_conv_fwd_multiple_d_multiple_r_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_conv_fwd_multiple_d_multiple_r.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_grouped_conv_fwd_multiple_d_wmma_cshuffle.hpp

Add Grouped Conv Fwd Large Tensor kernel (#1432 )

2024-08-06 10:06:10 +02:00

device_grouped_conv_fwd_multiple_d_xdl_cshuffle.hpp

Add instances for conv_scale with fp8@bf8->fp8 (#1220 )

2024-04-03 09:08:08 -05:00

device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_conv_utils.hpp

Add support for NGCHW in grouped conv fwd (#1499 )

2024-09-20 10:45:46 +02:00

device_grouped_gemm_multi_abd_xdl_fixed_nk.hpp

Fix grouped gemm check to avoid overflow (#1545 )

2024-10-04 17:32:43 +02:00

device_grouped_gemm_multiple_d_dl.hpp

LWPCK-2429: Device grouped GEMM uses Async Memcpy (#1695 )

2024-12-02 09:13:56 +01:00

device_grouped_gemm_multiple_d_splitk_xdl_cshuffle_two_stage.hpp

LWPCK-2429: Device grouped GEMM uses Async Memcpy (#1695 )

2024-12-02 09:13:56 +01:00

device_grouped_gemm_multiple_d_xdl_cshuffle_tile_loop.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_gemm_softmax_gemm_permute_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_gemm_xdl_fixed_nk.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_gemm_xdl_splitk_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_gemm_xdl.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00

device_grouped_query_attention_forward_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_image_to_column_impl.hpp

Codegen hipRTC compilation (#1579 )

2025-01-31 09:48:39 -08:00

device_max_pool_bwd_impl.hpp

Refactor elementwise kernels (#1222 )

2024-04-19 13:31:17 +02:00

device_moe_gemm.hpp

ck moe gemm implement (#1936 )

2025-03-05 15:56:55 +08:00

device_multi_query_attention_forward_wmma.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

device_multiple_reduce_multiblock.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_multiple_reduce_threadwise.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_normalization_bwd_data_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_normalization_bwd_gamma_beta_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_normalization_fwd_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_normalization_fwd_splitk_impl.hpp

layernorm and groupnorm backward data (#1083 )

2023-12-19 04:23:11 +08:00

device_permute_impl.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

device_pool2d_fwd_nhwc_nhwc.hpp

Rewrite pool2d fwd (#1462 )

2024-09-11 15:21:00 +02:00

device_pool3d_fwd_ndhwc_ndhwc.hpp

Refactor pool fwd (#815 )

2023-08-15 02:25:28 +08:00

device_put_element_impl.hpp

Maxpool bwd (#750 )

2023-06-19 09:44:22 -05:00

device_reduce_common.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_reduce_multiblock.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_reduce_threadwise_multi_d.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_reduce_threadwise.hpp

Support large: 12d tensor size for reduction kenrel (#1465 )

2024-08-13 16:15:47 +02:00

device_softmax_impl.hpp

Revert "Grouped Gemm with looping over the tiles. (#788 )" (#982 )

2023-10-11 14:27:29 -05:00

device_sparse_embeddings_forward_layernorm.hpp

Replace buffer load/store intrinsics with builtins (#1876 )

2025-03-05 14:33:28 -08:00

device_splitk_contraction_multiple_d_xdl_cshuffle.hpp

Merge from internal (#1857 )

2025-02-07 15:05:05 -07:00