mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-12 02:05:50 +00:00

Files

Kiefer van Teutem e785241250 [CK_Tile] Refactor amdgcn_mma policy structs (#5272 )

## Motivation
The point of this MR is to update the intrinsic layout parameters to
simplify them and make them more clear and flexible. Also, a number of
simple refactors were performed to reduce boilerplate and code
duplication.

## Technical Details
In CK Tile and old CK, the full set of information available in the
intrinsic wrappers, for WMMA and MFMA combined, would be something like:

```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kAMBlock;
static constexpr index_t kBNBlock;

static constexpr index_t kRepeat;
static constexpr index_t kAMLane;
static constexpr index_t kBNLane;
static constexpr index_t kABK0PerLane;
static constexpr index_t kABKLane;
static constexpr index_t kABK1PerLane;

static constexpr index_t kCMLane;
static constexpr index_t kCNLane;
static constexpr index_t kCM0PerLane;
static constexpr index_t kCM1PerLane;

using kABPs2RHssMajor = sequence<2, 1>;
using kABPs2RHssMinor = sequence<1, 0>;
using kABYs2RHsMajor  = sequence<2, 2>;
using kABYs2RHsMinor  = sequence<0, 2>;

using kCPs2RHssMajor = sequence<1, 2>;
using kCPs2RHssMinor = sequence<1, 0>;
using kCYs2RHsMajor  = sequence<1, 1>;
using kCYs2RHsMinor  = sequence<0, 2>;

using kCTPs2RHssMajor = sequence<2, 1>;
using kCTPs2RHssMinor = sequence<1, 0>;
using kCTYs2RHsMajor  = sequence<2, 2>;
using kCTYs2RHsMinor  = sequence<0, 2>;   
 ```
Note that on top of the intrinsic sizes, we have 12 layout parameters. I have reduced this in the new design to:

```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;

// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;

// Layout parameters
static constexpr index_t kABKPerLane; // K2 * K0, Always the same, even
for diff A / B layouts
static constexpr index_t kAKNumAccess; // K2
static constexpr index_t kARepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kBKNumAccess; // K2
static constexpr index_t kBRepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kCMPerLane;   // M2 * M0
static constexpr index_t kCMNumAccess; // M2

// Derived properties
using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;
```

Note that there are now only 7 layout parameters and no more dimensionality orderings. Believe it or not these 7 parameters are more general than the original 12, and can handle intrinsic and mid-level features that are currently awkward in CK Tile, like dealing with AttrNumAccess, different A / B layouts, more general block-hiding (currently very limited in CK tile), and future arch features.

Furthermore, the A, B and C vec types are now derived directly from the layout parameters to ensure internal consistency.

I added a detailed explanation of the new params in terms of register mappings at the top of amgcn_mma.hpp

Other refactorings I did in this MR:

- Make an amdgcn_mma_base struct to drastically reduce code duplication and potential bugs. Should also make auto-generating the amd_gcn specializations much easier.
- Simplify the MmaOpTraits significantly by only including those parameters that are not directly gettable from the MmaOp itself. This removes duplicated variables and simplifies higher level code.
- Remove overloaded "Block" term for intrinsic dimensions, and replace by "Frag" instead. Some spots were already using the term "Frag" for combined intrinsics, in which case I changed that term to "Chunk" instead.
- Remove some tests that had become somewhat pointless (setting variables and then checking their values immediately).

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-03-20 09:07:00 -06:00

add_rmsnorm2d_rdquant

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

atomic_add_op

Shuffle fix for gfx950 (#3491 )

2026-01-13 09:21:29 -08:00

batched_gemm

Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335 )

2026-02-06 16:14:28 -08:00

batched_transpose

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

container

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

core

[CK_Tile] Refactor amdgcn_mma policy structs (#5272 )

2026-03-20 09:07:00 -06:00

data_type

[CK_TILE] add tf32 support (#4302 )

2026-03-19 10:17:20 +01:00

elementwise

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

epilogue

[CK_TILE] add tf32 support (#4302 )

2026-03-19 10:17:20 +01:00

flatmm

ck_tile: add gtest unit tests for MX flatmm (gfx950) (#5082 )

2026-03-11 15:46:58 -07:00

fmha

[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368 )

2026-03-11 09:59:50 +00:00

gemm

Ck/joye/revert oob check (#5640 )

2026-03-20 12:30:08 +00:00

gemm_block_scale

[CK Tile] Eight Waves pipeline GEMM (#4964 )

2026-03-16 09:30:54 +01:00

gemm_multi_abd

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

gemm_multi_d

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

gemm_mx

[CK_TILE] Enable MXFP6 for MX GEMM op (#5095 )

2026-03-19 18:07:47 -07:00

gemm_persistent_async_input

Add persistent async input scheduler for GEMM kernels (#3520 )

2026-01-20 10:37:09 -08:00

gemm_streamk

[CK_TILE] Rename Stream-K grid function (#4795 )

2026-03-20 03:27:44 -06:00

gemm_streamk_tile_engine

ck_tile: add gtest unit tests for MX flatmm (gfx950) (#5082 )

2026-03-11 15:46:58 -07:00

gemm_tile_engine

Disable gemm_blockscale_f8 on gfx90a by default. (#3338 )

2025-12-02 11:33:33 -08:00

gemm_weight_preshuffle

[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 )

2026-03-03 13:54:08 -08:00

grouped_conv

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

grouped_gemm

[CK TILE] Skip work if any of Grouped GEMM groups M/N/K are zero. (#5050 )

2026-03-12 13:28:24 +00:00

grouped_gemm_abquant

[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 )

2026-01-28 19:45:09 -08:00

grouped_gemm_multi_d

Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335 )

2026-02-06 16:14:28 -08:00

grouped_gemm_preshuffle

[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 )

2026-03-03 13:54:08 -08:00

grouped_gemm_quant

test: Add umbrella test targets for CK Tile operations (#4301 )

2026-03-03 07:39:32 -08:00

image_to_column

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

layernorm2d

[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 )

2026-03-03 13:54:08 -08:00

memory_copy

Mx fp6 flatmm (#3601 )

2026-02-02 16:04:40 +08:00

moe_smoothquant

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

moe_sorting

Update unsigned long literals and format specifiers to work correctly in Windows (#3483 )

2026-01-02 22:16:41 -07:00

permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

pooling

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

reduce

test: Add umbrella test targets for CK Tile operations (#4301 )

2026-03-03 07:39:32 -08:00

rmsnorm2d

[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 )

2026-03-03 13:54:08 -08:00

slice_tile

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

smoothquant

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

topk_softmax

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

utility

[CK_TILE] Optimize ck_tile::sequence to reduce template instantiation depth [2A] (#5028 )

2026-03-11 14:24:54 -06:00

warp_gemm

chore: update copyright header for misc files (#3402 )

2025-12-11 08:25:29 -08:00

CMakeLists.txt

ck_tile: add gtest unit tests for MX flatmm (gfx950) (#5082 )

2026-03-11 15:46:58 -07:00

README.md

test: Add umbrella test targets for CK Tile operations (#4301 )

2026-03-03 07:39:32 -08:00

README.md

CK Tile Testing Guide

This document describes the test organization and available test targets for CK Tile operations.

Overview

CK Tile tests are organized with multiple levels of granularity to support different development workflows:

Global test labels - Run tests across all operations
Operation-specific umbrella targets - Run all tests for a specific operation
Individual test executables - Run specific tests

Global Test Labels

These targets run tests across all CK operations (not just CK Tile):

`ninja smoke`

Run fast smoke tests (tests that complete within ~30 seconds on gfx90a).

ninja smoke

`ninja regression`

Run slower, more comprehensive regression tests.

ninja regression

`ninja check`

Run ALL available tests in the entire codebase.

ninja check

Operation-Specific Umbrella Targets

These targets allow you to run all tests for a specific CK Tile operation. This is useful when making changes to a particular operation and wanting to validate all related tests without running the entire test suite.

GEMM Operations

`ck_tile_gemm_tests`

Run all basic GEMM pipeline tests (memory, compute variants, persistent, etc.)

ninja ck_tile_gemm_tests

Test executables included:

test_ck_tile_gemm_pipeline_mem
test_ck_tile_gemm_pipeline_compv3
test_ck_tile_gemm_pipeline_compv4
test_ck_tile_gemm_pipeline_persistent
test_ck_tile_gemm_pipeline_compv6
test_ck_tile_gemm_pipeline_comp_async (gfx95 only)
test_ck_tile_gemm_pipeline_*_wmma variants (gfx11/gfx12 only)

`ck_tile_gemm_block_scale_tests`

Run all GEMM tests with block-scale quantization (AQuant, BQuant, ABQuant, etc.)

ninja ck_tile_gemm_block_scale_tests

Test executables included: 29 test executables covering:

AQuant tests (memory pipelines, base layouts, prefill, preshuffle, transpose)
ABQuant tests (base, padding, preshuffle)
BQuant tests (1D/2D variants, transpose)
BQuant with PreshuffleB (decode/prefill, 1D/2D)
BQuant with PreshuffleQuant (decode/prefill, 1D/2D)
RowColQuant and TensorQuant tests

`ck_tile_gemm_streamk_tests`

Run all GEMM StreamK tests (tile partitioner, reduction, smoke, extended)

ninja ck_tile_gemm_streamk_tests

Test executables included:

test_ck_tile_streamk_tile_partitioner
test_ck_tile_streamk_reduction
test_ck_tile_streamk_smoke
test_ck_tile_streamk_extended

`ck_tile_grouped_gemm_quant_tests`

Run all grouped GEMM quantization tests

ninja ck_tile_grouped_gemm_quant_tests

Test executables included:

test_ck_tile_grouped_gemm_quant_rowcol
test_ck_tile_grouped_gemm_quant_tensor
test_ck_tile_grouped_gemm_quant_aquant
test_ck_tile_grouped_gemm_quant_bquant
test_ck_tile_grouped_gemm_quant_bquant_preshuffleb

Other Operations

`ck_tile_fmha_tests`

Run all FMHA (Flash Multi-Head Attention) tests

ninja ck_tile_fmha_tests

Test executables included: Forward and backward tests for fp16, bf16, fp8bf16, fp32

`ck_tile_reduce_tests`

Run all reduce operation tests

ninja ck_tile_reduce_tests

Test executables included:

test_ck_tile_reduce2d
test_ck_tile_multi_reduce2d_threadwise
test_ck_tile_multi_reduce2d_multiblock

Individual Test Executables

You can also build and run individual test executables:

Build a specific test

ninja test_ck_tile_gemm_pipeline_mem

Run a specific test directly

./build/bin/test_ck_tile_gemm_pipeline_mem

Run a specific test through ctest

ctest -R test_ck_tile_gemm_pipeline_mem --output-on-failure