mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-13 02:27:33 +00:00

Files

Kiefer van Teutem 9fe98c864f [CK Tile] Add Tile Distribution Encoding Calculator (#5515 )

## Motivation

We want to be able to calculate TileDistributionEncodings describing
register mappings for any MmaOp. This is necessary for further
integration with CK Tile.

This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma
type (MmaOp) and provides ABC warp distribution encodings for mapping
matrix fragment coordinates to register coordinates
(lane, vector item) and vice versa. It is able to take CTranpose,
Swizzle, and NumAccessA / NumAccessB template parameters for tweaking
the tile distributions. Swizzle modification will be implemented later.

The current implementation can deal with all intrinsic types and
block-hiding.

This MR also adds some additional static asserts and derived params
within amdgcn_mma_base, to enforce consistency and help calculate Tile
Distributions for block-hiding intrinsics.

An Example was added that uses the Tile Distr Enc Calc to calc and print
register layouts for Tile Distributions for some of our amdgcn_mma
structs. It also makes sure that the CTranspose modifier works as
intended.

Some additional gfx9 intrinsics were added to test block-hiding layouts
for the different types of C-block-hiding layouts.

The sparse intrinsic wrappers were updated according to Chris's recent
changes in another branch
(https://github.com/ROCm/rocm-libraries/pull/5508), which moved the
compression step outside of the intrinsic itself. This is necessary to
make sure that the Calculator can deal with this new interpretation of
the sparse intrinsics. I directly copied the new amdgcn structs from
Chris's branch and changed nothing else to avoid more complex merges in
the future. Note that this means I did not update a bunch of related
sparse code since that would be a lot, and therefore I disabled
test_amdgcn_sparse_mma for now.

The amdgcn_mma_layout test was refactored a bit:
- The old register mapping utility was removed and its use was replaced
by the new TileDistrEncCalc
- More tests were added to test layouts for different types of
block-hiding and sparse intrinsics
- The Selector method was removed and the tests were split up over
target architectures, with each target arch having a direct list of
amdgcn structs to be tested. This ensures that we force specific tests
on specific architectures and makes sure that the selector doesn't
quietly do some workarounds like creating compound intrinsics.

## Test Results

Layout tests based on calculated tile distribution encodings pass on all
architectures. Calculator works for all currently added amdgcn structs,
which includes different types of block-hiding and sparse intrinsics.
Printed layouts from new example verified by eye. CTranspose modifier
tested for large set of intrinsics.

2026-04-13 08:00:31 +00:00

algorithm

[CK][CK Tile] Improve access for merged groups and remove modulo from xor (#5334 )

2026-03-20 15:45:45 +00:00

arch

[CK Tile] Add Tile Distribution Encoding Calculator (#5515 )

2026-04-13 08:00:31 +00:00

container

CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 )

2026-04-10 11:17:11 -04:00

numeric

CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 )

2026-04-10 11:17:11 -04:00

tensor

CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 )

2026-04-10 11:17:11 -04:00

utility

[CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938 )

2026-04-02 15:25:14 -06:00

config.hpp

Add missing gfx1033 to gfx103 group definition in ck (#5141 )

2026-04-03 13:44:38 -06:00

README.md

introducing ck_tile! (#1216 )

2024-04-15 19:27:12 -05:00

README.md

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device