mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-12 10:08:01 +00:00

Files

Kiefer van Teutem 6cd016dde4 [CK Tile] Add Tile Distribution Encoding Calculator (#5515 )

## Motivation

We want to be able to calculate TileDistributionEncodings describing
register mappings for any MmaOp. This is necessary for further
integration with CK Tile.

This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma
type (MmaOp) and provides ABC warp distribution encodings for mapping
matrix fragment coordinates to register coordinates
(lane, vector item) and vice versa. It is able to take CTranpose,
Swizzle, and NumAccessA / NumAccessB template parameters for tweaking
the tile distributions. Swizzle modification will be implemented later.

The current implementation can deal with all intrinsic types and
block-hiding.

This MR also adds some additional static asserts and derived params
within amdgcn_mma_base, to enforce consistency and help calculate Tile
Distributions for block-hiding intrinsics.

An Example was added that uses the Tile Distr Enc Calc to calc and print
register layouts for Tile Distributions for some of our amdgcn_mma
structs. It also makes sure that the CTranspose modifier works as
intended.

Some additional gfx9 intrinsics were added to test block-hiding layouts
for the different types of C-block-hiding layouts.

The sparse intrinsic wrappers were updated according to Chris's recent
changes in another branch
(https://github.com/ROCm/rocm-libraries/pull/5508), which moved the
compression step outside of the intrinsic itself. This is necessary to
make sure that the Calculator can deal with this new interpretation of
the sparse intrinsics. I directly copied the new amdgcn structs from
Chris's branch and changed nothing else to avoid more complex merges in
the future. Note that this means I did not update a bunch of related
sparse code since that would be a lot, and therefore I disabled
test_amdgcn_sparse_mma for now.

The amdgcn_mma_layout test was refactored a bit:
- The old register mapping utility was removed and its use was replaced
by the new TileDistrEncCalc
- More tests were added to test layouts for different types of
block-hiding and sparse intrinsics
- The Selector method was removed and the tests were split up over
target architectures, with each target arch having a direct list of
amdgcn structs to be tested. This ensures that we force specific tests
on specific architectures and makes sure that the selector doesn't
quietly do some workarounds like creating compound intrinsics.

## Test Results

Layout tests based on calculated tile distribution encodings pass on all
architectures. Calculator works for all currently added amdgcn structs,
which includes different types of block-hiding and sparse intrinsics.
Printed layouts from new example verified by eye. CTranspose modifier
tested for large set of intrinsics.

2026-04-13 08:00:31 +00:00

add_rmsnorm2d_rdquant

CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 )

2026-04-10 11:17:11 -04:00

atomic_add_op

Shuffle fix for gfx950 (#3491 )

2026-01-13 09:21:29 -08:00

batched_gemm

Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335 )

2026-02-06 16:14:28 -08:00

batched_transpose

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

container

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

core

[CK Tile] Add Tile Distribution Encoding Calculator (#5515 )

2026-04-13 08:00:31 +00:00

data_type

[CK_TILE] add tf32 support (#4302 )

2026-03-19 10:17:20 +01:00

elementwise

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

epilogue

[CK_TILE] add tf32 support (#4302 )

2026-03-19 10:17:20 +01:00

flatmm

ck_tile: add gtest unit tests for MX flatmm (gfx950) (#5082 )

2026-03-11 15:46:58 -07:00

fmha

[CK Tile] Add sink token gradient support in FMHA backward pass (#5504 )

2026-04-02 11:17:01 +08:00

gemm

CK: Remove 4 orphaned files with verified replacements (~1,025 lines) (#6303 )

2026-04-10 11:22:31 -04:00

gemm_block_scale

CK: Extract shared boilerplate from 47 gemm_quant test files (#6323 )

2026-04-11 06:00:26 -04:00

gemm_multi_abd

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

gemm_multi_d

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

gemm_mx

[CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095 )" (#5849 )

2026-03-27 20:36:39 +00:00

gemm_persistent_async_input

Add persistent async input scheduler for GEMM kernels (#3520 )

2026-01-20 10:37:09 -08:00

gemm_streamk

[CK Tile] Stream-K gtest Code Gen (#5722 )

2026-04-02 21:05:44 +00:00

gemm_tile_engine

[CK_TILE] Add pooling in tile_engine (#4469 )

2026-04-01 07:31:46 +00:00

gemm_weight_preshuffle

[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 )

2026-03-03 13:54:08 -08:00

grouped_conv

[CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393 )

2026-03-27 10:17:10 +01:00

grouped_gemm

[CK TILE] Skip work if any of Grouped GEMM groups M/N/K are zero. (#5050 )

2026-03-12 13:28:24 +00:00

grouped_gemm_abquant

[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 )

2026-01-28 19:45:09 -08:00

grouped_gemm_multi_d

Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335 )

2026-02-06 16:14:28 -08:00

grouped_gemm_preshuffle

[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 )

2026-03-03 13:54:08 -08:00

grouped_gemm_quant

CK: Remove 4 orphaned files with verified replacements (~1,025 lines) (#6303 )

2026-04-10 11:22:31 -04:00

image_to_column

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

layernorm2d

[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 )

2026-03-03 13:54:08 -08:00

memory_copy

Mx fp6 flatmm (#3601 )

2026-02-02 16:04:40 +08:00

moe_smoothquant

CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 )

2026-04-10 11:17:11 -04:00

moe_sorting

Update unsigned long literals and format specifiers to work correctly in Windows (#3483 )

2026-01-02 22:16:41 -07:00

permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

pooling

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

pooling_tile_engine

[CK_TILE] Add pooling in tile_engine (#4469 )

2026-04-01 07:31:46 +00:00

reduce

test: Add umbrella test targets for CK Tile operations (#4301 )

2026-03-03 07:39:32 -08:00

rmsnorm2d

[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 )

2026-03-03 13:54:08 -08:00

slice_tile

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

smoothquant

CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 )

2026-04-10 11:17:11 -04:00

topk_softmax

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

utility

[CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938 )

2026-04-02 15:25:14 -06:00

warp_gemm

chore: update copyright header for misc files (#3402 )

2025-12-11 08:25:29 -08:00

CMakeLists.txt

[CK Tile] Stream-K gtest Code Gen (#5722 )

2026-04-02 21:05:44 +00:00

README.md

test: Add umbrella test targets for CK Tile operations (#4301 )

2026-03-03 07:39:32 -08:00

README.md

CK Tile Testing Guide

This document describes the test organization and available test targets for CK Tile operations.

Overview

CK Tile tests are organized with multiple levels of granularity to support different development workflows:

Global test labels - Run tests across all operations
Operation-specific umbrella targets - Run all tests for a specific operation
Individual test executables - Run specific tests

Global Test Labels

These targets run tests across all CK operations (not just CK Tile):

`ninja smoke`

Run fast smoke tests (tests that complete within ~30 seconds on gfx90a).

ninja smoke

`ninja regression`

Run slower, more comprehensive regression tests.

ninja regression

`ninja check`

Run ALL available tests in the entire codebase.

ninja check

Operation-Specific Umbrella Targets

These targets allow you to run all tests for a specific CK Tile operation. This is useful when making changes to a particular operation and wanting to validate all related tests without running the entire test suite.

GEMM Operations

`ck_tile_gemm_tests`

Run all basic GEMM pipeline tests (memory, compute variants, persistent, etc.)

ninja ck_tile_gemm_tests

Test executables included:

test_ck_tile_gemm_pipeline_mem
test_ck_tile_gemm_pipeline_compv3
test_ck_tile_gemm_pipeline_compv4
test_ck_tile_gemm_pipeline_persistent
test_ck_tile_gemm_pipeline_compv6
test_ck_tile_gemm_pipeline_comp_async (gfx95 only)
test_ck_tile_gemm_pipeline_*_wmma variants (gfx11/gfx12 only)

`ck_tile_gemm_block_scale_tests`

Run all GEMM tests with block-scale quantization (AQuant, BQuant, ABQuant, etc.)

ninja ck_tile_gemm_block_scale_tests

Test executables included: 29 test executables covering:

AQuant tests (memory pipelines, base layouts, prefill, preshuffle, transpose)
ABQuant tests (base, padding, preshuffle)
BQuant tests (1D/2D variants, transpose)
BQuant with PreshuffleB (decode/prefill, 1D/2D)
BQuant with PreshuffleQuant (decode/prefill, 1D/2D)
RowColQuant and TensorQuant tests

`ck_tile_gemm_streamk_tests`

Run all GEMM StreamK tests (tile partitioner, reduction, smoke, extended)

ninja ck_tile_gemm_streamk_tests

Test executables included:

test_ck_tile_streamk_tile_partitioner
test_ck_tile_streamk_reduction
test_ck_tile_streamk_smoke
test_ck_tile_streamk_extended

`ck_tile_grouped_gemm_quant_tests`

Run all grouped GEMM quantization tests

ninja ck_tile_grouped_gemm_quant_tests

Test executables included:

test_ck_tile_grouped_gemm_quant_rowcol
test_ck_tile_grouped_gemm_quant_tensor
test_ck_tile_grouped_gemm_quant_aquant
test_ck_tile_grouped_gemm_quant_bquant
test_ck_tile_grouped_gemm_quant_bquant_preshuffleb

Other Operations

`ck_tile_fmha_tests`

Run all FMHA (Flash Multi-Head Attention) tests

ninja ck_tile_fmha_tests

Test executables included: Forward and backward tests for fp16, bf16, fp8bf16, fp32

`ck_tile_reduce_tests`

Run all reduce operation tests

ninja ck_tile_reduce_tests

Test executables included:

test_ck_tile_reduce2d
test_ck_tile_multi_reduce2d_threadwise
test_ck_tile_multi_reduce2d_multiblock

Individual Test Executables

You can also build and run individual test executables:

Build a specific test

ninja test_ck_tile_gemm_pipeline_mem

Run a specific test directly

./build/bin/test_ck_tile_gemm_pipeline_mem

Run a specific test through ctest

ctest -R test_ck_tile_gemm_pipeline_mem --output-on-failure