mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 03:07:02 +00:00

Files

Sami Remes a3a12b8945 [rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf)

[CK_TILE] Enable full transpose layout support for MX GEMM
 pipeline (#5813)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Enable full transpose layout support for MX GEMM pipeline (32x32x64
MFMA)

### Summary

This PR enables all four matrix layout combinations (Row/Col, Row/Row,
Col/Col, Col/Row) for the MX GEMM pipeline with `32x32x64` MFMA warp
tiles, using `ds_read_tr` transposed LDS loads on gfx950. Previously,
only the canonical `A=RowMajor, B=ColumnMajor` layout was supported.

### Changes

**Kernel-side transpose support:**

- **`warp_gemm_attribute_mfma.hpp`**: Introduce `kSplitFactor` logic in
`get_warp_dstr_encoding` to split the K-dimension distribution encoding
when `kPerLane` exceeds the `ds_read_tr` subtile minor dimension. This
satisfies the `TransposeTileDistributionTraits` suffix validation
required by `load_tile_transpose`. The distribution encoding now also
receives the `DataType` template parameter to compute the split factor
based on packed element size.

- **`gemm_pipeline_ag_bg_cr_comp_async.hpp`**: Uncomment and enable the
`InputTileDistributionTraits` logic to properly transform LDS load tile
distributions for transposed reads. Add `static_assert`s to catch
misconfigurations where a layout requires transpose loads but the warp
tile size disables them (e.g. `KWarpTile=128` exceeds `ds_read_tr`
limits).

- **`load_tile_transpose.hpp`**: Fix `DataVec` sizing for packed types
(`pk_fp4_t`) — divide `vecLoadSize` by `PackedSize` to prevent buffer
overflow when each physical element contains multiple logical values.

- **`warp_gemm_attribute_mfma_impl.hpp`**: Set `kDefaultScale` to
`0x7F7F7F7F` (unity in e8m0 format) for the unscaled `operator()`
overloads of `WarpGemmAttributeMfmaImpl_f32_32x32x64_f8f6f4`, ensuring
correct behavior with `mfma_scale_f32_32x32x64_f8f6f4`.

- **`warp_gemm.hpp` / `warp_gemm_dispatcher.hpp`**: Add generic
`WarpGemmMfma_f32_32x32x64_f8f6f4<A, B>` alias and dispatcher
specialization to support arbitrary MX data type combinations (fp4, fp6,
fp8) with the 32x32x64 MFMA, consolidating the existing type-specific
aliases.

- **`gemm_pipeline_ag_bg_cr_comp_async_default_policy.hpp`**: Simplify
`wg_attr_num_access` determination — `Double` for fp8, `Single`
otherwise.

**Reference implementation fix:**

- **`reference_gemm.hpp`**: Fix nibble selection for packed 4-bit types
(`pk_fp4_t`, `pk_int4_t`) in `reference_mx_gemm`, `reference_gemm`, and
`reference_gemm_abquant`. The previous logic used `k % 2` or
`index[K_DIM] & 1` to select which nibble to extract, which assumed K
was always the fast (contiguous) memory dimension. This is only true for
`A=RowMajor` / `B=ColumnMajor`. For other layouts, the fix computes the
flat memory offset via `mDesc.GetOffsetFromMultiIndex(...)` and uses its
parity to correctly select the nibble regardless of layout.

**Test infrastructure:**

- **`test_mx_gemm_config.hpp`**: Add `MxGemmConfig32` base and
`MXfp4_GemmConfig32` / `MXfp8_GemmConfig32` configs for the 32x32x64
warp tile.
- **`test_mx_gemm_fp4.cpp` / `test_mx_gemm_fp8.cpp`**: Add `Config32`
test suites covering all four layout combinations. Restrict `Config16`
(16x16x128) to `A=Row, B=Col` only, since `KWarpTile=128` exceeds
`ds_read_tr` limits.
- **`test_mx_gemm_util.hpp`**: Fix scale tensor layout — scales are
always row-major `[M, K/32]` and column-major `[K/32, N]`, independent
of A/B data layout.

### Test plan

- [x] `test_ck_tile_mx_gemm_fp4` — 5/5 passed (16x16x128 Row/Col +
32x32x64 all 4 layouts)
- [x] `test_ck_tile_mx_gemm_fp8` — 5/5 passed (16x16x128 Row/Col +
32x32x64 all 4 layouts)
- [x] `test_ck_tile_mx_gemm_fp6` — 1/1 passed (16x16x128 Row/Col)

2026-06-18 17:05:09 +00:00

add_rmsnorm2d_rdquant

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

atomic_add_op

Shuffle fix for gfx950 (#3491 )

2026-01-13 09:21:29 -08:00

batched_gemm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

batched_transpose

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

container

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

core

[rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf)

2026-06-18 17:05:09 +00:00

data_type

[rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f)

2026-06-05 12:27:41 +00:00

elementwise

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

epilogue

[rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a)

2026-03-19 10:17:20 +01:00

flatmm

[rocm-libraries] ROCm/rocm-libraries#8325 (commit 559eaf6)

2026-06-15 16:12:33 +00:00

fmha

[rocm-libraries] ROCm/rocm-libraries#8241 (commit cd183df)

2026-06-10 15:37:44 +00:00

gemm

[rocm-libraries] ROCm/rocm-libraries#8535 (commit a0f47eb)

2026-06-18 12:59:59 +00:00

gemm_block_scale

[rocm-libraries] ROCm/rocm-libraries#8260 (commit 1139236)

2026-06-12 21:11:59 +00:00

gemm_multi_abd

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

gemm_multi_d

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

gemm_mx

[rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf)

2026-06-18 17:05:09 +00:00

gemm_persistent_async_input

[rocm-libraries] ROCm/rocm-libraries#5863 (commit 31d9247)

2026-04-14 20:22:18 +00:00

gemm_streamk

[rocm-libraries] ROCm/rocm-libraries#7836 (commit cdd9958)

2026-06-08 22:48:10 +00:00

gemm_tile_engine

[rocm-libraries] ROCm/rocm-libraries#4769 (commit 72ae66e)

2026-04-14 10:50:24 -07:00

gemm_weight_preshuffle

[rocm-libraries] ROCm/rocm-libraries#5854 (commit 8e2d46d)

2026-06-05 07:17:09 +00:00

grouped_conv

[rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da)

2026-06-04 15:00:17 +00:00

grouped_gemm

[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)

2026-05-22 02:43:50 +00:00

grouped_gemm_abquant

[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 )

2026-01-28 19:45:09 -08:00

grouped_gemm_multi_d

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

grouped_gemm_mx

[rocm-libraries] ROCm/rocm-libraries#8325 (commit 559eaf6)

2026-06-15 16:12:33 +00:00

grouped_gemm_preshuffle

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

grouped_gemm_quant

[rocm-libraries] ROCm/rocm-libraries#7199 (commit 23f7320)

2026-06-05 11:41:49 +00:00

image_to_column

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

layernorm2d

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 13:54:08 -08:00

load_and_convert_tile

[rocm-libraries] ROCm/rocm-libraries#8487 (commit 06a73ba)

2026-06-17 11:07:22 +00:00

memory_copy

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

moe_smoothquant

[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

2026-04-10 11:17:11 -04:00

moe_sorting

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

multicast_load

[rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da)

2026-06-04 15:00:17 +00:00

permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

pooling

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

pooling_tile_engine

[rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0)

2026-04-01 07:31:46 +00:00

reduce

[rocm-libraries] ROCm/rocm-libraries#4301 (commit 0821c9f)

2026-03-03 07:39:32 -08:00

rmsnorm2d

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 13:54:08 -08:00

slice_tile

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

smoothquant

[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

2026-04-10 11:17:11 -04:00

tdm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

topk_softmax

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

utility

[rocm-libraries] ROCm/rocm-libraries#7724 (commit 4cb149a)

2026-05-29 18:45:13 +00:00

warp_gemm

[rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf)

2026-06-18 17:05:09 +00:00

CMakeLists.txt

[rocm-libraries] ROCm/rocm-libraries#5510 (commit 8415c8c)

2026-06-15 06:42:28 +00:00

README.md

[rocm-libraries] ROCm/rocm-libraries#4301 (commit 0821c9f)

2026-03-03 07:39:32 -08:00

README.md

CK Tile Testing Guide

This document describes the test organization and available test targets for CK Tile operations.

Overview

CK Tile tests are organized with multiple levels of granularity to support different development workflows:

Global test labels - Run tests across all operations
Operation-specific umbrella targets - Run all tests for a specific operation
Individual test executables - Run specific tests

Global Test Labels

These targets run tests across all CK operations (not just CK Tile):

`ninja smoke`

Run fast smoke tests (tests that complete within ~30 seconds on gfx90a).

ninja smoke

`ninja regression`

Run slower, more comprehensive regression tests.

ninja regression

`ninja check`

Run ALL available tests in the entire codebase.

ninja check

Operation-Specific Umbrella Targets

These targets allow you to run all tests for a specific CK Tile operation. This is useful when making changes to a particular operation and wanting to validate all related tests without running the entire test suite.

GEMM Operations

`ck_tile_gemm_tests`

Run all basic GEMM pipeline tests (memory, compute variants, persistent, etc.)

ninja ck_tile_gemm_tests

Test executables included:

test_ck_tile_gemm_pipeline_mem
test_ck_tile_gemm_pipeline_compv3
test_ck_tile_gemm_pipeline_compv4
test_ck_tile_gemm_pipeline_persistent
test_ck_tile_gemm_pipeline_compv6
test_ck_tile_gemm_pipeline_comp_async (gfx95 only)
test_ck_tile_gemm_pipeline_*_wmma variants (gfx11/gfx12 only)

`ck_tile_gemm_block_scale_tests`

Run all GEMM tests with block-scale quantization (AQuant, BQuant, ABQuant, etc.)

ninja ck_tile_gemm_block_scale_tests

Test executables included: 29 test executables covering:

AQuant tests (memory pipelines, base layouts, prefill, preshuffle, transpose)
ABQuant tests (base, padding, preshuffle)
BQuant tests (1D/2D variants, transpose)
BQuant with PreshuffleB (decode/prefill, 1D/2D)
BQuant with PreshuffleQuant (decode/prefill, 1D/2D)
RowColQuant and TensorQuant tests

`ck_tile_gemm_streamk_tests`

Run all GEMM StreamK tests (tile partitioner, reduction, smoke, extended)

ninja ck_tile_gemm_streamk_tests

Test executables included:

test_ck_tile_streamk_tile_partitioner
test_ck_tile_streamk_reduction
test_ck_tile_streamk_smoke
test_ck_tile_streamk_extended

`ck_tile_grouped_gemm_quant_tests`

Run all grouped GEMM quantization tests

ninja ck_tile_grouped_gemm_quant_tests

Test executables included:

test_ck_tile_grouped_gemm_quant_rowcol
test_ck_tile_grouped_gemm_quant_tensor
test_ck_tile_grouped_gemm_quant_aquant
test_ck_tile_grouped_gemm_quant_bquant
test_ck_tile_grouped_gemm_quant_bquant_preshuffleb

Other Operations

`ck_tile_fmha_tests`

Run all FMHA (Flash Multi-Head Attention) tests

ninja ck_tile_fmha_tests

Test executables included: Forward and backward tests for fp16, bf16, fp8bf16, fp32

`ck_tile_reduce_tests`

Run all reduce operation tests

ninja ck_tile_reduce_tests

Test executables included:

test_ck_tile_reduce2d
test_ck_tile_multi_reduce2d_threadwise
test_ck_tile_multi_reduce2d_multiblock

Individual Test Executables

You can also build and run individual test executables:

Build a specific test

ninja test_ck_tile_gemm_pipeline_mem

Run a specific test directly

./build/bin/test_ck_tile_gemm_pipeline_mem

Run a specific test through ctest

ctest -R test_ck_tile_gemm_pipeline_mem --output-on-failure