Files
composable_kernel/tutorial/ck_tile/tile_distribution
Aviral Goel 6a17f951ea [rocm-libraries] ROCm/rocm-libraries#7714 (commit 13ae6d6)
[CK_TILE] Restructure naive GEMM tutorial and add tile distribution tutorials (#7714)

## Summary
- Flatten naive GEMM tutorial directory structure (remove
`block_level/`, `host_level/`, `warp_level/` subdirs) to match the
composable_kernel repo layout
- Add `CK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION` macro switch to toggle
between standard and transposed WarpGemm variants
- Consolidate 6 verbose markdown files (~2600 lines) into one concise
README (~120 lines)
- Add 3 tile distribution encoding tutorials with step-by-step "How to
read Ps/Ys" annotations:
- Tutorial 1: A-matrix DRAM load (256×32) — NDimP=2, coalesced
K-splitting
- Tutorial 2: B-matrix DRAM load (128×32) — same pattern, fewer
iterations
- Tutorial 3: C-matrix register layout (32×32) — MFMA m32n32k8 hardware
output mapping, standard vs transposed
- Tile distribution tutorials guarded to build only for gfx942 and
gfx950
2026-05-27 14:10:29 -04:00
..

CK Tile Distribution Encoding Tutorial

Overview

Every load_tile and store_tile in CK needs to know which thread reads which data element. This mapping is defined by a tile_distribution_encoding — a compile-time struct with 6 template parameters:

tile_distribution_encoding<Rs, Hs, Ps_major, Ps_minor, Ys_major, Ys_minor>

Every level of Hs (hierarchical dimensions) is assigned to exactly one role:

Role Meaning
P (parallel) Thread ID selects which slice — different threads get different data
Y (yield) Each thread owns the entire range in its buffer
R (replicate) Identical data broadcast to multiple thread groups

Tutorials

These tutorials use the exact tile sizes from the naive GEMM tutorial (01_naive_gemm/): MPerBlock=256, NPerBlock=128, KPerBlock=32, BlockSize=256, fp16.

# File Matrix Tile Key Concept
1 tile_distribution_1.cpp A (DRAM load) 256×32 NDimP=2, warp_id→M1, lane_id→M2×K0 (coalesced)
2 tile_distribution_2.cpp B (DRAM load) 128×32 Same pattern as A, but N0=2 iterations (vs A's M0=4) due to smaller N
3 tile_distribution_3.cpp C (registers) 256×128 Warp-level MFMA output + block-level composition, standard vs transposed

Tutorial 3 responds to CK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION — rebuild with =0 or =1 to see both C register layouts.

Architecture note: All comments and concrete values assume CDNA (warp_size=64). On RDNA (warp_size=32), the thread-to-data mapping will differ.

Building

cd <repo-root>/projects/composablekernel/build

# Build all tutorials:
make tutorials -j
# or: ninja tutorials

# Or build individually:
make tile_tutorial_tile_distribution_1 -j
make tile_tutorial_tile_distribution_2 -j
make tile_tutorial_tile_distribution_3 -j

# Tutorial 3 with standard (non-transposed) C:
cmake -DCMAKE_CXX_FLAGS="-DCK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION=0" ..
make tile_tutorial_tile_distribution_3 -j

Reference

  • Encoding definition: include/ck_tile/core/tensor/tile_distribution_encoding.hpp
  • Thread identity (NDimP): include/ck_tile/core/tensor/tile_distribution.hpp
  • MFMA warp output layout: include/ck_tile/ops/gemm/warp/warp_gemm_attribute_mfma.hpp
  • Production A/B distributions: include/ck_tile/ops/gemm/pipeline/gemm_pipeline_agmem_bgmem_creg_v1_default_policy.hpp
  • Naive GEMM tutorial: tutorial/ck_tile/gemm/01_naive_gemm/