mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-10 16:28:38 +00:00

Files

Aviral Goel 6a17f951ea [rocm-libraries] ROCm/rocm-libraries#7714 (commit 13ae6d6)

[CK_TILE] Restructure naive GEMM tutorial and add tile distribution tutorials (#7714)

## Summary
- Flatten naive GEMM tutorial directory structure (remove
`block_level/`, `host_level/`, `warp_level/` subdirs) to match the
composable_kernel repo layout
- Add `CK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION` macro switch to toggle
between standard and transposed WarpGemm variants
- Consolidate 6 verbose markdown files (~2600 lines) into one concise
README (~120 lines)
- Add 3 tile distribution encoding tutorials with step-by-step "How to
read Ps/Ys" annotations:
- Tutorial 1: A-matrix DRAM load (256×32) — NDimP=2, coalesced
K-splitting
- Tutorial 2: B-matrix DRAM load (128×32) — same pattern, fewer
iterations
- Tutorial 3: C-matrix register layout (32×32) — MFMA m32n32k8 hardware
output mapping, standard vs transposed
- Tile distribution tutorials guarded to build only for gfx942 and
gfx950

2026-05-27 14:10:29 -04:00

2.5 KiB

Raw Blame History

CK Tile Distribution Encoding Tutorial

Overview

Every load_tile and store_tile in CK needs to know which thread reads which data element. This mapping is defined by a tile_distribution_encoding — a compile-time struct with 6 template parameters:

tile_distribution_encoding<Rs, Hs, Ps_major, Ps_minor, Ys_major, Ys_minor>

Every level of Hs (hierarchical dimensions) is assigned to exactly one role:

Role	Meaning
P (parallel)	Thread ID selects which slice — different threads get different data
Y (yield)	Each thread owns the entire range in its buffer
R (replicate)	Identical data broadcast to multiple thread groups

Tutorials

These tutorials use the exact tile sizes from the naive GEMM tutorial (01_naive_gemm/): MPerBlock=256, NPerBlock=128, KPerBlock=32, BlockSize=256, fp16.

#	File	Matrix	Tile	Key Concept
1	`tile_distribution_1.cpp`	A (DRAM load)	256×32	NDimP=2, warp_id→M1, lane_id→M2×K0 (coalesced)
2	`tile_distribution_2.cpp`	B (DRAM load)	128×32	Same pattern as A, but N0=2 iterations (vs A's M0=4) due to smaller N
3	`tile_distribution_3.cpp`	C (registers)	256×128	Warp-level MFMA output + block-level composition, standard vs transposed

Tutorial 3 responds to CK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION — rebuild with =0 or =1 to see both C register layouts.

Architecture note: All comments and concrete values assume CDNA (warp_size=64). On RDNA (warp_size=32), the thread-to-data mapping will differ.

Building

cd <repo-root>/projects/composablekernel/build

# Build all tutorials:
make tutorials -j
# or: ninja tutorials

# Or build individually:
make tile_tutorial_tile_distribution_1 -j
make tile_tutorial_tile_distribution_2 -j
make tile_tutorial_tile_distribution_3 -j

# Tutorial 3 with standard (non-transposed) C:
cmake -DCMAKE_CXX_FLAGS="-DCK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION=0" ..
make tile_tutorial_tile_distribution_3 -j

Reference

Encoding definition: include/ck_tile/core/tensor/tile_distribution_encoding.hpp
Thread identity (NDimP): include/ck_tile/core/tensor/tile_distribution.hpp
MFMA warp output layout: include/ck_tile/ops/gemm/warp/warp_gemm_attribute_mfma.hpp
Production A/B distributions: include/ck_tile/ops/gemm/pipeline/gemm_pipeline_agmem_bgmem_creg_v1_default_policy.hpp
Naive GEMM tutorial: tutorial/ck_tile/gemm/01_naive_gemm/

2.5 KiB Raw Blame History Unescape Escape