mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-10 16:28:38 +00:00
[CK_TILE] Restructure naive GEMM tutorial and add tile distribution tutorials (#7714) ## Summary - Flatten naive GEMM tutorial directory structure (remove `block_level/`, `host_level/`, `warp_level/` subdirs) to match the composable_kernel repo layout - Add `CK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION` macro switch to toggle between standard and transposed WarpGemm variants - Consolidate 6 verbose markdown files (~2600 lines) into one concise README (~120 lines) - Add 3 tile distribution encoding tutorials with step-by-step "How to read Ps/Ys" annotations: - Tutorial 1: A-matrix DRAM load (256×32) — NDimP=2, coalesced K-splitting - Tutorial 2: B-matrix DRAM load (128×32) — same pattern, fewer iterations - Tutorial 3: C-matrix register layout (32×32) — MFMA m32n32k8 hardware output mapping, standard vs transposed - Tile distribution tutorials guarded to build only for gfx942 and gfx950
2.5 KiB
2.5 KiB
CK Tile Distribution Encoding Tutorial
Overview
Every load_tile and store_tile in CK needs to know which thread reads which data element.
This mapping is defined by a tile_distribution_encoding — a compile-time struct with 6 template
parameters:
tile_distribution_encoding<Rs, Hs, Ps_major, Ps_minor, Ys_major, Ys_minor>
Every level of Hs (hierarchical dimensions) is assigned to exactly one role:
| Role | Meaning |
|---|---|
| P (parallel) | Thread ID selects which slice — different threads get different data |
| Y (yield) | Each thread owns the entire range in its buffer |
| R (replicate) | Identical data broadcast to multiple thread groups |
Tutorials
These tutorials use the exact tile sizes from the naive GEMM tutorial
(01_naive_gemm/): MPerBlock=256, NPerBlock=128, KPerBlock=32, BlockSize=256, fp16.
| # | File | Matrix | Tile | Key Concept |
|---|---|---|---|---|
| 1 | tile_distribution_1.cpp |
A (DRAM load) | 256×32 | NDimP=2, warp_id→M1, lane_id→M2×K0 (coalesced) |
| 2 | tile_distribution_2.cpp |
B (DRAM load) | 128×32 | Same pattern as A, but N0=2 iterations (vs A's M0=4) due to smaller N |
| 3 | tile_distribution_3.cpp |
C (registers) | 256×128 | Warp-level MFMA output + block-level composition, standard vs transposed |
Tutorial 3 responds to CK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION — rebuild with =0 or =1
to see both C register layouts.
Architecture note: All comments and concrete values assume CDNA (warp_size=64). On RDNA (warp_size=32), the thread-to-data mapping will differ.
Building
cd <repo-root>/projects/composablekernel/build
# Build all tutorials:
make tutorials -j
# or: ninja tutorials
# Or build individually:
make tile_tutorial_tile_distribution_1 -j
make tile_tutorial_tile_distribution_2 -j
make tile_tutorial_tile_distribution_3 -j
# Tutorial 3 with standard (non-transposed) C:
cmake -DCMAKE_CXX_FLAGS="-DCK_TILE_ENABLE_TRANSPOSED_C_DISTRIBUTION=0" ..
make tile_tutorial_tile_distribution_3 -j
Reference
- Encoding definition:
include/ck_tile/core/tensor/tile_distribution_encoding.hpp - Thread identity (NDimP):
include/ck_tile/core/tensor/tile_distribution.hpp - MFMA warp output layout:
include/ck_tile/ops/gemm/warp/warp_gemm_attribute_mfma.hpp - Production A/B distributions:
include/ck_tile/ops/gemm/pipeline/gemm_pipeline_agmem_bgmem_creg_v1_default_policy.hpp - Naive GEMM tutorial:
tutorial/ck_tile/gemm/01_naive_gemm/