Add CK Tile Tutorials Folder with GEMM and COPY Kernel (#3038)

* feat: add tutorial folder with gemm tutorial

* chore: move copy kernel from examples folder to tutorial

* Update tutorial/ck_tile/01_naive_gemm/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tutorial/ck_tile/01_naive_gemm/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* chore: remove handdrawn images

* docs: add write ups to explain the gemm kernel

* docs: add about block level pipeline and static distributed tensors

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
Aviral Goel
2025-11-11 15:15:49 -05:00
committed by GitHub
parent c54ecd905b
commit b145a5fe80
24 changed files with 3287 additions and 15 deletions

View File

@@ -0,0 +1,315 @@
# CK Tile Framework: Getting Started with Tile Copy Operations
## Overview
### Copy Kernel
A minimal CK_Tile memory copy implementation demonstrating the basic setup required to write a kernel in CK Tile.
This experimental kernel is intended for novice CK developers. It introduces the building blocks of CK Tile and provides a sandbox for experimenting with kernel parameters.
## build
```
# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture
# (for example gfx90a or gfx942) or leave it blank
../script/cmake-ck-dev.sh ../ <arch>
# Make the copy kernel executable
make tile_example_copy -j
```
This will result in an executable `build/bin/test_copy_basic`
## example
```
args:
-m input matrix rows. (default 64)
-n input matrix cols. (default 8)
-id wave to use for computation. (default 0)
-v validation flag to check device results. (default 1)
-prec datatype precision to use. (default fp16)
-warmup no. of warmup iterations. (default 50)
-repeat no. of iterations for kernel execution time. (default 100)
```
## CK Tile Architecture Components
The CK Tile framework is built around four key architectural components that work together to define and execute GPU kernels: shape, policy, problem, and pipeline.
### **1. Shape**
Defines the **hierarchical tile structure** and **memory layout** of the kernel:
```cpp
using Shape = ck_tile::TileCopyShape<BlockWaves, BlockTile, WaveTile, ThreadTile>;
```
**Components:**
- **BlockWaves**: Number of concurrent waves per block (e.g., `seq<4, 1>` for 4 waves along M, 1 along N)
- **BlockTile**: Total elements processed by one block (e.g., `seq<512, 8>`)
- **WaveTile**: Elements processed by one wave (e.g., `seq<32, 8>`)
- **ThreadTile**: Elements processed by one thread (e.g., `seq<1, 4>` for 4 contiguous elements)
**Purpose**: Defines the **work distribution hierarchy** from threads → waves → blocks.
### **2. Problem**
Defines the **data types** and **kernel configuration**:
```cpp
using Problem = ck_tile::TileCopyProblem<XDataType, Shape>;
```
**Components:**
- **XDataType**: Input/output data type (e.g., `float`, `half`)
- **Shape**: The tile shape defined above
**Purpose**: Encapsulates **what** the kernel operates on and **how** it's configured.
### **3. Policy**
Defines the **memory access patterns** and **distribution strategies**:
```cpp
using Policy = ck_tile::TileCopyPolicy<Problem>;
```
**Key Functions:**
- **MakeDRAMDistribution()**: Defines how threads access DRAM memory.
**Purpose**: Defines **how** data is accessed and distributed across threads.
### **4. Pipeline**
Defines the **execution flow** and **memory movement patterns**:
```cpp
// Example pipeline stages:
// 1. DRAM → Registers (load_tile)
// 2. Registers → LDS (store_tile)
// 3. LDS → Registers (load_tile with distribution)
// 4. Registers → DRAM (store_tile)
```
**Purpose**: Defines the **sequence of operations** and **memory movement strategy**.
### **Component Interaction**
```cpp
// Complete kernel definition
using Shape = ck_tile::TileCopyShape<BlockWaves, BlockTile, WaveTile, ThreadTile>;
using Problem = ck_tile::TileCopyProblem<XDataType, Shape>;
using Policy = ck_tile::TileCopyPolicy<Problem>;
using Kernel = ck_tile::TileCopyKernel<Problem, Policy>;
```
**Flow:**
1. **Shape** defines the tile structure and work distribution
2. **Problem** combines data types with the shape
3. **Policy** defines memory access patterns for the problem
4. **Kernel** implements the actual computation using all components
### **Why This Architecture?**
#### **Separation of Concerns**
- **Shape**: Focuses on **work distribution** and **tile structure**
- **Problem**: Focuses on **data types** and **configuration**
- **Policy**: Focuses on **memory access** and **optimization**
- **Pipeline**: Focuses on **execution flow** and **synchronization**
#### **Reusability**
- Same **Shape** can be used with different **Problems**
- Same **Policy** can be applied to different **Problems**
- **Pipelines** can be reused across different kernels
#### **Performance Optimization**
- **Shape** enables optimal work distribution
- **Policy** enables optimal memory access patterns
- **Pipeline** enables optimal execution flow
## Core Concepts
### Hierarchical Tile Structure
The CK Tile framework organizes work in a hierarchical manner:
1. **ThreadTile**: Number of contiguous elements processed by a single thread
- Enables vectorized memory loads/stores.
- Example: `ThreadTile = seq<1, 4>` means each thread loads 4 contiguous elements along the N dimension
- A ThreadTile can be imagined as a thread-level tile
2. **WaveTile**: Number of elements covered by a single wave (64 threads on CDNA, 32 threads on RDNA)
- Must satisfy: `Wave_Tile_M / ThreadTile_M * Wave_Tile_N / ThreadTile_N == WaveSize`
- This ensures the number of threads needed equals the wave size
- Example: `WaveTile = seq<64, 4>` with `ThreadTile = seq<1, 4>` means:
- Each thread handles 4 elements (ThreadTile_N = 4)
- Wave needs 64×4/4 = 64 threads to cover 64×4 = 256 elements
- Total elements = 256, which requires WaveSize = 64 threads
3. **BlockTile**: Number of elements covered by one block (typically mapped to one CU)
- Example: `BlockTile = seq<256, 64>` means each block processes 256×64 elements
4. **BlockWaves**: Number of concurrent waves active in a block
- Typical: 4 waves for heavy workloads (e.g., GEMM)
- Limit: up to 1024 threads per block → up to 16 waves (CDNA) or 32 waves (RDNA)
- Example: `BlockWaves = seq<4, 1>` means 4 waves along M, 1 along N
### Wave Repetition
In many scenarios, the total work (BlockTile) is larger than what the available waves can cover in a single iteration. This requires **wave repetition**:
```cpp
// Calculate how many times a wave needs to repeat to cover the entire block tile
static constexpr index_t WaveRepetitionPerBlock_M =
Block_Tile_M / (Waves_Per_Block_M * Wave_Tile_M);
static constexpr index_t WaveRepetitionPerBlock_N =
Block_Tile_N / (Waves_Per_Block_N * Wave_Tile_N);
```
**Key Insight**: When waves repeat, the effective work per thread becomes `ThreadTile * Repeat`, not just `ThreadTile`.
## Tile Distribution Encoding
The tile distribution encoding specifies how work is distributed across threads:
```cpp
constexpr auto outer_encoding =
tile_distribution_encoding<sequence<1>, // replication
tuple<sequence<M0, M1, M2>, sequence<N0, N1>>, // hierarchy
tuple<sequence<1>, sequence<1, 2>>, // parallelism
tuple<sequence<1>, sequence<2, 0>>, // paralleism
sequence<1, 2>, // yield
sequence<0, 1>>{}; // yield
```
### Encoding Parameters Explained
- **M0, M1, M2**: Hierarchical distribution along M dimension
- M0: Number of wave iterations along M
- M1: Number of waves along M
- M2: Number of threads per wave along M
- **N0, N1**: Distribution along N dimension
- N0: Number of threads along N
- N1: ThreadTile size (elements per thread)
- **Order and layout**: The inner-most (rightmost) dimension is the fastest-changing. Choosing `N1 = ThreadTile_N` maps vector width to contiguous addresses, i.e., row-major access in this example.
- **YIELD arguments**: Both `Repeat` and `ThreadTile` because effective work per thread is `ThreadTile * Repeat`
## Tensor Abstractions
### Tensor Descriptor
Defines the logical structure of a tensor:
```cpp
auto desc = make_naive_tensor_descriptor(
make_tuple(M, N), // tensor dimensions
make_tuple(N, 1), // strides
number<ThreadTile_N>{}, // per-thread vector length
number<1>{} // guaranteed last dimension vector stride
);
```
### Tensor View
Combines memory buffer with tensor descriptor:
```cpp
auto x_m_n = make_naive_tensor_view<address_space_enum::global>(
p_x, // memory buffer
make_tuple(M, N), // dimensions
make_tuple(N, 1), // strides
number<S::ThreadTile_N>{}, // per-thread vector length
number<1>{} // guaranteed last dimension vector stride
);
```
### Tile Window
A view into a specific tile of the tensor with thread distribution:
```cpp
auto x_window = make_tile_window(
x_m_n, // tensor view
make_tuple(Block_Tile_M, Block_Tile_N), // tile size
{iM, 0}, // tile origin
tile_distribution // how work is distributed among threads
);
```
## The test_copy_basic Kernel
### Kernel Structure
The `TileCopyKernel` implements a basic copy operation from input tensor `x` to output tensor `y`:
```cpp
template <typename Problem_, typename Policy_>
struct TileCopyKernel
{
CK_TILE_DEVICE void operator()(const XDataType* p_x, XDataType* p_y, index_t M, index_t N) const
{
// 1. Create tensor views
// 2. Create tile windows
// 3. Iterate over N dimension tiles
// 4. Load, copy, and store data
}
};
```
### Step-by-Step Execution
1. **Tensor View Creation**:
```cpp
const auto x_m_n = make_naive_tensor_view<address_space_enum::global>(
p_x, make_tuple(M, N), make_tuple(N, 1), number<S::ThreadTile_N>{}, number<1>{});
```
- Creates views for both input and output tensors
- Specifies vectorized access with `ThreadTile_N` elements per load
2. **Tile Window Creation**:
```cpp
auto x_window = make_tile_window(x_m_n,
make_tuple(number<S::Block_Tile_M>{}, number<S::Block_Tile_N>{}),
{iM, 0},
Policy::template MakeDRAMDistribution<Problem>());
```
- Creates windows into specific tiles of the tensors
- Each block processes one tile starting at `{iM, 0}`
- Tile distribution determines how threads access data
3. **N-Dimension Iteration**:
```cpp
index_t num_n_tile_iteration = __builtin_amdgcn_readfirstlane(integer_divide_ceil(N, S::Block_Tile_N));
for(int iN = __builtin_amdgcn_readfirstlane(0); iN < num_n_tile_iteration; ++iN)
```
- If tensor N dimension > Block_Tile_N, multiple iterations are needed
- Each iteration processes one tile along N dimension
4. **Load-Store Operations**:
```cpp
dram_reg_tile dram_tile;
load_tile(dram_tile, x_window); // Load from global memory to registers
store_tile(y_window, dram_tile); // Store from registers to global memory
move_tile_window(x_window, {0, S::Block_Tile_N}); // Move to next N tile
move_tile_window(y_window, {0, S::Block_Tile_N});
```
### How Load/Store Works
1. **Load Tile**:
- Each thread loads its assigned elements based on tile distribution
- Vectorized loads enable efficient memory bandwidth utilization
- Data is distributed to per-thread register buffers
2. **Store Tile**:
- Each thread writes its assigned elements back to global memory
- Maintains the same distribution pattern as load
3. **Tile Window Movement**:
- Moves the window to the next tile along N dimension
- Enables processing of large tensors that don't fit in one tile
## Memory Access Patterns
### Vectorized Access
- Enabled by specifying vector length in tensor views
- Each thread loads/stores multiple contiguous elements in one operation
- Improves memory bandwidth utilization
### Thread Distribution
- Tile distribution encoding determines which threads access which elements
- Ensures all threads participate and no data is missed
- Enables memory coalescing for optimal performance
### Coordinate Transform (Embed)
- Maps multi-dimensional tensor indices to linear memory addresses
- Handles stride calculations automatically
- Enables efficient access to non-contiguous memory layouts