* chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>
CK Tile Framework: Getting Started with Tile Copy Operations
Overview
Copy Kernel
A minimal CK_Tile memory copy implementation demonstrating the basic setup required to write a kernel in CK Tile. This experimental kernel is intended for novice CK developers. It introduces the building blocks of CK Tile and provides a sandbox for experimenting with kernel parameters.
build
# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture
# (for example gfx90a or gfx942) or leave it blank
../script/cmake-ck-dev.sh ../ <arch>
# Make the copy kernel executable
make tile_example_copy -j
This will result in an executable build/bin/test_copy_basic
example
args:
-m input matrix rows. (default 64)
-n input matrix cols. (default 8)
-id wave to use for computation. (default 0)
-v validation flag to check device results. (default 1)
-prec datatype precision to use. (default fp16)
-warmup no. of warmup iterations. (default 50)
-repeat no. of iterations for kernel execution time. (default 100)
CK Tile Architecture Components
The CK Tile framework is built around four key architectural components that work together to define and execute GPU kernels: shape, policy, problem, and pipeline.
1. Shape
Defines the hierarchical tile structure and memory layout of the kernel:
using Shape = ck_tile::TileCopyShape<BlockWaves, BlockTile, WaveTile, ThreadTile>;
Components:
- BlockWaves: Number of concurrent waves per block (e.g.,
seq<4, 1>for 4 waves along M, 1 along N) - BlockTile: Total elements processed by one block (e.g.,
seq<512, 8>) - WaveTile: Elements processed by one wave (e.g.,
seq<32, 8>) - ThreadTile: Elements processed by one thread (e.g.,
seq<1, 4>for 4 contiguous elements)
Purpose: Defines the work distribution hierarchy from threads → waves → blocks.
2. Problem
Defines the data types and kernel configuration:
using Problem = ck_tile::TileCopyProblem<XDataType, Shape>;
Components:
- XDataType: Input/output data type (e.g.,
float,half) - Shape: The tile shape defined above
Purpose: Encapsulates what the kernel operates on and how it's configured.
3. Policy
Defines the memory access patterns and distribution strategies:
using Policy = ck_tile::TileCopyPolicy<Problem>;
Key Functions:
- MakeDRAMDistribution(): Defines how threads access DRAM memory.
Purpose: Defines how data is accessed and distributed across threads.
4. Pipeline
Defines the execution flow and memory movement patterns:
// Example pipeline stages:
// 1. DRAM → Registers (load_tile)
// 2. Registers → LDS (store_tile)
// 3. LDS → Registers (load_tile with distribution)
// 4. Registers → DRAM (store_tile)
Purpose: Defines the sequence of operations and memory movement strategy.
Component Interaction
// Complete kernel definition
using Shape = ck_tile::TileCopyShape<BlockWaves, BlockTile, WaveTile, ThreadTile>;
using Problem = ck_tile::TileCopyProblem<XDataType, Shape>;
using Policy = ck_tile::TileCopyPolicy<Problem>;
using Kernel = ck_tile::TileCopyKernel<Problem, Policy>;
Flow:
- Shape defines the tile structure and work distribution
- Problem combines data types with the shape
- Policy defines memory access patterns for the problem
- Kernel implements the actual computation using all components
Why This Architecture?
Separation of Concerns
- Shape: Focuses on work distribution and tile structure
- Problem: Focuses on data types and configuration
- Policy: Focuses on memory access and optimization
- Pipeline: Focuses on execution flow and synchronization
Reusability
- Same Shape can be used with different Problems
- Same Policy can be applied to different Problems
- Pipelines can be reused across different kernels
Performance Optimization
- Shape enables optimal work distribution
- Policy enables optimal memory access patterns
- Pipeline enables optimal execution flow
Core Concepts
Hierarchical Tile Structure
The CK Tile framework organizes work in a hierarchical manner:
-
ThreadTile: Number of contiguous elements processed by a single thread
- Enables vectorized memory loads/stores.
- Example:
ThreadTile = seq<1, 4>means each thread loads 4 contiguous elements along the N dimension - A ThreadTile can be imagined as a thread-level tile
-
WaveTile: Number of elements covered by a single wave (64 threads on CDNA, 32 threads on RDNA)
- Must satisfy:
Wave_Tile_M / ThreadTile_M * Wave_Tile_N / ThreadTile_N == WaveSize - This ensures the number of threads needed equals the wave size
- Example:
WaveTile = seq<64, 4>withThreadTile = seq<1, 4>means:- Each thread handles 4 elements (ThreadTile_N = 4)
- Wave needs 64×4/4 = 64 threads to cover 64×4 = 256 elements
- Total elements = 256, which requires WaveSize = 64 threads
- Must satisfy:
-
BlockTile: Number of elements covered by one block (typically mapped to one CU)
- Example:
BlockTile = seq<256, 64>means each block processes 256×64 elements
- Example:
-
BlockWaves: Number of concurrent waves active in a block
- Typical: 4 waves for heavy workloads (e.g., GEMM)
- Limit: up to 1024 threads per block → up to 16 waves (CDNA) or 32 waves (RDNA)
- Example:
BlockWaves = seq<4, 1>means 4 waves along M, 1 along N
Wave Repetition
In many scenarios, the total work (BlockTile) is larger than what the available waves can cover in a single iteration. This requires wave repetition:
// Calculate how many times a wave needs to repeat to cover the entire block tile
static constexpr index_t WaveRepetitionPerBlock_M =
Block_Tile_M / (Waves_Per_Block_M * Wave_Tile_M);
static constexpr index_t WaveRepetitionPerBlock_N =
Block_Tile_N / (Waves_Per_Block_N * Wave_Tile_N);
Key Insight: When waves repeat, the effective work per thread becomes ThreadTile * Repeat, not just ThreadTile.
Tile Distribution Encoding
The tile distribution encoding specifies how work is distributed across threads:
constexpr auto outer_encoding =
tile_distribution_encoding<sequence<1>, // replication
tuple<sequence<M0, M1, M2>, sequence<N0, N1>>, // hierarchy
tuple<sequence<1>, sequence<1, 2>>, // parallelism
tuple<sequence<1>, sequence<2, 0>>, // paralleism
sequence<1, 2>, // yield
sequence<0, 1>>{}; // yield
Encoding Parameters Explained
- M0, M1, M2: Hierarchical distribution along M dimension
- M0: Number of wave iterations along M
- M1: Number of waves along M
- M2: Number of threads per wave along M
- N0, N1: Distribution along N dimension
- N0: Number of threads along N
- N1: ThreadTile size (elements per thread)
- Order and layout: The inner-most (rightmost) dimension is the fastest-changing. Choosing
N1 = ThreadTile_Nmaps vector width to contiguous addresses, i.e., row-major access in this example. - YIELD arguments: Both
RepeatandThreadTilebecause effective work per thread isThreadTile * Repeat
Tensor Abstractions
Tensor Descriptor
Defines the logical structure of a tensor:
auto desc = make_naive_tensor_descriptor(
make_tuple(M, N), // tensor dimensions
make_tuple(N, 1), // strides
number<ThreadTile_N>{}, // per-thread vector length
number<1>{} // guaranteed last dimension vector stride
);
Tensor View
Combines memory buffer with tensor descriptor:
auto x_m_n = make_naive_tensor_view<address_space_enum::global>(
p_x, // memory buffer
make_tuple(M, N), // dimensions
make_tuple(N, 1), // strides
number<S::ThreadTile_N>{}, // per-thread vector length
number<1>{} // guaranteed last dimension vector stride
);
Tile Window
A view into a specific tile of the tensor with thread distribution:
auto x_window = make_tile_window(
x_m_n, // tensor view
make_tuple(Block_Tile_M, Block_Tile_N), // tile size
{iM, 0}, // tile origin
tile_distribution // how work is distributed among threads
);
The test_copy_basic Kernel
Kernel Structure
The TileCopyKernel implements a basic copy operation from input tensor x to output tensor y:
template <typename Problem_, typename Policy_>
struct TileCopyKernel
{
CK_TILE_DEVICE void operator()(const XDataType* p_x, XDataType* p_y, index_t M, index_t N) const
{
// 1. Create tensor views
// 2. Create tile windows
// 3. Iterate over N dimension tiles
// 4. Load, copy, and store data
}
};
Step-by-Step Execution
-
Tensor View Creation:
const auto x_m_n = make_naive_tensor_view<address_space_enum::global>( p_x, make_tuple(M, N), make_tuple(N, 1), number<S::ThreadTile_N>{}, number<1>{});- Creates views for both input and output tensors
- Specifies vectorized access with
ThreadTile_Nelements per load
-
Tile Window Creation:
auto x_window = make_tile_window(x_m_n, make_tuple(number<S::Block_Tile_M>{}, number<S::Block_Tile_N>{}), {iM, 0}, Policy::template MakeDRAMDistribution<Problem>());- Creates windows into specific tiles of the tensors
- Each block processes one tile starting at
{iM, 0} - Tile distribution determines how threads access data
-
N-Dimension Iteration:
index_t num_n_tile_iteration = __builtin_amdgcn_readfirstlane(integer_divide_ceil(N, S::Block_Tile_N)); for(int iN = __builtin_amdgcn_readfirstlane(0); iN < num_n_tile_iteration; ++iN)- If tensor N dimension > Block_Tile_N, multiple iterations are needed
- Each iteration processes one tile along N dimension
-
Load-Store Operations:
dram_reg_tile dram_tile; load_tile(dram_tile, x_window); // Load from global memory to registers store_tile(y_window, dram_tile); // Store from registers to global memory move_tile_window(x_window, {0, S::Block_Tile_N}); // Move to next N tile move_tile_window(y_window, {0, S::Block_Tile_N});
How Load/Store Works
-
Load Tile:
- Each thread loads its assigned elements based on tile distribution
- Vectorized loads enable efficient memory bandwidth utilization
- Data is distributed to per-thread register buffers
-
Store Tile:
- Each thread writes its assigned elements back to global memory
- Maintains the same distribution pattern as load
-
Tile Window Movement:
- Moves the window to the next tile along N dimension
- Enables processing of large tensors that don't fit in one tile
Memory Access Patterns
Vectorized Access
- Enabled by specifying vector length in tensor views
- Each thread loads/stores multiple contiguous elements in one operation
- Improves memory bandwidth utilization
Thread Distribution
- Tile distribution encoding determines which threads access which elements
- Ensures all threads participate and no data is missed
- Enables memory coalescing for optimal performance
Coordinate Transform (Embed)
- Maps multi-dimensional tensor indices to linear memory addresses
- Handles stride calculations automatically
- Enables efficient access to non-contiguous memory layouts