Add Operation Support Matrix to Dispatcher README Added an Operation Support Matrix to the Dispatcher README, detailing CK Tile operations with support status for various data types, layouts, and GPU targets. ## Motivation Provide a clear understanding of which operators (and variants) are supported by dispatcher. ## Technical Details Entirely generated by a skill. ## Test Plan N/A. This is a documentation-only change. ## Test Result N/A. This is a documentation-only change. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
CK Tile Dispatcher
A unified kernel dispatch system for AMD GPUs with C++ and Python frontends.
Validated Platform: AMD Instinct MI300 series (gfx942)
Table of Contents
- Quick Start
- Docker Setup
- Prerequisites
- Step-by-Step Build Guide
- Running Examples
- External Integration
- Core Concepts
- Operation Support Matrix
- Troubleshooting
- File Structure
Quick Start
Complete setup from scratch (5 minutes):
# From the composable_kernel root directory
cd dispatcher
# Step 1: Create build directory
mkdir -p build && cd build
# Step 2: Configure CMake
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release \
-DGPU_TARGETS="gfx942" \
-DBUILD_DISPATCHER_EXAMPLES=ON
# Step 3: Generate kernels and build (CMake handles this automatically)
make -j$(nproc)
# Step 4: Run C++ examples
./examples/gemm_01_basic
# Step 5: Build Python libraries (required for Python examples)
make python_libs
# Step 6: Run Python examples (from dispatcher directory)
cd ..
python3 examples/gemm/python/01_basic_gemm.py
Docker Setup (Recommended)
For a reproducible build environment, use the official ROCm Docker image:
Step 1: Pull and Run Container
# Pull the CK Docker image
docker pull rocm/composable_kernel:ck_ub24.04_rocm7.0.1
# Run container with GPU access
docker run \
-it \
--privileged \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--group-add render \
-w /root/workspace \
-v $(pwd):/root/workspace \
rocm/composable_kernel:ck_ub24.04_rocm7.0.1 \
/bin/bash
Note: Omit
--deviceflags if building without GPU access.
Step 2: Clone and Build
# Inside the container
git clone https://github.com/ROCm/composable_kernel.git
cd composable_kernel
git checkout builder-dispatch-tile-gemm
# Set up Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install numpy
# Build dispatcher
cd dispatcher
mkdir -p build && cd build
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release \
-DGPU_TARGETS="gfx942" \
-DBUILD_DISPATCHER_EXAMPLES=ON
make -j$(nproc)
One-Liner Build (inside container)
git clone https://github.com/ROCm/composable_kernel.git && \
cd composable_kernel && git checkout builder-dispatch-tile-gemm && \
python3 -m venv .venv && source .venv/bin/activate && pip install numpy && \
cd dispatcher && mkdir -p build && cd build && \
cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release -DGPU_TARGETS="gfx942" -DBUILD_DISPATCHER_EXAMPLES=ON && \
make -j$(nproc)
Prerequisites
Required Software
| Software | Minimum Version | Check Command |
|---|---|---|
| ROCm | 6.4+ | rocminfo |
| CMake | 3.16+ | cmake --version |
| Python | 3.8+ | python3 --version |
| NumPy | 1.20+ | pip show numpy |
| hipcc | (from ROCm) | /opt/rocm/bin/hipcc --version |
Note: Newer GPU targets (gfx950, gfx1201) require ROCm 6.3+. For ROCm 6.4+, you can also use
amdclang++instead ofhipcc.
Check Your GPU Architecture
# Find your GPU architecture
rocminfo | grep -i "gfx"
# Example output: "gfx942"
Supported architectures:
- gfx942 - MI300X, MI300A, MI308, MI325 (Instinct MI300 series)
- gfx90a - MI200 series (MI250, MI250X)
- gfx950 - MI350 series
- gfx1101 - RDNA3 series
- gfx1201 - RDNA4 series
Install Python Dependencies
NumPy is required for Python examples and kernel generation. We recommend using a virtual environment:
Option 1: Using standard venv
# Create virtual environment
python3 -m venv .venv
# Activate virtual environment
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install NumPy
pip install numpy
Option 2: Using uv (faster alternative)
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install NumPy
uv pip install numpy
Option 3: System-wide install (not recommended)
pip install numpy
Note: Always activate your virtual environment before running CMake or Python examples.
Supported Data Types
CK Tile supports a wide range of data types for GEMM operations:
| A dtype | B dtype | Acc dtype | Warp Tile Sizes | Notes |
|---|---|---|---|---|
fp32 |
fp32 |
fp32 |
16x16x4, 16x16x16 | Full precision |
fp16 |
fp16 |
fp32 |
32x32x8, 32x32x16, 16x16x16, 16x16x32 | Standard half |
bf16 |
bf16 |
fp32 |
32x32x8, 32x32x16, 16x16x16, 16x16x32 | Brain float 16 |
fp8 |
fp8 |
fp32 |
32x32x16, 32x32x32, 16x16x32, 16x16x64 | FP8 E4M3 |
fp8 |
bf8 |
fp32 |
32x32x16, 16x16x32 | Mixed FP8/BF8 |
bf8 |
fp8 |
fp32 |
32x32x16, 16x16x128 | Mixed BF8/FP8 |
bf8 |
bf8 |
fp32 |
32x32x16, 32x32x32, 16x16x32 | BF8 E5M2 |
int8 |
int8 |
int32 |
32x32x16, 16x16x32, 16x16x16 | Integer GEMM |
pk_fp4 |
pk_fp4 |
fp32 |
16x16x128 | Packed 4-bit float |
Notes:
- Accumulator is always
fp32except forint8which usesint32 - FP8 types:
fp8= E4M3,bf8= E5M2 pk_fp4= Packed 4-bit float (2 values per byte)- Some dtypes require specific GPU architectures (e.g., FP8 requires MI300+)
Step-by-Step Build Guide
Step 1: Navigate to Dispatcher Directory
# From composable_kernel root
cd dispatcher
# Verify you're in the right place
ls CMakeLists.txt # Should exist
Step 2: Create Build Directory
mkdir -p build
cd build
Step 3: Configure CMake
Basic configuration (library only):
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release \
-DGPU_TARGETS="gfx942"
Full configuration (with examples and tests):
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release \
-DGPU_TARGETS="gfx942" \
-DBUILD_DISPATCHER_EXAMPLES=ON \
-DBUILD_DISPATCHER_TESTS=ON
Expected output:
-- Found hip: /opt/rocm (found suitable version "6.x.x")
-- Generating GEMM kernels...
-- Built: gemm_01 through gemm_06, dispatcher_gemm_lib.so
-- Configuring done
Step 4: Build
# Build all targets (generates kernels automatically, then compiles)
make -j$(nproc)
# Or build specific targets
make gemm_01_basic # Single GEMM example
make dispatcher_gemm_lib # GEMM shared library for Python
# Build ONLY Python libraries (faster if you don't need C++ examples)
make python_libs -j$(nproc)
Kernel Generation Targets
Kernels are generated automatically during make, but you can also control generation explicitly:
# Generate all kernels only (no compilation)
make generate_all_kernels
# Generate GEMM kernels only
make generate_gemm_kernels
# Force regenerate (even if kernels exist)
make regenerate_all_kernels
make regenerate_gemm_kernels
# Generate for specific GPU architecture
make generate_kernels_gfx942 # MI300X
make generate_kernels_gfx90a # MI200
make generate_kernels_gfx1100 # RDNA3
Step 5: Verify Build
# Check executables were built
ls examples/gemm_*
# Check shared libraries were built
ls examples/libdispatcher_gemm_lib.so
CMake Options Reference
| Flag | Default | Description |
|---|---|---|
CMAKE_BUILD_TYPE |
Debug | Use Release for performance! |
GPU_TARGETS |
None | Target GPU: "gfx942", "gfx90a", etc. |
BUILD_DISPATCHER_EXAMPLES |
OFF | Build C++ examples and Python libs |
BUILD_DISPATCHER_TESTS |
OFF | Build unit tests |
CMAKE_PREFIX_PATH |
- | ROCm installation path |
CMAKE_CXX_COMPILER |
- | Path to hipcc compiler |
⚠️ Important: Always use -DCMAKE_BUILD_TYPE=Release for benchmarking. Debug builds are slower.
⚠️ Important: Note that the current system provides single GPU target support for architecture-based kernel filtering, please do not use multiple GPU targets at a time (if necessary, please compile into different build directories).
Running Examples
C++ Examples
After building, executables are in build/examples/:
cd build/examples
# GEMM Examples
./gemm_01_basic # Basic GEMM with autofill/autocorrect
./gemm_02_multi_size # Wildcard expansion
./gemm_03_benchmark_validation # Benchmarking + validation
./gemm_04_heuristics # Heuristic kernel selection
./gemm_05_json_export # Registry JSON export
./gemm_06_multi_registry # Multiple registries
Python Examples
Run from the dispatcher directory:
cd /path/to/composable_kernel/dispatcher
# GEMM Examples
python3 examples/gemm/python/01_basic_gemm.py # Basic multi-kernel GEMM
python3 examples/gemm/python/04_validation.py # CPU reference validation
python3 examples/gemm/python/07_stress_test.py # Stress test (48 kernels)
python3 examples/gemm/python/08_heuristics.py # Heuristic selection
Example Output
Expected C++ output (gemm_01_basic):
======================================================================
Example 01: Basic GEMM with Declarative Kernel Definition
======================================================================
Step 1: Declared Kernels
------------------------
Kernel Set: fp16_gemm_kernels
Architecture: gfx942
Configurations: 1
- gemm_fp16_rcr_compv4_cshuffle_intrawave_128x128x32
Step 2: Create Registry and Dispatcher
--------------------------------------
Registered 1 kernels
Step 3: Define Problem
----------------------
M=1024, N=1024, K=1024
Step 4: GPU Execution
---------------------
*** GPU EXECUTION ***
Time: <varies> ms
TFLOPS: <varies>
Note: Timing values vary by GPU model and system configuration.
Benchmark Parameters
The dispatcher supports fine-grained control over benchmarking, matching CK Tile's stream_config:
Available Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
warmup |
int | 5 | Warmup iterations (discarded from timing) |
repeat |
int | 20 | Benchmark iterations (averaged) |
flush_cache |
bool | false | Flush GPU L2 cache between iterations |
rotating_count |
int | 1 | Rotating buffer count (for cache simulation) |
timer |
string | "gpu" | Timer type: "gpu" (HIP events) or "cpu" |
init |
string | "random" | Matrix initialization: "random", "linear", "constant" |
split_k |
int | 1 | Split-K parallelism factor |
Python Usage
from ctypes_utils import DispatcherLib
# Basic usage (default benchmark settings)
lib = DispatcherLib.load()
# Advanced benchmark settings via command line
python3 examples/gemm/python/10_advanced_benchmark.py \
--warmup 10 \
--repeat 100 \
--flush-cache
C++ Usage
// Basic timing
ck_tile::stream_config cfg{nullptr, true};
// Advanced benchmark settings
ck_tile::stream_config cfg{
nullptr, // stream_id (nullptr = default stream)
true, // time_kernel
1, // log_level
10, // cold_niters (warmup)
100, // nrepeat
true, // is_gpu_timer
true, // flush_cache
4 // rotating_count
};
float avg_time = kernel.run(args, cfg);
Command Line (Python Examples)
# Basic run
python3 examples/gemm/python/10_advanced_benchmark.py
# With benchmark parameters
python3 examples/gemm/python/10_advanced_benchmark.py \
--warmup 10 \
--repeat 100 \
--flush-cache \
--rotating-count 4 \
--timer gpu
When to Use Each Parameter
| Use Case | Recommended Settings |
|---|---|
| Quick test | warmup=1, repeat=3 |
| Stable benchmark | warmup=10, repeat=100 |
| Memory-bound analysis | flush_cache=True, rotating_count=4 |
| Compute-bound analysis | flush_cache=False (default) |
| Debug timing | timer="cpu" |
| Production | timer="gpu" (default) |
External Integration
Using Dispatcher in Your Own Project
Option 1: CMake Integration (Recommended)
Add to your CMakeLists.txt:
# Set path to composable_kernel
set(CK_ROOT "/path/to/composable_kernel")
# Add dispatcher subdirectory
add_subdirectory(${CK_ROOT}/dispatcher dispatcher_build)
# Link to your target
target_link_libraries(your_target PRIVATE ck_tile_dispatcher)
target_include_directories(your_target PRIVATE
${CK_ROOT}/dispatcher/include
${CK_ROOT}/include
)
Option 2: Include as Pre-built Library
# Find the pre-built library
find_library(CK_DISPATCHER ck_tile_dispatcher
PATHS /path/to/composable_kernel/dispatcher/build)
# Include directories
set(CK_INCLUDE_DIRS
/path/to/composable_kernel/include
/path/to/composable_kernel/dispatcher/include
)
target_link_libraries(your_target PRIVATE ${CK_DISPATCHER})
target_include_directories(your_target PRIVATE ${CK_INCLUDE_DIRS})
Option 3: Python Integration
import sys
sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python")
# For GEMM
from ctypes_utils import DispatcherLib, Dispatcher, KernelConfig
Required Include Paths
When integrating, you need these include paths:
/path/to/composable_kernel/include # CK Tile core headers
/path/to/composable_kernel/dispatcher/include # Dispatcher headers
/path/to/composable_kernel/dispatcher/build/generated_kernels # Generated kernels
Required Compile Flags
# Minimum flags for hipcc
-std=c++17
-D__HIP_PLATFORM_AMD__=1
--offload-arch=gfx942 # Your target GPU
# Recommended flags
-O3
-mllvm -enable-noalias-to-md-conversion=0
-Wno-undefined-func-template
-Wno-float-equal
-Wall
-Werror
Python Path Setup
For Python scripts outside the dispatcher directory:
# Option 1: Environment variable
export PYTHONPATH="/path/to/composable_kernel/dispatcher/examples/gemm/python:$PYTHONPATH"
# Option 2: In your Python script
import sys
sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python")
Library Search Paths
The Python utilities search for the shared library in these locations:
# For GEMM (ctypes_utils.py)
SEARCH_PATHS = [
"build/examples/libdispatcher_gemm_lib.so",
"../build/examples/libdispatcher_gemm_lib.so",
"../../build/examples/libdispatcher_gemm_lib.so",
]
If using from a different location, set the library path explicitly:
# GEMM
from ctypes_utils import DispatcherLib
lib = DispatcherLib.load("/absolute/path/to/libdispatcher_gemm_lib.so")
Core Concepts
Data Flow
KernelConfig → Registry → Dispatcher → GPU Execution
- KernelConfig: Defines kernel parameters (tile sizes, data types, layouts)
- Registry: Stores multiple kernel configurations
- Dispatcher: Selects best kernel for a given problem and executes it
GEMM Layouts
| Layout | A | B | C | Use Case |
|---|---|---|---|---|
| RCR | Row | Col | Row | Most common (PyTorch default) |
| RRR | Row | Row | Row | Both inputs row-major |
| CRR | Col | Row | Row | A transposed |
| CCR | Col | Col | Row | Both inputs column-major |
Split-K Support
Split-K divides the K dimension across multiple thread blocks, useful for large K dimensions.
Usage (C++):
// GEMM with 4-way K split
auto problem = ProblemBuilder()
.m(1024).n(1024).k(8192)
.split_k(4)
.build();
Operation Support Matrix
This matrix shows all CK Tile operations with per-data-type, per-layout, and per-GPU support status. It uses a three-state convention: ✅ = supported by both CK Tile and the dispatcher, ❌ = supported by CK Tile but not yet in the dispatcher, blank = not supported by CK Tile itself.
| Data Types | Layouts | GPU Targets | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Op | CK Tile Kernel | fp16 | fp8 | bf16 | bf8 | int8 | fp4 | fp6 | rcr | rrr | ccr | crr | 90a | 942 | 950 | 1201 |
| GEMM | gemm_multi_d [5] engine: dispatcher/example: 19_gemm_multi_d/ |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| GEMM | gemm_preshuffle [1][2] engine: dispatcher/ |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | |||||
| GEMM | gemm_universal [3][4][7][8] engine: dispatcher/example: 03_gemm/ |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| GEMM | batched_contraction example: 41_batched_contraction/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| GEMM | batched_gemm example: 16_batched_gemm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | block_scale_gemm example: 38_block_scale_gemm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | flatmm example: 18_flatmm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | gemm_multi_abd example: 22_gemm_multi_abd/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | gemm_quant | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||
| GEMM | grouped_gemm example: 17_grouped_gemm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | grouped_gemm_quant | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||
| GEMM | streamk_gemm example: 40_streamk_gemm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Reduce | multi_reduce2d example: 05_reduce/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||||
| Reduce | reduce2d example: 05_reduce/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||||
| Attention | fmha example: 01_fmha/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Attention | sparse_attn example: 50_sparse_attn/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||||
| Activation | softmax | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||||
| Activation | topk_softmax example: 09_topk_softmax/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||||
| Conv | grouped_conv [6] example: 20_grouped_convolution/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||
| Data Move | batched_transpose example: 35_batched_transpose/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||||
| Data Move | image_to_column example: 04_img2col/ |
❌ | ❌ | ❌ | ❌ | ❌ | ||||||||||
| Data Move | permute example: 06_permute/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||
| Elementwise | elementwise example: 21_elementwise/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||
| MoE | fused_moe example: 15_fused_moe/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Norm | add_rmsnorm2d_rdquant example: 11_add_rmsnorm2d_rdquant/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Norm | layernorm2d example: 02_layernorm2d/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Norm | norm_reduce | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||||
| Norm | rmsnorm2d example: 10_rmsnorm2d/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Pooling | pooling example: 36_pooling/ |
❌ | ❌ | ❌ | ❌ | ❌ | ||||||||||
| Quant | smoothquant example: 12_smoothquant/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
Notes:
- [1] gemm_preshuffle: Supports only
rcrlayout. Uses fixedpreshufflev2pipeline,Autoscheduler, andcshuffleepilogue. - [2] gemm_preshuffle:
int8preshuffle support is limited to gfx942 and gfx950 (entries inpreshuffle_warp_tile_combos). - [3] gemm_universal:
fp4(pk_fp4) support is only available on gfx950. - [4] gemm_universal:
fp32GEMM is supported by the dispatcher (fp32_fp32_fp32warp tile combos exist) but is omitted from matrix columns for consistency with the tile engine matrix format. - [5] gemm_multi_d: Codegen supports
MultiDAddandMultiDMultiplyelement-wise ops. Preselected kernel sets also testRelu,Gelu,FastGelu. - [6] grouped_conv:
arch_filter.pydefines conv operator types (CONV_FWD,CONV_BWD_DATA,CONV_BWD_WEIGHT,CONV3D_*) but dispatcher infrastructure is incomplete (ctypes bindings are stubs,conv_utils.hppdoes not exist). - [7] (all dispatcher ops): gfx908, gfx1100, and gfx1200 also have
warp_tile_combosinarch_specs.jsonbut are not shown in the matrix's 4 GPU columns. - [8] (all dispatcher ops):
int4,fp32,fp64are valid dispatcher data types (defined inkernel_key.hppDataTypeenum) but have no dedicated matrix columns.
Dispatcher GEMM Configuration Detail
Per-Variant Configuration
| GEMM Variant | Pipelines | Schedulers | Epilogues | Element-wise Ops | Output Dtype |
|---|---|---|---|---|---|
| gemm_universal | mem, compv3, compv4 | intrawave, interwave | cshuffle, default | PassThrough | Same as input (fp8/bf8 -> fp16) |
| gemm_preshuffle | preshufflev2 | Auto | cshuffle | PassThrough | Same as input (fp8/bf8 -> fp16) |
| gemm_multi_d | mem, compv3, compv4 | intrawave, interwave | cshuffle, default | MultiDAdd, MultiDMultiply | Same as input (fp8/bf8 -> fp16) |
Warp Tile Combinations per GPU
| GPU | fp16 | bf16 | fp8 | bf8 | int8 | pk_fp4 |
|---|---|---|---|---|---|---|
| gfx1100 | 16x16x16 | 16x16x16 | -- | -- | 16x16x16 | -- |
| gfx1200 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | -- |
| gfx1201 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | -- |
| gfx908 | 32x32x8, 16x16x16, 32x32x16, 16x16x32 | 32x32x8, 16x16x16, 32x32x16, 16x16x32 | -- | -- | 32x32x16, 16x16x32 | -- |
| gfx90a | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32 | 32x32x16, 32x32x32 | 32x32x16, 16x16x32 | -- |
| gfx942 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 16x16x32 | -- |
| gfx950 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 16x16x32 | 16x16x128 |
Preshuffle Warp Tile Combinations
| GPU | fp16 | bf16 | fp8 | bf8 | int8 |
|---|---|---|---|---|---|
| gfx90a | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32 | 32x32x16, 32x32x32 | -- |
| gfx942 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 32x32x32, 16x16x64, 16x16x32 | 16x16x32, 32x32x16 |
| gfx950 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 32x32x32, 16x16x64, 16x16x32, 16x16x128, 32x32x64 | -- |
Legend:
- CK Tile Kernel column: First line is the kernel name. Lines prefixed with "engine:" show the dispatcher directory. Lines prefixed with "example:" show the CK Tile example directory under
example/ck_tile/. - Green cell (✅): CK Tile implementation exists and the dispatcher supports it.
- Red cell (❌): CK Tile implementation exists but the dispatcher does not support it.
- Grey cell (blank): No CK Tile implementation exists for this combination.
Layout codes: Each 3-character layout code specifies the memory layout for tensors A, B, and C:
r= row-major,c= column-major- Example:
rcrmeans A is row-major, B is column-major, C is row-major gemm_multi_duses 4-character codes internally (e.g.,rcrr) where the 4th character is the D tensor layout (alwaysr). The matrix shows only the 3-character A/B/C portion.
Data type mapping per config label:
| Config Label | A (source) | B (source) | Acc | C (output) |
|---|---|---|---|---|
| fp16 | fp16 | fp16 | fp32 | fp16 |
| bf16 | bf16 | bf16 | fp32 | bf16 |
| int8 | int8 | int8 | int32 | int32 |
| fp8 | fp8 | fp8 | fp32 | fp16 |
| bf8 | bf8 | bf8 | fp32 | fp16 |
| fp6 | fp6 | fp6 | fp32 | fp32 |
| fp4 | fp16 or bf16 | fp4 | fp32 | fp16 or bf16 |
Troubleshooting
Build Issues
| Problem | Solution |
|---|---|
hipcc not found |
Set -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc |
hip not found |
Set -DCMAKE_PREFIX_PATH=/opt/rocm |
| Very slow performance | Use -DCMAKE_BUILD_TYPE=Release |
gfx942 not supported |
Check ROCm version (need 6.0+) |
| Kernel generation fails | Ensure Python 3.8+ with NumPy installed in active venv |
| Build errors | First verify CK builds without dispatcher (see main CK README) |
Runtime Issues
| Problem | Solution |
|---|---|
Library not found |
Build with -DBUILD_DISPATCHER_EXAMPLES=ON |
No kernel found |
Check GPU arch matches build target |
Python ModuleNotFoundError |
Add paths to PYTHONPATH (see above) |
| Wrong results | Verify layout matches your data |
Debug Commands
# Check ROCm installation
rocminfo | head -20
# Check GPU architecture
rocminfo | grep "Name:"
# Verify library exists
ls -la build/examples/libdispatcher_*.so
# Run with verbose output
./build/examples/gemm_01_basic 2>&1
# Python: Check library loading
python3 -c "
import ctypes
lib = ctypes.CDLL('/path/to/libdispatcher_gemm_lib.so')
print('Library loaded successfully')
"
Clean Rebuild
If you encounter issues, try a clean rebuild:
cd dispatcher
rm -rf build
mkdir build && cd build
cmake .. [your options]
make -j$(nproc)
File Structure
dispatcher/
├── README.md # This file
├── CMakeLists.txt # Build configuration
│
├── include/ck_tile/dispatcher/ # C++ headers
│ ├── dispatcher.hpp # GEMM dispatcher
│ ├── registry.hpp # Kernel registry
│ └── kernel_key.hpp # Kernel configuration
│
├── src/ # C++ implementation
│
├── codegen/ # Kernel generation
│ ├── unified_gemm_codegen.py # GEMM kernel generator
│ └── arch_specs.json # GPU specifications
│
├── bindings/ctypes/ # Python ctypes interface
│ └── gemm_ctypes_lib.cpp # GEMM Python library
│
├── examples/ # Examples
│ └── gemm/
│ ├── cpp/ # C++ GEMM examples (01-06)
│ └── python/ # Python GEMM examples (01-11)
│
├── scripts/ # Build scripts
│
└── tests/ # Unit tests
Example Documentation
| Directory | README |
|---|---|
| GEMM C++ | examples/gemm/cpp/README.md |
| GEMM Python | examples/gemm/python/README.md |
| Codegen | codegen/README.md |
Archived Content
Convolution examples and utilities have been archived to ck-2/conv_archive/dispatcher/:
examples/conv/cpp/- 11 C++ convolution examplesexamples/conv/python/- 14 Python convolution examplescodegen/unified_conv_codegen.py- Conv kernel generatorinclude/ck_tile/dispatcher/conv_*.hpp- Conv headerspython/conv_utils.py- Conv Python utilities
License
MIT License - Copyright (c) 2025, Advanced Micro Devices, Inc.