[CK][CK TILE] Dispatcher kernel selection heuristic for grouped conv (#6327) ## Motivation The ML heuristic in dispatcher does not support grouped-conv operator yet. In this PR, the support for fwd, bdw-data, and bwd-weight grouped-conv kernels have been added. A tile_engine utility has also been added to compile and run any selected kernel configuration through dispatcher infrastructure. ## Technical Details 1. Tile engine utility is added to benchmark each shape with all the possible kernel+tile_size combinations here - [https://github.com/ROCm/rocm-libraries/blob/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/tile_engine/ops/grouped_conv/grouped_conv_full_benchmark.py](url) 2. New LGBM regressor models for grouped conv are added to models directory. We have 3 separate models for fwd, bwd-data, and bwd-weights [https://github.com/ROCm/rocm-libraries/tree/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/dispatcher/heuristics/models](url) 3. Implemented lazy GPU initialization (dispatcher/python) - **Issue**: ProcessPoolExecutor fork() + GPU context caused memory access faults - **Solution**: Mirror FMHA pattern - defer GPU initialization until first run() - **Changes**: - setup_multiple_grouped_conv_dispatchers() returns List[Path], not loaded libs - GpuGroupedConvRunner.__init__() no longer calls ctypes.CDLL - Added _ensure_initialized() method for lazy GPU loading - GPU context created only on first run() call - **Benefit**: Parallel compilation now works without GPU conflicts 4. Addressed few miscellaneous issues such as: - Fixed BF16->FP16 naming bug in the dispatcher wrapper - Added new tile sizes, and comp_v5 pipeline to the arch spec to expand the kernel selection - Added automatic padding support for unsupported shapes in dispatcher runner - Created a single source of truth between tile_engine and dispatcher about the architecture and tile_size details - Build a validation scripts to compare oracle_best vs ml_heuristic comparison ## Test Plan 1. Validated fwd, bwd-data, and bwd-weight kernels with both known and unseen data sets with up to 300 problems. 2. Ensured that test cases are added in both dispatcher and tile_engine to validate the heuristic. ## Test Result Results on Unseen shapes validated on gfx950 #### Forward Pass Model - **Training Data**: 48,845 measurements across 1,372 unique problem shapes - **Validation Set**: 300 unseen problems from model crawler - **Validation Performance** (vs. oracle): - Mean Efficiency: **93.05%** - Median Efficiency: **96.8%** - P10 Efficiency: **79.9%** #### Backward Data Gradient (bwd_data) Model - **Training Data**: 18,773 measurements across 891 unique problem shapes - **Validation Set**: 300 unseen problems from model crawler - **Validation Performance** (vs. oracle): - Mean Efficiency: **93.8%** - Median Efficiency: **96.5%** - P10 Efficiency: **82.9%** #### Backward Weight Gradient (bwd_weight) Model - **Training Data**: 34,900 measurements across 1,508 unique problem shapes - **Validation Set**: 300 unseen problems from model crawler - **Validation Performance** (vs. oracle): - Mean Efficiency: **96.1%** - Median Efficiency: **99.2%** - P10 Efficiency: **89.4%** ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
CK Tile Dispatcher
A unified kernel dispatch system for AMD GPUs with C++ and Python frontends, supporting GEMM and Grouped Convolution operations.
Validated Platform: AMD Instinct MI300 series (gfx942)
Table of Contents
- Quick Start
- Docker Setup
- Prerequisites
- Step-by-Step Build Guide
- Running Examples
- External Integration
- Core Concepts
- Operation Support Matrix
- Troubleshooting
- File Structure
Quick Start
Complete setup from scratch (5 minutes):
# From the composable_kernel root directory
cd dispatcher
# Step 1: Create build directory
mkdir -p build && cd build
# Step 2: Configure CMake
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release \
-DGPU_TARGETS="gfx942" \
-DBUILD_DISPATCHER_EXAMPLES=ON
# Step 3: Generate kernels and build (CMake handles this automatically)
make -j$(nproc)
# Step 4: Run C++ examples
./examples/gemm_01_basic
# Step 5: Build Python libraries (required for Python examples)
make python_libs
# Step 6: Run Python examples (from dispatcher directory)
cd ..
python3 examples/gemm/python/01_basic_gemm.py
Docker Setup (Recommended)
For a reproducible build environment, use the official ROCm Docker image:
Step 1: Pull and Run Container
# Pull the CK Docker image
docker pull rocm/composable_kernel:ck_ub24.04_rocm7.0.1
# Run container with GPU access
docker run \
-it \
--privileged \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--group-add render \
-w /root/workspace \
-v $(pwd):/root/workspace \
rocm/composable_kernel:ck_ub24.04_rocm7.0.1 \
/bin/bash
Note: Omit
--deviceflags if building without GPU access.
Step 2: Clone and Build
# Inside the container
git clone https://github.com/ROCm/composable_kernel.git
cd composable_kernel
git checkout builder-dispatch-tile-gemm
# Set up Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install numpy
# Build dispatcher
cd dispatcher
mkdir -p build && cd build
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release \
-DGPU_TARGETS="gfx942" \
-DBUILD_DISPATCHER_EXAMPLES=ON
make -j$(nproc)
One-Liner Build (inside container)
git clone https://github.com/ROCm/composable_kernel.git && \
cd composable_kernel && git checkout builder-dispatch-tile-gemm && \
python3 -m venv .venv && source .venv/bin/activate && pip install numpy && \
cd dispatcher && mkdir -p build && cd build && \
cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release -DGPU_TARGETS="gfx942" -DBUILD_DISPATCHER_EXAMPLES=ON && \
make -j$(nproc)
Prerequisites
Required Software
| Software | Minimum Version | Check Command |
|---|---|---|
| ROCm | 6.4+ | rocminfo |
| CMake | 3.16+ | cmake --version |
| Python | 3.8+ | python3 --version |
| NumPy | 1.20+ | pip show numpy |
| hipcc | (from ROCm) | /opt/rocm/bin/hipcc --version |
Note: Newer GPU targets (gfx950, gfx1201) require ROCm 6.3+. For ROCm 6.4+, you can also use
amdclang++instead ofhipcc.
Check Your GPU Architecture
# Find your GPU architecture
rocminfo | grep -i "gfx"
# Example output: "gfx942"
Supported architectures:
- gfx942 - MI300X, MI300A, MI308, MI325 (Instinct MI300 series)
- gfx90a - MI200 series (MI250, MI250X)
- gfx950 - MI350 series
- gfx1101 - RDNA3 series
- gfx1201 - RDNA4 series
Install Python Dependencies
Core Dependencies (Required)
NumPy is required for Python examples and kernel generation. We recommend using a virtual environment:
Option 1: Using standard venv
# Create virtual environment
python3 -m venv .venv
# Activate virtual environment
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install core dependencies
pip install -r python/requirements.txt
Option 2: Using uv (faster alternative)
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install core dependencies
uv pip install -r python/requirements.txt
Option 3: System-wide install (not recommended)
pip install -r python/requirements.txt
Note: Always activate your virtual environment before running CMake or Python examples.
ML Heuristics Dependencies (Optional)
For ML-based kernel selection (examples 09-11), install additional dependencies:
# Activate your virtual environment first
source .venv/bin/activate
# Install ML dependencies (LightGBM, pandas, pyarrow, scikit-learn)
pip install -r requirements-ml.txt
Why separate? ML dependencies are large (especially pyarrow) and not needed for basic dispatcher usage. Install only if you need:
- ML-based kernel selection (
examples/gemm/python/09_ml_heuristic.py) - Model training (
heuristics/train.py) - Model evaluation (
heuristics/evaluate.py) - Automated benchmark analysis
Core dependencies: ~50 MB (NumPy only) With ML dependencies: ~500 MB (includes LightGBM, pandas, pyarrow, scikit-learn)
Supported Data Types
CK Tile supports a wide range of data types for GEMM operations:
| A dtype | B dtype | Acc dtype | Warp Tile Sizes | Notes |
|---|---|---|---|---|
fp32 |
fp32 |
fp32 |
16x16x4, 16x16x16 | Full precision |
fp16 |
fp16 |
fp32 |
32x32x8, 32x32x16, 16x16x16, 16x16x32 | Standard half |
bf16 |
bf16 |
fp32 |
32x32x8, 32x32x16, 16x16x16, 16x16x32 | Brain float 16 |
fp8 |
fp8 |
fp32 |
32x32x16, 32x32x32, 16x16x32, 16x16x64 | FP8 E4M3 |
fp8 |
bf8 |
fp32 |
32x32x16, 16x16x32 | Mixed FP8/BF8 |
bf8 |
fp8 |
fp32 |
32x32x16, 16x16x128 | Mixed BF8/FP8 |
bf8 |
bf8 |
fp32 |
32x32x16, 32x32x32, 16x16x32 | BF8 E5M2 |
int8 |
int8 |
int32 |
32x32x16, 16x16x32, 16x16x16 | Integer GEMM |
pk_fp4 |
pk_fp4 |
fp32 |
16x16x128 | Packed 4-bit float |
Notes:
- Accumulator is always
fp32except forint8which usesint32 - FP8 types:
fp8= E4M3,bf8= E5M2 pk_fp4= Packed 4-bit float (2 values per byte)- Some dtypes require specific GPU architectures (e.g., FP8 requires MI300+)
Step-by-Step Build Guide
Step 1: Navigate to Dispatcher Directory
# From composable_kernel root
cd dispatcher
# Verify you're in the right place
ls CMakeLists.txt # Should exist
Step 2: Create Build Directory
mkdir -p build
cd build
Step 3: Configure CMake
Basic configuration (library only):
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release \
-DGPU_TARGETS="gfx942"
Full configuration (with examples and tests):
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_BUILD_TYPE=Release \
-DGPU_TARGETS="gfx942" \
-DBUILD_DISPATCHER_EXAMPLES=ON \
-DBUILD_DISPATCHER_TESTS=ON
Expected output:
-- Found hip: /opt/rocm (found suitable version "6.x.x")
-- Generating GEMM kernels...
-- Built: gemm_01 through gemm_06, dispatcher_gemm_lib.so
-- Configuring done
Step 4: Build
# Build all targets (generates kernels automatically, then compiles)
make -j$(nproc)
# Or build specific targets
make gemm_01_basic # Single GEMM example
make dispatcher_gemm_lib # GEMM shared library for Python
# Build ONLY Python libraries (faster if you don't need C++ examples)
make python_libs -j$(nproc)
Kernel Generation Targets
Kernels are generated automatically during make, but you can also control generation explicitly:
# Generate all kernels only (no compilation)
make generate_all_kernels
# Generate GEMM kernels only
make generate_gemm_kernels
# Force regenerate (even if kernels exist)
make regenerate_all_kernels
make regenerate_gemm_kernels
# Generate for specific GPU architecture
make generate_kernels_gfx942 # MI300X
make generate_kernels_gfx90a # MI200
make generate_kernels_gfx1100 # RDNA3
Step 5: Verify Build
# Check executables were built
ls examples/gemm_*
# Check shared libraries were built
ls examples/libdispatcher_gemm_lib.so
CMake Options Reference
| Flag | Default | Description |
|---|---|---|
CMAKE_BUILD_TYPE |
Debug | Use Release for performance! |
GPU_TARGETS |
None | Target GPU: "gfx942", "gfx90a", etc. |
BUILD_DISPATCHER_EXAMPLES |
OFF | Build C++ examples and Python libs |
BUILD_DISPATCHER_TESTS |
OFF | Build unit tests |
CMAKE_PREFIX_PATH |
- | ROCm installation path |
CMAKE_CXX_COMPILER |
- | Path to hipcc compiler |
WARNING: Important: Always use -DCMAKE_BUILD_TYPE=Release for benchmarking. Debug builds are slower.
WARNING: Important: Note that the current system provides single GPU target support for architecture-based kernel filtering, please do not use multiple GPU targets at a time (if necessary, please compile into different build directories).
Running Examples
C++ Examples
After building, executables are in build/examples/:
cd build/examples
# GEMM Examples
./gemm_01_basic # Basic GEMM with autofill/autocorrect
./gemm_02_multi_size # Wildcard expansion
./gemm_03_benchmark_validation # Benchmarking + validation
./gemm_04_heuristics # Heuristic kernel selection
./gemm_05_json_export # Registry JSON export
./gemm_06_multi_registry # Multiple registries
# Grouped Convolution Examples
./grouped_conv_01_basic # Declaration patterns + GPU execution
./grouped_conv_02_all_dirs # Forward/BwdData/BwdWeight with GPU
./grouped_conv_03_bench_val # Benchmark + CPU reference validation
./grouped_conv_04_registry_json # Heuristic selection + JSON export
./grouped_conv_05_bwd_data # Backward data + CPU validation
./grouped_conv_06_bwd_weight # Backward weight + CPU validation
./grouped_conv_07_benchmark # Multi-tile ResNet benchmark
Python Examples
Run from the dispatcher directory:
cd /path/to/composable_kernel/dispatcher
# GEMM Examples
python3 examples/gemm/python/01_basic_gemm.py # Basic multi-kernel GEMM
python3 examples/gemm/python/04_validation.py # CPU reference validation
python3 examples/gemm/python/07_stress_test.py # Stress test
python3 examples/gemm/python/08_heuristics.py # Heuristic selection
# Grouped Convolution Examples
python3 examples/grouped_conv/python/01_basic_grouped_conv.py # Config patterns + registry + GPU
python3 examples/grouped_conv/python/02_forward.py # Forward 2D/3D + CPU ref
python3 examples/grouped_conv/python/03_bwd_data.py # Backward data + CPU ref
python3 examples/grouped_conv/python/04_bwd_weight.py # Backward weight + CPU ref
python3 examples/grouped_conv/python/05_benchmark.py # Multi-problem benchmark
python3 examples/grouped_conv/python/06_registry_json.py # Heuristic selection + JSON
Example Output
Expected C++ output (gemm_01_basic):
======================================================================
Example 01: Basic GEMM with Declarative Kernel Definition
======================================================================
Step 1: Declared Kernels
------------------------
Kernel Set: fp16_gemm_kernels
Architecture: gfx942
Configurations: 1
- gemm_fp16_rcr_compv4_cshuffle_intrawave_128x128x32
Step 2: Create Registry and Dispatcher
--------------------------------------
Registered 1 kernels
Step 3: Define Problem
----------------------
M=1024, N=1024, K=1024
Step 4: GPU Execution
---------------------
*** GPU EXECUTION ***
Time: <varies> ms
TFLOPS: <varies>
Note: Timing values vary by GPU model and system configuration.
Benchmark Parameters
The dispatcher supports fine-grained control over benchmarking, matching CK Tile's stream_config:
Available Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
warmup |
int | 5 | Warmup iterations (discarded from timing) |
repeat |
int | 20 | Benchmark iterations (averaged) |
flush_cache |
bool | false | Flush GPU L2 cache between iterations |
rotating_count |
int | 1 | Rotating buffer count (for cache simulation) |
timer |
string | "gpu" | Timer type: "gpu" (HIP events) or "cpu" |
init |
string | "random" | Matrix initialization: "random", "linear", "constant" |
split_k |
int | 1 | Split-K parallelism factor |
Python Usage
from ctypes_utils import DispatcherLib
# Basic usage (default benchmark settings)
lib = DispatcherLib.load()
# Advanced benchmark settings via command line
python3 examples/gemm/python/10_advanced_benchmark.py \
--warmup 10 \
--repeat 100 \
--flush-cache
C++ Usage
// Basic timing
ck_tile::stream_config cfg{nullptr, true};
// Advanced benchmark settings
ck_tile::stream_config cfg{
nullptr, // stream_id (nullptr = default stream)
true, // time_kernel
1, // log_level
10, // cold_niters (warmup)
100, // nrepeat
true, // is_gpu_timer
true, // flush_cache
4 // rotating_count
};
float avg_time = kernel.run(args, cfg);
Command Line (Python Examples)
# Basic run
python3 examples/gemm/python/10_advanced_benchmark.py
# With benchmark parameters
python3 examples/gemm/python/10_advanced_benchmark.py \
--warmup 10 \
--repeat 100 \
--flush-cache \
--rotating-count 4 \
--timer gpu
When to Use Each Parameter
| Use Case | Recommended Settings |
|---|---|
| Quick test | warmup=1, repeat=3 |
| Stable benchmark | warmup=10, repeat=100 |
| Memory-bound analysis | flush_cache=True, rotating_count=4 |
| Compute-bound analysis | flush_cache=False (default) |
| Debug timing | timer="cpu" |
| Production | timer="gpu" (default) |
ML-Based Kernel Selection (Optional)
The dispatcher includes ML heuristics for automated kernel selection using trained LightGBM models.
Prerequisites: Install ML dependencies first:
pip install -r requirements-ml.txt # ~500 MB (LightGBM, pandas, pyarrow, scikit-learn)
Documentation: See heuristics/README.md for:
- Training and evaluating models
- Feature engineering (72 features)
- Using pre-trained models
- Python API reference
Examples:
python3 examples/gemm/python/09_ml_heuristic.py # ML-based kernel selection
python3 examples/gemm/python/10_rank_kernels.py # Kernel ranking
Model Compression: Trained models are stored in compressed .lgbm.gz format to save space (~67% size reduction). Python tools automatically decompress models on first use. For C++ examples, decompress manually:
# If you have compressed models
cd heuristics/models/gemm_universal_fp16_gfx950
gunzip model_tflops.lgbm.gz
# Then use in C++ example
cd ../../../build
./gemm_09_ml_heuristic --model ../heuristics/models/gemm_universal_fp16_gfx950/model_tflops.lgbm
External Integration
Using Dispatcher in Your Own Project
Option 1: CMake Integration (Recommended)
Add to your CMakeLists.txt:
# Set path to composable_kernel
set(CK_ROOT "/path/to/composable_kernel")
# Add dispatcher subdirectory
add_subdirectory(${CK_ROOT}/dispatcher dispatcher_build)
# Link to your target
target_link_libraries(your_target PRIVATE ck_tile_dispatcher)
target_include_directories(your_target PRIVATE
${CK_ROOT}/dispatcher/include
${CK_ROOT}/include
)
Option 2: Include as Pre-built Library
# Find the pre-built library
find_library(CK_DISPATCHER ck_tile_dispatcher
PATHS /path/to/composable_kernel/dispatcher/build)
# Include directories
set(CK_INCLUDE_DIRS
/path/to/composable_kernel/include
/path/to/composable_kernel/dispatcher/include
)
target_link_libraries(your_target PRIVATE ${CK_DISPATCHER})
target_include_directories(your_target PRIVATE ${CK_INCLUDE_DIRS})
Option 3: Python Integration
import sys
sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python")
# For GEMM
from ctypes_utils import DispatcherLib, Dispatcher, KernelConfig
Required Include Paths
When integrating, you need these include paths:
/path/to/composable_kernel/include # CK Tile core headers
/path/to/composable_kernel/dispatcher/include # Dispatcher headers
/path/to/composable_kernel/dispatcher/build/generated_kernels # Generated kernels
Required Compile Flags
# Minimum flags for hipcc
-std=c++17
-D__HIP_PLATFORM_AMD__=1
--offload-arch=gfx942 # Your target GPU
# Recommended flags
-O3
-mllvm -enable-noalias-to-md-conversion=0
-Wno-undefined-func-template
-Wno-float-equal
-Wall
-Werror
Python Path Setup
For Python scripts outside the dispatcher directory:
# Option 1: Environment variable
export PYTHONPATH="/path/to/composable_kernel/dispatcher/examples/gemm/python:$PYTHONPATH"
# Option 2: In your Python script
import sys
sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python")
Library Search Paths
The Python utilities search for the shared library in these locations:
# For GEMM (ctypes_utils.py)
SEARCH_PATHS = [
"build/examples/libdispatcher_gemm_lib.so",
"../build/examples/libdispatcher_gemm_lib.so",
"../../build/examples/libdispatcher_gemm_lib.so",
]
If using from a different location, set the library path explicitly:
# GEMM
from ctypes_utils import DispatcherLib
lib = DispatcherLib.load("/absolute/path/to/libdispatcher_gemm_lib.so")
Core Concepts
Data Flow
KernelConfig -> Registry -> Dispatcher -> GPU Execution
- KernelConfig: Defines kernel parameters (tile sizes, data types, layouts)
- Registry: Stores multiple kernel configurations
- Dispatcher: Selects best kernel for a given problem and executes it
GEMM Layouts
| Layout | A | B | C | Use Case |
|---|---|---|---|---|
| RCR | Row | Col | Row | Most common (PyTorch default) |
| RRR | Row | Row | Row | Both inputs row-major |
| CRR | Col | Row | Row | A transposed |
| CCR | Col | Col | Row | Both inputs column-major |
Split-K Support
Split-K divides the K dimension across multiple thread blocks, useful for large K dimensions.
Usage (C++):
// GEMM with 4-way K split
auto problem = ProblemBuilder()
.m(1024).n(1024).k(8192)
.split_k(4)
.build();
Operation Support Matrix
This matrix shows all CK Tile operations with per-data-type, per-layout, and per-GPU support status. It uses a three-state convention: ✅ = supported by both CK Tile and the dispatcher, ❌ = supported by CK Tile but not yet in the dispatcher, blank = not supported by CK Tile itself.
| Data Types | Layouts | GPU Targets | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Op | CK Tile Kernel | fp16 | fp8 | bf16 | bf8 | int8 | fp4 | fp6 | rcr | rrr | ccr | crr | 90a | 942 | 950 | 1201 |
| GEMM | gemm_multi_d [5] engine: dispatcher/example: 19_gemm_multi_d/ |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| GEMM | gemm_preshuffle [1][2] engine: dispatcher/ |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | |||||
| GEMM | gemm_universal [3][4][7][8] engine: dispatcher/example: 03_gemm/ |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| GEMM | batched_contraction example: 41_batched_contraction/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| GEMM | batched_gemm example: 16_batched_gemm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | block_scale_gemm example: 38_block_scale_gemm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | flatmm example: 18_flatmm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | gemm_multi_abd example: 22_gemm_multi_abd/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | gemm_quant | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||
| GEMM | grouped_gemm example: 17_grouped_gemm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GEMM | grouped_gemm_quant | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||
| GEMM | streamk_gemm example: 40_streamk_gemm/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Reduce | multi_reduce2d example: 05_reduce/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||||
| Reduce | reduce2d example: 05_reduce/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||||
| Attention | fmha example: 01_fmha/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Attention | sparse_attn example: 50_sparse_attn/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||||
| Activation | softmax | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||||
| Activation | topk_softmax example: 09_topk_softmax/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||||
| Conv | grouped_conv [6] example: 20_grouped_convolution/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||
| Data Move | batched_transpose example: 35_batched_transpose/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||||
| Data Move | image_to_column example: 04_img2col/ |
❌ | ❌ | ❌ | ❌ | ❌ | ||||||||||
| Data Move | permute example: 06_permute/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||
| Elementwise | elementwise example: 21_elementwise/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||
| MoE | fused_moe example: 15_fused_moe/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Norm | add_rmsnorm2d_rdquant example: 11_add_rmsnorm2d_rdquant/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Norm | layernorm2d example: 02_layernorm2d/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Norm | norm_reduce | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | |||||||||
| Norm | rmsnorm2d example: 10_rmsnorm2d/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ||||||
| Pooling | pooling example: 36_pooling/ |
❌ | ❌ | ❌ | ❌ | ❌ | ||||||||||
| Quant | smoothquant example: 12_smoothquant/ |
❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
Notes:
- [1] gemm_preshuffle: Supports only
rcrlayout. Uses fixedpreshufflev2pipeline,Autoscheduler, andcshuffleepilogue. - [2] gemm_preshuffle:
int8preshuffle support is limited to gfx942 and gfx950 (entries inpreshuffle_warp_tile_combos). - [3] gemm_universal:
fp4(pk_fp4) support is only available on gfx950. - [4] gemm_universal:
fp32GEMM is supported by the dispatcher (fp32_fp32_fp32warp tile combos exist) but is omitted from matrix columns for consistency with the tile engine matrix format. - [5] gemm_multi_d: Codegen supports
MultiDAddandMultiDMultiplyelement-wise ops. Preselected kernel sets also testRelu,Gelu,FastGelu. - [6] grouped_conv:
arch_filter.pydefines conv operator types (CONV_FWD,CONV_BWD_DATA,CONV_BWD_WEIGHT,CONV3D_*) but dispatcher infrastructure is incomplete (ctypes bindings are stubs,conv_utils.hppdoes not exist). - [7] (all dispatcher ops): gfx908, gfx1100, and gfx1200 also have
warp_tile_combosinarch_specs.jsonbut are not shown in the matrix's 4 GPU columns. - [8] (all dispatcher ops):
int4,fp32,fp64are valid dispatcher data types (defined inkernel_key.hppDataTypeenum) but have no dedicated matrix columns.
Dispatcher GEMM Configuration Detail
Per-Variant Configuration
| GEMM Variant | Pipelines | Schedulers | Epilogues | Element-wise Ops | Output Dtype |
|---|---|---|---|---|---|
| gemm_universal | mem, compv3, compv4 | intrawave, interwave | cshuffle, default | PassThrough | Same as input (fp8/bf8 -> fp16) |
| gemm_preshuffle | preshufflev2 | Auto | cshuffle | PassThrough | Same as input (fp8/bf8 -> fp16) |
| gemm_multi_d | mem, compv3, compv4 | intrawave, interwave | cshuffle, default | MultiDAdd, MultiDMultiply | Same as input (fp8/bf8 -> fp16) |
Warp Tile Combinations per GPU
| GPU | fp16 | bf16 | fp8 | bf8 | int8 | pk_fp4 |
|---|---|---|---|---|---|---|
| gfx1100 | 16x16x16 | 16x16x16 | -- | -- | 16x16x16 | -- |
| gfx1200 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | -- |
| gfx1201 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | -- |
| gfx908 | 32x32x8, 16x16x16, 32x32x16, 16x16x32 | 32x32x8, 16x16x16, 32x32x16, 16x16x32 | -- | -- | 32x32x16, 16x16x32 | -- |
| gfx90a | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32 | 32x32x16, 32x32x32 | 32x32x16, 16x16x32 | -- |
| gfx942 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 16x16x32 | -- |
| gfx950 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 16x16x32 | 16x16x128 |
Preshuffle Warp Tile Combinations
| GPU | fp16 | bf16 | fp8 | bf8 | int8 |
|---|---|---|---|---|---|
| gfx90a | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32 | 32x32x16, 32x32x32 | -- |
| gfx942 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 32x32x32, 16x16x64, 16x16x32 | 16x16x32, 32x32x16 |
| gfx950 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 32x32x32, 16x16x64, 16x16x32, 16x16x128, 32x32x64 | -- |
Legend:
- CK Tile Kernel column: First line is the kernel name. Lines prefixed with "engine:" show the dispatcher directory. Lines prefixed with "example:" show the CK Tile example directory under
example/ck_tile/. - Green cell (✅): CK Tile implementation exists and the dispatcher supports it.
- Red cell (❌): CK Tile implementation exists but the dispatcher does not support it.
- Grey cell (blank): No CK Tile implementation exists for this combination.
Layout codes: Each 3-character layout code specifies the memory layout for tensors A, B, and C:
r= row-major,c= column-major- Example:
rcrmeans A is row-major, B is column-major, C is row-major gemm_multi_duses 4-character codes internally (e.g.,rcrr) where the 4th character is the D tensor layout (alwaysr). The matrix shows only the 3-character A/B/C portion.
Data type mapping per config label:
| Config Label | A (source) | B (source) | Acc | C (output) |
|---|---|---|---|---|
| fp16 | fp16 | fp16 | fp32 | fp16 |
| bf16 | bf16 | bf16 | fp32 | bf16 |
| int8 | int8 | int8 | int32 | int32 |
| fp8 | fp8 | fp8 | fp32 | fp16 |
| bf8 | bf8 | bf8 | fp32 | fp16 |
| fp6 | fp6 | fp6 | fp32 | fp32 |
| fp4 | fp16 or bf16 | fp4 | fp32 | fp16 or bf16 |
Troubleshooting
Build Issues
| Problem | Solution |
|---|---|
hipcc not found |
Set -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc |
hip not found |
Set -DCMAKE_PREFIX_PATH=/opt/rocm |
| Very slow performance | Use -DCMAKE_BUILD_TYPE=Release |
gfx942 not supported |
Check ROCm version (need 6.0+) |
| Kernel generation fails | Ensure Python 3.8+ with NumPy installed in active venv |
| Build errors | First verify CK builds without dispatcher (see main CK README) |
Runtime Issues
| Problem | Solution |
|---|---|
Library not found |
Build with -DBUILD_DISPATCHER_EXAMPLES=ON |
No kernel found |
Check GPU arch matches build target |
Python ModuleNotFoundError |
Add paths to PYTHONPATH (see above) |
| Wrong results | Verify layout matches your data |
Debug Commands
# Check ROCm installation
rocminfo | head -20
# Check GPU architecture
rocminfo | grep "Name:"
# Verify library exists
ls -la build/examples/libdispatcher_*.so
# Run with verbose output
./build/examples/gemm_01_basic 2>&1
# Python: Check library loading
python3 -c "
import ctypes
lib = ctypes.CDLL('/path/to/libdispatcher_gemm_lib.so')
print('Library loaded successfully')
"
Clean Rebuild
If you encounter issues, try a clean rebuild:
cd dispatcher
rm -rf build
mkdir build && cd build
cmake .. [your options]
make -j$(nproc)
File Structure
dispatcher/
|---- README.md # This file
|---- CMakeLists.txt # Build configuration
|
|---- include/ck_tile/dispatcher/ # C++ headers
| |---- dispatcher.hpp # Main dispatcher include
| |---- registry.hpp # GEMM kernel registry
| |---- kernel_key.hpp # Kernel configuration
| |---- grouped_conv_config.hpp # Grouped conv configuration
| |---- grouped_conv_problem.hpp # Grouped conv problem (with builder)
| |---- grouped_conv_kernel_decl.hpp # Grouped conv kernel declarations
| |---- grouped_conv_registry.hpp # Grouped conv registry (thread-safe)
| +---- grouped_conv_utils.hpp # Grouped conv utilities
|
|---- src/ # C++ implementation
|
|---- codegen/ # Kernel generation
| |---- codegen_common.py # Shared: TileConfig, TraitConfigBase, type mappings
| |---- unified_gemm_codegen.py # GEMM kernel generator
| |---- unified_grouped_conv_codegen.py # Grouped conv kernel generator
| +---- arch_specs.json # GPU specifications
|
|---- python/ # Python utilities
| |---- dispatcher_common.py # Shared: paths, validation, Colors, phased output
| |---- ctypes_utils.py # GEMM ctypes utilities
| +---- grouped_conv_utils.py # Grouped conv utilities
|
|---- scripts/ # Build scripts
| |---- compile_gemm_examples.py # GEMM build script
| +---- compile_grouped_conv_examples.py # Grouped conv build script
|
|---- bindings/ctypes/ # Python ctypes interface
| |---- gemm_ctypes_lib.cpp # GEMM Python library
| +---- conv_ctypes_lib.cpp # Grouped conv Python library
|
|---- examples/ # Examples
| |---- gemm/
| | |---- cpp/ # C++ GEMM examples (01-07)
| | +---- python/ # Python GEMM examples (01-11)
| +---- grouped_conv/
| |---- cpp/ # C++ Grouped Conv examples (01-07)
| +---- python/ # Python Grouped Conv examples (01-06)
|
+---- tests/ # Unit tests (C++ and Python)
Example Documentation
| Directory | README |
|---|---|
| GEMM C++ | examples/gemm/cpp/README.md |
| GEMM Python | examples/gemm/python/README.md |
| Codegen | codegen/README.md |
| Python Utils | python/README.md |
| C++ Headers | include/ck_tile/dispatcher/README.md |
Grouped Convolution Support
Grouped convolution is fully supported alongside GEMM, with shared infrastructure to eliminate duplication.
Python
# Generate grouped conv kernels
python3 codegen/unified_grouped_conv_codegen.py \
--output-dir build/generated_kernels \
--datatype fp16 --variant forward --ndim-spatial 2
# Build grouped conv examples
python3 scripts/compile_grouped_conv_examples.py examples/grouped_conv/cpp/01_basic_grouped_conv.cpp
Key Files
| Component | File |
|---|---|
| C++ Headers | include/ck_tile/dispatcher/grouped_conv_*.hpp |
| Python Codegen | codegen/unified_grouped_conv_codegen.py |
| Python Utils | python/grouped_conv_utils.py |
| Build Script | scripts/compile_grouped_conv_examples.py |
| Shared Codegen | codegen/codegen_common.py |
| Shared Utils | python/dispatcher_common.py |
Variants
- Forward (
grouped_conv_fwd) - Standard grouped convolution - Backward Data (
grouped_conv_bwd_data) - Gradient w.r.t. input - Backward Weight (
grouped_conv_bwd_weight) - Gradient w.r.t. weights
Shared Infrastructure
GEMM and grouped convolution share common code to avoid duplication:
codegen/codegen_common.py- TileConfig, TraitConfigBase, type mappings, parallel generation, arch-aware expansionpython/dispatcher_common.py- Path helpers, validation, auto-correction, Colors, phased output
License
MIT License - Copyright (c) 2025, Advanced Micro Devices, Inc.