# CK Tile Dispatcher A unified kernel dispatch system for AMD GPUs with C++ and Python frontends, supporting GEMM and Grouped Convolution operations. **Validated Platform:** AMD Instinct MI300 series (gfx942) --- ## Table of Contents 1. [Quick Start](#quick-start) 2. [Docker Setup](#docker-setup-recommended) 3. [Prerequisites](#prerequisites) 4. [Step-by-Step Build Guide](#step-by-step-build-guide) 5. [Running Examples](#running-examples) 6. [External Integration](#external-integration) 7. [Core Concepts](#core-concepts) 8. [Operation Support Matrix](#operation-support-matrix) 9. [Troubleshooting](#troubleshooting) 10. [File Structure](#file-structure) --- ## Quick Start **Complete setup from scratch (5 minutes):** ```bash # From the composable_kernel root directory cd dispatcher # Step 1: Create build directory mkdir -p build && cd build # Step 2: Configure CMake cmake .. \ -DCMAKE_PREFIX_PATH=/opt/rocm \ -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ -DCMAKE_BUILD_TYPE=Release \ -DGPU_TARGETS="gfx942" \ -DBUILD_DISPATCHER_EXAMPLES=ON # Step 3: Generate kernels and build (CMake handles this automatically) make -j$(nproc) # Step 4: Run C++ examples ./examples/gemm_01_basic # Step 5: Build Python libraries (required for Python examples) make python_libs # Step 6: Run Python examples (from dispatcher directory) cd .. python3 examples/gemm/python/01_basic_gemm.py ``` --- ## Docker Setup (Recommended) For a reproducible build environment, use the official ROCm Docker image: ### Step 1: Pull and Run Container ```bash # Pull the CK Docker image docker pull rocm/composable_kernel:ck_ub24.04_rocm7.0.1 # Run container with GPU access docker run \ -it \ --privileged \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --group-add render \ -w /root/workspace \ -v $(pwd):/root/workspace \ rocm/composable_kernel:ck_ub24.04_rocm7.0.1 \ /bin/bash ``` > **Note:** Omit `--device` flags if building without GPU access. ### Step 2: Clone and Build ```bash # Inside the container git clone https://github.com/ROCm/composable_kernel.git cd composable_kernel git checkout builder-dispatch-tile-gemm # Set up Python environment python3 -m venv .venv source .venv/bin/activate pip install numpy # Build dispatcher cd dispatcher mkdir -p build && cd build cmake .. \ -DCMAKE_PREFIX_PATH=/opt/rocm \ -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ -DCMAKE_BUILD_TYPE=Release \ -DGPU_TARGETS="gfx942" \ -DBUILD_DISPATCHER_EXAMPLES=ON make -j$(nproc) ``` ### One-Liner Build (inside container) ```bash git clone https://github.com/ROCm/composable_kernel.git && \ cd composable_kernel && git checkout builder-dispatch-tile-gemm && \ python3 -m venv .venv && source .venv/bin/activate && pip install numpy && \ cd dispatcher && mkdir -p build && cd build && \ cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ -DCMAKE_BUILD_TYPE=Release -DGPU_TARGETS="gfx942" -DBUILD_DISPATCHER_EXAMPLES=ON && \ make -j$(nproc) ``` --- ## Prerequisites ### Required Software | Software | Minimum Version | Check Command | |----------|-----------------|---------------| | ROCm | 6.4+ | `rocminfo` | | CMake | 3.16+ | `cmake --version` | | Python | 3.8+ | `python3 --version` | | NumPy | 1.20+ | `pip show numpy` | | hipcc | (from ROCm) | `/opt/rocm/bin/hipcc --version` | > **Note:** Newer GPU targets (gfx950, gfx1201) require ROCm 6.3+. For ROCm 6.4+, you can also use `amdclang++` instead of `hipcc`. ### Check Your GPU Architecture ```bash # Find your GPU architecture rocminfo | grep -i "gfx" # Example output: "gfx942" ``` **Supported architectures:** - **gfx942** - MI300X, MI300A, MI308, MI325 (Instinct MI300 series) - **gfx90a** - MI200 series (MI250, MI250X) - **gfx950** - MI350 series - **gfx1101** - RDNA3 series - **gfx1201** - RDNA4 series ### Install Python Dependencies #### Core Dependencies (Required) NumPy is required for Python examples and kernel generation. We recommend using a virtual environment: **Option 1: Using standard venv** ```bash # Create virtual environment python3 -m venv .venv # Activate virtual environment source .venv/bin/activate # Linux/macOS # .venv\Scripts\activate # Windows # Install core dependencies pip install -r python/requirements.txt ``` **Option 2: Using uv (faster alternative)** ```bash # Install uv if not already installed curl -LsSf https://astral.sh/uv/install.sh | sh # Create and activate virtual environment uv venv .venv source .venv/bin/activate # Linux/macOS # .venv\Scripts\activate # Windows # Install core dependencies uv pip install -r python/requirements.txt ``` **Option 3: System-wide install (not recommended)** ```bash pip install -r python/requirements.txt ``` > **Note:** Always activate your virtual environment before running CMake or Python examples. #### ML Heuristics Dependencies (Optional) For ML-based kernel selection (examples 09-11), install additional dependencies: ```bash # Activate your virtual environment first source .venv/bin/activate # Install ML dependencies (LightGBM, pandas, pyarrow, scikit-learn) pip install -r requirements-ml.txt ``` **Why separate?** ML dependencies are large (especially pyarrow) and not needed for basic dispatcher usage. Install only if you need: - ML-based kernel selection (`examples/gemm/python/09_ml_heuristic.py`) - Model training (`heuristics/train.py`) - Model evaluation (`heuristics/evaluate.py`) - Automated benchmark analysis **Core dependencies:** ~50 MB (NumPy only) **With ML dependencies:** ~500 MB (includes LightGBM, pandas, pyarrow, scikit-learn) ### Supported Data Types CK Tile supports a wide range of data types for GEMM operations: | A dtype | B dtype | Acc dtype | Warp Tile Sizes | Notes | |---------|---------|-----------|-----------------|-------| | `fp32` | `fp32` | `fp32` | 16x16x4, 16x16x16 | Full precision | | `fp16` | `fp16` | `fp32` | 32x32x8, 32x32x16, 16x16x16, 16x16x32 | Standard half | | `bf16` | `bf16` | `fp32` | 32x32x8, 32x32x16, 16x16x16, 16x16x32 | Brain float 16 | | `fp8` | `fp8` | `fp32` | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | FP8 E4M3 | | `fp8` | `bf8` | `fp32` | 32x32x16, 16x16x32 | Mixed FP8/BF8 | | `bf8` | `fp8` | `fp32` | 32x32x16, 16x16x128 | Mixed BF8/FP8 | | `bf8` | `bf8` | `fp32` | 32x32x16, 32x32x32, 16x16x32 | BF8 E5M2 | | `int8` | `int8` | `int32` | 32x32x16, 16x16x32, 16x16x16 | Integer GEMM | | `pk_fp4` | `pk_fp4` | `fp32` | 16x16x128 | Packed 4-bit float | **Notes:** - Accumulator is always `fp32` except for `int8` which uses `int32` - FP8 types: `fp8` = E4M3, `bf8` = E5M2 - `pk_fp4` = Packed 4-bit float (2 values per byte) - Some dtypes require specific GPU architectures (e.g., FP8 requires MI300+) --- ## Step-by-Step Build Guide ### Step 1: Navigate to Dispatcher Directory ```bash # From composable_kernel root cd dispatcher # Verify you're in the right place ls CMakeLists.txt # Should exist ``` ### Step 2: Create Build Directory ```bash mkdir -p build cd build ``` ### Step 3: Configure CMake **Basic configuration (library only):** ```bash cmake .. \ -DCMAKE_PREFIX_PATH=/opt/rocm \ -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ -DCMAKE_BUILD_TYPE=Release \ -DGPU_TARGETS="gfx942" ``` **Full configuration (with examples and tests):** ```bash cmake .. \ -DCMAKE_PREFIX_PATH=/opt/rocm \ -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ -DCMAKE_BUILD_TYPE=Release \ -DGPU_TARGETS="gfx942" \ -DBUILD_DISPATCHER_EXAMPLES=ON \ -DBUILD_DISPATCHER_TESTS=ON ``` **Expected output:** ``` -- Found hip: /opt/rocm (found suitable version "6.x.x") -- Generating GEMM kernels... -- Built: gemm_01 through gemm_06, dispatcher_gemm_lib.so -- Configuring done ``` ### Step 4: Build ```bash # Build all targets (generates kernels automatically, then compiles) make -j$(nproc) # Or build specific targets make gemm_01_basic # Single GEMM example make dispatcher_gemm_lib # GEMM shared library for Python # Build ONLY Python libraries (faster if you don't need C++ examples) make python_libs -j$(nproc) ``` ### Kernel Generation Targets Kernels are generated automatically during `make`, but you can also control generation explicitly: ```bash # Generate all kernels only (no compilation) make generate_all_kernels # Generate GEMM kernels only make generate_gemm_kernels # Force regenerate (even if kernels exist) make regenerate_all_kernels make regenerate_gemm_kernels # Generate for specific GPU architecture make generate_kernels_gfx942 # MI300X make generate_kernels_gfx90a # MI200 make generate_kernels_gfx1100 # RDNA3 ``` ### Step 5: Verify Build ```bash # Check executables were built ls examples/gemm_* # Check shared libraries were built ls examples/libdispatcher_gemm_lib.so ``` ### CMake Options Reference | Flag | Default | Description | |------|---------|-------------| | `CMAKE_BUILD_TYPE` | Debug | **Use `Release` for performance!** | | `GPU_TARGETS` | None | Target GPU: `"gfx942"`, `"gfx90a"`, etc. | | `BUILD_DISPATCHER_EXAMPLES` | OFF | Build C++ examples and Python libs | | `BUILD_DISPATCHER_TESTS` | OFF | Build unit tests | | `CMAKE_PREFIX_PATH` | - | ROCm installation path | | `CMAKE_CXX_COMPILER` | - | Path to hipcc compiler | WARNING: **Important:** Always use `-DCMAKE_BUILD_TYPE=Release` for benchmarking. Debug builds are slower. WARNING: **Important:** Note that the current system provides single GPU target support for architecture-based kernel filtering, please do not use multiple GPU targets at a time (if necessary, please compile into different build directories). --- ## Running Examples ### C++ Examples After building, executables are in `build/examples/`: ```bash cd build/examples # GEMM Examples ./gemm_01_basic # Basic GEMM with autofill/autocorrect ./gemm_02_multi_size # Wildcard expansion ./gemm_03_benchmark_validation # Benchmarking + validation ./gemm_04_heuristics # Heuristic kernel selection ./gemm_05_json_export # Registry JSON export ./gemm_06_multi_registry # Multiple registries # Grouped Convolution Examples ./grouped_conv_01_basic # Declaration patterns + GPU execution ./grouped_conv_02_all_dirs # Forward/BwdData/BwdWeight with GPU ./grouped_conv_03_bench_val # Benchmark + CPU reference validation ./grouped_conv_04_registry_json # Heuristic selection + JSON export ./grouped_conv_05_bwd_data # Backward data + CPU validation ./grouped_conv_06_bwd_weight # Backward weight + CPU validation ./grouped_conv_07_benchmark # Multi-tile ResNet benchmark ``` ### Python Examples Run from the `dispatcher` directory: ```bash cd /path/to/composable_kernel/dispatcher # GEMM Examples python3 examples/gemm/python/01_basic_gemm.py # Basic multi-kernel GEMM python3 examples/gemm/python/04_validation.py # CPU reference validation python3 examples/gemm/python/07_stress_test.py # Stress test python3 examples/gemm/python/08_heuristics.py # Heuristic selection # Grouped Convolution Examples python3 examples/grouped_conv/python/01_basic_grouped_conv.py # Config patterns + registry + GPU python3 examples/grouped_conv/python/02_forward.py # Forward 2D/3D + CPU ref python3 examples/grouped_conv/python/03_bwd_data.py # Backward data + CPU ref python3 examples/grouped_conv/python/04_bwd_weight.py # Backward weight + CPU ref python3 examples/grouped_conv/python/05_benchmark.py # Multi-problem benchmark python3 examples/grouped_conv/python/06_registry_json.py # Heuristic selection + JSON ``` ### Example Output **Expected C++ output (`gemm_01_basic`):** ``` ====================================================================== Example 01: Basic GEMM with Declarative Kernel Definition ====================================================================== Step 1: Declared Kernels ------------------------ Kernel Set: fp16_gemm_kernels Architecture: gfx942 Configurations: 1 - gemm_fp16_rcr_compv4_cshuffle_intrawave_128x128x32 Step 2: Create Registry and Dispatcher -------------------------------------- Registered 1 kernels Step 3: Define Problem ---------------------- M=1024, N=1024, K=1024 Step 4: GPU Execution --------------------- *** GPU EXECUTION *** Time: ms TFLOPS: ``` > **Note:** Timing values vary by GPU model and system configuration. --- ## Benchmark Parameters The dispatcher supports fine-grained control over benchmarking, matching CK Tile's `stream_config`: ### Available Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `warmup` | int | 5 | Warmup iterations (discarded from timing) | | `repeat` | int | 20 | Benchmark iterations (averaged) | | `flush_cache` | bool | false | Flush GPU L2 cache between iterations | | `rotating_count` | int | 1 | Rotating buffer count (for cache simulation) | | `timer` | string | "gpu" | Timer type: "gpu" (HIP events) or "cpu" | | `init` | string | "random" | Matrix initialization: "random", "linear", "constant" | | `split_k` | int | 1 | Split-K parallelism factor | ### Python Usage ```python from ctypes_utils import DispatcherLib # Basic usage (default benchmark settings) lib = DispatcherLib.load() # Advanced benchmark settings via command line python3 examples/gemm/python/10_advanced_benchmark.py \ --warmup 10 \ --repeat 100 \ --flush-cache ``` ### C++ Usage ```cpp // Basic timing ck_tile::stream_config cfg{nullptr, true}; // Advanced benchmark settings ck_tile::stream_config cfg{ nullptr, // stream_id (nullptr = default stream) true, // time_kernel 1, // log_level 10, // cold_niters (warmup) 100, // nrepeat true, // is_gpu_timer true, // flush_cache 4 // rotating_count }; float avg_time = kernel.run(args, cfg); ``` ### Command Line (Python Examples) ```bash # Basic run python3 examples/gemm/python/10_advanced_benchmark.py # With benchmark parameters python3 examples/gemm/python/10_advanced_benchmark.py \ --warmup 10 \ --repeat 100 \ --flush-cache \ --rotating-count 4 \ --timer gpu ``` ### When to Use Each Parameter | Use Case | Recommended Settings | |----------|---------------------| | Quick test | `warmup=1, repeat=3` | | Stable benchmark | `warmup=10, repeat=100` | | Memory-bound analysis | `flush_cache=True, rotating_count=4` | | Compute-bound analysis | `flush_cache=False` (default) | | Debug timing | `timer="cpu"` | | Production | `timer="gpu"` (default) | --- ## ML-Based Kernel Selection (Optional) The dispatcher includes ML heuristics for automated kernel selection using trained LightGBM models. **Prerequisites:** Install ML dependencies first: ```bash pip install -r requirements-ml.txt # ~500 MB (LightGBM, pandas, pyarrow, scikit-learn) ``` **Documentation:** See [heuristics/README.md](heuristics/README.md) for: - Training and evaluating models - Feature engineering (72 features) - Using pre-trained models - Python API reference **Examples:** ```bash python3 examples/gemm/python/09_ml_heuristic.py # ML-based kernel selection python3 examples/gemm/python/10_rank_kernels.py # Kernel ranking ``` **Model Compression:** Trained models are stored in compressed `.lgbm.gz` format to save space (~67% size reduction). Python tools automatically decompress models on first use. For C++ examples, decompress manually: ```bash # If you have compressed models cd heuristics/models/gemm_universal_fp16_gfx950 gunzip model_tflops.lgbm.gz # Then use in C++ example cd ../../../build ./gemm_09_ml_heuristic --model ../heuristics/models/gemm_universal_fp16_gfx950/model_tflops.lgbm ``` --- ## External Integration ### Using Dispatcher in Your Own Project #### Option 1: CMake Integration (Recommended) Add to your `CMakeLists.txt`: ```cmake # Set path to composable_kernel set(CK_ROOT "/path/to/composable_kernel") # Add dispatcher subdirectory add_subdirectory(${CK_ROOT}/dispatcher dispatcher_build) # Link to your target target_link_libraries(your_target PRIVATE ck_tile_dispatcher) target_include_directories(your_target PRIVATE ${CK_ROOT}/dispatcher/include ${CK_ROOT}/include ) ``` #### Option 2: Include as Pre-built Library ```cmake # Find the pre-built library find_library(CK_DISPATCHER ck_tile_dispatcher PATHS /path/to/composable_kernel/dispatcher/build) # Include directories set(CK_INCLUDE_DIRS /path/to/composable_kernel/include /path/to/composable_kernel/dispatcher/include ) target_link_libraries(your_target PRIVATE ${CK_DISPATCHER}) target_include_directories(your_target PRIVATE ${CK_INCLUDE_DIRS}) ``` #### Option 3: Python Integration ```python import sys sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python") # For GEMM from ctypes_utils import DispatcherLib, Dispatcher, KernelConfig ``` ### Required Include Paths When integrating, you need these include paths: ``` /path/to/composable_kernel/include # CK Tile core headers /path/to/composable_kernel/dispatcher/include # Dispatcher headers /path/to/composable_kernel/dispatcher/build/generated_kernels # Generated kernels ``` ### Required Compile Flags ```bash # Minimum flags for hipcc -std=c++17 -D__HIP_PLATFORM_AMD__=1 --offload-arch=gfx942 # Your target GPU # Recommended flags -O3 -mllvm -enable-noalias-to-md-conversion=0 -Wno-undefined-func-template -Wno-float-equal -Wall -Werror ``` ### Python Path Setup For Python scripts outside the dispatcher directory: ```bash # Option 1: Environment variable export PYTHONPATH="/path/to/composable_kernel/dispatcher/examples/gemm/python:$PYTHONPATH" # Option 2: In your Python script import sys sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python") ``` ### Library Search Paths The Python utilities search for the shared library in these locations: ```python # For GEMM (ctypes_utils.py) SEARCH_PATHS = [ "build/examples/libdispatcher_gemm_lib.so", "../build/examples/libdispatcher_gemm_lib.so", "../../build/examples/libdispatcher_gemm_lib.so", ] ``` If using from a different location, set the library path explicitly: ```python # GEMM from ctypes_utils import DispatcherLib lib = DispatcherLib.load("/absolute/path/to/libdispatcher_gemm_lib.so") ``` --- ## Core Concepts ### Data Flow ``` KernelConfig -> Registry -> Dispatcher -> GPU Execution ``` 1. **KernelConfig**: Defines kernel parameters (tile sizes, data types, layouts) 2. **Registry**: Stores multiple kernel configurations 3. **Dispatcher**: Selects best kernel for a given problem and executes it ### GEMM Layouts | Layout | A | B | C | Use Case | |--------|---|---|---|----------| | RCR | Row | Col | Row | Most common (PyTorch default) | | RRR | Row | Row | Row | Both inputs row-major | | CRR | Col | Row | Row | A transposed | | CCR | Col | Col | Row | Both inputs column-major | ### Split-K Support Split-K divides the K dimension across multiple thread blocks, useful for large K dimensions. **Usage (C++):** ```cpp // GEMM with 4-way K split auto problem = ProblemBuilder() .m(1024).n(1024).k(8192) .split_k(4) .build(); ``` --- ## Operation Support Matrix This matrix shows all CK Tile operations with per-data-type, per-layout, and per-GPU support status. It uses a three-state convention: ✅ = supported by both CK Tile and the dispatcher, ❌ = supported by CK Tile but not yet in the dispatcher, blank = not supported by CK Tile itself. | | | | | | **Data Types** | | | | | **Layouts** | | | | **GPU Targets** | | | |:---:|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | **Op** | **CK Tile Kernel** | **fp16** | **fp8** | **bf16** | **bf8** | **int8** | **fp4** | **fp6** | **rcr** | **rrr** | **ccr** | **crr** | **90a** | **942** | **950** | **1201** | | GEMM | gemm_multi_d [5]
engine: `dispatcher/`
example: `19_gemm_multi_d/` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | GEMM | gemm_preshuffle [1][2]
engine: `dispatcher/` | ✅ | ✅ | ✅ | ✅ | ✅ | | | ✅ | | | | ✅ | ✅ | ✅ | ❌ | | GEMM | gemm_universal [3][4][7][8]
engine: `dispatcher/`
example: `03_gemm/` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | GEMM | batched_contraction
example: `41_batched_contraction/` | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | GEMM | batched_gemm
example: `16_batched_gemm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | GEMM | block_scale_gemm
example: `38_block_scale_gemm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | GEMM | flatmm
example: `18_flatmm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | GEMM | gemm_multi_abd
example: `22_gemm_multi_abd/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | GEMM | gemm_quant | | ❌ | ❌ | ❌ | ❌ | ❌ | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | GEMM | grouped_gemm
example: `17_grouped_gemm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | GEMM | grouped_gemm_quant | | ❌ | ❌ | ❌ | ❌ | ❌ | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | GEMM | streamk_gemm
example: `40_streamk_gemm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | Reduce | multi_reduce2d
example: `05_reduce/` | ❌ | | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ | | Reduce | reduce2d
example: `05_reduce/` | ❌ | | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ | | Attention | fmha
example: `01_fmha/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | | Attention | sparse_attn
example: `50_sparse_attn/` | ❌ | | ❌ | | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | | Activation | softmax | ❌ | | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ | | Activation | topk_softmax
example: `09_topk_softmax/` | ❌ | ❌ | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ | | Conv | grouped_conv [6]
example: `20_grouped_convolution/` | ❌ | ❌ | ❌ | ❌ | | | | | | | | ❌ | ❌ | ❌ | ❌ | | Data Move | batched_transpose
example: `35_batched_transpose/` | ❌ | ❌ | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ | | Data Move | image_to_column
example: `04_img2col/` | ❌ | | | | | | | | | | | ❌ | ❌ | ❌ | ❌ | | Data Move | permute
example: `06_permute/` | ❌ | ❌ | ❌ | | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | | Elementwise | elementwise
example: `21_elementwise/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | ❌ | ❌ | ❌ | ❌ | | MoE | fused_moe
example: `15_fused_moe/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | | Norm | add_rmsnorm2d_rdquant
example: `11_add_rmsnorm2d_rdquant/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | | Norm | layernorm2d
example: `02_layernorm2d/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | | Norm | norm_reduce | ❌ | | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ | | Norm | rmsnorm2d
example: `10_rmsnorm2d/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | | Pooling | pooling
example: `36_pooling/` | ❌ | | | | | | | | | | | ❌ | ❌ | ❌ | ❌ | | Quant | smoothquant
example: `12_smoothquant/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | **Notes:** - [1] **gemm_preshuffle:** Supports only `rcr` layout. Uses fixed `preshufflev2` pipeline, `Auto` scheduler, and `cshuffle` epilogue. - [2] **gemm_preshuffle:** `int8` preshuffle support is limited to gfx942 and gfx950 (entries in `preshuffle_warp_tile_combos`). - [3] **gemm_universal:** `fp4` (pk_fp4) support is only available on gfx950. - [4] **gemm_universal:** `fp32` GEMM is supported by the dispatcher (`fp32_fp32_fp32` warp tile combos exist) but is omitted from matrix columns for consistency with the tile engine matrix format. - [5] **gemm_multi_d:** Codegen supports `MultiDAdd` and `MultiDMultiply` element-wise ops. Preselected kernel sets also test `Relu`, `Gelu`, `FastGelu`. - [6] **grouped_conv:** `arch_filter.py` defines conv operator types (`CONV_FWD`, `CONV_BWD_DATA`, `CONV_BWD_WEIGHT`, `CONV3D_*`) but dispatcher infrastructure is incomplete (ctypes bindings are stubs, `conv_utils.hpp` does not exist). - [7] **(all dispatcher ops):** gfx908, gfx1100, and gfx1200 also have `warp_tile_combos` in `arch_specs.json` but are not shown in the matrix's 4 GPU columns. - [8] **(all dispatcher ops):** `int4`, `fp32`, `fp64` are valid dispatcher data types (defined in `kernel_key.hpp` `DataType` enum) but have no dedicated matrix columns. ### Dispatcher GEMM Configuration Detail #### Per-Variant Configuration | GEMM Variant | Pipelines | Schedulers | Epilogues | Element-wise Ops | Output Dtype | |:---:|:---:|:---:|:---:|:---:|:---:| | gemm_universal | mem, compv3, compv4 | intrawave, interwave | cshuffle, default | PassThrough | Same as input (fp8/bf8 -> fp16) | | gemm_preshuffle | preshufflev2 | Auto | cshuffle | PassThrough | Same as input (fp8/bf8 -> fp16) | | gemm_multi_d | mem, compv3, compv4 | intrawave, interwave | cshuffle, default | MultiDAdd, MultiDMultiply | Same as input (fp8/bf8 -> fp16) | #### Warp Tile Combinations per GPU | GPU | fp16 | bf16 | fp8 | bf8 | int8 | pk_fp4 | |:---:|:---:|:---:|:---:|:---:|:---:|:---:| | gfx1100 | 16x16x16 | 16x16x16 | -- | -- | 16x16x16 | -- | | gfx1200 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | -- | | gfx1201 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | -- | | gfx908 | 32x32x8, 16x16x16, 32x32x16, 16x16x32 | 32x32x8, 16x16x16, 32x32x16, 16x16x32 | -- | -- | 32x32x16, 16x16x32 | -- | | gfx90a | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32 | 32x32x16, 32x32x32 | 32x32x16, 16x16x32 | -- | | gfx942 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 16x16x32 | -- | | gfx950 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 16x16x32 | 16x16x128 | #### Preshuffle Warp Tile Combinations | GPU | fp16 | bf16 | fp8 | bf8 | int8 | |:---:|:---:|:---:|:---:|:---:|:---:| | gfx90a | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32 | 32x32x16, 32x32x32 | -- | | gfx942 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 32x32x32, 16x16x64, 16x16x32 | 16x16x32, 32x32x16 | | gfx950 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 32x32x32, 16x16x64, 16x16x32, 16x16x128, 32x32x64 | -- | **Legend:** - **CK Tile Kernel column:** First line is the kernel name. Lines prefixed with "engine:" show the dispatcher directory. Lines prefixed with "example:" show the CK Tile example directory under `example/ck_tile/`. - **Green cell** (✅): CK Tile implementation exists **and** the dispatcher supports it. - **Red cell** (❌): CK Tile implementation exists **but** the dispatcher does **not** support it. - **Grey cell** (blank): No CK Tile implementation exists for this combination. **Layout codes:** Each 3-character layout code specifies the memory layout for tensors A, B, and C: - `r` = row-major, `c` = column-major - Example: `rcr` means A is row-major, B is column-major, C is row-major - `gemm_multi_d` uses 4-character codes internally (e.g., `rcrr`) where the 4th character is the D tensor layout (always `r`). The matrix shows only the 3-character A/B/C portion. **Data type mapping per config label:** | Config Label | A (source) | B (source) | Acc | C (output) | |:---:|:---:|:---:|:---:|:---:| | fp16 | fp16 | fp16 | fp32 | fp16 | | bf16 | bf16 | bf16 | fp32 | bf16 | | int8 | int8 | int8 | int32 | int32 | | fp8 | fp8 | fp8 | fp32 | fp16 | | bf8 | bf8 | bf8 | fp32 | fp16 | | fp6 | fp6 | fp6 | fp32 | fp32 | | fp4 | fp16 or bf16 | fp4 | fp32 | fp16 or bf16 | ## Troubleshooting ### Build Issues | Problem | Solution | |---------|----------| | `hipcc not found` | Set `-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc` | | `hip not found` | Set `-DCMAKE_PREFIX_PATH=/opt/rocm` | | Very slow performance | Use `-DCMAKE_BUILD_TYPE=Release` | | `gfx942 not supported` | Check ROCm version (need 6.0+) | | Kernel generation fails | Ensure Python 3.8+ with NumPy installed in active venv | | Build errors | First verify CK builds without dispatcher (see main CK README) | ### Runtime Issues | Problem | Solution | |---------|----------| | `Library not found` | Build with `-DBUILD_DISPATCHER_EXAMPLES=ON` | | `No kernel found` | Check GPU arch matches build target | | Python `ModuleNotFoundError` | Add paths to `PYTHONPATH` (see above) | | Wrong results | Verify layout matches your data | ### Debug Commands ```bash # Check ROCm installation rocminfo | head -20 # Check GPU architecture rocminfo | grep "Name:" # Verify library exists ls -la build/examples/libdispatcher_*.so # Run with verbose output ./build/examples/gemm_01_basic 2>&1 # Python: Check library loading python3 -c " import ctypes lib = ctypes.CDLL('/path/to/libdispatcher_gemm_lib.so') print('Library loaded successfully') " ``` ### Clean Rebuild If you encounter issues, try a clean rebuild: ```bash cd dispatcher rm -rf build mkdir build && cd build cmake .. [your options] make -j$(nproc) ``` --- ## File Structure ``` dispatcher/ |---- README.md # This file |---- CMakeLists.txt # Build configuration | |---- include/ck_tile/dispatcher/ # C++ headers | |---- dispatcher.hpp # Main dispatcher include | |---- registry.hpp # GEMM kernel registry | |---- kernel_key.hpp # Kernel configuration | |---- grouped_conv_config.hpp # Grouped conv configuration | |---- grouped_conv_problem.hpp # Grouped conv problem (with builder) | |---- grouped_conv_kernel_decl.hpp # Grouped conv kernel declarations | |---- grouped_conv_registry.hpp # Grouped conv registry (thread-safe) | +---- grouped_conv_utils.hpp # Grouped conv utilities | |---- src/ # C++ implementation | |---- codegen/ # Kernel generation | |---- codegen_common.py # Shared: TileConfig, TraitConfigBase, type mappings | |---- unified_gemm_codegen.py # GEMM kernel generator | |---- unified_grouped_conv_codegen.py # Grouped conv kernel generator | +---- arch_specs.json # GPU specifications | |---- python/ # Python utilities | |---- dispatcher_common.py # Shared: paths, validation, Colors, phased output | |---- ctypes_utils.py # GEMM ctypes utilities | +---- grouped_conv_utils.py # Grouped conv utilities | |---- scripts/ # Build scripts | |---- compile_gemm_examples.py # GEMM build script | +---- compile_grouped_conv_examples.py # Grouped conv build script | |---- bindings/ctypes/ # Python ctypes interface | |---- gemm_ctypes_lib.cpp # GEMM Python library | +---- conv_ctypes_lib.cpp # Grouped conv Python library | |---- examples/ # Examples | |---- gemm/ | | |---- cpp/ # C++ GEMM examples (01-07) | | +---- python/ # Python GEMM examples (01-11) | +---- grouped_conv/ | |---- cpp/ # C++ Grouped Conv examples (01-07) | +---- python/ # Python Grouped Conv examples (01-06) | +---- tests/ # Unit tests (C++ and Python) ``` --- ## Example Documentation | Directory | README | |-----------|--------| | GEMM C++ | [examples/gemm/cpp/README.md](examples/gemm/cpp/README.md) | | GEMM Python | [examples/gemm/python/README.md](examples/gemm/python/README.md) | | Codegen | [codegen/README.md](codegen/README.md) | | Python Utils | [python/README.md](python/README.md) | | C++ Headers | [include/ck_tile/dispatcher/README.md](include/ck_tile/dispatcher/README.md) | --- ## Grouped Convolution Support Grouped convolution is fully supported alongside GEMM, with shared infrastructure to eliminate duplication. ### Python ```bash # Generate grouped conv kernels python3 codegen/unified_grouped_conv_codegen.py \ --output-dir build/generated_kernels \ --datatype fp16 --variant forward --ndim-spatial 2 # Build grouped conv examples python3 scripts/compile_grouped_conv_examples.py examples/grouped_conv/cpp/01_basic_grouped_conv.cpp ``` ### Key Files | Component | File | |-----------|------| | C++ Headers | `include/ck_tile/dispatcher/grouped_conv_*.hpp` | | Python Codegen | `codegen/unified_grouped_conv_codegen.py` | | Python Utils | `python/grouped_conv_utils.py` | | Build Script | `scripts/compile_grouped_conv_examples.py` | | Shared Codegen | `codegen/codegen_common.py` | | Shared Utils | `python/dispatcher_common.py` | ### Variants - **Forward** (`grouped_conv_fwd`) - Standard grouped convolution - **Backward Data** (`grouped_conv_bwd_data`) - Gradient w.r.t. input - **Backward Weight** (`grouped_conv_bwd_weight`) - Gradient w.r.t. weights ### Shared Infrastructure GEMM and grouped convolution share common code to avoid duplication: - `codegen/codegen_common.py` - TileConfig, TraitConfigBase, type mappings, parallel generation, arch-aware expansion - `python/dispatcher_common.py` - Path helpers, validation, auto-correction, Colors, phased output --- ## License MIT License - Copyright (c) 2025, Advanced Micro Devices, Inc.