mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-19 20:40:07 +00:00
[CK][CK TILE]Autotuning heuristics infra for universal GEMM kernel selection (#5676) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR adds ML-based kernel selection heuristics to the CK Tile dispatcher, enabling fast and accurate automatic kernel selection for Universal Gemm kernels. Instead of requiring exhaustive search through 4600+ kernel configurations (taking ~46 seconds per problem shape), the ML heuristic predicts optimal kernels in microseconds while achieving >98% of oracle-best performance. ## Technical Details **ML infrastructure** https://github.com/ROCm/rocm-libraries/tree/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics * Feature Engine ([feature_engine.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/feature_engine.py)): 55-feature extraction including problem dimensions, kernel configuration, tile efficiency, and hardware profile * Training Pipeline ([train.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/train.py)): LightGBM regression with log-transform, GroupKFold cross-validation, warm-start support * Predictor ([predict.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/predict.py)): Kernel ranking and TFLOPS prediction for problem shapes * Evaluation ([evaluate.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/evaluate.py)): Comprehensive metrics including efficiency, NDCG@k, shape family analysis **Data Generation Tools:** * [generate_benchmark_data.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/generate_benchmark_data.py): Build and benchmark kernels across diverse problem shapes * [convert_json_to_parquet.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/convert_json_to_parquet.py): Convert benchmark JSON to training-ready parquet format * [data_pipeline.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/data_pipeline.py): Parse streaming benchmark logs into canonical datasets **Examples** * [09_ml_heuristic.cpp](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/examples/gemm/cpp/09_ml_heuristic.cpp): C++ example demonstrating ML-based kernel selection * [09_ml_heuristic.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/examples/gemm/python/09_ml_heuristic.py): Python example with validation **Pre-trained Models (projects/composablekernel/dispatcher/heuristics/models/):** * gemm_universal_fp8_gfx950/: fp8 RCR model (42K trees, 97.51% mean efficiency) * gemm_universal_fp16_gfx950/: fp16 RCR model (20K trees, 99.36% mean efficiency) ## Test Plan * Evaluated on 25 diverse shapes for fp16, 168 shapes for fp8 * All shape families tested: tiny M (M<8), small M, medium M, large M (M≥1024) * All pipeline types: compv3, compv4, mem ## Test Result **fp16 Model (gfx950, RCR layout)** * Mean Efficiency: 99.36% * P10 Efficiency: 98.05% (90th percentile of shapes achieve ≥98% of oracle best) * Min Efficiency: 95.45% **fp8 Model (gfx950, RCR layout)** * Mean Efficiency: 98.28% (original), 97.51% (wide coverage) * P10 Efficiency: 94.64% (original), 93.89% (wide coverage) * Min Efficiency: 84.5% ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
899 lines
30 KiB
Markdown
899 lines
30 KiB
Markdown
# CK Tile Dispatcher
|
|
|
|
A unified kernel dispatch system for AMD GPUs with C++ and Python frontends.
|
|
|
|
**Validated Platform:** AMD Instinct MI300 series (gfx942)
|
|
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Quick Start](#quick-start)
|
|
2. [Docker Setup](#docker-setup-recommended)
|
|
3. [Prerequisites](#prerequisites)
|
|
4. [Step-by-Step Build Guide](#step-by-step-build-guide)
|
|
5. [Running Examples](#running-examples)
|
|
6. [External Integration](#external-integration)
|
|
7. [Core Concepts](#core-concepts)
|
|
8. [Operation Support Matrix](#operation-support-matrix)
|
|
9. [Troubleshooting](#troubleshooting)
|
|
10. [File Structure](#file-structure)
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
**Complete setup from scratch (5 minutes):**
|
|
|
|
```bash
|
|
# From the composable_kernel root directory
|
|
cd dispatcher
|
|
|
|
# Step 1: Create build directory
|
|
mkdir -p build && cd build
|
|
|
|
# Step 2: Configure CMake
|
|
cmake .. \
|
|
-DCMAKE_PREFIX_PATH=/opt/rocm \
|
|
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
|
-DCMAKE_BUILD_TYPE=Release \
|
|
-DGPU_TARGETS="gfx942" \
|
|
-DBUILD_DISPATCHER_EXAMPLES=ON
|
|
|
|
# Step 3: Generate kernels and build (CMake handles this automatically)
|
|
make -j$(nproc)
|
|
|
|
# Step 4: Run C++ examples
|
|
./examples/gemm_01_basic
|
|
|
|
# Step 5: Build Python libraries (required for Python examples)
|
|
make python_libs
|
|
|
|
# Step 6: Run Python examples (from dispatcher directory)
|
|
cd ..
|
|
python3 examples/gemm/python/01_basic_gemm.py
|
|
```
|
|
|
|
---
|
|
|
|
## Docker Setup (Recommended)
|
|
|
|
For a reproducible build environment, use the official ROCm Docker image:
|
|
|
|
### Step 1: Pull and Run Container
|
|
|
|
```bash
|
|
# Pull the CK Docker image
|
|
docker pull rocm/composable_kernel:ck_ub24.04_rocm7.0.1
|
|
|
|
# Run container with GPU access
|
|
docker run \
|
|
-it \
|
|
--privileged \
|
|
--device=/dev/kfd \
|
|
--device=/dev/dri \
|
|
--group-add video \
|
|
--group-add render \
|
|
-w /root/workspace \
|
|
-v $(pwd):/root/workspace \
|
|
rocm/composable_kernel:ck_ub24.04_rocm7.0.1 \
|
|
/bin/bash
|
|
```
|
|
|
|
> **Note:** Omit `--device` flags if building without GPU access.
|
|
|
|
### Step 2: Clone and Build
|
|
|
|
```bash
|
|
# Inside the container
|
|
git clone https://github.com/ROCm/composable_kernel.git
|
|
cd composable_kernel
|
|
git checkout builder-dispatch-tile-gemm
|
|
|
|
# Set up Python environment
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install numpy
|
|
|
|
# Build dispatcher
|
|
cd dispatcher
|
|
mkdir -p build && cd build
|
|
cmake .. \
|
|
-DCMAKE_PREFIX_PATH=/opt/rocm \
|
|
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
|
-DCMAKE_BUILD_TYPE=Release \
|
|
-DGPU_TARGETS="gfx942" \
|
|
-DBUILD_DISPATCHER_EXAMPLES=ON
|
|
|
|
make -j$(nproc)
|
|
```
|
|
|
|
### One-Liner Build (inside container)
|
|
|
|
```bash
|
|
git clone https://github.com/ROCm/composable_kernel.git && \
|
|
cd composable_kernel && git checkout builder-dispatch-tile-gemm && \
|
|
python3 -m venv .venv && source .venv/bin/activate && pip install numpy && \
|
|
cd dispatcher && mkdir -p build && cd build && \
|
|
cmake .. -DCMAKE_PREFIX_PATH=/opt/rocm -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
|
-DCMAKE_BUILD_TYPE=Release -DGPU_TARGETS="gfx942" -DBUILD_DISPATCHER_EXAMPLES=ON && \
|
|
make -j$(nproc)
|
|
```
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
### Required Software
|
|
|
|
| Software | Minimum Version | Check Command |
|
|
|----------|-----------------|---------------|
|
|
| ROCm | 6.4+ | `rocminfo` |
|
|
| CMake | 3.16+ | `cmake --version` |
|
|
| Python | 3.8+ | `python3 --version` |
|
|
| NumPy | 1.20+ | `pip show numpy` |
|
|
| hipcc | (from ROCm) | `/opt/rocm/bin/hipcc --version` |
|
|
|
|
> **Note:** Newer GPU targets (gfx950, gfx1201) require ROCm 6.3+. For ROCm 6.4+, you can also use `amdclang++` instead of `hipcc`.
|
|
|
|
### Check Your GPU Architecture
|
|
|
|
```bash
|
|
# Find your GPU architecture
|
|
rocminfo | grep -i "gfx"
|
|
# Example output: "gfx942"
|
|
```
|
|
|
|
**Supported architectures:**
|
|
- **gfx942** - MI300X, MI300A, MI308, MI325 (Instinct MI300 series)
|
|
- **gfx90a** - MI200 series (MI250, MI250X)
|
|
- **gfx950** - MI350 series
|
|
- **gfx1101** - RDNA3 series
|
|
- **gfx1201** - RDNA4 series
|
|
|
|
### Install Python Dependencies
|
|
|
|
#### Core Dependencies (Required)
|
|
|
|
NumPy is required for Python examples and kernel generation. We recommend using a virtual environment:
|
|
|
|
**Option 1: Using standard venv**
|
|
```bash
|
|
# Create virtual environment
|
|
python3 -m venv .venv
|
|
|
|
# Activate virtual environment
|
|
source .venv/bin/activate # Linux/macOS
|
|
# .venv\Scripts\activate # Windows
|
|
|
|
# Install core dependencies
|
|
pip install -r python/requirements.txt
|
|
```
|
|
|
|
**Option 2: Using uv (faster alternative)**
|
|
```bash
|
|
# Install uv if not already installed
|
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
|
|
# Create and activate virtual environment
|
|
uv venv .venv
|
|
source .venv/bin/activate # Linux/macOS
|
|
# .venv\Scripts\activate # Windows
|
|
|
|
# Install core dependencies
|
|
uv pip install -r python/requirements.txt
|
|
```
|
|
|
|
**Option 3: System-wide install (not recommended)**
|
|
```bash
|
|
pip install -r python/requirements.txt
|
|
```
|
|
|
|
> **Note:** Always activate your virtual environment before running CMake or Python examples.
|
|
|
|
#### ML Heuristics Dependencies (Optional)
|
|
|
|
For ML-based kernel selection (examples 09-11), install additional dependencies:
|
|
|
|
```bash
|
|
# Activate your virtual environment first
|
|
source .venv/bin/activate
|
|
|
|
# Install ML dependencies (LightGBM, pandas, pyarrow, scikit-learn)
|
|
pip install -r requirements-ml.txt
|
|
```
|
|
|
|
**Why separate?** ML dependencies are large (especially pyarrow) and not needed for basic dispatcher usage. Install only if you need:
|
|
- ML-based kernel selection (`examples/gemm/python/09_ml_heuristic.py`)
|
|
- Model training (`heuristics/train.py`)
|
|
- Model evaluation (`heuristics/evaluate.py`)
|
|
- Automated benchmark analysis
|
|
|
|
**Core dependencies:** ~50 MB (NumPy only)
|
|
**With ML dependencies:** ~500 MB (includes LightGBM, pandas, pyarrow, scikit-learn)
|
|
|
|
### Supported Data Types
|
|
|
|
CK Tile supports a wide range of data types for GEMM operations:
|
|
|
|
| A dtype | B dtype | Acc dtype | Warp Tile Sizes | Notes |
|
|
|---------|---------|-----------|-----------------|-------|
|
|
| `fp32` | `fp32` | `fp32` | 16x16x4, 16x16x16 | Full precision |
|
|
| `fp16` | `fp16` | `fp32` | 32x32x8, 32x32x16, 16x16x16, 16x16x32 | Standard half |
|
|
| `bf16` | `bf16` | `fp32` | 32x32x8, 32x32x16, 16x16x16, 16x16x32 | Brain float 16 |
|
|
| `fp8` | `fp8` | `fp32` | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | FP8 E4M3 |
|
|
| `fp8` | `bf8` | `fp32` | 32x32x16, 16x16x32 | Mixed FP8/BF8 |
|
|
| `bf8` | `fp8` | `fp32` | 32x32x16, 16x16x128 | Mixed BF8/FP8 |
|
|
| `bf8` | `bf8` | `fp32` | 32x32x16, 32x32x32, 16x16x32 | BF8 E5M2 |
|
|
| `int8` | `int8` | `int32` | 32x32x16, 16x16x32, 16x16x16 | Integer GEMM |
|
|
| `pk_fp4` | `pk_fp4` | `fp32` | 16x16x128 | Packed 4-bit float |
|
|
|
|
**Notes:**
|
|
- Accumulator is always `fp32` except for `int8` which uses `int32`
|
|
- FP8 types: `fp8` = E4M3, `bf8` = E5M2
|
|
- `pk_fp4` = Packed 4-bit float (2 values per byte)
|
|
- Some dtypes require specific GPU architectures (e.g., FP8 requires MI300+)
|
|
|
|
---
|
|
|
|
## Step-by-Step Build Guide
|
|
|
|
### Step 1: Navigate to Dispatcher Directory
|
|
|
|
```bash
|
|
# From composable_kernel root
|
|
cd dispatcher
|
|
|
|
# Verify you're in the right place
|
|
ls CMakeLists.txt # Should exist
|
|
```
|
|
|
|
### Step 2: Create Build Directory
|
|
|
|
```bash
|
|
mkdir -p build
|
|
cd build
|
|
```
|
|
|
|
### Step 3: Configure CMake
|
|
|
|
**Basic configuration (library only):**
|
|
```bash
|
|
cmake .. \
|
|
-DCMAKE_PREFIX_PATH=/opt/rocm \
|
|
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
|
-DCMAKE_BUILD_TYPE=Release \
|
|
-DGPU_TARGETS="gfx942"
|
|
```
|
|
|
|
**Full configuration (with examples and tests):**
|
|
```bash
|
|
cmake .. \
|
|
-DCMAKE_PREFIX_PATH=/opt/rocm \
|
|
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
|
-DCMAKE_BUILD_TYPE=Release \
|
|
-DGPU_TARGETS="gfx942" \
|
|
-DBUILD_DISPATCHER_EXAMPLES=ON \
|
|
-DBUILD_DISPATCHER_TESTS=ON
|
|
```
|
|
|
|
**Expected output:**
|
|
```
|
|
-- Found hip: /opt/rocm (found suitable version "6.x.x")
|
|
-- Generating GEMM kernels...
|
|
-- Built: gemm_01 through gemm_06, dispatcher_gemm_lib.so
|
|
-- Configuring done
|
|
```
|
|
|
|
### Step 4: Build
|
|
|
|
```bash
|
|
# Build all targets (generates kernels automatically, then compiles)
|
|
make -j$(nproc)
|
|
|
|
# Or build specific targets
|
|
make gemm_01_basic # Single GEMM example
|
|
make dispatcher_gemm_lib # GEMM shared library for Python
|
|
|
|
# Build ONLY Python libraries (faster if you don't need C++ examples)
|
|
make python_libs -j$(nproc)
|
|
```
|
|
|
|
### Kernel Generation Targets
|
|
|
|
Kernels are generated automatically during `make`, but you can also control generation explicitly:
|
|
|
|
```bash
|
|
# Generate all kernels only (no compilation)
|
|
make generate_all_kernels
|
|
|
|
# Generate GEMM kernels only
|
|
make generate_gemm_kernels
|
|
|
|
# Force regenerate (even if kernels exist)
|
|
make regenerate_all_kernels
|
|
make regenerate_gemm_kernels
|
|
|
|
# Generate for specific GPU architecture
|
|
make generate_kernels_gfx942 # MI300X
|
|
make generate_kernels_gfx90a # MI200
|
|
make generate_kernels_gfx1100 # RDNA3
|
|
```
|
|
|
|
### Step 5: Verify Build
|
|
|
|
```bash
|
|
# Check executables were built
|
|
ls examples/gemm_*
|
|
|
|
# Check shared libraries were built
|
|
ls examples/libdispatcher_gemm_lib.so
|
|
```
|
|
|
|
### CMake Options Reference
|
|
|
|
| Flag | Default | Description |
|
|
|------|---------|-------------|
|
|
| `CMAKE_BUILD_TYPE` | Debug | **Use `Release` for performance!** |
|
|
| `GPU_TARGETS` | None | Target GPU: `"gfx942"`, `"gfx90a"`, etc. |
|
|
| `BUILD_DISPATCHER_EXAMPLES` | OFF | Build C++ examples and Python libs |
|
|
| `BUILD_DISPATCHER_TESTS` | OFF | Build unit tests |
|
|
| `CMAKE_PREFIX_PATH` | - | ROCm installation path |
|
|
| `CMAKE_CXX_COMPILER` | - | Path to hipcc compiler |
|
|
|
|
⚠️ **Important:** Always use `-DCMAKE_BUILD_TYPE=Release` for benchmarking. Debug builds are slower.
|
|
⚠️ **Important:** Note that the current system provides single GPU target support for architecture-based kernel filtering, please do not use multiple GPU targets at a time (if necessary, please compile into different build directories).
|
|
|
|
---
|
|
|
|
## Running Examples
|
|
|
|
### C++ Examples
|
|
|
|
After building, executables are in `build/examples/`:
|
|
|
|
```bash
|
|
cd build/examples
|
|
|
|
# GEMM Examples
|
|
./gemm_01_basic # Basic GEMM with autofill/autocorrect
|
|
./gemm_02_multi_size # Wildcard expansion
|
|
./gemm_03_benchmark_validation # Benchmarking + validation
|
|
./gemm_04_heuristics # Heuristic kernel selection
|
|
./gemm_05_json_export # Registry JSON export
|
|
./gemm_06_multi_registry # Multiple registries
|
|
```
|
|
|
|
### Python Examples
|
|
|
|
Run from the `dispatcher` directory:
|
|
|
|
```bash
|
|
cd /path/to/composable_kernel/dispatcher
|
|
|
|
# GEMM Examples
|
|
python3 examples/gemm/python/01_basic_gemm.py # Basic multi-kernel GEMM
|
|
python3 examples/gemm/python/04_validation.py # CPU reference validation
|
|
python3 examples/gemm/python/07_stress_test.py # Stress test (48 kernels)
|
|
python3 examples/gemm/python/08_heuristics.py # Heuristic selection
|
|
```
|
|
|
|
### Example Output
|
|
|
|
**Expected C++ output (`gemm_01_basic`):**
|
|
```
|
|
======================================================================
|
|
Example 01: Basic GEMM with Declarative Kernel Definition
|
|
======================================================================
|
|
|
|
Step 1: Declared Kernels
|
|
------------------------
|
|
Kernel Set: fp16_gemm_kernels
|
|
Architecture: gfx942
|
|
Configurations: 1
|
|
- gemm_fp16_rcr_compv4_cshuffle_intrawave_128x128x32
|
|
|
|
Step 2: Create Registry and Dispatcher
|
|
--------------------------------------
|
|
Registered 1 kernels
|
|
|
|
Step 3: Define Problem
|
|
----------------------
|
|
M=1024, N=1024, K=1024
|
|
|
|
Step 4: GPU Execution
|
|
---------------------
|
|
*** GPU EXECUTION ***
|
|
Time: <varies> ms
|
|
TFLOPS: <varies>
|
|
```
|
|
|
|
> **Note:** Timing values vary by GPU model and system configuration.
|
|
|
|
---
|
|
|
|
## Benchmark Parameters
|
|
|
|
The dispatcher supports fine-grained control over benchmarking, matching CK Tile's `stream_config`:
|
|
|
|
### Available Parameters
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `warmup` | int | 5 | Warmup iterations (discarded from timing) |
|
|
| `repeat` | int | 20 | Benchmark iterations (averaged) |
|
|
| `flush_cache` | bool | false | Flush GPU L2 cache between iterations |
|
|
| `rotating_count` | int | 1 | Rotating buffer count (for cache simulation) |
|
|
| `timer` | string | "gpu" | Timer type: "gpu" (HIP events) or "cpu" |
|
|
| `init` | string | "random" | Matrix initialization: "random", "linear", "constant" |
|
|
| `split_k` | int | 1 | Split-K parallelism factor |
|
|
|
|
### Python Usage
|
|
|
|
```python
|
|
from ctypes_utils import DispatcherLib
|
|
|
|
# Basic usage (default benchmark settings)
|
|
lib = DispatcherLib.load()
|
|
|
|
# Advanced benchmark settings via command line
|
|
python3 examples/gemm/python/10_advanced_benchmark.py \
|
|
--warmup 10 \
|
|
--repeat 100 \
|
|
--flush-cache
|
|
```
|
|
|
|
### C++ Usage
|
|
|
|
```cpp
|
|
// Basic timing
|
|
ck_tile::stream_config cfg{nullptr, true};
|
|
|
|
// Advanced benchmark settings
|
|
ck_tile::stream_config cfg{
|
|
nullptr, // stream_id (nullptr = default stream)
|
|
true, // time_kernel
|
|
1, // log_level
|
|
10, // cold_niters (warmup)
|
|
100, // nrepeat
|
|
true, // is_gpu_timer
|
|
true, // flush_cache
|
|
4 // rotating_count
|
|
};
|
|
|
|
float avg_time = kernel.run(args, cfg);
|
|
```
|
|
|
|
### Command Line (Python Examples)
|
|
|
|
```bash
|
|
# Basic run
|
|
python3 examples/gemm/python/10_advanced_benchmark.py
|
|
|
|
# With benchmark parameters
|
|
python3 examples/gemm/python/10_advanced_benchmark.py \
|
|
--warmup 10 \
|
|
--repeat 100 \
|
|
--flush-cache \
|
|
--rotating-count 4 \
|
|
--timer gpu
|
|
```
|
|
|
|
### When to Use Each Parameter
|
|
|
|
| Use Case | Recommended Settings |
|
|
|----------|---------------------|
|
|
| Quick test | `warmup=1, repeat=3` |
|
|
| Stable benchmark | `warmup=10, repeat=100` |
|
|
| Memory-bound analysis | `flush_cache=True, rotating_count=4` |
|
|
| Compute-bound analysis | `flush_cache=False` (default) |
|
|
| Debug timing | `timer="cpu"` |
|
|
| Production | `timer="gpu"` (default) |
|
|
|
|
---
|
|
|
|
## ML-Based Kernel Selection (Optional)
|
|
|
|
The dispatcher includes ML heuristics for automated kernel selection using trained LightGBM models.
|
|
|
|
**Prerequisites:** Install ML dependencies first:
|
|
|
|
```bash
|
|
pip install -r requirements-ml.txt # ~500 MB (LightGBM, pandas, pyarrow, scikit-learn)
|
|
```
|
|
|
|
**Documentation:** See [heuristics/README.md](heuristics/README.md) for:
|
|
- Training and evaluating models
|
|
- Feature engineering (72 features)
|
|
- Using pre-trained models
|
|
- Python API reference
|
|
|
|
**Examples:**
|
|
```bash
|
|
python3 examples/gemm/python/09_ml_heuristic.py # ML-based kernel selection
|
|
python3 examples/gemm/python/10_rank_kernels.py # Kernel ranking
|
|
```
|
|
|
|
**Model Compression:** Trained models are stored in compressed `.lgbm.gz` format to save space (~67% size reduction). Python tools automatically decompress models on first use. For C++ examples, decompress manually:
|
|
|
|
```bash
|
|
# If you have compressed models
|
|
cd heuristics/models/gemm_universal_fp16_gfx950
|
|
gunzip model_tflops.lgbm.gz
|
|
|
|
# Then use in C++ example
|
|
cd ../../../build
|
|
./gemm_09_ml_heuristic --model ../heuristics/models/gemm_universal_fp16_gfx950/model_tflops.lgbm
|
|
```
|
|
|
|
---
|
|
|
|
## External Integration
|
|
|
|
### Using Dispatcher in Your Own Project
|
|
|
|
#### Option 1: CMake Integration (Recommended)
|
|
|
|
Add to your `CMakeLists.txt`:
|
|
|
|
```cmake
|
|
# Set path to composable_kernel
|
|
set(CK_ROOT "/path/to/composable_kernel")
|
|
|
|
# Add dispatcher subdirectory
|
|
add_subdirectory(${CK_ROOT}/dispatcher dispatcher_build)
|
|
|
|
# Link to your target
|
|
target_link_libraries(your_target PRIVATE ck_tile_dispatcher)
|
|
target_include_directories(your_target PRIVATE
|
|
${CK_ROOT}/dispatcher/include
|
|
${CK_ROOT}/include
|
|
)
|
|
```
|
|
|
|
#### Option 2: Include as Pre-built Library
|
|
|
|
```cmake
|
|
# Find the pre-built library
|
|
find_library(CK_DISPATCHER ck_tile_dispatcher
|
|
PATHS /path/to/composable_kernel/dispatcher/build)
|
|
|
|
# Include directories
|
|
set(CK_INCLUDE_DIRS
|
|
/path/to/composable_kernel/include
|
|
/path/to/composable_kernel/dispatcher/include
|
|
)
|
|
|
|
target_link_libraries(your_target PRIVATE ${CK_DISPATCHER})
|
|
target_include_directories(your_target PRIVATE ${CK_INCLUDE_DIRS})
|
|
```
|
|
|
|
#### Option 3: Python Integration
|
|
|
|
```python
|
|
import sys
|
|
sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python")
|
|
|
|
# For GEMM
|
|
from ctypes_utils import DispatcherLib, Dispatcher, KernelConfig
|
|
```
|
|
|
|
### Required Include Paths
|
|
|
|
When integrating, you need these include paths:
|
|
|
|
```
|
|
/path/to/composable_kernel/include # CK Tile core headers
|
|
/path/to/composable_kernel/dispatcher/include # Dispatcher headers
|
|
/path/to/composable_kernel/dispatcher/build/generated_kernels # Generated kernels
|
|
```
|
|
|
|
### Required Compile Flags
|
|
|
|
```bash
|
|
# Minimum flags for hipcc
|
|
-std=c++17
|
|
-D__HIP_PLATFORM_AMD__=1
|
|
--offload-arch=gfx942 # Your target GPU
|
|
|
|
# Recommended flags
|
|
-O3
|
|
-mllvm -enable-noalias-to-md-conversion=0
|
|
-Wno-undefined-func-template
|
|
-Wno-float-equal
|
|
-Wall
|
|
-Werror
|
|
```
|
|
|
|
### Python Path Setup
|
|
|
|
For Python scripts outside the dispatcher directory:
|
|
|
|
```bash
|
|
# Option 1: Environment variable
|
|
export PYTHONPATH="/path/to/composable_kernel/dispatcher/examples/gemm/python:$PYTHONPATH"
|
|
|
|
# Option 2: In your Python script
|
|
import sys
|
|
sys.path.insert(0, "/path/to/composable_kernel/dispatcher/examples/gemm/python")
|
|
```
|
|
|
|
### Library Search Paths
|
|
|
|
The Python utilities search for the shared library in these locations:
|
|
|
|
```python
|
|
# For GEMM (ctypes_utils.py)
|
|
SEARCH_PATHS = [
|
|
"build/examples/libdispatcher_gemm_lib.so",
|
|
"../build/examples/libdispatcher_gemm_lib.so",
|
|
"../../build/examples/libdispatcher_gemm_lib.so",
|
|
]
|
|
```
|
|
|
|
If using from a different location, set the library path explicitly:
|
|
|
|
```python
|
|
# GEMM
|
|
from ctypes_utils import DispatcherLib
|
|
lib = DispatcherLib.load("/absolute/path/to/libdispatcher_gemm_lib.so")
|
|
```
|
|
|
|
---
|
|
|
|
## Core Concepts
|
|
|
|
### Data Flow
|
|
|
|
```
|
|
KernelConfig → Registry → Dispatcher → GPU Execution
|
|
```
|
|
|
|
1. **KernelConfig**: Defines kernel parameters (tile sizes, data types, layouts)
|
|
2. **Registry**: Stores multiple kernel configurations
|
|
3. **Dispatcher**: Selects best kernel for a given problem and executes it
|
|
|
|
### GEMM Layouts
|
|
|
|
| Layout | A | B | C | Use Case |
|
|
|--------|---|---|---|----------|
|
|
| RCR | Row | Col | Row | Most common (PyTorch default) |
|
|
| RRR | Row | Row | Row | Both inputs row-major |
|
|
| CRR | Col | Row | Row | A transposed |
|
|
| CCR | Col | Col | Row | Both inputs column-major |
|
|
|
|
### Split-K Support
|
|
|
|
Split-K divides the K dimension across multiple thread blocks, useful for large K dimensions.
|
|
|
|
**Usage (C++):**
|
|
```cpp
|
|
// GEMM with 4-way K split
|
|
auto problem = ProblemBuilder()
|
|
.m(1024).n(1024).k(8192)
|
|
.split_k(4)
|
|
.build();
|
|
```
|
|
|
|
---
|
|
|
|
## Operation Support Matrix
|
|
|
|
This matrix shows all CK Tile operations with per-data-type, per-layout, and per-GPU support status. It uses a three-state convention: ✅ = supported by both CK Tile and the dispatcher, ❌ = supported by CK Tile but not yet in the dispatcher, blank = not supported by CK Tile itself.
|
|
|
|
| | | | | | **Data Types** | | | | | **Layouts** | | | | **GPU Targets** | | |
|
|
|:---:|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
| **Op** | **CK Tile Kernel** | **fp16** | **fp8** | **bf16** | **bf8** | **int8** | **fp4** | **fp6** | **rcr** | **rrr** | **ccr** | **crr** | **90a** | **942** | **950** | **1201** |
|
|
| GEMM | gemm_multi_d [5]<br>engine: `dispatcher/`<br>example: `19_gemm_multi_d/` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
|
| GEMM | gemm_preshuffle [1][2]<br>engine: `dispatcher/` | ✅ | ✅ | ✅ | ✅ | ✅ | | | ✅ | | | | ✅ | ✅ | ✅ | ❌ |
|
|
| GEMM | gemm_universal [3][4][7][8]<br>engine: `dispatcher/`<br>example: `03_gemm/` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
|
| GEMM | batched_contraction<br>example: `41_batched_contraction/` | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
|
| GEMM | batched_gemm<br>example: `16_batched_gemm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
|
| GEMM | block_scale_gemm<br>example: `38_block_scale_gemm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
|
| GEMM | flatmm<br>example: `18_flatmm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
|
| GEMM | gemm_multi_abd<br>example: `22_gemm_multi_abd/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
|
| GEMM | gemm_quant | | ❌ | ❌ | ❌ | ❌ | ❌ | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
|
| GEMM | grouped_gemm<br>example: `17_grouped_gemm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
|
| GEMM | grouped_gemm_quant | | ❌ | ❌ | ❌ | ❌ | ❌ | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
|
| GEMM | streamk_gemm<br>example: `40_streamk_gemm/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
|
| Reduce | multi_reduce2d<br>example: `05_reduce/` | ❌ | | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Reduce | reduce2d<br>example: `05_reduce/` | ❌ | | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Attention | fmha<br>example: `01_fmha/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Attention | sparse_attn<br>example: `50_sparse_attn/` | ❌ | | ❌ | | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Activation | softmax | ❌ | | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Activation | topk_softmax<br>example: `09_topk_softmax/` | ❌ | ❌ | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Conv | grouped_conv [6]<br>example: `20_grouped_convolution/` | ❌ | ❌ | ❌ | ❌ | | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Data Move | batched_transpose<br>example: `35_batched_transpose/` | ❌ | ❌ | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Data Move | image_to_column<br>example: `04_img2col/` | ❌ | | | | | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Data Move | permute<br>example: `06_permute/` | ❌ | ❌ | ❌ | | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Elementwise | elementwise<br>example: `21_elementwise/` | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| MoE | fused_moe<br>example: `15_fused_moe/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Norm | add_rmsnorm2d_rdquant<br>example: `11_add_rmsnorm2d_rdquant/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Norm | layernorm2d<br>example: `02_layernorm2d/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Norm | norm_reduce | ❌ | | ❌ | | | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Norm | rmsnorm2d<br>example: `10_rmsnorm2d/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Pooling | pooling<br>example: `36_pooling/` | ❌ | | | | | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
| Quant | smoothquant<br>example: `12_smoothquant/` | ❌ | ❌ | ❌ | ❌ | ❌ | | | | | | | ❌ | ❌ | ❌ | ❌ |
|
|
|
|
**Notes:**
|
|
|
|
- [1] **gemm_preshuffle:** Supports only `rcr` layout. Uses fixed `preshufflev2` pipeline, `Auto` scheduler, and `cshuffle` epilogue.
|
|
- [2] **gemm_preshuffle:** `int8` preshuffle support is limited to gfx942 and gfx950 (entries in `preshuffle_warp_tile_combos`).
|
|
- [3] **gemm_universal:** `fp4` (pk_fp4) support is only available on gfx950.
|
|
- [4] **gemm_universal:** `fp32` GEMM is supported by the dispatcher (`fp32_fp32_fp32` warp tile combos exist) but is omitted from matrix columns for consistency with the tile engine matrix format.
|
|
- [5] **gemm_multi_d:** Codegen supports `MultiDAdd` and `MultiDMultiply` element-wise ops. Preselected kernel sets also test `Relu`, `Gelu`, `FastGelu`.
|
|
- [6] **grouped_conv:** `arch_filter.py` defines conv operator types (`CONV_FWD`, `CONV_BWD_DATA`, `CONV_BWD_WEIGHT`, `CONV3D_*`) but dispatcher infrastructure is incomplete (ctypes bindings are stubs, `conv_utils.hpp` does not exist).
|
|
- [7] **(all dispatcher ops):** gfx908, gfx1100, and gfx1200 also have `warp_tile_combos` in `arch_specs.json` but are not shown in the matrix's 4 GPU columns.
|
|
- [8] **(all dispatcher ops):** `int4`, `fp32`, `fp64` are valid dispatcher data types (defined in `kernel_key.hpp` `DataType` enum) but have no dedicated matrix columns.
|
|
|
|
### Dispatcher GEMM Configuration Detail
|
|
|
|
#### Per-Variant Configuration
|
|
|
|
| GEMM Variant | Pipelines | Schedulers | Epilogues | Element-wise Ops | Output Dtype |
|
|
|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
| gemm_universal | mem, compv3, compv4 | intrawave, interwave | cshuffle, default | PassThrough | Same as input (fp8/bf8 -> fp16) |
|
|
| gemm_preshuffle | preshufflev2 | Auto | cshuffle | PassThrough | Same as input (fp8/bf8 -> fp16) |
|
|
| gemm_multi_d | mem, compv3, compv4 | intrawave, interwave | cshuffle, default | MultiDAdd, MultiDMultiply | Same as input (fp8/bf8 -> fp16) |
|
|
|
|
#### Warp Tile Combinations per GPU
|
|
|
|
| GPU | fp16 | bf16 | fp8 | bf8 | int8 | pk_fp4 |
|
|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
| gfx1100 | 16x16x16 | 16x16x16 | -- | -- | 16x16x16 | -- |
|
|
| gfx1200 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | -- |
|
|
| gfx1201 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | 16x16x16 | -- |
|
|
| gfx908 | 32x32x8, 16x16x16, 32x32x16, 16x16x32 | 32x32x8, 16x16x16, 32x32x16, 16x16x32 | -- | -- | 32x32x16, 16x16x32 | -- |
|
|
| gfx90a | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32 | 32x32x16, 32x32x32 | 32x32x16, 16x16x32 | -- |
|
|
| gfx942 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 16x16x32 | -- |
|
|
| gfx950 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 4x64x16, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 16x16x32 | 16x16x128 |
|
|
|
|
#### Preshuffle Warp Tile Combinations
|
|
|
|
| GPU | fp16 | bf16 | fp8 | bf8 | int8 |
|
|
|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
| gfx90a | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32 | 32x32x16, 32x32x32 | -- |
|
|
| gfx942 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64 | 32x32x16, 32x32x32, 16x16x64, 16x16x32 | 16x16x32, 32x32x16 |
|
|
| gfx950 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x8, 16x16x16, 32x32x16, 16x16x32, 64x4x16 | 32x32x16, 32x32x32, 16x16x32, 16x16x64, 16x16x128, 32x32x64 | 32x32x16, 32x32x32, 16x16x64, 16x16x32, 16x16x128, 32x32x64 | -- |
|
|
|
|
**Legend:**
|
|
- **CK Tile Kernel column:** First line is the kernel name. Lines prefixed with "engine:" show the dispatcher directory. Lines prefixed with "example:" show the CK Tile example directory under `example/ck_tile/`.
|
|
- **Green cell** (✅): CK Tile implementation exists **and** the dispatcher supports it.
|
|
- **Red cell** (❌): CK Tile implementation exists **but** the dispatcher does **not** support it.
|
|
- **Grey cell** (blank): No CK Tile implementation exists for this combination.
|
|
|
|
**Layout codes:** Each 3-character layout code specifies the memory layout for tensors A, B, and C:
|
|
- `r` = row-major, `c` = column-major
|
|
- Example: `rcr` means A is row-major, B is column-major, C is row-major
|
|
- `gemm_multi_d` uses 4-character codes internally (e.g., `rcrr`) where the 4th character is the D tensor layout (always `r`). The matrix shows only the 3-character A/B/C portion.
|
|
|
|
**Data type mapping per config label:**
|
|
|
|
| Config Label | A (source) | B (source) | Acc | C (output) |
|
|
|:---:|:---:|:---:|:---:|:---:|
|
|
| fp16 | fp16 | fp16 | fp32 | fp16 |
|
|
| bf16 | bf16 | bf16 | fp32 | bf16 |
|
|
| int8 | int8 | int8 | int32 | int32 |
|
|
| fp8 | fp8 | fp8 | fp32 | fp16 |
|
|
| bf8 | bf8 | bf8 | fp32 | fp16 |
|
|
| fp6 | fp6 | fp6 | fp32 | fp32 |
|
|
| fp4 | fp16 or bf16 | fp4 | fp32 | fp16 or bf16 |
|
|
|
|
## Troubleshooting
|
|
|
|
### Build Issues
|
|
|
|
| Problem | Solution |
|
|
|---------|----------|
|
|
| `hipcc not found` | Set `-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc` |
|
|
| `hip not found` | Set `-DCMAKE_PREFIX_PATH=/opt/rocm` |
|
|
| Very slow performance | Use `-DCMAKE_BUILD_TYPE=Release` |
|
|
| `gfx942 not supported` | Check ROCm version (need 6.0+) |
|
|
| Kernel generation fails | Ensure Python 3.8+ with NumPy installed in active venv |
|
|
| Build errors | First verify CK builds without dispatcher (see main CK README) |
|
|
|
|
### Runtime Issues
|
|
|
|
| Problem | Solution |
|
|
|---------|----------|
|
|
| `Library not found` | Build with `-DBUILD_DISPATCHER_EXAMPLES=ON` |
|
|
| `No kernel found` | Check GPU arch matches build target |
|
|
| Python `ModuleNotFoundError` | Add paths to `PYTHONPATH` (see above) |
|
|
| Wrong results | Verify layout matches your data |
|
|
|
|
### Debug Commands
|
|
|
|
```bash
|
|
# Check ROCm installation
|
|
rocminfo | head -20
|
|
|
|
# Check GPU architecture
|
|
rocminfo | grep "Name:"
|
|
|
|
# Verify library exists
|
|
ls -la build/examples/libdispatcher_*.so
|
|
|
|
# Run with verbose output
|
|
./build/examples/gemm_01_basic 2>&1
|
|
|
|
# Python: Check library loading
|
|
python3 -c "
|
|
import ctypes
|
|
lib = ctypes.CDLL('/path/to/libdispatcher_gemm_lib.so')
|
|
print('Library loaded successfully')
|
|
"
|
|
```
|
|
|
|
### Clean Rebuild
|
|
|
|
If you encounter issues, try a clean rebuild:
|
|
|
|
```bash
|
|
cd dispatcher
|
|
rm -rf build
|
|
mkdir build && cd build
|
|
cmake .. [your options]
|
|
make -j$(nproc)
|
|
```
|
|
|
|
---
|
|
|
|
## File Structure
|
|
|
|
```
|
|
dispatcher/
|
|
├── README.md # This file
|
|
├── CMakeLists.txt # Build configuration
|
|
│
|
|
├── include/ck_tile/dispatcher/ # C++ headers
|
|
│ ├── dispatcher.hpp # GEMM dispatcher
|
|
│ ├── registry.hpp # Kernel registry
|
|
│ └── kernel_key.hpp # Kernel configuration
|
|
│
|
|
├── src/ # C++ implementation
|
|
│
|
|
├── codegen/ # Kernel generation
|
|
│ ├── unified_gemm_codegen.py # GEMM kernel generator
|
|
│ └── arch_specs.json # GPU specifications
|
|
│
|
|
├── bindings/ctypes/ # Python ctypes interface
|
|
│ └── gemm_ctypes_lib.cpp # GEMM Python library
|
|
│
|
|
├── examples/ # Examples
|
|
│ └── gemm/
|
|
│ ├── cpp/ # C++ GEMM examples (01-06)
|
|
│ └── python/ # Python GEMM examples (01-11)
|
|
│
|
|
├── scripts/ # Build scripts
|
|
│
|
|
└── tests/ # Unit tests
|
|
```
|
|
|
|
---
|
|
|
|
## Example Documentation
|
|
|
|
| Directory | README |
|
|
|-----------|--------|
|
|
| GEMM C++ | [examples/gemm/cpp/README.md](examples/gemm/cpp/README.md) |
|
|
| GEMM Python | [examples/gemm/python/README.md](examples/gemm/python/README.md) |
|
|
| Codegen | [codegen/README.md](codegen/README.md) |
|
|
|
|
---
|
|
|
|
## Archived Content
|
|
|
|
Convolution examples and utilities have been archived to `ck-2/conv_archive/dispatcher/`:
|
|
- `examples/conv/cpp/` - 11 C++ convolution examples
|
|
- `examples/conv/python/` - 14 Python convolution examples
|
|
- `codegen/unified_conv_codegen.py` - Conv kernel generator
|
|
- `include/ck_tile/dispatcher/conv_*.hpp` - Conv headers
|
|
- `python/conv_utils.py` - Conv Python utilities
|
|
|
|
---
|
|
|
|
## License
|
|
|
|
MIT License - Copyright (c) 2025, Advanced Micro Devices, Inc.
|