mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-11 08:50:17 +00:00
* WIP POC of dispatcher * Dispatcher python workflow setup. * Dispatcher cleanup and updates. Further dispatcher cleanup and updates. Build fixes Improvements and python to CK example Improvements to readme * Fixes to python paths * Cleaning up code * Improving dispatcher support for different arch Fixing typos * Fix formatting errors * Cleaning up examples * Improving codegeneration * Improving and fixing C++ examples * Adding conv functionality (fwd,bwd,bwdw) and examples. * Fixes based on feedback. * Further fixes based on feedback. * Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug. * Another round of improvements based on feedback. * Trimming out unnecessary code. * Fixing the multi-D implementation. * Using gpu verification for gemms and fixing convolutions tflops calculation. * Fix counter usage issue and arch filtering per ops. * Adding changelog and other fixes. * Improve examples and resolve critical bugs. * Reduce build time for python examples. * Fixing minor bug. * Fix compilation error. * Improve installation instructions for dispatcher. * Add docker based installation instructions for dispatcher. * Fixing arch-based filtering to match tile engine. * Remove dead code and fix arch filtering. * Minor bugfix. * Updates after rebase. * Trimming code. * Fix copyright headers. * Consolidate examples, cut down code. * Minor fixes. * Improving python examples. * Update readmes. * Remove conv functionality. * Cleanup following conv removable.
211 lines
4.9 KiB
Markdown
211 lines
4.9 KiB
Markdown
# CK Tile Dispatcher Examples
|
|
|
|
Comprehensive examples for GEMM operations with GPU execution.
|
|
|
|
> **Note**: Convolution examples have been moved to `ck-2/conv_archive/` for reference.
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### Step 1: Build
|
|
|
|
```bash
|
|
cd /path/to/composable_kernel/dispatcher
|
|
mkdir -p build && cd build
|
|
|
|
cmake .. \
|
|
-DCMAKE_PREFIX_PATH=/opt/rocm \
|
|
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
|
-DCMAKE_BUILD_TYPE=Release \
|
|
-DGPU_TARGETS="gfx942" \
|
|
-DBUILD_DISPATCHER_EXAMPLES=ON
|
|
|
|
# Build everything (C++ examples + Python libraries)
|
|
make -j$(nproc)
|
|
|
|
# Or build ONLY Python libraries (faster)
|
|
make python_libs -j$(nproc)
|
|
```
|
|
|
|
### Step 2: Run C++ Examples
|
|
|
|
```bash
|
|
cd build/examples
|
|
|
|
# GEMM
|
|
./gemm_01_basic
|
|
./gemm_02_multi_size
|
|
./gemm_03_benchmark_validation
|
|
./gemm_04_heuristics
|
|
./gemm_05_json_export
|
|
./gemm_06_multi_registry
|
|
```
|
|
|
|
### Step 3: Run Python Examples
|
|
|
|
```bash
|
|
cd /path/to/composable_kernel/dispatcher
|
|
|
|
# GEMM
|
|
python3 examples/gemm/python/01_basic_gemm.py
|
|
python3 examples/gemm/python/04_validation.py
|
|
python3 examples/gemm/python/07_stress_test.py
|
|
python3 examples/gemm/python/08_heuristics.py
|
|
```
|
|
|
|
---
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
examples/
|
|
├── gemm/
|
|
│ ├── cpp/ # 6 C++ GEMM examples
|
|
│ └── python/ # 11 Python GEMM examples
|
|
│
|
|
└── README.md
|
|
```
|
|
|
|
---
|
|
|
|
## GEMM Examples
|
|
|
|
### C++ Examples
|
|
|
|
| # | Example | Description |
|
|
|---|---------|-------------|
|
|
| 01 | `gemm_01_basic` | Basic GEMM with declarative API, autofill, autocorrect |
|
|
| 02 | `gemm_02_multi_size` | Wildcard expansion for multiple configurations |
|
|
| 03 | `gemm_03_benchmark_validation` | Performance benchmarking with CPU/GPU validation |
|
|
| 04 | `gemm_04_heuristics` | Heuristic-based kernel selection |
|
|
| 05 | `gemm_05_json_export` | Registry JSON export for external tools |
|
|
| 06 | `gemm_06_multi_registry` | Multiple registries with named kernel sets |
|
|
|
|
**Details:** [gemm/cpp/README.md](gemm/cpp/README.md)
|
|
|
|
---
|
|
|
|
### Python Examples
|
|
|
|
| # | Example | Description |
|
|
|---|---------|-------------|
|
|
| 01 | `01_basic_gemm.py` | Basic GEMM with multi-kernel support |
|
|
| 02 | `02_batch_gemm.py` | Batched GEMM operations |
|
|
| 03 | `03_benchmark.py` | Performance benchmarking |
|
|
| 04 | `04_validation.py` | CPU reference validation |
|
|
| 05 | `05_numpy_integration.py` | NumPy array integration |
|
|
| 06 | `06_json_export.py` | Registry JSON export |
|
|
| 07 | `07_stress_test.py` | Multi-kernel stress testing (48 configs) |
|
|
| 08 | `08_heuristics.py` | Heuristic-based kernel selection (24 configs) |
|
|
| 09 | `09_multi_registry.py` | Multiple registries |
|
|
| 10 | `10_advanced_benchmark.py` | Advanced benchmark with full control |
|
|
| 11 | `11_json_import.py` | Import kernels from JSON |
|
|
|
|
**Details:** [gemm/python/README.md](gemm/python/README.md)
|
|
|
|
---
|
|
|
|
## Key Features
|
|
|
|
### Declarative Kernel API
|
|
|
|
Both C++ and Python examples use a declarative approach:
|
|
|
|
**C++ (DECL_KERNEL_SET macro):**
|
|
```cpp
|
|
DECL_KERNEL_SET(my_kernels,
|
|
.add(
|
|
Signature().dtype("fp16").layout("rcr"),
|
|
Algorithm().tile(256, 256, 32).wave(2, 2, 1).warp(32, 32, 16)
|
|
.pipeline("compv4").scheduler("intrawave"),
|
|
"gfx942"
|
|
)
|
|
);
|
|
```
|
|
|
|
**Python (KernelConfig):**
|
|
```python
|
|
config = KernelConfig(
|
|
tile_m=256, tile_n=256, tile_k=32,
|
|
wave_m=2, wave_n=2, wave_k=1,
|
|
warp_tile_m=32, warp_tile_n=32, warp_tile_k=16,
|
|
pipeline="compv4", scheduler="intrawave"
|
|
)
|
|
```
|
|
|
|
### Autofill and Autocorrect
|
|
|
|
The build system automatically:
|
|
- **Autofills** missing parameters with sensible defaults
|
|
- **Autocorrects** invalid parameters based on architecture constraints
|
|
- **Expands** wildcards (`*`, `-1`, `ANY_INT`) to all valid configurations
|
|
|
|
### Architecture Filtering
|
|
|
|
Kernel configurations are validated against GPU architecture constraints:
|
|
- Tile divisibility requirements
|
|
- Warp tile constraints
|
|
- Pipeline compatibility
|
|
|
|
Invalid configurations are automatically pruned during code generation.
|
|
|
|
---
|
|
|
|
## Validation Examples
|
|
|
|
### C++ Validation
|
|
|
|
```bash
|
|
./gemm_03_benchmark_validation --verify 1 # GEMM with CPU reference
|
|
./gemm_03_benchmark_validation --verify 2 # GEMM with GPU reference
|
|
```
|
|
|
|
### Python Validation
|
|
|
|
```bash
|
|
python3 examples/gemm/python/04_validation.py
|
|
python3 examples/gemm/python/07_stress_test.py # Multi-kernel validation
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Python: Library not found
|
|
|
|
```bash
|
|
# Run from dispatcher directory
|
|
cd /path/to/composable_kernel/dispatcher
|
|
python3 examples/gemm/python/01_basic_gemm.py
|
|
```
|
|
|
|
### C++: Executables not found
|
|
|
|
```bash
|
|
# Build with examples enabled
|
|
cmake .. -DBUILD_DISPATCHER_EXAMPLES=ON
|
|
make -j$(nproc)
|
|
|
|
# Run from build/examples
|
|
cd build/examples
|
|
./gemm_01_basic
|
|
```
|
|
|
|
### GPU not detected
|
|
|
|
```bash
|
|
rocminfo | grep "Name:"
|
|
# Should show: gfx942, gfx90a, etc.
|
|
```
|
|
|
|
---
|
|
|
|
## Archived Examples
|
|
|
|
Convolution examples have been archived to `ck-2/conv_archive/dispatcher/`:
|
|
- `examples/conv/cpp/` - 11 C++ convolution examples
|
|
- `examples/conv/python/` - 14 Python convolution examples
|
|
|
|
See the archive for convolution functionality reference.
|