mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-03-25 01:27:40 +00:00
* WIP POC of dispatcher * Dispatcher python workflow setup. * Dispatcher cleanup and updates. Further dispatcher cleanup and updates. Build fixes Improvements and python to CK example Improvements to readme * Fixes to python paths * Cleaning up code * Improving dispatcher support for different arch Fixing typos * Fix formatting errors * Cleaning up examples * Improving codegeneration * Improving and fixing C++ examples * Adding conv functionality (fwd,bwd,bwdw) and examples. * Fixes based on feedback. * Further fixes based on feedback. * Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug. * Another round of improvements based on feedback. * Trimming out unnecessary code. * Fixing the multi-D implementation. * Using gpu verification for gemms and fixing convolutions tflops calculation. * Fix counter usage issue and arch filtering per ops. * Adding changelog and other fixes. * Improve examples and resolve critical bugs. * Reduce build time for python examples. * Fixing minor bug. * Fix compilation error. * Improve installation instructions for dispatcher. * Add docker based installation instructions for dispatcher. * Fixing arch-based filtering to match tile engine. * Remove dead code and fix arch filtering. * Minor bugfix. * Updates after rebase. * Trimming code. * Fix copyright headers. * Consolidate examples, cut down code. * Minor fixes. * Improving python examples. * Update readmes. * Remove conv functionality. * Cleanup following conv removable.
4.7 KiB
4.7 KiB
Adding New GPU Architecture Support
Guide for adding support for a new AMD GPU architecture to the CK Tile Dispatcher.
See also: Main Dispatcher README | Codegen README
Overview
The dispatcher uses arch_specs.json as the single source of truth for GPU specifications:
arch_specs.json → generate_arch_specs.py → arch_specs_generated.py (Python)
→ arch_specs_generated.hpp (C++)
Quick Start
# 1. Edit arch_specs.json
# 2. Run generator
python generate_arch_specs.py
# 3. Rebuild
cd ../build && cmake --build . -j8
# 4. Test
ctest
Step-by-Step Guide
Step 1: Edit arch_specs.json
Add new architecture under "architectures":
{
"architectures": {
"gfx1100": {
"family": "rdna3",
"description": "AMD Radeon RX 7000 series (RDNA3)",
"warp_size": 32,
"lds_capacity_kb": 64,
"warp_configs": [
[2, 4, 1],
[4, 2, 1]
],
"warp_tile_combos": {
"fp16_fp16_fp16": [[16, 16, 16], [32, 32, 16]],
"bf16_bf16_bf16": [[16, 16, 16], [32, 32, 16]]
}
}
}
}
Step 2: Configuration Fields
| Field | Description | Example |
|---|---|---|
family |
GPU family | "cdna3", "rdna4" |
description |
Human-readable name | "AMD Instinct MI300" |
warp_size |
Wave/warp size | 64 (CDNA), 32 (RDNA) |
lds_capacity_kb |
LDS memory in KB | 64 |
warp_configs |
Valid [warp_m, warp_n, warp_k] |
[[2,2,1], [4,4,1]] |
warp_tile_combos |
Warp tiles per dtype | See below |
Step 3: Warp Tile Combinations
Map data type combinations to valid warp tile sizes:
"warp_tile_combos": {
"fp16_fp16_fp16": [[32, 32, 8], [16, 16, 16], [32, 32, 16]],
"bf16_bf16_bf16": [[32, 32, 8], [16, 16, 16]],
"fp8_fp8_fp16": [[32, 32, 16], [32, 32, 32]],
"int8_int8_int32": [[16, 16, 32], [32, 32, 16]]
}
Key format: {A_dtype}_{B_dtype}_{C_dtype}
Step 4: Run Generator
cd dispatcher/codegen
python generate_arch_specs.py
This generates:
arch_specs_generated.py(Python module)../include/ck_tile/dispatcher/arch_specs_generated.hpp(C++ header)
Step 5: Rebuild and Test
cd ../build
cmake --build . -j8
ctest --output-on-failure
Step 6: Verify
from arch_filter import ArchFilter
filter = ArchFilter("gfx1100")
is_valid = filter.is_kernel_valid(
datatype_a="fp16", datatype_b="fp16", datatype_c="fp16",
tile_m=128, tile_n=128, tile_k=32,
warp_m=2, warp_n=2, warp_k=1,
warp_tile_m=16, warp_tile_n=16, warp_tile_k=16
)
print(f"Valid: {is_valid}")
Reference
Supported Data Types
| Key | Description |
|---|---|
fp16 |
Half precision (16-bit) |
bf16 |
Brain float 16 |
fp32 |
Single precision (32-bit) |
fp64 |
Double precision (64-bit) |
fp8 |
8-bit float (E4M3) |
bf8 |
8-bit brain float (E5M2) |
int8 |
8-bit integer |
int4 |
4-bit integer |
GPU Families
| Family | Description |
|---|---|
cdna2 |
MI200 series (gfx90a) |
cdna3 |
MI300 series (gfx942) |
cdna4 |
MI350 series (gfx950) |
rdna3 |
RX 7000 series (gfx1100) |
rdna4 |
RX 9000 series (gfx1201) |
Pipeline LDS Limits
| Pipeline | LDS Limit |
|---|---|
compv4 |
32 KB |
preshufflev2 |
32 KB |
default |
64 KB |
Troubleshooting
"Unknown GPU architecture"
- Check architecture key matches exactly (e.g.,
"gfx942"not"GFX942") - Verify you ran
generate_arch_specs.py - Rebuild C++ code
Kernels being rejected
from arch_filter import ArchFilter, KernelConfig
filter = ArchFilter("gfx942")
result = filter.validate_kernel(config)
print(f"Valid: {result.valid}")
for error in result.errors:
print(f" Error: {error}")
Missing warp tile combination
- Check
warp_tile_combosinarch_specs.json - Ensure
[warp_tile_m, warp_tile_n, warp_tile_k]is in the list - Verify data type key format
File Structure
codegen/
├── arch_specs.json # Single source of truth (EDIT THIS)
├── generate_arch_specs.py # Generator script
├── arch_specs_generated.py # Generated Python module
└── ADDING_NEW_GPU.md # This file
include/ck_tile/dispatcher/
├── arch_specs_generated.hpp # Generated C++ header
└── arch_filter.hpp # C++ filter
Best Practices
- Test thoroughly - Run all tests after adding a new GPU
- Start minimal - Add only validated configurations
- Document sources - Note where warp tile combinations came from
- Keep in sync - If using tile_engine, keep both updated
More info: See ../README.md for full documentation.