mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-03-24 09:07:39 +00:00
* WIP POC of dispatcher * Dispatcher python workflow setup. * Dispatcher cleanup and updates. Further dispatcher cleanup and updates. Build fixes Improvements and python to CK example Improvements to readme * Fixes to python paths * Cleaning up code * Improving dispatcher support for different arch Fixing typos * Fix formatting errors * Cleaning up examples * Improving codegeneration * Improving and fixing C++ examples * Adding conv functionality (fwd,bwd,bwdw) and examples. * Fixes based on feedback. * Further fixes based on feedback. * Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug. * Another round of improvements based on feedback. * Trimming out unnecessary code. * Fixing the multi-D implementation. * Using gpu verification for gemms and fixing convolutions tflops calculation. * Fix counter usage issue and arch filtering per ops. * Adding changelog and other fixes. * Improve examples and resolve critical bugs. * Reduce build time for python examples. * Fixing minor bug. * Fix compilation error. * Improve installation instructions for dispatcher. * Add docker based installation instructions for dispatcher. * Fixing arch-based filtering to match tile engine. * Remove dead code and fix arch filtering. * Minor bugfix. * Updates after rebase. * Trimming code. * Fix copyright headers. * Consolidate examples, cut down code. * Minor fixes. * Improving python examples. * Update readmes. * Remove conv functionality. * Cleanup following conv removable.
2.8 KiB
2.8 KiB
CK Tile GEMM Unified Code Generator
Single source of truth for all GEMM kernel generation.
See also: Main Dispatcher README for installation and core concepts.
Quick Start
cd dispatcher/codegen
# Generate standard FP16 kernels
python3 unified_gemm_codegen.py \
--output-dir ../build/generated_kernels \
--datatype fp16 \
--layout rcr \
--variants standard
# Generate all variants
python3 unified_gemm_codegen.py \
--output-dir ../build/generated_kernels \
--variants standard preshuffle multi_d
Using from Python
from ctypes_utils import CodegenRunner, KernelConfig
# Generate from specific config
config = KernelConfig(tile_m=256, tile_n=256, tile_k=64)
codegen = CodegenRunner()
result = codegen.generate_from_config(config)
# Generate variant
result = codegen.generate("preshuffle")
# Generate all
results = codegen.generate_all()
Command Line Options
| Option | Values | Description |
|---|---|---|
--output-dir |
path | Output directory |
--datatype |
fp16, bf16, fp32, int8 |
Data type |
--layout |
rcr, rrr, crr, ccr |
Matrix layouts |
--gpu-target |
gfx942, gfx90a, gfx950 |
Target GPU |
--variants |
standard, preshuffle, multi_d |
Kernel variants |
--preselected |
fp16_rcr_essential, etc. |
Predefined kernel set |
Layout Notation
R= Row-major,C= Column-major- Order: A, B, C (e.g.,
rcr= A row, B col, C row)
Variants
Standard
Basic GEMM: C = A × B
PreShuffle
Optimized weight access with LDS pre-shuffling. Best for large matrices.
Multi-D
Element-wise fusion: C = op(A × B + D0 + D1 + ...)
Supported ops: PassThrough, MultiDAdd, Relu, Gelu, Sigmoid, Tanh
Output Structure
generated_kernels/
├── gemm_fp16_rcr_compv4_..._128x128x32_....hpp
├── gemm_fp16_rcr_compv4_..._preshuffle.hpp
├── gemm_fp16_rcr_compv4_..._multid_Relu_d1.hpp
└── ...
Configuration Files
arch_specs.json
GPU architecture specifications (single source of truth):
{
"architectures": {
"gfx942": {
"family": "cdna3",
"warp_size": 64,
"warp_configs": [[2, 2, 1], [4, 4, 1]],
...
}
}
}
preselected_kernels.py
Curated kernel sets for common use cases.
Adding New GPU Support
See ADDING_NEW_GPU.md for complete guide.
Quick steps:
- Edit
arch_specs.json - Run
python generate_arch_specs.py - Rebuild
Troubleshooting
| Issue | Solution |
|---|---|
| "Arguments not supported" | Check tile config validity |
| Missing element-wise op | Check elementwise_ops.hpp |
| Compilation errors | Verify C++17, include paths |
More info: See ../README.md for full documentation.