mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 10:09:41 +00:00
* WIP POC of dispatcher
* Dispatcher python workflow setup.
* Dispatcher cleanup and updates.
Further dispatcher cleanup and updates.
Build fixes
Improvements and python to CK example
Improvements to readme
* Fixes to python paths
* Cleaning up code
* Improving dispatcher support for different arch
Fixing typos
* Fix formatting errors
* Cleaning up examples
* Improving codegeneration
* Improving and fixing C++ examples
* Adding conv functionality (fwd,bwd,bwdw) and examples.
* Fixes based on feedback.
* Further fixes based on feedback.
* Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug.
* Another round of improvements based on feedback.
* Trimming out unnecessary code.
* Fixing the multi-D implementation.
* Using gpu verification for gemms and fixing convolutions tflops calculation.
* Fix counter usage issue and arch filtering per ops.
* Adding changelog and other fixes.
* Improve examples and resolve critical bugs.
* Reduce build time for python examples.
* Fixing minor bug.
* Fix compilation error.
* Improve installation instructions for dispatcher.
* Add docker based installation instructions for dispatcher.
* Fixing arch-based filtering to match tile engine.
* Remove dead code and fix arch filtering.
* Minor bugfix.
* Updates after rebase.
* Trimming code.
* Fix copyright headers.
* Consolidate examples, cut down code.
* Minor fixes.
* Improving python examples.
* Update readmes.
* Remove conv functionality.
* Cleanup following conv removable.
[ROCm/composable_kernel commit: 9e049a32a1]
CK Tile GEMM Unified Code Generator
Single source of truth for all GEMM kernel generation.
See also: Main Dispatcher README for installation and core concepts.
Quick Start
cd dispatcher/codegen
# Generate standard FP16 kernels
python3 unified_gemm_codegen.py \
--output-dir ../build/generated_kernels \
--datatype fp16 \
--layout rcr \
--variants standard
# Generate all variants
python3 unified_gemm_codegen.py \
--output-dir ../build/generated_kernels \
--variants standard preshuffle multi_d
Using from Python
from ctypes_utils import CodegenRunner, KernelConfig
# Generate from specific config
config = KernelConfig(tile_m=256, tile_n=256, tile_k=64)
codegen = CodegenRunner()
result = codegen.generate_from_config(config)
# Generate variant
result = codegen.generate("preshuffle")
# Generate all
results = codegen.generate_all()
Command Line Options
| Option | Values | Description |
|---|---|---|
--output-dir |
path | Output directory |
--datatype |
fp16, bf16, fp32, int8 |
Data type |
--layout |
rcr, rrr, crr, ccr |
Matrix layouts |
--gpu-target |
gfx942, gfx90a, gfx950 |
Target GPU |
--variants |
standard, preshuffle, multi_d |
Kernel variants |
--preselected |
fp16_rcr_essential, etc. |
Predefined kernel set |
Layout Notation
R= Row-major,C= Column-major- Order: A, B, C (e.g.,
rcr= A row, B col, C row)
Variants
Standard
Basic GEMM: C = A × B
PreShuffle
Optimized weight access with LDS pre-shuffling. Best for large matrices.
Multi-D
Element-wise fusion: C = op(A × B + D0 + D1 + ...)
Supported ops: PassThrough, MultiDAdd, Relu, Gelu, Sigmoid, Tanh
Output Structure
generated_kernels/
├── gemm_fp16_rcr_compv4_..._128x128x32_....hpp
├── gemm_fp16_rcr_compv4_..._preshuffle.hpp
├── gemm_fp16_rcr_compv4_..._multid_Relu_d1.hpp
└── ...
Configuration Files
arch_specs.json
GPU architecture specifications (single source of truth):
{
"architectures": {
"gfx942": {
"family": "cdna3",
"warp_size": 64,
"warp_configs": [[2, 2, 1], [4, 4, 1]],
...
}
}
}
preselected_kernels.py
Curated kernel sets for common use cases.
Adding New GPU Support
See ADDING_NEW_GPU.md for complete guide.
Quick steps:
- Edit
arch_specs.json - Run
python generate_arch_specs.py - Rebuild
Troubleshooting
| Issue | Solution |
|---|---|
| "Arguments not supported" | Check tile config validity |
| Missing element-wise op | Check elementwise_ops.hpp |
| Compilation errors | Verify C++17, include paths |
More info: See ../README.md for full documentation.