Files
composable_kernel/dispatcher/examples/gemm/cpp/README.md
Vidyasagar Ananthan 9e049a32a1 Adding dispatcher architecture (#3300)
* WIP POC of dispatcher

* Dispatcher python workflow setup.

* Dispatcher cleanup and updates.

Further dispatcher cleanup and updates.

Build fixes

Improvements and python to CK example

Improvements to readme

* Fixes to python paths

* Cleaning up code

* Improving dispatcher support for different arch

Fixing typos

* Fix formatting errors

* Cleaning up examples

* Improving codegeneration

* Improving and fixing C++ examples

* Adding conv functionality (fwd,bwd,bwdw) and examples.

* Fixes based on feedback.

* Further fixes based on feedback.

* Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug.

* Another round of improvements  based on feedback.

* Trimming out unnecessary code.

* Fixing the multi-D implementation.

* Using gpu verification for gemms and fixing convolutions tflops calculation.

* Fix counter usage issue and arch filtering per ops.

* Adding changelog and other fixes.

* Improve examples and resolve critical bugs.

* Reduce build time for python examples.

* Fixing minor bug.

* Fix compilation error.

* Improve installation instructions for dispatcher.

* Add docker based  installation instructions for dispatcher.

* Fixing arch-based filtering to match tile engine.

* Remove dead code and fix arch filtering.

* Minor bugfix.

* Updates after rebase.

* Trimming code.

* Fix copyright headers.

* Consolidate examples, cut down code.

* Minor fixes.

* Improving python examples.

* Update readmes.

* Remove conv functionality.

* Cleanup following conv removable.
2026-01-22 09:34:33 -08:00

230 lines
7.4 KiB
Markdown

# GEMM C++ Examples
CK Tile Dispatcher C++ examples for GEMM (General Matrix Multiplication) operations.
> **Main Documentation**: [Dispatcher README](../../../README.md) | [Examples Overview](../../README.md)
## Quick Start
### Build and Run
```bash
cd /path/to/composable_kernel/dispatcher
mkdir -p build && cd build
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DBUILD_DISPATCHER_EXAMPLES=ON
# Build (kernels generated automatically by CMake)
make -j$(nproc)
# Run examples
cd examples
./gemm_01_basic
./gemm_03_benchmark_validation
./gemm_04_heuristics
```
## Examples
| Example | Description | Complexity |
|---------|-------------|------------|
| [01_basic_gemm.cpp](01_basic_gemm.cpp) | Basic GEMM with declarative API, autofill, autocorrect | ★☆☆☆☆ |
| [02_multi_size.cpp](02_multi_size.cpp) | Wildcard expansion for multiple configurations | ★★☆☆☆ |
| [03_benchmark_validation.cpp](03_benchmark_validation.cpp) | Performance benchmarking with CPU reference validation | ★★☆☆☆ |
| [04_heuristics.cpp](04_heuristics.cpp) | Heuristic-based kernel selection | ★★★☆☆ |
| [05_json_export.cpp](05_json_export.cpp) | Registry JSON export for external tools | ★★☆☆☆ |
| [06_multi_registry.cpp](06_multi_registry.cpp) | Multiple registries with named kernel sets | ★★★☆☆ |
## Example Details
### 01_basic_gemm.cpp - Basic GEMM
Demonstrates the declarative kernel API with three patterns:
1. **Autofill Pattern** - Minimal specification, defaults filled automatically
2. **Autocorrect Pattern** - Invalid parameters corrected at build time
3. **Full Specification Pattern** - Complete kernel configuration
```cpp
DECL_KERNEL_SET(basic_kernels,
// Pattern 1: Autofill - minimal specification
.add(
Signature().dtype("fp16").layout("rcr"),
Algorithm(), // Defaults filled by autofill
"gfx942"
)
// Pattern 2: Full specification
.add(
Signature().dtype("fp16").layout("rcr"),
Algorithm().tile(256, 256, 32).wave(2, 2, 1).warp(32, 32, 16)
.pipeline("compv4").scheduler("intrawave"),
"gfx942"
)
);
```
**Features:**
- Uses generic `REGISTER_GENERATED_KERNELS` macro
- `print_registered_kernels()` utility for debugging
- Demonstrates autofill messages during build
### 02_multi_size.cpp - Wildcard Expansion
Demonstrates automatic generation of multiple kernel configurations:
```cpp
DECL_KERNEL_SET(multi_kernels,
.add(
Signature().dtype("fp16").layout("rcr"),
Algorithm().tile(*, *, 32) // Wildcard tile M and N
.wave(2, 2, 1)
.warp(32, 32, 16)
.pipeline("compv4")
.scheduler("intrawave"),
"gfx942"
)
);
```
**Wildcard Values:**
- `*`, `-1`, or `ANY_INT` expand to all valid configurations
- Architecture filter prunes invalid combinations automatically
- Example generates 5 valid kernels after arch filtering (from 7 expansions)
### 03_benchmark_validation.cpp - Benchmark + Validation
Consolidated example combining performance benchmarking with correctness validation:
```bash
# Benchmark only
./gemm_03_benchmark_validation --warmup 10 --iterations 100
# With CPU validation
./gemm_03_benchmark_validation --verify 1 --rtol 1e-3 --atol 1e-3
# With GPU reference validation (faster for large matrices)
./gemm_03_benchmark_validation --verify 2
```
**Features:**
- Warmup iterations (discarded from timing)
- Benchmark iterations with statistics (min/max/mean/median)
- CPU reference validation using `ck_tile::reference_gemm`
- GPU reference validation using `ck_tile::reference_gemm_gpu`
- Configurable tolerances
### 04_heuristics.cpp - Heuristic Selection
Demonstrates custom kernel selection based on problem characteristics:
```cpp
// Problem size analysis
auto heuristic = [](const Problem& p) -> std::optional<KernelKey> {
if (p.M() * p.N() < 256 * 256) {
return small_kernel_key; // Memory-bound heuristic
} else {
return large_kernel_key; // Compute-bound heuristic
}
};
dispatcher.set_heuristic(heuristic);
```
**Features:**
- Problem size analysis (small vs large matrices)
- Compute-bound vs memory-bound selection
- Custom heuristic function registration
### 05_json_export.cpp - JSON Export
Exports registry information to JSON for external tool integration:
```cpp
auto json = registry.to_json();
std::ofstream file("kernels.json");
file << json;
```
**Use Cases:**
- Kernel metadata serialization
- External analysis tools
- Configuration management
### 06_multi_registry.cpp - Multiple Registries
Demonstrates using multiple registries with named kernel sets:
```cpp
// Define separate kernel sets
DECL_KERNEL_SET(compute_optimized, ...);
DECL_KERNEL_SET(latency_optimized, ...);
// Register to specific registries
Registry compute_registry, latency_registry;
REGISTER_KERNEL_SET(compute_optimized, compute_registry);
REGISTER_KERNEL_SET(latency_optimized, latency_registry);
// Use appropriate registry based on workload
Dispatcher compute_dispatcher(compute_registry);
Dispatcher latency_dispatcher(latency_registry);
```
**Features:**
- Named kernel set registration with `REGISTER_KERNEL_SET` macro
- Separate registries for different optimization goals
- Dynamic kernel set selection by name
## Benchmark Parameters (stream_config)
CK Tile uses `stream_config` for benchmark control:
```cpp
ck_tile::stream_config cfg{
nullptr, // stream_id - HIP stream (nullptr = default)
true, // time_kernel - Enable timing
1, // log_level - Verbosity (0=quiet, 1=normal)
5, // cold_niters - Warmup iterations
20, // nrepeat - Benchmark iterations
true, // is_gpu_timer - Use GPU events vs CPU chrono
false, // flush_cache - Flush L2 cache between iterations
1 // rotating_count - Rotating buffers for cache simulation
};
```
| Parameter | CLI Option | Default | Description |
|-----------|------------|---------|-------------|
| `cold_niters_` | `--warmup` | 5 | Warmup iterations |
| `nrepeat_` | `--iterations` | 100 | Benchmark iterations |
| `flush_cache_` | - | false | Flush L2 cache |
| `rotating_count_` | - | 1 | Rotating buffers |
| `is_gpu_timer_` | - | true | GPU timer vs CPU |
## Declarative Kernel Pattern
All examples use the declarative `DECL_KERNEL_SET` macro:
```cpp
DECL_KERNEL_SET(my_kernels,
.add(
Signature() // WHAT: operation signature
.dtype("fp16") // Data type
.layout("rcr"), // Matrix layouts (A=row, B=col, C=row)
Algorithm() // HOW: implementation details
.tile(256, 256, 32) // Tile sizes (M, N, K)
.wave(2, 2, 1) // Wave configuration
.warp(32, 32, 16) // Warp tile sizes
.pipeline("compv4") // Pipeline type
.scheduler("intrawave"), // Scheduler type
"gfx942" // WHERE: target architecture
)
);
```
**Key Macros:**
- `DECL_KERNEL_SET(name, ...)` - Declare a kernel set
- `REGISTER_GENERATED_KERNELS` - Register all kernels from this example
- `REGISTER_KERNEL_SET(name, registry)` - Register specific kernel set to a registry
## Related Documentation
- [Python GEMM Examples](../python/README.md)
- [Convolution Examples](../../conv/cpp/README.md)
- [Main Dispatcher README](../../../README.md)