mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-03-22 16:17:37 +00:00
* WIP POC of dispatcher * Dispatcher python workflow setup. * Dispatcher cleanup and updates. Further dispatcher cleanup and updates. Build fixes Improvements and python to CK example Improvements to readme * Fixes to python paths * Cleaning up code * Improving dispatcher support for different arch Fixing typos * Fix formatting errors * Cleaning up examples * Improving codegeneration * Improving and fixing C++ examples * Adding conv functionality (fwd,bwd,bwdw) and examples. * Fixes based on feedback. * Further fixes based on feedback. * Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug. * Another round of improvements based on feedback. * Trimming out unnecessary code. * Fixing the multi-D implementation. * Using gpu verification for gemms and fixing convolutions tflops calculation. * Fix counter usage issue and arch filtering per ops. * Adding changelog and other fixes. * Improve examples and resolve critical bugs. * Reduce build time for python examples. * Fixing minor bug. * Fix compilation error. * Improve installation instructions for dispatcher. * Add docker based installation instructions for dispatcher. * Fixing arch-based filtering to match tile engine. * Remove dead code and fix arch filtering. * Minor bugfix. * Updates after rebase. * Trimming code. * Fix copyright headers. * Consolidate examples, cut down code. * Minor fixes. * Improving python examples. * Update readmes. * Remove conv functionality. * Cleanup following conv removable.
7.4 KiB
7.4 KiB
GEMM C++ Examples
CK Tile Dispatcher C++ examples for GEMM (General Matrix Multiplication) operations.
Main Documentation: Dispatcher README | Examples Overview
Quick Start
Build and Run
cd /path/to/composable_kernel/dispatcher
mkdir -p build && cd build
cmake .. \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DBUILD_DISPATCHER_EXAMPLES=ON
# Build (kernels generated automatically by CMake)
make -j$(nproc)
# Run examples
cd examples
./gemm_01_basic
./gemm_03_benchmark_validation
./gemm_04_heuristics
Examples
| Example | Description | Complexity |
|---|---|---|
| 01_basic_gemm.cpp | Basic GEMM with declarative API, autofill, autocorrect | ★☆☆☆☆ |
| 02_multi_size.cpp | Wildcard expansion for multiple configurations | ★★☆☆☆ |
| 03_benchmark_validation.cpp | Performance benchmarking with CPU reference validation | ★★☆☆☆ |
| 04_heuristics.cpp | Heuristic-based kernel selection | ★★★☆☆ |
| 05_json_export.cpp | Registry JSON export for external tools | ★★☆☆☆ |
| 06_multi_registry.cpp | Multiple registries with named kernel sets | ★★★☆☆ |
Example Details
01_basic_gemm.cpp - Basic GEMM
Demonstrates the declarative kernel API with three patterns:
- Autofill Pattern - Minimal specification, defaults filled automatically
- Autocorrect Pattern - Invalid parameters corrected at build time
- Full Specification Pattern - Complete kernel configuration
DECL_KERNEL_SET(basic_kernels,
// Pattern 1: Autofill - minimal specification
.add(
Signature().dtype("fp16").layout("rcr"),
Algorithm(), // Defaults filled by autofill
"gfx942"
)
// Pattern 2: Full specification
.add(
Signature().dtype("fp16").layout("rcr"),
Algorithm().tile(256, 256, 32).wave(2, 2, 1).warp(32, 32, 16)
.pipeline("compv4").scheduler("intrawave"),
"gfx942"
)
);
Features:
- Uses generic
REGISTER_GENERATED_KERNELSmacro print_registered_kernels()utility for debugging- Demonstrates autofill messages during build
02_multi_size.cpp - Wildcard Expansion
Demonstrates automatic generation of multiple kernel configurations:
DECL_KERNEL_SET(multi_kernels,
.add(
Signature().dtype("fp16").layout("rcr"),
Algorithm().tile(*, *, 32) // Wildcard tile M and N
.wave(2, 2, 1)
.warp(32, 32, 16)
.pipeline("compv4")
.scheduler("intrawave"),
"gfx942"
)
);
Wildcard Values:
*,-1, orANY_INTexpand to all valid configurations- Architecture filter prunes invalid combinations automatically
- Example generates 5 valid kernels after arch filtering (from 7 expansions)
03_benchmark_validation.cpp - Benchmark + Validation
Consolidated example combining performance benchmarking with correctness validation:
# Benchmark only
./gemm_03_benchmark_validation --warmup 10 --iterations 100
# With CPU validation
./gemm_03_benchmark_validation --verify 1 --rtol 1e-3 --atol 1e-3
# With GPU reference validation (faster for large matrices)
./gemm_03_benchmark_validation --verify 2
Features:
- Warmup iterations (discarded from timing)
- Benchmark iterations with statistics (min/max/mean/median)
- CPU reference validation using
ck_tile::reference_gemm - GPU reference validation using
ck_tile::reference_gemm_gpu - Configurable tolerances
04_heuristics.cpp - Heuristic Selection
Demonstrates custom kernel selection based on problem characteristics:
// Problem size analysis
auto heuristic = [](const Problem& p) -> std::optional<KernelKey> {
if (p.M() * p.N() < 256 * 256) {
return small_kernel_key; // Memory-bound heuristic
} else {
return large_kernel_key; // Compute-bound heuristic
}
};
dispatcher.set_heuristic(heuristic);
Features:
- Problem size analysis (small vs large matrices)
- Compute-bound vs memory-bound selection
- Custom heuristic function registration
05_json_export.cpp - JSON Export
Exports registry information to JSON for external tool integration:
auto json = registry.to_json();
std::ofstream file("kernels.json");
file << json;
Use Cases:
- Kernel metadata serialization
- External analysis tools
- Configuration management
06_multi_registry.cpp - Multiple Registries
Demonstrates using multiple registries with named kernel sets:
// Define separate kernel sets
DECL_KERNEL_SET(compute_optimized, ...);
DECL_KERNEL_SET(latency_optimized, ...);
// Register to specific registries
Registry compute_registry, latency_registry;
REGISTER_KERNEL_SET(compute_optimized, compute_registry);
REGISTER_KERNEL_SET(latency_optimized, latency_registry);
// Use appropriate registry based on workload
Dispatcher compute_dispatcher(compute_registry);
Dispatcher latency_dispatcher(latency_registry);
Features:
- Named kernel set registration with
REGISTER_KERNEL_SETmacro - Separate registries for different optimization goals
- Dynamic kernel set selection by name
Benchmark Parameters (stream_config)
CK Tile uses stream_config for benchmark control:
ck_tile::stream_config cfg{
nullptr, // stream_id - HIP stream (nullptr = default)
true, // time_kernel - Enable timing
1, // log_level - Verbosity (0=quiet, 1=normal)
5, // cold_niters - Warmup iterations
20, // nrepeat - Benchmark iterations
true, // is_gpu_timer - Use GPU events vs CPU chrono
false, // flush_cache - Flush L2 cache between iterations
1 // rotating_count - Rotating buffers for cache simulation
};
| Parameter | CLI Option | Default | Description |
|---|---|---|---|
cold_niters_ |
--warmup |
5 | Warmup iterations |
nrepeat_ |
--iterations |
100 | Benchmark iterations |
flush_cache_ |
- | false | Flush L2 cache |
rotating_count_ |
- | 1 | Rotating buffers |
is_gpu_timer_ |
- | true | GPU timer vs CPU |
Declarative Kernel Pattern
All examples use the declarative DECL_KERNEL_SET macro:
DECL_KERNEL_SET(my_kernels,
.add(
Signature() // WHAT: operation signature
.dtype("fp16") // Data type
.layout("rcr"), // Matrix layouts (A=row, B=col, C=row)
Algorithm() // HOW: implementation details
.tile(256, 256, 32) // Tile sizes (M, N, K)
.wave(2, 2, 1) // Wave configuration
.warp(32, 32, 16) // Warp tile sizes
.pipeline("compv4") // Pipeline type
.scheduler("intrawave"), // Scheduler type
"gfx942" // WHERE: target architecture
)
);
Key Macros:
DECL_KERNEL_SET(name, ...)- Declare a kernel setREGISTER_GENERATED_KERNELS- Register all kernels from this exampleREGISTER_KERNEL_SET(name, registry)- Register specific kernel set to a registry