Files
Vidyasagar Ananthan 9e049a32a1 Adding dispatcher architecture (#3300)
* WIP POC of dispatcher

* Dispatcher python workflow setup.

* Dispatcher cleanup and updates.

Further dispatcher cleanup and updates.

Build fixes

Improvements and python to CK example

Improvements to readme

* Fixes to python paths

* Cleaning up code

* Improving dispatcher support for different arch

Fixing typos

* Fix formatting errors

* Cleaning up examples

* Improving codegeneration

* Improving and fixing C++ examples

* Adding conv functionality (fwd,bwd,bwdw) and examples.

* Fixes based on feedback.

* Further fixes based on feedback.

* Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug.

* Another round of improvements  based on feedback.

* Trimming out unnecessary code.

* Fixing the multi-D implementation.

* Using gpu verification for gemms and fixing convolutions tflops calculation.

* Fix counter usage issue and arch filtering per ops.

* Adding changelog and other fixes.

* Improve examples and resolve critical bugs.

* Reduce build time for python examples.

* Fixing minor bug.

* Fix compilation error.

* Improve installation instructions for dispatcher.

* Add docker based  installation instructions for dispatcher.

* Fixing arch-based filtering to match tile engine.

* Remove dead code and fix arch filtering.

* Minor bugfix.

* Updates after rebase.

* Trimming code.

* Fix copyright headers.

* Consolidate examples, cut down code.

* Minor fixes.

* Improving python examples.

* Update readmes.

* Remove conv functionality.

* Cleanup following conv removable.
2026-01-22 09:34:33 -08:00

7.4 KiB

GEMM C++ Examples

CK Tile Dispatcher C++ examples for GEMM (General Matrix Multiplication) operations.

Main Documentation: Dispatcher README | Examples Overview

Quick Start

Build and Run

cd /path/to/composable_kernel/dispatcher
mkdir -p build && cd build

cmake .. \
  -DCMAKE_PREFIX_PATH=/opt/rocm \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DBUILD_DISPATCHER_EXAMPLES=ON

# Build (kernels generated automatically by CMake)
make -j$(nproc)

# Run examples
cd examples
./gemm_01_basic
./gemm_03_benchmark_validation
./gemm_04_heuristics

Examples

Example Description Complexity
01_basic_gemm.cpp Basic GEMM with declarative API, autofill, autocorrect ★☆☆☆☆
02_multi_size.cpp Wildcard expansion for multiple configurations ★★☆☆☆
03_benchmark_validation.cpp Performance benchmarking with CPU reference validation ★★☆☆☆
04_heuristics.cpp Heuristic-based kernel selection ★★★☆☆
05_json_export.cpp Registry JSON export for external tools ★★☆☆☆
06_multi_registry.cpp Multiple registries with named kernel sets ★★★☆☆

Example Details

01_basic_gemm.cpp - Basic GEMM

Demonstrates the declarative kernel API with three patterns:

  1. Autofill Pattern - Minimal specification, defaults filled automatically
  2. Autocorrect Pattern - Invalid parameters corrected at build time
  3. Full Specification Pattern - Complete kernel configuration
DECL_KERNEL_SET(basic_kernels,
    // Pattern 1: Autofill - minimal specification
    .add(
        Signature().dtype("fp16").layout("rcr"),
        Algorithm(),  // Defaults filled by autofill
        "gfx942"
    )
    // Pattern 2: Full specification
    .add(
        Signature().dtype("fp16").layout("rcr"),
        Algorithm().tile(256, 256, 32).wave(2, 2, 1).warp(32, 32, 16)
                   .pipeline("compv4").scheduler("intrawave"),
        "gfx942"
    )
);

Features:

  • Uses generic REGISTER_GENERATED_KERNELS macro
  • print_registered_kernels() utility for debugging
  • Demonstrates autofill messages during build

02_multi_size.cpp - Wildcard Expansion

Demonstrates automatic generation of multiple kernel configurations:

DECL_KERNEL_SET(multi_kernels,
    .add(
        Signature().dtype("fp16").layout("rcr"),
        Algorithm().tile(*, *, 32)     // Wildcard tile M and N
                   .wave(2, 2, 1)
                   .warp(32, 32, 16)
                   .pipeline("compv4")
                   .scheduler("intrawave"),
        "gfx942"
    )
);

Wildcard Values:

  • *, -1, or ANY_INT expand to all valid configurations
  • Architecture filter prunes invalid combinations automatically
  • Example generates 5 valid kernels after arch filtering (from 7 expansions)

03_benchmark_validation.cpp - Benchmark + Validation

Consolidated example combining performance benchmarking with correctness validation:

# Benchmark only
./gemm_03_benchmark_validation --warmup 10 --iterations 100

# With CPU validation
./gemm_03_benchmark_validation --verify 1 --rtol 1e-3 --atol 1e-3

# With GPU reference validation (faster for large matrices)
./gemm_03_benchmark_validation --verify 2

Features:

  • Warmup iterations (discarded from timing)
  • Benchmark iterations with statistics (min/max/mean/median)
  • CPU reference validation using ck_tile::reference_gemm
  • GPU reference validation using ck_tile::reference_gemm_gpu
  • Configurable tolerances

04_heuristics.cpp - Heuristic Selection

Demonstrates custom kernel selection based on problem characteristics:

// Problem size analysis
auto heuristic = [](const Problem& p) -> std::optional<KernelKey> {
    if (p.M() * p.N() < 256 * 256) {
        return small_kernel_key;   // Memory-bound heuristic
    } else {
        return large_kernel_key;   // Compute-bound heuristic
    }
};

dispatcher.set_heuristic(heuristic);

Features:

  • Problem size analysis (small vs large matrices)
  • Compute-bound vs memory-bound selection
  • Custom heuristic function registration

05_json_export.cpp - JSON Export

Exports registry information to JSON for external tool integration:

auto json = registry.to_json();
std::ofstream file("kernels.json");
file << json;

Use Cases:

  • Kernel metadata serialization
  • External analysis tools
  • Configuration management

06_multi_registry.cpp - Multiple Registries

Demonstrates using multiple registries with named kernel sets:

// Define separate kernel sets
DECL_KERNEL_SET(compute_optimized, ...);
DECL_KERNEL_SET(latency_optimized, ...);

// Register to specific registries
Registry compute_registry, latency_registry;
REGISTER_KERNEL_SET(compute_optimized, compute_registry);
REGISTER_KERNEL_SET(latency_optimized, latency_registry);

// Use appropriate registry based on workload
Dispatcher compute_dispatcher(compute_registry);
Dispatcher latency_dispatcher(latency_registry);

Features:

  • Named kernel set registration with REGISTER_KERNEL_SET macro
  • Separate registries for different optimization goals
  • Dynamic kernel set selection by name

Benchmark Parameters (stream_config)

CK Tile uses stream_config for benchmark control:

ck_tile::stream_config cfg{
    nullptr,    // stream_id       - HIP stream (nullptr = default)
    true,       // time_kernel     - Enable timing
    1,          // log_level       - Verbosity (0=quiet, 1=normal)
    5,          // cold_niters     - Warmup iterations
    20,         // nrepeat         - Benchmark iterations
    true,       // is_gpu_timer    - Use GPU events vs CPU chrono
    false,      // flush_cache     - Flush L2 cache between iterations
    1           // rotating_count  - Rotating buffers for cache simulation
};
Parameter CLI Option Default Description
cold_niters_ --warmup 5 Warmup iterations
nrepeat_ --iterations 100 Benchmark iterations
flush_cache_ - false Flush L2 cache
rotating_count_ - 1 Rotating buffers
is_gpu_timer_ - true GPU timer vs CPU

Declarative Kernel Pattern

All examples use the declarative DECL_KERNEL_SET macro:

DECL_KERNEL_SET(my_kernels,
    .add(
        Signature()               // WHAT: operation signature
            .dtype("fp16")        // Data type
            .layout("rcr"),       // Matrix layouts (A=row, B=col, C=row)
        Algorithm()               // HOW: implementation details  
            .tile(256, 256, 32)   // Tile sizes (M, N, K)
            .wave(2, 2, 1)        // Wave configuration
            .warp(32, 32, 16)     // Warp tile sizes
            .pipeline("compv4")   // Pipeline type
            .scheduler("intrawave"), // Scheduler type
        "gfx942"                  // WHERE: target architecture
    )
);

Key Macros:

  • DECL_KERNEL_SET(name, ...) - Declare a kernel set
  • REGISTER_GENERATED_KERNELS - Register all kernels from this example
  • REGISTER_KERNEL_SET(name, registry) - Register specific kernel set to a registry