The CK Tile Engine GEMM module provides a comprehensive system for generating, building, and benchmarking GEMM (General Matrix Multiplication) kernels with various configurations. It supports multiple data types, layouts, and optimization strategies. The system has evolved from a monolithic build approach (where all kernels compile into a single executable) to a more flexible individual kernel compilation system, providing better build parallelism and targeted testing capabilities.

Build System Architecture
Build Instructions
Running Benchmarks
Configuration System
Scripts and Tools
Command Line Options
Understanding Kernel Names
Troubleshooting
Performance Tips

Build System Architecture

Individual Kernel Compilation (New Approach)

The new tile engine benchmark system compiles each kernel configuration into a separate executable. This provides:

Better build parallelism
Faster incremental builds
More targeted testing
Easier debugging of specific configurations

Each benchmark executable follows the naming pattern:

benchmark_gemm_<dtype>_<layout>_<config>_<tile_sizes>

Monolithic Build (Legacy Approach)

The original system compiles all kernels into a single executable (benchmark_gemm_[Datatype]_[Layout]), which can then be filtered at runtime using command-line arguments.

Build Instructions

Prerequisites

ROCm installation
CMake 3.16 or higher
C++17 compatible compiler

Basic Build

# In the root of composable kernel, create build directory
mkdir build && cd build

# Configure with specific datatypes and layouts
# Replace [Arch] with your GPU architecture (e.g., gfx90a, gfx942)
# Replace [Datatype1;Datatype2;...] with datatypes (fp8, bf8, int8, fp16, bf16, fp32, fp64)
# Replace [Layout1;Layout2;...] with layouts (rcr, rrr, crr, ccr)
../script/cmake-ck-dev.sh ../ [Arch] -DGEMM_DATATYPE="[Datatype1;Datatype2]" -DGEMM_LAYOUT="[Layout1;Layout2]"

# Build specific benchmarks
make benchmark_gemm_[Datatype1]_[Layout1] -j

Configuration Options

The build system supports several configuration options:

Using Custom Config Files

# Method 1: CMake variable (config file must be in configs/ directory)
cmake -DGEMM_CONFIG_FILE=my_custom_config.json ...

# Method 2: Environment variable (takes precedence over CMake variable)
export GEMM_CONFIG_FILE=my_custom_config.json
cmake ...

Config File Priority Order

Environment variable GEMM_CONFIG_FILE (highest priority)
CMake variable GEMM_CONFIG_FILE
Default config (default_config.json for all layouts)

Note: All custom config files must be placed in the tile_engine/ops/gemm/configs/ directory.

Example Build Commands

# Build for gfx942 with fp8 and fp16 datatypes, rcr layout
mkdir build && cd build
../script/cmake-ck-dev.sh ../ gfx942 -DGEMM_DATATYPE="fp8;fp16" -DGEMM_LAYOUT="rcr;ccr;rrr;crr"
make benchmark_gemm_fp8_rcr -j
make benchmark_gemm_fp16_rcr -j

Building Individual Kernels

# Build a specific kernel configuration
make benchmark_gemm_fp8_rcr_compv4_default_intrawave_False_False_False_False_256x256x32_1x4x1_32x32x32

# Build all fp16 benchmarks in parallel
make -j$(nproc) $(make help | grep benchmark_gemm_fp16 | awk '{print $2}')

Rebuilding After Configuration Changes

If you modify the configuration file, you must rebuild:

rm -rf tile_engine/ && make benchmark_gemm_[Datatype]_[Layout] -j

Running Benchmarks

Individual Kernel Execution

cd /path/to/build/directory
./bin/benchmark_gemm_fp16_rcr_compv3_default_intrawave_False_False_False_False_256x128x32_4x1x1_32x32x16 \
    -m=512 -n=512 -k=512 -verify=1

Monolithic Executable (Legacy)

# Run specific pipeline/scheduler/epilogue combination
./bin/benchmark_gemm_[Datatype]_[Layout] -pipeline=compv3 -scheduler=intrawave -epilogue=default

Automated Testing

Use the provided test script to run multiple benchmarks:

cd /path/to/composable_kernel/tile_engine/ops/gemm
./test_benchmark.sh [build_directory]

Configuration System

Configuration Files

The system uses JSON configuration files to specify kernel parameters:

configs/default_config.json - Default configurations for various datatypes
configs/user_provided_config.json - User-customizable configurations

Configuration Structure

{
    "tile_config": {
        "tile_m": {"values": [256, 128]},
        "tile_n": {"values": [256, 128]},
        "tile_k": {"values": [64, 32]},
        "warp_m": {"values": [2, 4]},
        "warp_n": {"values": [2, 1]},
        "warp_k": {"values": [1]},
        "warp_tile_m": {"values": [32, 16]},
        "warp_tile_n": {"values": [32, 16]},
        "warp_tile_k": {"values": [16, 32]}
    },
    "trait_config": {
        "pipeline": {"values": ["compv3", "compv4", "mem"]},
        "scheduler": {"values": ["intrawave", "interwave"]},
        "epilogue": {"values": ["default", "cshuffle"]},
        "pad_m": {"values": [false]},
        "pad_n": {"values": [false]},
        "pad_k": {"values": [false]},
        "persistent": {"values": [false]}
    }
}

Scripts and Tools

Python Scripts

gemm_instance_builder.py

Purpose: Main kernel instance generation script that creates C++ kernel implementations based on configuration files.

Key Features:

Generates individual kernel header files for separate compilation
Supports multiple data types (fp16, fp8, bf16, fp32, fp64)
Validates tile configurations for correctness
Creates CMake integration files

Usage:

python gemm_instance_builder.py \
    --working_path ./generated \
    --datatype fp16 \
    --layout rcr \
    --config_json configs/user_provided_config.json \
    --gen_all_individual \
    --gpu_target gfx942

gemm_instance_builder_parallel.py

Purpose: Parallel version of the instance builder for faster generation of multiple kernel configurations.

Features:

Multi-threaded kernel generation
Improved performance for large configuration spaces

validation_utils.py

Purpose: Provides comprehensive validation functions for kernel configurations.

Key Functions:

is_tile_config_valid() - Validates tile dimensions and alignments
is_trait_combination_valid() - Checks if pipeline/epilogue/scheduler combinations are supported
validate_warp_tile_combination() - GPU-specific warp tile validation
validate_lds_capacity() - Ensures configurations fit in LDS memory

Validation Checks:

Dimension alignment (tile dimensions must be divisible by warp dimensions)
LDS capacity constraints
GPU-specific warp tile support
Unsupported trait combinations

test_validation.py

Purpose: Test suite for the validation logic to ensure correctness.

Usage:

python test_validation.py

Tests:

Warp tile combination validation
Trait combination validation
Full tile configuration validation

gemm_benchmark.py

Purpose: Python script for running and analyzing GEMM benchmarks.

Features:

Automated benchmark execution
Performance data collection
Result analysis and reporting

json_config.py

Purpose: Configuration file parsing and management.

Features:

JSON configuration loading
Default configuration handling
Configuration validation

codegen_utils.py

Purpose: Utility functions for code generation.

Features:

Template processing
Code formatting utilities
File generation helpers

Shell Scripts

test_benchmark.sh

Purpose: Automated benchmark testing script that finds and runs all built benchmark executables.

Features:

Automatic build directory detection
Batch execution of multiple benchmarks
CSV result collection
Colored output for easy reading
Example command generation

Usage:

# Auto-detect build directory
./test_benchmark.sh

# Specify build directory
./test_benchmark.sh /path/to/build/directory

What it does:

Finds all benchmark executables in the build directory
Runs each with multiple problem sizes (512, 1024, 2048)
Performs GPU verification
Saves results to timestamped CSV file
Provides summary statistics

Command Line Options

All benchmark executables support the following options:

Matrix Dimensions

-m=<value> - M dimension (default: 3840)
-n=<value> - N dimension (default: 4096)
-k=<value> - K dimension (default: 2048)

Strides

-stride_a=<value> - Stride for matrix A (default: 0, auto-calculated)
-stride_b=<value> - Stride for matrix B (default: 0, auto-calculated)
-stride_c=<value> - Stride for matrix C (default: 0, auto-calculated)

Verification

-verify=<0|1|2> - Verification mode
- 0: No verification (default)
- 1: CPU verification
- 2: GPU verification

Performance Testing

-warmup=<value> - Warmup iterations (default: 50)
-repeat=<value> - Benchmark iterations (default: 100)
-timer=<true|false> - Use GPU timer (default: true)
-flush_cache=<true|false> - Flush cache between runs (default: true)
-rotating_count=<value> - Cache rotation count (default: 1000)

Initialization

-init=<0|1|2> - Tensor initialization method
- 0: Random values [-1, 1] (default)
- 1: Linear sequence (i % 17)
- 2: Constant value (1.0)

Output Options

-log=<true|false> - Enable verbose logging (default: false)
-metric=<0|1|2> - Performance metric
- 0: Latency in ms (default)
- 1: TFLOPS
- 2: Bandwidth in GB/s
-json_output=<true|false> - JSON format output (default: false)
-csv_filename=<filename> - Save results to CSV
-csv_format=<simple|comprehensive> - CSV format (default: comprehensive)

Advanced Options

-split_k=<value> - Split-K factor (default: 1)
-structured_sparsity=<true|false> - Enable structured sparsity (default: false)
-pipeline=<compv3|compv4|mem> - Pipeline type (default: compv3)
-scheduler=<intrawave|interwave> - Scheduler type (default: intrawave)
-epilogue=<cshuffle|default> - Epilogue type (default: cshuffle)
-pad_m=<true|false> - Pad M dimension (default: false)
-pad_n=<true|false> - Pad N dimension (default: false)
-pad_k=<true|false> - Pad K dimension (default: false)
-persistent=<true|false> - Use persistent kernel (default: false)

Understanding Kernel Names

The kernel naming convention encodes the configuration:

benchmark_gemm_fp16_rcr_compv3_default_intrawave_False_False_False_False_256x128x32_4x1x1_32x32x16
               ^^^^  ^^^ ^^^^^^ ^^^^^^^ ^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^ ^^^^^^^ ^^^^^^^^^
               |     |   |      |       |         |                       |         |       |
               |     |   |      |       |         Padding & flags         |         |       Warp tile
               |     |   |      |       Scheduler                         |         Thread tile
               |     |   |      Epilogue                                  Block tile
               |     |   Pipeline
               |     Layout (Row-Column-Row)
               Data type

Components:

Data type: fp16, fp32, bf16, fp8, bf8, int8
Layout: rcr (Row-Column-Row), rrr, crr, ccr
Pipeline: mem, compv3, compv4
Epilogue: default, cshuffle
Scheduler: intrawave, interwave
Flags: pad_m, pad_n, pad_k, persistent (4 boolean flags)
Tile sizes: BlockTile x ThreadTile x WarpTile

Troubleshooting

Common Issues

Kernel not found
- Ensure the specific benchmark executable is built
- Check the build directory bin/ folder
Verification failures
- Try GPU verification (-verify=2) which may be more accurate
- Check data type compatibility
- Verify stride calculations
Build failures
- Check GPU architecture compatibility
- Ensure ROCm is properly installed
- Verify configuration file syntax
Performance variations
- Increase warmup iterations
- Disable CPU frequency scaling
- Use GPU timer for accurate measurements

Debug Options

Enable verbose logging:

./bin/benchmark_gemm_... -log=true -verify=1

Test validation logic:

python test_validation.py

Performance Tips

Optimal Problem Sizes: Use sizes that are multiples of tile dimensions
Warmup: Use at least 50-100 warmup iterations
GPU Timer: Always use -timer=true for accurate measurements
Cache Management: Enable cache flushing for consistent results
Thread Affinity: Set CPU affinity to reduce variation

Integration Examples

Python Integration

import subprocess
import json

# Run benchmark with JSON output
result = subprocess.run([
    './bin/benchmark_gemm_fp16_rcr_...', 
    '-m=1024', '-n=1024', '-k=1024',
    '-json_output=true'
], capture_output=True, text=True)

# Parse results
data = json.loads(result.stdout)
print(f"Performance: {data['tflops']} TFLOPS")

Batch Testing Script

#!/bin/bash
SIZES="512 1024 2048 4096"
for size in $SIZES; do
    echo "Testing ${size}x${size}x${size}"
    ./bin/benchmark_gemm_... -m=$size -n=$size -k=$size \
        -verify=2 -csv_filename=results.csv
done

Contributing

When adding new features or configurations:

Update validation logic in validation_utils.py
Add tests to test_validation.py
Update configuration examples
Document new command-line options

For more information about the Composable Kernel project, visit the main repository documentation.

README.md

CK Tile Engine GEMM Operations

Overview

Table of Contents

Build System Architecture

Individual Kernel Compilation (New Approach)

Monolithic Build (Legacy Approach)

Build Instructions

Prerequisites

Basic Build

Configuration Options

Using Custom Config Files

Config File Priority Order

Example Build Commands

Building Individual Kernels

Rebuilding After Configuration Changes

Running Benchmarks

Individual Kernel Execution

Monolithic Executable (Legacy)

Automated Testing

Configuration System

Configuration Files

Configuration Structure

Scripts and Tools

Python Scripts

gemm_instance_builder.py

gemm_instance_builder_parallel.py

validation_utils.py

test_validation.py

gemm_benchmark.py

json_config.py

codegen_utils.py

Shell Scripts

test_benchmark.sh

Command Line Options

Matrix Dimensions

Strides

Verification

Performance Testing

Initialization

Output Options

Advanced Options

Understanding Kernel Names

Components:

Troubleshooting

Common Issues

Debug Options

Performance Tips

Integration Examples

Python Integration

Batch Testing Script

Contributing