[CK Tile] Support multi-vector reads in static encoding patterns (#7528) ## Motivation The thread-raked / warp-raked / block-raked static tile distribution patterns in `ck_tile` silently produce wrong results when the contiguous tile dimension is larger than `warp_size * vector_size`, because the encoding has no per-thread iteration dimension along X. Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in `ConvConfigComputeV3`, the grouped convolution backward-weight example reports about 50 percent wrong values, with errors starting exactly at the `X0*X1 = 64` boundary. The second pass over the contiguous dim is never performed. This PR extends the encoding so multi-vector reads in the contiguous tile dimension are supported, while keeping every existing call site bit-for-bit identical. ## Technical Details Three files changed. ### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp` Add a per-thread X iteration dimension in all three raked specializations: - `X0 = min(warp_size, XPerTile / X1)` — threads in X dim - `X1 = min(LargestVec, VecSize)` — vector size per access - `X2 = XPerTile / (X0 * X1)` — number of X-iters per thread (new) `X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both `make_2d_static_tile_distribution()` and `make_shuffled_2d_static_tile_distribution()`. The new encoding places `X2` in the middle of the Ys iteration list, which preserves reverse symmetry between the regular `<..., X2, X1>` and shuffled `<X1, X2, ...>` encodings. Patterns updated: `thread_raked`, `warp_raked`, `block_raked`. ### 2. `include/ck_tile/core/tensor/transpose_tile.hpp` Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch alongside the existing `NDimY == 2` branch. The original branch is byte-for-byte unchanged. Both branches dispatch to the same `transpose_tile2d_impl_in_thread`, whose body has always been NDimY-generic (iterates with `static_for<0, NDimY, 1>` and `number<NDimY>{}`). ### 3. `experimental/grouped_convolution_tile_instances/generate_instances.py` Removed the two now-obsolete skip guards in `parse_bwd_weight_instances` and `parse_bwd_data_instances`: ```python if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector): print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.") continue ``` Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating, irregular-load shapes, scalar-per-vector > tile size) are kept untouched. ### Compatibility Strict. Every existing caller has `X2 == 1` and therefore hits the original encoding path verbatim. No upstream config or pipeline behavior changes. ## Test Plan The grouped convolution example is the natural exerciser since `GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for both A and B tiles, and all three conv directions share the same `ConvConfigComputeV3`. For each test below we ran: ``` ./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_fwd [-prec={fp16,bf16}] ./build/bin/tile_example_grouped_conv_bwd_data [-prec={fp16,bf16}] ``` with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both code paths: | Test | M / N / K | VecA/B/C | A path | B path | dtype | |------|-------------|----------|------------|----------------|-------------| | T1 | 16/64/32 | 4/8/4 | old (X2=1) | old (X2=1) | fp16 | | T2 | 128/128/64 | 2/2/2 | old (X2=1) | old (X2=1) | fp16 | | T3 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 | | T5 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 (3 dir)| | T4b | 128/128/128 | 1/1/1 | new (X2=2) | new (X2=2) | fp16 + bf16 (3 dir) | A larger T4a (256/256/128) was attempted to stress both A and B with X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128 KB > 64 KB limit), independent of this PR. For the generator change we ran: ``` python3 generate_instances.py --mode profiler --direction all ``` and verified `Skipping instance ... with multiple warps per continous tile dim` no longer appears (count went from non-zero to 0); other skip categories are unchanged. `clang-format-18` was applied to both modified `.hpp` files (matches the repo's `.clang-format`). ## Test Result - T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`. Confirms existing callers are unaffected. - T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3 encoding + transpose branch. - T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3 `correct`. - T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs `correct`. - Generator: 0 `multiple warps per continous tile dim` skips remaining; other skips unchanged. Sample run output (T4b, bf16, bwd_data): ``` shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32 pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0 Vector size A: 1, Vector size B: 1, Vector size C: 1 0.934907 ms, 8.34683 TFlops, 34.3178 GB/s Relative error threshold: 0.00390625 Absolute error threshold: 0.25 The CPU verification result is: correct ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Cursor <cursoragent@cursor.com>
CK Tile Instance Generation and Integration
Table of Contents
- Overview
- Architecture
- Instance Generation Workflow
- Configuration Files
- Python Generation Script
- Generated Artifacts
- Integration with CK Profiler
- Directory Structure
- Usage
Overview
The CK Tile instance generation system provides an automated way to create optimized convolution kernel instances using the CK Builder pattern. These instances are:
- Generated from configuration files containing instance parameter strings
- Integrated with the CK Profiler for benchmarking and validation
Key Components
-
CK Builder (
/projects/composablekernel/experimental/builder)- High-level C++20 interface for constructing composable kernel operations
- Provides compile-time dispatch from builder descriptors to specialized kernel implementations
-
Instance Generator (
/projects/composablekernel/experimental/grouped_convolution_tile_instances)- Python-based code generation system
- Parses configuration files with instance strings
- Generates C++ wrapper files using templates
-
CK Profiler Integration (
projects/composablekernel/profiler)- Benchmarks generated instances
- Validates correctness against reference implementations
- Selects best-performing kernels
Architecture
CK Builder Design
The CK Builder uses a builder pattern that separates:
- Signature - Defines the operation (data type, layout, direction)
- Algorithm - Specifies tile parameters and optimizations
- Instance - The compiled kernel from Builder + Algorithm
// Example: Building a convolution instance
using Builder = ckb::ConvBuilder<SIGNATURE, ALGORITHM>;
using Instance = Builder::Instance;
auto conv = Instance{};
ckt::RunResult result = ckt::run(conv, args, inputs, outputs, stream_config);
Convolution Signatures
Signatures are compile-time constants that define the operation:
constexpr auto SIGNATURE_NHWGC_FP16_FWD = ckt::ConvSignature{
.spatial_dim = 2, // 2D convolution
.direction = ckb::ConvDirection::FORWARD,
.data_type = ckb::DataType::FP16,
.accumulation_data_type = ckb::DataType::FP32,
.input = {.config = {.layout = ckb::TensorLayout::NHWGC}},
.weight = {.config = {.layout = ckb::TensorLayout::GKYXC}},
.output = {.config = {.layout = ckb::TensorLayout::NHWGK}}
};
Tile Algorithm Configuration
Algorithms specify tile sizes, GEMM parameters, and optimizations:
constexpr auto ALGORITHM = cku::ConvAlgorithm_Tile_GroupedConvolutionKernel{}
.with_tile_specializations(ckb::TileConvSpecialization::DEFAULT)
.with_tile_thread_block(ckt::TileThreadBlock{
.tile_size = {.m = 128, .n = 128, .k = 32}
})
.with_tile_block_gemm(ckt::TileBlockGemm{
.warps = {.m = 2, .n = 2, .k = 1},
.warp_tile = {.m = 32, .n = 32, .k = 16},
.double_smem_buffer = false,
.num_wave_groups = 1,
.pipeline_version = ckb::PipelineVersion::V1,
.scheduler = ckb::PipelineScheduler::INTRAWAVE
})
.with_tile_transfer(ckt::TileTransfer{
.a_scalar_per_vector = 8,
.b_scalar_per_vector = 8,
.c_scalar_per_vector = 8
})
.with_tile_optimizations(ckt::TileOptimizations{
.num_groups_to_merge = 1,
.split_image = false,
.explicit_gemm = false
});
Instance Generation Workflow
Step 1: Configuration Files
Instance strings are defined in configuration files organized by:
- Direction:
forward,backward_weight,backward_data - Purpose:
profiler(all instances),tests(limited set),compilation(empty) - Layout & Data Type: e.g.,
nhwgc_fp16.conf,ndhwgc_bf16.conf
Location: configs/{direction}/{purpose}/{layout_dtype}.conf
Step 2: Python Generation
Run generate_instances.py to parse configs and generate C++ files:
python generate_instances.py \
--mode profiler \
--direction all \
--filter_pattern convolution
Step 3: Generated Files
For each instance, the script generates:
-
Individual C++ files (one per instance)
- Location:
instances/{direction}/{config}/{instance_name}.cpp - Contains instance-specific kernel wrapper
- Location:
-
Include files (
.incheaders){problem_name}.inc- Function declarations{problem_name}_calls.inc- Function call invocations
-
CMake integration (via
CMakeLists.txt)- Compiles all generated instances
- Links with profiler
Step 4: Compilation
CMake compiles the generated instances with:
- GPU-specific optimizations
- Target architecture (e.g.,
gfx942) - C++20 standard required
Step 5: Profiler Integration
Generated instances are integrated via include files in profiler headers.
Configuration Files
Instance String Format
Configuration files contain instance strings that define kernel parameters. The format varies by device operation type.
Forward Convolution Example
DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle<64, 64, 64, 32, Default, 32, 32, 2, 2, 1, 1, 1, 1, 1, 1>
Parameters (parsed by parse_fwd_instances):
block_size= 64 (total threads per block)m_per_block= 64 (M dimension of tile)n_per_block= 64 (N dimension of tile)k_per_block= 32 (K dimension of tile)spec= Default (specialization: Default, Filter1x1Pad0, Filter1x1Stride1Pad0, OddC, Filter3x3)m_per_xdl= 32 (M dimension per XDL instruction)n_per_xdl= 32 (N dimension per XDL instruction)m_xdl_per_wave= 2 (XDL tiles in M per wave)n_xdl_per_wave= 2 (XDL tiles in N per wave)a_scalar_per_vector= 1 (vectorization for input)b_scalar_per_vector= 1 (vectorization for weight)c_scalar_per_vector= 1 (vectorization for output) 13-14. Optional pipeline parameters- Optional
num_groups_to_merge
Backward Weight Convolution Example (V3 Instance)
DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<256, 128, 128, 64, Default, 32, 32, 2, 2, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v4>
Additional Parameters (V3 instances):
BlkGemmPipelineScheduler- Intrawave or InterwaveBlkGemmPipelineVersion- v1, v2, v3, v4, or v5
Specializations
- DEFAULT - General purpose convolution
- FILTER_1X1_PAD0 - Optimized for 1x1 filters with no padding
- FILTER_1X1_STRIDE1_PAD0 - Optimized for 1x1 filters, stride 1, no padding
- FILTER_3x3 - Optimized for 3x3 filters
- OddC - Optimized for odd channel counts
Pipeline Versions
- v1 - Basic pipeline
- v2 - Enhanced pipeline with better scheduling
- v3 - Advanced pipeline optimizations
- v4 - Double shared memory buffering
- v5 - Two wave groups (2x parallelism)
Python Generation Script
Script: generate_instances.py
Key Functions
-
parse_fwd_instances(instances, problem_name)- Parses forward convolution instance strings
- Extracts tile sizes, GEMM parameters, specializations
- Returns list of
ConvInstanceTemplateParamsobjects
-
parse_bwd_weight_instances(instances, problem_name)- Parses backward weight convolution instance strings
- Handles V1, V3, and TwoStage variants
- Extracts pipeline scheduler and version parameters
-
parse_bwd_data_instances(instances, problem_name)- Placeholder for backward data parsing (not yet implemented)
-
generate_conv_cpp(instances, problem_name, config, direction, signature_name, filter_pattern)- Generates individual C++ wrapper files from template
- One file per instance
-
generate_defs_inc(instances, problem_name, signature, direction, filter_pattern)- Generates function declarations (
.incfile) - Used by profiler to call instances
- Generates function declarations (
-
generate_calls_inc(instances, problem_name, direction, filter_pattern)- Generates function call statements (
.incfile) - Invokes each instance in profiler benchmark loop
- Generates function call statements (
Template System
Template: instances/grouped_convolution_tile.cpp.in
Placeholders:
gen_signature→ Signature constant namegen_instance_name→ Unique instance function namegen_specialization→ Tile specialization enumgen_thread_block→ Thread block configurationgen_block_gemm_desc→ Block GEMM descriptorgen_block_transfer→ Transfer parametersgen_optimizations→ Optimization settings
Generated Output: instances/{direction}/{config}/{instance_name}.cpp
Command-Line Arguments
python generate_instances.py \
--mode {compilation|tests|profiler} \
--direction {forward|backward_weight|backward_data|all} \
--filter_pattern {pattern}
Modes:
compilation- Empty instance list (compile-time check only)tests- Limited instances for testingprofiler- All instances for benchmarking
Generated Artifacts
Directory Structure
instances/
├── forward/
│ ├── nhwgc_fp16/
│ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_0.cpp
│ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_1.cpp
│ │ └── ...
│ ├── grouped_convolution_forward_tile_nhwgc_fp16.inc
│ └── grouped_convolution_forward_tile_nhwgc_fp16_calls.inc
├── backward_weight/
│ ├── nhwgc_bf16/
│ │ └── ...
│ └── ...
├── instance_includes.inc # Shared headers and signatures
└── instance_run.inc # Shared instance execution logic
File Types
-
Instance Implementation (
.cpp)// grouped_convolution_forward_tile_nhwgc_fp16_0.cpp #include "../../instance_includes.inc" namespace ck_tile::builder::profiling { constexpr auto SIGNATURE = SIGNATURE_NHWGC_FP16_FWD; std::tuple<bool, float, std::string> run_grouped_convolution_forward_tile_nhwgc_fp16_0( const ckt::Args<SIGNATURE>& args, const ckt::Inputs<SIGNATURE>& inputs, const ckt::Outputs<SIGNATURE>& outputs, const ck_tile::stream_config& s_conf) { constexpr auto ALGORITHM = /* ... */; #include "../../instance_run.inc" } } -
Function Declarations (
.inc)// grouped_convolution_forward_tile_nhwgc_fp16.inc std::tuple<bool, float, std::string> run_grouped_convolution_forward_tile_nhwgc_fp16_0(...); std::tuple<bool, float, std::string> run_grouped_convolution_forward_tile_nhwgc_fp16_1(...); // ... -
Function Calls (
_calls.inc)// grouped_convolution_forward_tile_nhwgc_fp16_calls.inc run_alg(run_grouped_convolution_forward_tile_nhwgc_fp16_0); run_alg(run_grouped_convolution_forward_tile_nhwgc_fp16_1); // ...
Integration with CK Profiler
Profiler Header: grouped_convolution_forward_tile_algs.hpp
This file orchestrates the benchmarking of all CK Tile instances.
Key Components
-
Include Generated Instances
#include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_fp32.inc" #include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_bf16.inc" #include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_fp16.inc" // ... more includes -
Benchmark Loop (
run_grouped_conv_forward_tile_algs)template <auto SIGNATURE> std::tuple<bool, float, std::string> run_grouped_conv_forward_tile_algs( const ckt::Args<SIGNATURE>& args, const ckt::Inputs<SIGNATURE>& inputs, const ckt::Outputs<SIGNATURE>& outputs, const ck_tile::stream_config& s_conf) { float best_avg_time = std::numeric_limits<float>::max(); std::string best_op_name; bool valid = true; // Generate reference output auto reference = ckt::alloc_outputs(args); using ReferenceInstance = /* ... */; auto ref_conv = ReferenceInstance{}; auto ref_result = ckt::run(ref_conv, args, inputs, reference.get()); // Lambda to run and validate each instance auto run_alg = [&](auto&& run_alg_func) { auto [is_supported, avg_time, op_name] = run_alg_func(args, inputs, outputs, s_conf); if(is_supported) { best_avg_time = std::min(best_avg_time, avg_time); best_op_name = (best_avg_time < avg_time) ? best_op_name : op_name; // Validate correctness valid = ck_tile::check_err(outputs, reference, rtol, atol); std::cout << "Perf: " << avg_time << " ms, " << op_name << std::endl; } }; // Run all instances based on signature if constexpr(SIGNATURE == SIGNATURE_NHWGC_FP16_FWD) { #include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_fp16_calls.inc" } // ... more signature branches return std::make_tuple(valid, best_avg_time, best_op_name); }
Directory Structure
projects/composablekernel/
├── experimental/
│ ├── builder/ # CK Builder framework
│ │ ├── include/ck_tile/builder/ # Builder API
│ │ │ ├── conv_builder.hpp # Main builder interface
│ │ │ ├── factory/ # Dispatch to kernel implementations
│ │ │ └── reflect/ # Instance traits and reflection
│ │ ├── test/ # Builder tests and utilities
│ │ └── README.md # Builder documentation
│ │
│ └── grouped_convolution_tile_instances/ # Instance generation system
│ ├── generate_instances.py # Main generation script
│ ├── CMakeLists.txt # Build configuration
│ ├── README.md # Brief overview
│ │
│ ├── configs/ # Configuration files
│ │ ├── forward/
│ │ │ ├── profiler/ # All instances for profiling
│ │ │ │ ├── nhwgc_fp16.conf
│ │ │ │ ├── nhwgc_fp32.conf
│ │ │ │ ├── nhwgc_bf16.conf
│ │ │ │ ├── ndhwgc_fp16.conf
│ │ │ │ ├── ndhwgc_fp32.conf
│ │ │ │ └── ndhwgc_bf16.conf
│ │ │ └── tests/ # Limited instances for testing
│ │ ├── backward_weight/
│ │ │ └── profiler/
│ │ └── backward_data/
│ │ └── profiler/
│ │
│ └── instances/ # Generated C++ files
│ ├── instance_includes.inc # Shared headers and signatures
│ ├── instance_run.inc # Shared execution logic
│ ├── grouped_convolution_tile.cpp.in # Template file
│ │
│ ├── forward/ # Forward instances
│ │ ├── nhwgc_fp16/
│ │ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_0.cpp
│ │ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_1.cpp
│ │ │ └── ...
│ │ ├── grouped_convolution_forward_tile_nhwgc_fp16.inc
│ │ ├── grouped_convolution_forward_tile_nhwgc_fp16_calls.inc
│ │ └── ...
│ │
│ └── backward_weight/ # Backward weight instances
│ └── ...
│
└── profiler/
├── include/profiler/
│ ├── grouped_convolution_forward_tile_algs.hpp # Profiler integration
│ └── ...
└── src/
└── profile_grouped_conv_fwd.cpp # Main profiler entry point
Usage
1: Generate All Instances for Profiling
cd projects/composablekernel/experimental/grouped_convolution_tile_instances
# Generate all forward, backward_weight, and backward_data instances
python generate_instances.py --mode profiler --direction all
Output:
- Generates
.cppfiles for all instances - Creates
.incdeclaration and call files - Ready to compile with CMake
1.1: Generate Only Forward Instances for Testing
# Generate limited forward instances from test configs
python generate_instances.py --mode tests --direction forward
1.2: Filter Specific Instances
# Only generate instances matching "fp16"
python generate_instances.py \
--mode profiler \
--direction forward \
--filter_pattern fp16
2: Compile the Generated Instances
cd build
cmake -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-D CMAKE_BUILD_TYPE=Release \
-D GPU_TARGETS="gfx942" \
-D CK_EXPERIMENTAL_BUILDER=ON \
-D CMAKE_CXX_STANDARD=20 \
-G Ninja \
..
ninja device_grouped_conv_fwd_tile_instances
ninja device_grouped_conv_bwd_weight_tile_instances
3: Run the Profiler
# Profile 2D convolution (NHWGC layout, FP16 data type)
./bin/ckProfiler conv fwd \
1 0 2 \ # data_type=FP16, layout=NHWGC, spatial_dim=2
1 128 128 64 \ # G=1, N=128, K=128, C=64
3 3 \ # filter: 3x3
28 28 \ # input: 28x28
1 1 \ # stride: 1x1
1 1 \ # dilation: 1x1
1 1 1 1 \ # padding: 1,1,1,1
1 0 1 # verification, initialization, profiling
# Output shows performance of each instance and selects best kernel
Adding a New Instance Configuration
-
Edit config file:
configs/forward/profiler/nhwgc_fp16.confDeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<256, 256, 256, 64, Default, 32, 32, 4, 4, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v5> -
Regenerate instances:
python generate_instances.py --mode profiler --direction forward --filter_pattern fp16 -
Rebuild:
ninja grouped_convolution_tile_instances -
Profile:
./bin/ckProfiler conv fwd ...