mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 16:59:10 +00:00

Files

JH-Leon-KIM-AMD 720ceb6500 [rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f)

[CK Tile] Support multi-vector reads in static encoding patterns  (#7528)

## Motivation

The thread-raked / warp-raked / block-raked static tile distribution
patterns in `ck_tile` silently produce wrong results when the contiguous
tile dimension is larger than `warp_size * vector_size`, because the
encoding has no per-thread iteration dimension along X.

Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in
`ConvConfigComputeV3`, the grouped convolution backward-weight example
reports about 50 percent wrong values, with errors starting exactly at
the `X0*X1 = 64` boundary. The second pass over the contiguous dim is
never performed.

This PR extends the encoding so multi-vector reads in the contiguous
tile dimension are supported, while keeping every existing call site
bit-for-bit identical.

## Technical Details

Three files changed.

### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp`

Add a per-thread X iteration dimension in all three raked
specializations:

- `X0 = min(warp_size, XPerTile / X1)` — threads in X dim
- `X1 = min(LargestVec, VecSize)` — vector size per access
- `X2 = XPerTile / (X0 * X1)` — number of X-iters per thread (new)

`X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both
`make_2d_static_tile_distribution()` and
`make_shuffled_2d_static_tile_distribution()`.

The new encoding places `X2` in the middle of the Ys iteration list,
which preserves reverse symmetry between the regular `<..., X2, X1>` and
shuffled `<X1, X2, ...>` encodings.

Patterns updated: `thread_raked`, `warp_raked`, `block_raked`.

### 2. `include/ck_tile/core/tensor/transpose_tile.hpp`

Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch
alongside the existing `NDimY == 2` branch. The original branch is
byte-for-byte unchanged.

Both branches dispatch to the same `transpose_tile2d_impl_in_thread`,
whose body has always been NDimY-generic (iterates with `static_for<0,
NDimY, 1>` and `number<NDimY>{}`).

### 3.
`experimental/grouped_convolution_tile_instances/generate_instances.py`

Removed the two now-obsolete skip guards in `parse_bwd_weight_instances`
and `parse_bwd_data_instances`:

```python
if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector):
    print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.")
    continue
```

Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating,
irregular-load shapes, scalar-per-vector > tile size) are kept
untouched.

### Compatibility

Strict. Every existing caller has `X2 == 1` and therefore hits the
original encoding path verbatim. No upstream config or pipeline behavior
changes.

## Test Plan

The grouped convolution example is the natural exerciser since
`GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for
both A and B tiles, and all three conv directions share the same
`ConvConfigComputeV3`.

For each test below we ran:

```
./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_fwd        [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_bwd_data   [-prec={fp16,bf16}]
```

with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both
code paths:

| Test | M / N / K | VecA/B/C | A path | B path | dtype |

|------|-------------|----------|------------|----------------|-------------|
| T1 | 16/64/32 | 4/8/4 | old (X2=1) | old (X2=1) | fp16 |
| T2 | 128/128/64 | 2/2/2 | old (X2=1) | old (X2=1) | fp16 |
| T3 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 |
| T5 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 (3 dir)|
| T4b | 128/128/128 | 1/1/1 | new (X2=2) | new (X2=2) | fp16 + bf16 (3
dir) |

A larger T4a (256/256/128) was attempted to stress both A and B with
X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128
KB > 64 KB limit), independent of this PR.

For the generator change we ran:

```
python3 generate_instances.py --mode profiler --direction all
```

and verified `Skipping instance ... with multiple warps per continous
tile dim` no longer appears (count went from non-zero to 0); other skip
categories are unchanged.

`clang-format-18` was applied to both modified `.hpp` files (matches the
repo's `.clang-format`).

## Test Result

- T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`.
Confirms existing callers are unaffected.
- T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3
encoding + transpose branch.
- T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3
`correct`.
- T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs
`correct`.
- Generator: 0 `multiple warps per continous tile dim` skips remaining;
other skips unchanged.

Sample run output (T4b, bf16, bwd_data):

```
shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32
pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0
Vector size A: 1, Vector size B: 1, Vector size C: 1
0.934907 ms, 8.34683 TFlops, 34.3178 GB/s
Relative error threshold: 0.00390625 Absolute error threshold: 0.25
The CPU verification result is: correct
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-20 17:25:22 +03:00

configs

[rocm-libraries] ROCm/rocm-libraries#5904 (commit f4e261a)

2026-05-16 10:49:18 +02:00

include

[rocm-libraries] ROCm/rocm-libraries#6838 (commit ff7a665)

2026-05-15 15:47:55 +02:00

CMakeLists.txt

[rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda)

2026-03-25 14:34:13 +00:00

generate_instances.py

[rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f)

2026-05-20 17:25:22 +03:00

README.md

[rocm-libraries] ROCm/rocm-libraries#4797 (commit 1a30400)

2026-03-04 21:49:42 +00:00

test-instances.py

[rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda)

2026-03-25 14:34:13 +00:00

README.md

CK Tile Instance Generation and Integration

Overview
Architecture
Instance Generation Workflow
Configuration Files
Python Generation Script
Generated Artifacts
Integration with CK Profiler
Directory Structure
Usage

Overview

The CK Tile instance generation system provides an automated way to create optimized convolution kernel instances using the CK Builder pattern. These instances are:

Generated from configuration files containing instance parameter strings
Integrated with the CK Profiler for benchmarking and validation

Key Components

CK Builder (/projects/composablekernel/experimental/builder)
- High-level C++20 interface for constructing composable kernel operations
- Provides compile-time dispatch from builder descriptors to specialized kernel implementations
Instance Generator (/projects/composablekernel/experimental/grouped_convolution_tile_instances)
- Python-based code generation system
- Parses configuration files with instance strings
- Generates C++ wrapper files using templates
CK Profiler Integration (projects/composablekernel/profiler)
- Benchmarks generated instances
- Validates correctness against reference implementations
- Selects best-performing kernels

Architecture

CK Builder Design

The CK Builder uses a builder pattern that separates:

Signature - Defines the operation (data type, layout, direction)
Algorithm - Specifies tile parameters and optimizations
Instance - The compiled kernel from Builder + Algorithm

// Example: Building a convolution instance
using Builder = ckb::ConvBuilder<SIGNATURE, ALGORITHM>;
using Instance = Builder::Instance;

auto conv = Instance{};
ckt::RunResult result = ckt::run(conv, args, inputs, outputs, stream_config);

Convolution Signatures

Signatures are compile-time constants that define the operation:

constexpr auto SIGNATURE_NHWGC_FP16_FWD = ckt::ConvSignature{
    .spatial_dim            = 2,              // 2D convolution
    .direction              = ckb::ConvDirection::FORWARD,
    .data_type              = ckb::DataType::FP16,
    .accumulation_data_type = ckb::DataType::FP32,
    .input                  = {.config = {.layout = ckb::TensorLayout::NHWGC}},
    .weight                 = {.config = {.layout = ckb::TensorLayout::GKYXC}},
    .output                 = {.config = {.layout = ckb::TensorLayout::NHWGK}}
};

Tile Algorithm Configuration

Algorithms specify tile sizes, GEMM parameters, and optimizations:

constexpr auto ALGORITHM = cku::ConvAlgorithm_Tile_GroupedConvolutionKernel{}
    .with_tile_specializations(ckb::TileConvSpecialization::DEFAULT)
    .with_tile_thread_block(ckt::TileThreadBlock{
        .tile_size = {.m = 128, .n = 128, .k = 32}
    })
    .with_tile_block_gemm(ckt::TileBlockGemm{
        .warps              = {.m = 2, .n = 2, .k = 1},
        .warp_tile          = {.m = 32, .n = 32, .k = 16},
        .double_smem_buffer = false,
        .num_wave_groups    = 1,
        .pipeline_version   = ckb::PipelineVersion::V1,
        .scheduler          = ckb::PipelineScheduler::INTRAWAVE
    })
    .with_tile_transfer(ckt::TileTransfer{
        .a_scalar_per_vector = 8,
        .b_scalar_per_vector = 8,
        .c_scalar_per_vector = 8
    })
    .with_tile_optimizations(ckt::TileOptimizations{
        .num_groups_to_merge = 1,
        .split_image         = false,
        .explicit_gemm       = false
    });

Instance Generation Workflow

Step 1: Configuration Files

Instance strings are defined in configuration files organized by:

Direction: forward, backward_weight, backward_data
Purpose: profiler (all instances), tests (limited set), compilation (empty)
Layout & Data Type: e.g., nhwgc_fp16.conf, ndhwgc_bf16.conf

Location: configs/{direction}/{purpose}/{layout_dtype}.conf

Step 2: Python Generation

Run generate_instances.py to parse configs and generate C++ files:

python generate_instances.py \
    --mode profiler \
    --direction all \
    --filter_pattern convolution

Step 3: Generated Files

For each instance, the script generates:

Individual C++ files (one per instance)
- Location: instances/{direction}/{config}/{instance_name}.cpp
- Contains instance-specific kernel wrapper
Include files (.inc headers)
- {problem_name}.inc - Function declarations
- {problem_name}_calls.inc - Function call invocations
CMake integration (via CMakeLists.txt)
- Compiles all generated instances
- Links with profiler

Step 4: Compilation

CMake compiles the generated instances with:

GPU-specific optimizations
Target architecture (e.g., gfx942)
C++20 standard required

Step 5: Profiler Integration

Generated instances are integrated via include files in profiler headers.

Configuration Files

Instance String Format

Configuration files contain instance strings that define kernel parameters. The format varies by device operation type.

Forward Convolution Example

DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle<64, 64, 64, 32, Default, 32, 32, 2, 2, 1, 1, 1, 1, 1, 1>

Parameters (parsed by parse_fwd_instances):

block_size = 64 (total threads per block)
m_per_block = 64 (M dimension of tile)
n_per_block = 64 (N dimension of tile)
k_per_block = 32 (K dimension of tile)
spec = Default (specialization: Default, Filter1x1Pad0, Filter1x1Stride1Pad0, OddC, Filter3x3)
m_per_xdl = 32 (M dimension per XDL instruction)
n_per_xdl = 32 (N dimension per XDL instruction)
m_xdl_per_wave = 2 (XDL tiles in M per wave)
n_xdl_per_wave = 2 (XDL tiles in N per wave)
a_scalar_per_vector = 1 (vectorization for input)
b_scalar_per_vector = 1 (vectorization for weight)
c_scalar_per_vector = 1 (vectorization for output) 13-14. Optional pipeline parameters
Optional num_groups_to_merge

Backward Weight Convolution Example (V3 Instance)

DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<256, 128, 128, 64, Default, 32, 32, 2, 2, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v4>

Additional Parameters (V3 instances):

BlkGemmPipelineScheduler - Intrawave or Interwave
BlkGemmPipelineVersion - v1, v2, v3, v4, or v5

Specializations

DEFAULT - General purpose convolution
FILTER_1X1_PAD0 - Optimized for 1x1 filters with no padding
FILTER_1X1_STRIDE1_PAD0 - Optimized for 1x1 filters, stride 1, no padding
FILTER_3x3 - Optimized for 3x3 filters
OddC - Optimized for odd channel counts

Pipeline Versions

v1 - Basic pipeline
v2 - Enhanced pipeline with better scheduling
v3 - Advanced pipeline optimizations
v4 - Double shared memory buffering
v5 - Two wave groups (2x parallelism)

Python Generation Script

Script: `generate_instances.py`

Key Functions

parse_fwd_instances(instances, problem_name)
- Parses forward convolution instance strings
- Extracts tile sizes, GEMM parameters, specializations
- Returns list of ConvInstanceTemplateParams objects
parse_bwd_weight_instances(instances, problem_name)
- Parses backward weight convolution instance strings
- Handles V1, V3, and TwoStage variants
- Extracts pipeline scheduler and version parameters
parse_bwd_data_instances(instances, problem_name)
- Placeholder for backward data parsing (not yet implemented)
generate_conv_cpp(instances, problem_name, config, direction, signature_name, filter_pattern)
- Generates individual C++ wrapper files from template
- One file per instance
generate_defs_inc(instances, problem_name, signature, direction, filter_pattern)
- Generates function declarations (.inc file)
- Used by profiler to call instances
generate_calls_inc(instances, problem_name, direction, filter_pattern)
- Generates function call statements (.inc file)
- Invokes each instance in profiler benchmark loop

Template System

Template: instances/grouped_convolution_tile.cpp.in

Placeholders:

gen_signature → Signature constant name
gen_instance_name → Unique instance function name
gen_specialization → Tile specialization enum
gen_thread_block → Thread block configuration
gen_block_gemm_desc → Block GEMM descriptor
gen_block_transfer → Transfer parameters
gen_optimizations → Optimization settings

Generated Output: instances/{direction}/{config}/{instance_name}.cpp

Command-Line Arguments

python generate_instances.py \
    --mode {compilation|tests|profiler} \
    --direction {forward|backward_weight|backward_data|all} \
    --filter_pattern {pattern}

Modes:

compilation - Empty instance list (compile-time check only)
tests - Limited instances for testing
profiler - All instances for benchmarking

Generated Artifacts

Directory Structure

instances/
├── forward/
│   ├── nhwgc_fp16/
│   │   ├── grouped_convolution_forward_tile_nhwgc_fp16_0.cpp
│   │   ├── grouped_convolution_forward_tile_nhwgc_fp16_1.cpp
│   │   └── ...
│   ├── grouped_convolution_forward_tile_nhwgc_fp16.inc
│   └── grouped_convolution_forward_tile_nhwgc_fp16_calls.inc
├── backward_weight/
│   ├── nhwgc_bf16/
│   │   └── ...
│   └── ...
├── instance_includes.inc     # Shared headers and signatures
└── instance_run.inc          # Shared instance execution logic

File Types

Instance Implementation (.cpp)

// grouped_convolution_forward_tile_nhwgc_fp16_0.cpp
#include "../../instance_includes.inc"
namespace ck_tile::builder::profiling {
    constexpr auto SIGNATURE = SIGNATURE_NHWGC_FP16_FWD;
    std::tuple<bool, float, std::string> run_grouped_convolution_forward_tile_nhwgc_fp16_0(
        const ckt::Args<SIGNATURE>& args,
        const ckt::Inputs<SIGNATURE>& inputs,
        const ckt::Outputs<SIGNATURE>& outputs,
        const ck_tile::stream_config& s_conf)
    {
        constexpr auto ALGORITHM = /* ... */;
        #include "../../instance_run.inc"
    }
}

Function Declarations (.inc)

// grouped_convolution_forward_tile_nhwgc_fp16.inc
std::tuple<bool, float, std::string> run_grouped_convolution_forward_tile_nhwgc_fp16_0(...);
std::tuple<bool, float, std::string> run_grouped_convolution_forward_tile_nhwgc_fp16_1(...);
// ...

Function Calls (_calls.inc)

// grouped_convolution_forward_tile_nhwgc_fp16_calls.inc
run_alg(run_grouped_convolution_forward_tile_nhwgc_fp16_0);
run_alg(run_grouped_convolution_forward_tile_nhwgc_fp16_1);
// ...

Integration with CK Profiler

Profiler Header: `grouped_convolution_forward_tile_algs.hpp`

This file orchestrates the benchmarking of all CK Tile instances.

Key Components

Include Generated Instances

#include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_fp32.inc"
#include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_bf16.inc"
#include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_fp16.inc"
// ... more includes

Benchmark Loop (run_grouped_conv_forward_tile_algs)

template <auto SIGNATURE>
std::tuple<bool, float, std::string> run_grouped_conv_forward_tile_algs(
    const ckt::Args<SIGNATURE>& args,
    const ckt::Inputs<SIGNATURE>& inputs,
    const ckt::Outputs<SIGNATURE>& outputs,
    const ck_tile::stream_config& s_conf)
{
    float best_avg_time = std::numeric_limits<float>::max();
    std::string best_op_name;
    bool valid = true;

    // Generate reference output
    auto reference = ckt::alloc_outputs(args);
    using ReferenceInstance = /* ... */;
    auto ref_conv = ReferenceInstance{};
    auto ref_result = ckt::run(ref_conv, args, inputs, reference.get());

    // Lambda to run and validate each instance
    auto run_alg = [&](auto&& run_alg_func) {
        auto [is_supported, avg_time, op_name] = run_alg_func(args, inputs, outputs, s_conf);
        if(is_supported) {
            best_avg_time = std::min(best_avg_time, avg_time);
            best_op_name = (best_avg_time < avg_time) ? best_op_name : op_name;

            // Validate correctness
            valid = ck_tile::check_err(outputs, reference, rtol, atol);

            std::cout << "Perf: " << avg_time << " ms, " << op_name << std::endl;
        }
    };

    // Run all instances based on signature
    if constexpr(SIGNATURE == SIGNATURE_NHWGC_FP16_FWD) {
        #include "../../experimental/grouped_convolution_tile_instances/grouped_convolution_forward_tile_nhwgc_fp16_calls.inc"
    }
    // ... more signature branches

    return std::make_tuple(valid, best_avg_time, best_op_name);
}

Directory Structure

projects/composablekernel/
├── experimental/
│   ├── builder/                           # CK Builder framework
│   │   ├── include/ck_tile/builder/       # Builder API
│   │   │   ├── conv_builder.hpp           # Main builder interface
│   │   │   ├── factory/                   # Dispatch to kernel implementations
│   │   │   └── reflect/                   # Instance traits and reflection
│   │   ├── test/                          # Builder tests and utilities
│   │   └── README.md                      # Builder documentation
│   │
│   └── grouped_convolution_tile_instances/ # Instance generation system
│       ├── generate_instances.py           # Main generation script
│       ├── CMakeLists.txt                  # Build configuration
│       ├── README.md                       # Brief overview
│       │
│       ├── configs/                        # Configuration files
│       │   ├── forward/
│       │   │   ├── profiler/               # All instances for profiling
│       │   │   │   ├── nhwgc_fp16.conf
│       │   │   │   ├── nhwgc_fp32.conf
│       │   │   │   ├── nhwgc_bf16.conf
│       │   │   │   ├── ndhwgc_fp16.conf
│       │   │   │   ├── ndhwgc_fp32.conf
│       │   │   │   └── ndhwgc_bf16.conf
│       │   │   └── tests/                  # Limited instances for testing
│       │   ├── backward_weight/
│       │   │   └── profiler/
│       │   └── backward_data/
│       │       └── profiler/
│       │
│       └── instances/                      # Generated C++ files
│           ├── instance_includes.inc       # Shared headers and signatures
│           ├── instance_run.inc            # Shared execution logic
│           ├── grouped_convolution_tile.cpp.in  # Template file
│           │
│           ├── forward/                    # Forward instances
│           │   ├── nhwgc_fp16/
│           │   │   ├── grouped_convolution_forward_tile_nhwgc_fp16_0.cpp
│           │   │   ├── grouped_convolution_forward_tile_nhwgc_fp16_1.cpp
│           │   │   └── ...
│           │   ├── grouped_convolution_forward_tile_nhwgc_fp16.inc
│           │   ├── grouped_convolution_forward_tile_nhwgc_fp16_calls.inc
│           │   └── ...
│           │
│           └── backward_weight/            # Backward weight instances
│               └── ...
│
└── profiler/
    ├── include/profiler/
    │   ├── grouped_convolution_forward_tile_algs.hpp  # Profiler integration
    │   └── ...
    └── src/
        └── profile_grouped_conv_fwd.cpp    # Main profiler entry point

Usage

1: Generate All Instances for Profiling

cd projects/composablekernel/experimental/grouped_convolution_tile_instances

# Generate all forward, backward_weight, and backward_data instances
python generate_instances.py --mode profiler --direction all

Output:

Generates .cpp files for all instances
Creates .inc declaration and call files
Ready to compile with CMake

1.1: Generate Only Forward Instances for Testing

# Generate limited forward instances from test configs
python generate_instances.py --mode tests --direction forward

1.2: Filter Specific Instances

# Only generate instances matching "fp16"
python generate_instances.py \
    --mode profiler \
    --direction forward \
    --filter_pattern fp16

2: Compile the Generated Instances

cd build
cmake -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
      -D CMAKE_BUILD_TYPE=Release \
      -D GPU_TARGETS="gfx942" \
      -D CK_EXPERIMENTAL_BUILDER=ON \
      -D CMAKE_CXX_STANDARD=20 \
      -G Ninja \
      ..
      
ninja device_grouped_conv_fwd_tile_instances
ninja device_grouped_conv_bwd_weight_tile_instances

3: Run the Profiler

# Profile 2D convolution (NHWGC layout, FP16 data type)
./bin/ckProfiler conv fwd \
    1 0 2 \              # data_type=FP16, layout=NHWGC, spatial_dim=2
    1 128 128 64 \       # G=1, N=128, K=128, C=64
    3 3 \                # filter: 3x3
    28 28 \              # input: 28x28
    1 1 \                # stride: 1x1
    1 1 \                # dilation: 1x1
    1 1 1 1 \            # padding: 1,1,1,1
    1 0 1                # verification, initialization, profiling

# Output shows performance of each instance and selects best kernel

Adding a New Instance Configuration

Edit config file: configs/forward/profiler/nhwgc_fp16.conf

DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3<256, 256, 256, 64, Default, 32, 32, 4, 4, 8, 8, 8, 1, 1, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v5>

Regenerate instances:

python generate_instances.py --mode profiler --direction forward --filter_pattern fp16

Rebuild:

ninja grouped_convolution_tile_instances

Profile:
```
./bin/ckProfiler conv fwd ...
```

README.md

CK Tile Instance Generation and Integration

Table of Contents

Overview

Key Components

Architecture

CK Builder Design

Convolution Signatures

Tile Algorithm Configuration

Instance Generation Workflow

Step 1: Configuration Files

Step 2: Python Generation

Step 3: Generated Files

Step 4: Compilation

Step 5: Profiler Integration

Configuration Files

Instance String Format

Forward Convolution Example

Backward Weight Convolution Example (V3 Instance)

Specializations

Pipeline Versions

Python Generation Script

Script: generate_instances.py

Key Functions

Template System

Command-Line Arguments

Generated Artifacts

Directory Structure

File Types

Integration with CK Profiler

Profiler Header: grouped_convolution_forward_tile_algs.hpp

Key Components

Directory Structure

Usage

1: Generate All Instances for Profiling

1.1: Generate Only Forward Instances for Testing

1.2: Filter Specific Instances

2: Compile the Generated Instances

3: Run the Profiler

Adding a New Instance Configuration

References

Script: `generate_instances.py`

Profiler Header: `grouped_convolution_forward_tile_algs.hpp`