Files
Ville Pietilä 60b276647b [rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39)
[CK Tile] Rule-based configuration generation in CK
 Dispatcher codegen (#8157)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Tile Dispatcher code generation for CK Tile Profiler relies on
flat JSON files to list the generated configurations. This approach has
the following problems

- The JSON files are verbose
- The JSON files get easily out of sync with the CK Builder .config
files from which they were generated from.
- The JSON file based configuration make it hard to list explicitly the
rules that govern the instance generation.

## Technical Details

Replaced the JSON files with a rule based configuration. To preserve the
existing functionality, the `profiler` and the `tests` instance sets are
generated directly from the CK Builder config files. The JSON config
files are removed from source control, and the "on-the-fly" generation
guarantees that the Dispatcher codegen uses up to date configurations.

This is PR introduces six different rule sets for the CK Tile Dispatcher
code generation

1. `profiler`: matches with the old JSON set of profiler configurations.
2. `tests`: matches with the old JSON set of tests configurations.
3. `full`: full configuration set created from a rule-based config
selection
4. `full-tests`: a subset of `full` for generating configurations for
convolution integration tests.
5. `tiny`: a subset of `full-tests` to produce the minimal set of
configurations to test the Dispatcher codegen.
6. `default`: the default rules, which corresponds to the existing
heuristic rules for configuration selection. This ensures that ML based
kernel selection doesn't get broken.

The main use of the `full` rule set is to define a reasonable solution
space for the possible implicit GEMM configurations. We start from the
configurations that allowed by the device architecture. The `full` rule
set defines the relevant tile sizes for each convolution direction. From
the tile size we have a curated mapping to the number of waves over the
different GEMM axes, i.e., we describe how many waves each GEMM
dimensions corresponds to. The GEMM-K wave tile dimension can be
computed from the other parameters and does not need to be listed
explicitly.

An orthogonal axis to the tiling strategy is the vectorization strategy.
This mainly defined by the data type and hardware as in general, we want
to use the maximum possible load widths. The maximum sizes for each
convolution direction variant are defined by the implicit GEMM matrix
dimensions. For cases where have a low number of channels per
convolution group, we need smaller vector load sizes. These are captured
by the `VecStrategy` enumeration in the codegen rules.

The problem with the rule based configuration selection is that we "over
generate" configurations. The old JSON configurations compose
approximately 25% of all configuration that the `full` rule set creates.
The additional configurations are valid, but they many not provide any
performance benefits. Hence, we keep the `profiler` and `tests` rule set
for now to avoid building an excessive amount configurations by default.
The `full` rule set can be taken into use by specifying CMake
configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the
`tests` rule set is used, i.e., we don't change the existing bahaviour.

## Test Plan

Added a new stage in the CI/CD pipeline that ensures the Dispatcher
codegen rules are up to date. Otherwise the functionality is covered by
the existing CI/CD tests. There are no functional changes to the
convolution kernels. Only how the different instances are generated.

## Test Result

If the CK Tile conv instances build without errors, the Dispatcher
codegen is generating valid code. If all tests in CI/CD pipeline are
passing, the Dispatcher codegen generates valid instances.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 01:22:50 +00:00
..

Grouped Convolution Tile Engine

Benchmarking harness for grouped convolution kernels via the CK dispatcher's pipelined JIT compilation.

Covers all three variants -- forward, backward-data, backward-weight -- across the suffix-aware pipeline pool (compv3 / compv4 / compv5 / mem, intrawave / interwave, optional dsb / si suffixes) for 2D and 3D shapes.

This directory is purely a benchmarking and sweep tool. ML kernel-selection heuristics, training, and validation live in dispatcher/heuristics/ (see Related Documentation).

Directory Layout

grouped_conv/
  grouped_conv_full_benchmark.py     Orchestrator: enumerate kernels x problems, JIT compile, benchmark
  grouped_conv_instance_builder.py   Kernel enumeration from JSON trait config
  run_one_grouped_conv_kernel.py     Subprocess worker (one kernel, fresh GPU context)
  README.md                          This file
  configs/                           Kernel trait configurations
    forward_bf16.json                  Forward bf16 (compv3/v4/v5)
    bwd_data.json                      Backward data (compv3 / mem)
    bwd_weight.json                    Backward weight (compv3 / mem)
  problems/                          Problem datasets (registry keys consumed by --problems)
    forward_2d.py / forward_3d.py
    bwd_data_2d.py / bwd_data_3d.py
    bwd_weight_2d.py / bwd_weight_3d.py
    *_test_validation.py               Small unseen-shape subsets
    validation_holdout.py              VALIDATION_PROBLEMS (300 forward shapes)

Quick Start

# Count kernels matching a trait config without compiling
python grouped_conv_instance_builder.py configs/forward_bf16.json --arch gfx950 --count-only

# List kernel names
python grouped_conv_instance_builder.py configs/forward_bf16.json --arch gfx950 --list

# Smoke benchmark: forward 2D on the validation subset
python grouped_conv_full_benchmark.py \
  --variant forward \
  --problems forward_2d_test_validation \
  --workers 256 \
  --output sweep_forward_smoke.csv

# Full sweep: all forward kernels x all forward-2D problems
python grouped_conv_full_benchmark.py \
  --variant forward \
  --problems forward_2d \
  --workers 256 \
  --output sweep_forward_2d.csv

# Backward data / weight sweeps
python grouped_conv_full_benchmark.py --variant bwd_data   --problems bwd_data_2d   --output sweep_bwd_data.csv
python grouped_conv_full_benchmark.py --variant bwd_weight --problems bwd_weight_2d --output sweep_bwd_weight.csv

The benchmark always starts fresh and overwrites --output. Move or rename the file beforehand if you need to keep prior results.

How It Works

Kernel Enumeration

JSON trait config (variant + allowed pipelines / wave modes / suffixes)
  --> grouped_conv_instance_builder.py
    --> dispatcher/codegen/grouped_config_rules.py (tile + suffix-aware pool)
      --> list of GroupedConvKernelConfig
        --> optional --filter expression

The pipeline rules in dispatcher/codegen/grouped_config_rules.py are the single source of truth for the kernel pool (tile sizes, wave modes, pipeline variants, dsb / si suffixes). The instance builder reads a JSON trait allow-list and produces the cartesian product of legal configurations.

Benchmark Pipeline

grouped_conv_full_benchmark.py (orchestrator)
  |-- grouped_conv_instance_builder.py    enumerate kernel configs
  |-- Build phase                          codegen -> hipcc -> link .so (serial; avoids fork + GPU init issues)
  '-- Benchmark phase                      one subprocess per kernel batch
        '-- run_one_grouped_conv_kernel.py
              '-- GpuGroupedConvRunner     fresh HIP context per problem

Key design choices:

  1. Subprocess isolation -- a fresh HIP context per kernel batch avoids cumulative driver/device leaks during long sweeps.
  2. Serial GPU access -- accurate timing, no contention.
  3. Path-only build in the main process -- the orchestrator never initializes the GPU runtime, so fork() after codegen is safe.
  4. Batch size ~20 kernels/subprocess -- empirically a good throughput/overhead tradeoff.

The --workers flag controls codegen/compile parallelism for the build phase. Benchmarking itself is serial per device.

JSON Config Format

{
  "variant": "forward",
  "trait_config": {
    "data_type":    {"values": ["bf16"]},
    "pipeline":     {"values": ["compv3", "compv4", "compv5"]},
    "wave_mode":    {"values": ["intrawave", "interwave"]},
    "ndim_spatial": {"values": [2, 3]}
  }
}

Allowed keys mirror GroupedConvKernelConfig fields. See dispatcher/codegen/grouped_config_rules.py for the full schema.

Filtering examples

# Only large tiles on compv5
python grouped_conv_instance_builder.py configs/forward_bf16.json \
  --arch gfx950 \
  --filter "c.tile_n >= 128 and c.pipeline == 'compv5'" --list

# Export the resolved kernel list to JSON
python grouped_conv_instance_builder.py configs/forward_bf16.json \
  --arch gfx950 --export-json kernels.json

Problem Registry

--problems accepts only registry keys, not file paths. The keys are wired in grouped_conv_full_benchmark.py. Current keys:

Key Direction Notes
forward_2d / forward_3d forward Full training-grade problem sets
bwd_data_2d / bwd_data_3d backward data Full training-grade problem sets
bwd_weight_2d / bwd_weight_3d backward wgt Full training-grade problem sets
*_test_validation per direction Small unseen-shape subsets
validation_holdout forward 300 shapes (250 2D + 50 3D)

Adding a new subset requires both a problems/<name>.py file and a registry entry in grouped_conv_full_benchmark.py.

Each problem module exposes a list of dataclasses with fields N, C, K, G, Hi, Wi[, Di], Y, X[, Z], stride_h, stride_w[, stride_d], pad_h, pad_w[, pad_d] and optional dilation_*.

Output CSV Schema

kernel, problem_idx, N, C, K, G, [Di,] Hi, Wi, [Z,] Y, X,
        [stride_d,] stride_h, stride_w,
        [pad_d,]    pad_h,    pad_w,
        latency_ms, tflops, non_zero

non_zero is a sanity flag (output checksum != 0). Failed launches are written with latency_ms=N/A and tflops=0.

Hardware

  • Validated on AMD Instinct MI355X (gfx950).
  • Datatypes: bf16 (primary), fp16, fp32.
  • Pipelines: compv3 / compv4 / compv5 (forward), compv3 / mem (backward).
  • Schedulers: intrawave, interwave (with optional dsb, si suffixes).

GPU access caveat (this host)

On the dev host the device files have non-default GIDs (/dev/kfd GID 506, /dev/dri/renderD144 GID 109). If hipMalloc returns code 100 (hipErrorOutOfMemory) on every allocation, it is a permissions issue, not VRAM exhaustion. Launch the benchmark via sudo -u sshuser bash -lc '...' so the process tree picks up kfdhost, renderhost, and video groups.

Anything ML-heuristic-related has been moved out of this directory:

  • ML training pipeline & models: dispatcher/heuristics/README.md
  • ML vs oracle comparison & validation: dispatcher/heuristics/validation/grouped_conv/
    • validate_ml_vs_oracle.py -- run trained predictor over a problem set and compare against oracle CSVs produced by this harness.
    • compare_ml_vs_oracle.py -- post-hoc comparison of oracle + ML prediction CSVs (efficiency, top-k, scatter plot).
  • Dispatcher Python API: dispatcher/python/
  • End-to-end examples: dispatcher/examples/grouped_conv/