Files
composable_kernel/dispatcher/examples/grouped_conv/python
Ville Pietilä 60b276647b [rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39)
[CK Tile] Rule-based configuration generation in CK
 Dispatcher codegen (#8157)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Tile Dispatcher code generation for CK Tile Profiler relies on
flat JSON files to list the generated configurations. This approach has
the following problems

- The JSON files are verbose
- The JSON files get easily out of sync with the CK Builder .config
files from which they were generated from.
- The JSON file based configuration make it hard to list explicitly the
rules that govern the instance generation.

## Technical Details

Replaced the JSON files with a rule based configuration. To preserve the
existing functionality, the `profiler` and the `tests` instance sets are
generated directly from the CK Builder config files. The JSON config
files are removed from source control, and the "on-the-fly" generation
guarantees that the Dispatcher codegen uses up to date configurations.

This is PR introduces six different rule sets for the CK Tile Dispatcher
code generation

1. `profiler`: matches with the old JSON set of profiler configurations.
2. `tests`: matches with the old JSON set of tests configurations.
3. `full`: full configuration set created from a rule-based config
selection
4. `full-tests`: a subset of `full` for generating configurations for
convolution integration tests.
5. `tiny`: a subset of `full-tests` to produce the minimal set of
configurations to test the Dispatcher codegen.
6. `default`: the default rules, which corresponds to the existing
heuristic rules for configuration selection. This ensures that ML based
kernel selection doesn't get broken.

The main use of the `full` rule set is to define a reasonable solution
space for the possible implicit GEMM configurations. We start from the
configurations that allowed by the device architecture. The `full` rule
set defines the relevant tile sizes for each convolution direction. From
the tile size we have a curated mapping to the number of waves over the
different GEMM axes, i.e., we describe how many waves each GEMM
dimensions corresponds to. The GEMM-K wave tile dimension can be
computed from the other parameters and does not need to be listed
explicitly.

An orthogonal axis to the tiling strategy is the vectorization strategy.
This mainly defined by the data type and hardware as in general, we want
to use the maximum possible load widths. The maximum sizes for each
convolution direction variant are defined by the implicit GEMM matrix
dimensions. For cases where have a low number of channels per
convolution group, we need smaller vector load sizes. These are captured
by the `VecStrategy` enumeration in the codegen rules.

The problem with the rule based configuration selection is that we "over
generate" configurations. The old JSON configurations compose
approximately 25% of all configuration that the `full` rule set creates.
The additional configurations are valid, but they many not provide any
performance benefits. Hence, we keep the `profiler` and `tests` rule set
for now to avoid building an excessive amount configurations by default.
The `full` rule set can be taken into use by specifying CMake
configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the
`tests` rule set is used, i.e., we don't change the existing bahaviour.

## Test Plan

Added a new stage in the CI/CD pipeline that ensures the Dispatcher
codegen rules are up to date. Otherwise the functionality is covered by
the existing CI/CD tests. There are no functional changes to the
convolution kernels. Only how the different instances are generated.

## Test Result

If the CK Tile conv instances build without errors, the Dispatcher
codegen is generating valid code. If all tests in CI/CD pipeline are
passing, the Dispatcher codegen generates valid instances.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 01:22:50 +00:00
..

Grouped Convolution — Python Examples

Examples and test harnesses for the grouped convolution dispatcher (forward, bwd_data, bwd_weight) using the Python JIT codegen + hipcc workflow.

Run scripts from this directory:

cd dispatcher/examples/grouped_conv/python
python3 -u <script.py>          # use -u for unbuffered logs

GPU arch is auto-detected (detect_gpu_arch()); pass --arch gfx950 to override.

Examples

Script Purpose
01_basic_grouped_conv.py End-to-end smoke test: build + run forward kernel, verify output.
02_forward.py Forward variant (NHWGC / GKYXC), small 2D problem.
03_bwd_data.py Backward-data variant. Runner contract: run(dY, W, prob).
04_bwd_weight.py Backward-weight variant. Runner contract: run(X, dY, prob).
05_benchmark.py Multi-kernel sweep + timing (slow; runs many configs).
06_registry_json.py Build a registry from a JSON config file.
09_ml_heuristic.py Demo of LightGBM heuristic (requires lightgbm); see ML heuristic below.
10_test_all_pipelines.py For each variant, test all 8 pipelines with intrawave.
11_test_schedulers.py For each variant, test all 8 pipelines × {intrawave, interwave}.
12_test_config_options.py Test the 5 config options (see Config-options harness below).

Runner argument contract

runner.run(input_np, weight_np, prob) — order matters per variant:

Variant input_np weight_np
forward X (NHWGC) W (GKYXC)
bwd_data dY W
bwd_weight X dY

Pipelines & schedulers

All 8 pipelines: basic_v1, mem, compv3, compv4, compv5, compv6, comp_async, basic_async_v1.

  • compv4 and comp_async require double_smem_buffer=True (loud static_assert otherwise).
  • Not every pipeline supports both intrawave and interwave. 11_test_schedulers.py treats a pipeline as supported if at least one scheduler runs successfully.

Config-options harness (12_test_config_options.py)

Verifies the 5 GroupedConvKernelConfig options:

  1. double_smem_buffer — LDS ping-pong (required for compv4 / comp_async).
  2. num_groups_to_merge — fuse groups into one tile.
  3. split_image — split spatial dims for large tensors.
  4. explicit_gemm — explicit GEMM path (experimental).
  5. two_stage — two-stage bwd_weight with fp32 workspace.

Each test is run in its own subprocess (--single-test '<json>' mode) so a single GPU page fault doesnt take down the whole sweep — failing combinations are reported as CRASH and the run continues.

Test problem sizes are kept small (e.g. 2D: N=1, G=2, C=K=64, Hi=Wi=8, 3×3) to avoid OOM / aperture violations on the test GPU.

ML heuristic (09_ml_heuristic.py)

LightGBM regression model that predicts kernel TFLOPS and selects a kernel for a given problem. Requires the lightgbm Python package.

  • Models live in dispatcher/heuristics/models/grouped_conv_<variant>_bf16_<arch>/ (forward, bwd_data, bwd_weight all available).
  • Feature engine: dispatcher/heuristics/feature_engine_grouped_conv.py.
  • Training entry point: dispatcher/heuristics/train.py.
  • Prediction: dispatcher/heuristics/predict.py (use Predictor with GroupedConvFeatureEngine; build the candidate kernel pool from a training/holdout parquet via df["kernel_name"].unique()).

Typical training flow:

# 1. Benchmark to CSV (slow)
cd tile_engine/ops/grouped_conv
python3 -u grouped_conv_full_benchmark.py configs/forward_bf16.json \
  --arch gfx950 --problems forward_training \
  --csv benchmark_forward_bf16_gfx950.csv --workers 8

# 2. CSV → Parquet
cd ../../../dispatcher/heuristics
python3 convert_csv_to_parquet.py \
  --input ../../tile_engine/ops/grouped_conv/benchmark_forward_bf16_gfx950.csv \
  --output data/grouped_conv_forward_bf16_gfx950.parquet --arch gfx950

# 3. Train
python3 train.py --data_dir data \
  --out_dir models/grouped_conv_forward_bf16_gfx950 \
  --op grouped_conv --dtype bf16 --arch gfx950 --targets tflops --n_splits 5

To add a new pipeline (e.g. compv6) update: dispatcher/codegen/grouped_config_rules.py (VARIANT_PIPELINES), dispatcher/heuristics/feature_engine_grouped_conv.py (add the is_<name> flag), and the relevant tile_engine/ops/grouped_conv/configs/*.json. Then re-run the benchmark + train flow above.

Notes

  • Use python3 -u for any long-running script so logs arent buffered.
  • Kernels are compiled once and cached under /tmp/dispatcher/; subsequent runs reuse the cached .so.
  • This repo has 1 GPU — do not run benchmarks in parallel.