[CK Tile] Rule-based configuration generation in CK Dispatcher codegen (#8157) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK Tile Dispatcher code generation for CK Tile Profiler relies on flat JSON files to list the generated configurations. This approach has the following problems - The JSON files are verbose - The JSON files get easily out of sync with the CK Builder .config files from which they were generated from. - The JSON file based configuration make it hard to list explicitly the rules that govern the instance generation. ## Technical Details Replaced the JSON files with a rule based configuration. To preserve the existing functionality, the `profiler` and the `tests` instance sets are generated directly from the CK Builder config files. The JSON config files are removed from source control, and the "on-the-fly" generation guarantees that the Dispatcher codegen uses up to date configurations. This is PR introduces six different rule sets for the CK Tile Dispatcher code generation 1. `profiler`: matches with the old JSON set of profiler configurations. 2. `tests`: matches with the old JSON set of tests configurations. 3. `full`: full configuration set created from a rule-based config selection 4. `full-tests`: a subset of `full` for generating configurations for convolution integration tests. 5. `tiny`: a subset of `full-tests` to produce the minimal set of configurations to test the Dispatcher codegen. 6. `default`: the default rules, which corresponds to the existing heuristic rules for configuration selection. This ensures that ML based kernel selection doesn't get broken. The main use of the `full` rule set is to define a reasonable solution space for the possible implicit GEMM configurations. We start from the configurations that allowed by the device architecture. The `full` rule set defines the relevant tile sizes for each convolution direction. From the tile size we have a curated mapping to the number of waves over the different GEMM axes, i.e., we describe how many waves each GEMM dimensions corresponds to. The GEMM-K wave tile dimension can be computed from the other parameters and does not need to be listed explicitly. An orthogonal axis to the tiling strategy is the vectorization strategy. This mainly defined by the data type and hardware as in general, we want to use the maximum possible load widths. The maximum sizes for each convolution direction variant are defined by the implicit GEMM matrix dimensions. For cases where have a low number of channels per convolution group, we need smaller vector load sizes. These are captured by the `VecStrategy` enumeration in the codegen rules. The problem with the rule based configuration selection is that we "over generate" configurations. The old JSON configurations compose approximately 25% of all configuration that the `full` rule set creates. The additional configurations are valid, but they many not provide any performance benefits. Hence, we keep the `profiler` and `tests` rule set for now to avoid building an excessive amount configurations by default. The `full` rule set can be taken into use by specifying CMake configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the `tests` rule set is used, i.e., we don't change the existing bahaviour. ## Test Plan Added a new stage in the CI/CD pipeline that ensures the Dispatcher codegen rules are up to date. Otherwise the functionality is covered by the existing CI/CD tests. There are no functional changes to the convolution kernels. Only how the different instances are generated. ## Test Result If the CK Tile conv instances build without errors, the Dispatcher codegen is generating valid code. If all tests in CI/CD pipeline are passing, the Dispatcher codegen generates valid instances. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Grouped Convolution Tile Engine
Benchmarking harness for grouped convolution kernels via the CK dispatcher's pipelined JIT compilation.
Covers all three variants -- forward, backward-data, backward-weight -- across the suffix-aware pipeline pool (compv3 / compv4 / compv5 / mem, intrawave / interwave, optional dsb / si suffixes) for 2D and 3D shapes.
This directory is purely a benchmarking and sweep tool. ML kernel-selection heuristics, training, and validation live in dispatcher/heuristics/ (see Related Documentation).
Directory Layout
grouped_conv/
grouped_conv_full_benchmark.py Orchestrator: enumerate kernels x problems, JIT compile, benchmark
grouped_conv_instance_builder.py Kernel enumeration from JSON trait config
run_one_grouped_conv_kernel.py Subprocess worker (one kernel, fresh GPU context)
README.md This file
configs/ Kernel trait configurations
forward_bf16.json Forward bf16 (compv3/v4/v5)
bwd_data.json Backward data (compv3 / mem)
bwd_weight.json Backward weight (compv3 / mem)
problems/ Problem datasets (registry keys consumed by --problems)
forward_2d.py / forward_3d.py
bwd_data_2d.py / bwd_data_3d.py
bwd_weight_2d.py / bwd_weight_3d.py
*_test_validation.py Small unseen-shape subsets
validation_holdout.py VALIDATION_PROBLEMS (300 forward shapes)
Quick Start
# Count kernels matching a trait config without compiling
python grouped_conv_instance_builder.py configs/forward_bf16.json --arch gfx950 --count-only
# List kernel names
python grouped_conv_instance_builder.py configs/forward_bf16.json --arch gfx950 --list
# Smoke benchmark: forward 2D on the validation subset
python grouped_conv_full_benchmark.py \
--variant forward \
--problems forward_2d_test_validation \
--workers 256 \
--output sweep_forward_smoke.csv
# Full sweep: all forward kernels x all forward-2D problems
python grouped_conv_full_benchmark.py \
--variant forward \
--problems forward_2d \
--workers 256 \
--output sweep_forward_2d.csv
# Backward data / weight sweeps
python grouped_conv_full_benchmark.py --variant bwd_data --problems bwd_data_2d --output sweep_bwd_data.csv
python grouped_conv_full_benchmark.py --variant bwd_weight --problems bwd_weight_2d --output sweep_bwd_weight.csv
The benchmark always starts fresh and overwrites --output. Move or rename the file beforehand if you need to keep prior results.
How It Works
Kernel Enumeration
JSON trait config (variant + allowed pipelines / wave modes / suffixes)
--> grouped_conv_instance_builder.py
--> dispatcher/codegen/grouped_config_rules.py (tile + suffix-aware pool)
--> list of GroupedConvKernelConfig
--> optional --filter expression
The pipeline rules in dispatcher/codegen/grouped_config_rules.py are the single source of truth for the kernel pool (tile sizes, wave modes, pipeline variants, dsb / si suffixes). The instance builder reads a JSON trait allow-list and produces the cartesian product of legal configurations.
Benchmark Pipeline
grouped_conv_full_benchmark.py (orchestrator)
|-- grouped_conv_instance_builder.py enumerate kernel configs
|-- Build phase codegen -> hipcc -> link .so (serial; avoids fork + GPU init issues)
'-- Benchmark phase one subprocess per kernel batch
'-- run_one_grouped_conv_kernel.py
'-- GpuGroupedConvRunner fresh HIP context per problem
Key design choices:
- Subprocess isolation -- a fresh HIP context per kernel batch avoids cumulative driver/device leaks during long sweeps.
- Serial GPU access -- accurate timing, no contention.
- Path-only build in the main process -- the orchestrator never initializes the GPU runtime, so
fork()after codegen is safe. - Batch size ~20 kernels/subprocess -- empirically a good throughput/overhead tradeoff.
The
--workersflag controls codegen/compile parallelism for the build phase. Benchmarking itself is serial per device.
JSON Config Format
{
"variant": "forward",
"trait_config": {
"data_type": {"values": ["bf16"]},
"pipeline": {"values": ["compv3", "compv4", "compv5"]},
"wave_mode": {"values": ["intrawave", "interwave"]},
"ndim_spatial": {"values": [2, 3]}
}
}
Allowed keys mirror GroupedConvKernelConfig fields. See dispatcher/codegen/grouped_config_rules.py for the full schema.
Filtering examples
# Only large tiles on compv5
python grouped_conv_instance_builder.py configs/forward_bf16.json \
--arch gfx950 \
--filter "c.tile_n >= 128 and c.pipeline == 'compv5'" --list
# Export the resolved kernel list to JSON
python grouped_conv_instance_builder.py configs/forward_bf16.json \
--arch gfx950 --export-json kernels.json
Problem Registry
--problems accepts only registry keys, not file paths. The keys are wired in grouped_conv_full_benchmark.py. Current keys:
| Key | Direction | Notes |
|---|---|---|
forward_2d / forward_3d |
forward | Full training-grade problem sets |
bwd_data_2d / bwd_data_3d |
backward data | Full training-grade problem sets |
bwd_weight_2d / bwd_weight_3d |
backward wgt | Full training-grade problem sets |
*_test_validation |
per direction | Small unseen-shape subsets |
validation_holdout |
forward | 300 shapes (250 2D + 50 3D) |
Adding a new subset requires both a problems/<name>.py file and a registry entry in grouped_conv_full_benchmark.py.
Each problem module exposes a list of dataclasses with fields N, C, K, G, Hi, Wi[, Di], Y, X[, Z], stride_h, stride_w[, stride_d], pad_h, pad_w[, pad_d] and optional dilation_*.
Output CSV Schema
kernel, problem_idx, N, C, K, G, [Di,] Hi, Wi, [Z,] Y, X,
[stride_d,] stride_h, stride_w,
[pad_d,] pad_h, pad_w,
latency_ms, tflops, non_zero
non_zero is a sanity flag (output checksum != 0). Failed launches are written with latency_ms=N/A and tflops=0.
Hardware
- Validated on AMD Instinct MI355X (gfx950).
- Datatypes: bf16 (primary), fp16, fp32.
- Pipelines: compv3 / compv4 / compv5 (forward), compv3 / mem (backward).
- Schedulers: intrawave, interwave (with optional
dsb,sisuffixes).
GPU access caveat (this host)
On the dev host the device files have non-default GIDs (/dev/kfd GID 506, /dev/dri/renderD144 GID 109). If hipMalloc returns code 100 (hipErrorOutOfMemory) on every allocation, it is a permissions issue, not VRAM exhaustion. Launch the benchmark via sudo -u sshuser bash -lc '...' so the process tree picks up kfdhost, renderhost, and video groups.
Related Documentation
Anything ML-heuristic-related has been moved out of this directory:
- ML training pipeline & models:
dispatcher/heuristics/README.md - ML vs oracle comparison & validation:
dispatcher/heuristics/validation/grouped_conv/validate_ml_vs_oracle.py-- run trained predictor over a problem set and compare against oracle CSVs produced by this harness.compare_ml_vs_oracle.py-- post-hoc comparison of oracle + ML prediction CSVs (efficiency, top-k, scatter plot).
- Dispatcher Python API:
dispatcher/python/ - End-to-end examples:
dispatcher/examples/grouped_conv/