mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 00:39:02 +00:00

Files

Yaswanth Raparti fe085f8a69 [rocm-libraries] ROCm/rocm-libraries#7761 (commit 237b766)

[CK][CK TILE] Clean up tile_engine grouped_conv harness
 (#7761)

## Motivation
Tile_engine grouped_conv contains ML heuristic validation scripts which
cause confusion to new developers. So, this PR is intended to relocate
the scripts into dispatcher/heuristic directory to maintain separation
of concern.

## Technical Details
The grouped_conv tile_engine directory is a benchmarking harness for
grouped convolution kernels; ML-heuristic content does not belong there.

- Move compare_ml_vs_oracle.py and validate_ml_vs_oracle.py from
tile_engine/ops/grouped_conv/ to
dispatcher/heuristics/validation/grouped_conv/, and rebase their
sys.path / oracle CSV / model dir lookups for the new location (CSV path
is now an --oracle-csv flag instead of a hard-coded sibling).
- Move GROUPED_CONV_HEURISTIC_REPORT.md (system-level ML report) into
dispatcher/heuristics/ where the rest of the heuristic docs live.
- Rewrite tile_engine/ops/grouped_conv/README.md as a pure benchmarking
/ dispatcher-sweep doc (kernel enumeration, JIT pipeline, CSV schema,
problem registry), in the style of tile_engine/ops/fmha/README.md. All
ML training / model-efficiency content is removed and replaced with a
pointer to dispatcher/heuristics/.

## Test Plan

Validation scripts are re-wired and tested locally

## Test Result

Tests passed on local machine.

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-05-29 17:09:29 +00:00

configs

[rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e)

2026-05-08 13:47:13 -07:00

problems

[rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e)

2026-05-08 13:47:13 -07:00

.gitignore

[rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e)

2026-05-08 13:47:13 -07:00

grouped_conv_full_benchmark.py

[rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e)

2026-05-08 13:47:13 -07:00

grouped_conv_instance_builder.py

[rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e)

2026-05-08 13:47:13 -07:00

README.md

[rocm-libraries] ROCm/rocm-libraries#7761 (commit 237b766)

2026-05-29 17:09:29 +00:00

run_one_grouped_conv_kernel.py

[rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e)

2026-05-08 13:47:13 -07:00

README.md

Grouped Convolution Tile Engine

Benchmarking harness for grouped convolution kernels via the CK dispatcher's pipelined JIT compilation.

Covers all three variants -- forward, backward-data, backward-weight -- across the suffix-aware pipeline pool (compv3 / compv4 / compv5 / mem, intrawave / interwave, optional dsb / si suffixes) for 2D and 3D shapes.

This directory is purely a benchmarking and sweep tool. ML kernel-selection heuristics, training, and validation live in dispatcher/heuristics/ (see Related Documentation).

Directory Layout

grouped_conv/
  grouped_conv_full_benchmark.py     Orchestrator: enumerate kernels x problems, JIT compile, benchmark
  grouped_conv_instance_builder.py   Kernel enumeration from JSON trait config
  run_one_grouped_conv_kernel.py     Subprocess worker (one kernel, fresh GPU context)
  README.md                          This file
  configs/                           Kernel trait configurations
    forward_bf16.json                  Forward bf16 (compv3/v4/v5)
    bwd_data.json                      Backward data (compv3 / mem)
    bwd_weight.json                    Backward weight (compv3 / mem)
  problems/                          Problem datasets (registry keys consumed by --problems)
    forward_2d.py / forward_3d.py
    bwd_data_2d.py / bwd_data_3d.py
    bwd_weight_2d.py / bwd_weight_3d.py
    *_test_validation.py               Small unseen-shape subsets
    validation_holdout.py              VALIDATION_PROBLEMS (300 forward shapes)

Quick Start

# Count kernels matching a trait config without compiling
python grouped_conv_instance_builder.py configs/forward_bf16.json --arch gfx950 --count-only

# List kernel names
python grouped_conv_instance_builder.py configs/forward_bf16.json --arch gfx950 --list

# Smoke benchmark: forward 2D on the validation subset
python grouped_conv_full_benchmark.py \
  --variant forward \
  --problems forward_2d_test_validation \
  --workers 256 \
  --output sweep_forward_smoke.csv

# Full sweep: all forward kernels x all forward-2D problems
python grouped_conv_full_benchmark.py \
  --variant forward \
  --problems forward_2d \
  --workers 256 \
  --output sweep_forward_2d.csv

# Backward data / weight sweeps
python grouped_conv_full_benchmark.py --variant bwd_data   --problems bwd_data_2d   --output sweep_bwd_data.csv
python grouped_conv_full_benchmark.py --variant bwd_weight --problems bwd_weight_2d --output sweep_bwd_weight.csv

The benchmark always starts fresh and overwrites --output. Move or rename the file beforehand if you need to keep prior results.

How It Works

Kernel Enumeration

JSON trait config (variant + allowed pipelines / wave modes / suffixes)
  --> grouped_conv_instance_builder.py
    --> dispatcher/codegen/grouped_config_rules.py (tile + suffix-aware pool)
      --> list of GroupedConvKernelConfig
        --> optional --filter expression

The pipeline rules in dispatcher/codegen/grouped_config_rules.py are the single source of truth for the kernel pool (tile sizes, wave modes, pipeline variants, dsb / si suffixes). The instance builder reads a JSON trait allow-list and produces the cartesian product of legal configurations.

Benchmark Pipeline

grouped_conv_full_benchmark.py (orchestrator)
  |-- grouped_conv_instance_builder.py    enumerate kernel configs
  |-- Build phase                          codegen -> hipcc -> link .so (serial; avoids fork + GPU init issues)
  '-- Benchmark phase                      one subprocess per kernel batch
        '-- run_one_grouped_conv_kernel.py
              '-- GpuGroupedConvRunner     fresh HIP context per problem

Key design choices:

Subprocess isolation -- a fresh HIP context per kernel batch avoids cumulative driver/device leaks during long sweeps.
Serial GPU access -- accurate timing, no contention.
Path-only build in the main process -- the orchestrator never initializes the GPU runtime, so fork() after codegen is safe.
Batch size ~20 kernels/subprocess -- empirically a good throughput/overhead tradeoff.

The --workers flag controls codegen/compile parallelism for the build phase. Benchmarking itself is serial per device.

JSON Config Format

{
  "variant": "forward",
  "trait_config": {
    "data_type":    {"values": ["bf16"]},
    "pipeline":     {"values": ["compv3", "compv4", "compv5"]},
    "wave_mode":    {"values": ["intrawave", "interwave"]},
    "ndim_spatial": {"values": [2, 3]}
  }
}

Allowed keys mirror GroupedConvKernelConfig fields. See dispatcher/codegen/grouped_config_rules.py for the full schema.

Filtering examples

# Only large tiles on compv5
python grouped_conv_instance_builder.py configs/forward_bf16.json \
  --arch gfx950 \
  --filter "c.tile_n >= 128 and c.pipeline == 'compv5'" --list

# Export the resolved kernel list to JSON
python grouped_conv_instance_builder.py configs/forward_bf16.json \
  --arch gfx950 --export-json kernels.json

Problem Registry

--problems accepts only registry keys, not file paths. The keys are wired in grouped_conv_full_benchmark.py. Current keys:

Key	Direction	Notes
`forward_2d` / `forward_3d`	forward	Full training-grade problem sets
`bwd_data_2d` / `bwd_data_3d`	backward data	Full training-grade problem sets
`bwd_weight_2d` / `bwd_weight_3d`	backward wgt	Full training-grade problem sets
`*_test_validation`	per direction	Small unseen-shape subsets
`validation_holdout`	forward	300 shapes (250 2D + 50 3D)

Adding a new subset requires both a problems/<name>.py file and a registry entry in grouped_conv_full_benchmark.py.

Each problem module exposes a list of dataclasses with fields N, C, K, G, Hi, Wi[, Di], Y, X[, Z], stride_h, stride_w[, stride_d], pad_h, pad_w[, pad_d] and optional dilation_*.

Output CSV Schema

kernel, problem_idx, N, C, K, G, [Di,] Hi, Wi, [Z,] Y, X,
        [stride_d,] stride_h, stride_w,
        [pad_d,]    pad_h,    pad_w,
        latency_ms, tflops, non_zero

non_zero is a sanity flag (output checksum != 0). Failed launches are written with latency_ms=N/A and tflops=0.

Hardware

Validated on AMD Instinct MI355X (gfx950).
Datatypes: bf16 (primary), fp16, fp32.
Pipelines: compv3 / compv4 / compv5 (forward), compv3 / mem (backward).
Schedulers: intrawave, interwave (with optional dsb, si suffixes).

GPU access caveat (this host)

On the dev host the device files have non-default GIDs (/dev/kfd GID 506, /dev/dri/renderD144 GID 109). If hipMalloc returns code 100 (hipErrorOutOfMemory) on every allocation, it is a permissions issue, not VRAM exhaustion. Launch the benchmark via sudo -u sshuser bash -lc '...' so the process tree picks up kfdhost, renderhost, and video groups.

Anything ML-heuristic-related has been moved out of this directory:

ML training pipeline & models: dispatcher/heuristics/README.md
ML vs oracle comparison & validation: dispatcher/heuristics/validation/grouped_conv/
- validate_ml_vs_oracle.py -- run trained predictor over a problem set and compare against oracle CSVs produced by this harness.
- compare_ml_vs_oracle.py -- post-hoc comparison of oracle + ML prediction CSVs (efficiency, top-k, scatter plot).
Dispatcher Python API: dispatcher/python/
End-to-end examples: dispatcher/examples/grouped_conv/