Files
composable_kernel/dispatcher/examples/grouped_conv/python/README.md
Yaswanth Raparti 6989cf800c [rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e)
[CK][CK TILE] Dispatcher kernel selection heuristic for
 grouped conv (#6327)

## Motivation
The ML heuristic in dispatcher does not support grouped-conv operator
yet. In this PR, the support for fwd, bdw-data, and bwd-weight
grouped-conv kernels have been added. A tile_engine utility has also
been added to compile and run any selected kernel configuration through
dispatcher infrastructure.

## Technical Details

1. Tile engine utility is added to benchmark each shape with all the
possible kernel+tile_size combinations here -
[https://github.com/ROCm/rocm-libraries/blob/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/tile_engine/ops/grouped_conv/grouped_conv_full_benchmark.py](url)
2. New LGBM regressor models for grouped conv are added to models
directory. We have 3 separate models for fwd, bwd-data, and bwd-weights
[https://github.com/ROCm/rocm-libraries/tree/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/dispatcher/heuristics/models](url)
3. Implemented lazy GPU initialization (dispatcher/python)
- **Issue**: ProcessPoolExecutor fork() + GPU context caused memory
access faults
- **Solution**: Mirror FMHA pattern - defer GPU initialization until
first run()
  - **Changes**:
- setup_multiple_grouped_conv_dispatchers() returns List[Path], not
loaded libs
    - GpuGroupedConvRunner.__init__() no longer calls ctypes.CDLL
    - Added _ensure_initialized() method for lazy GPU loading
    - GPU context created only on first run() call
  - **Benefit**: Parallel compilation now works without GPU conflicts
4. Addressed few miscellaneous issues such as:
  - Fixed BF16->FP16 naming bug in the dispatcher wrapper
- Added new tile sizes, and comp_v5 pipeline to the arch spec to expand
the kernel selection
- Added automatic padding support for unsupported shapes in dispatcher
runner
- Created a single source of truth between tile_engine and dispatcher
about the architecture and tile_size details
- Build a validation scripts to compare oracle_best vs ml_heuristic
comparison

## Test Plan

1. Validated fwd, bwd-data, and bwd-weight kernels with both known and
unseen data sets with up to 300 problems.
2. Ensured that test cases are added in both dispatcher and tile_engine
to validate the heuristic.

## Test Result
Results on Unseen shapes validated on gfx950
#### Forward Pass Model
- **Training Data**: 48,845 measurements across 1,372 unique problem
shapes
- **Validation Set**: 300 unseen problems from model crawler
- **Validation Performance** (vs. oracle):
  - Mean Efficiency: **93.05%**
  - Median Efficiency: **96.8%**
  - P10 Efficiency: **79.9%**

#### Backward Data Gradient (bwd_data) Model
- **Training Data**: 18,773 measurements across 891 unique problem
shapes
- **Validation Set**: 300 unseen problems from model crawler
- **Validation Performance** (vs. oracle):
  - Mean Efficiency: **93.8%**
  - Median Efficiency: **96.5%**
  - P10 Efficiency: **82.9%**

#### Backward Weight Gradient (bwd_weight) Model
- **Training Data**: 34,900 measurements across 1,508 unique problem
shapes
- **Validation Set**: 300 unseen problems from model crawler
- **Validation Performance** (vs. oracle):
  - Mean Efficiency: **96.1%**
  - Median Efficiency: **99.2%**
  - P10 Efficiency: **89.4%**

## Submission Checklist

- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-08 20:48:42 +00:00

112 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Grouped Convolution — Python Examples
Examples and test harnesses for the grouped convolution dispatcher (forward,
bwd_data, bwd_weight) using the Python JIT codegen + hipcc workflow.
Run scripts from this directory:
```bash
cd dispatcher/examples/grouped_conv/python
python3 -u <script.py> # use -u for unbuffered logs
```
GPU arch is auto-detected (`detect_gpu_arch()`); pass `--arch gfx950` to override.
## Examples
| Script | Purpose |
|---|---|
| `01_basic_grouped_conv.py` | End-to-end smoke test: build + run forward kernel, verify output. |
| `02_forward.py` | Forward variant (NHWGC / GKYXC), small 2D problem. |
| `03_bwd_data.py` | Backward-data variant. Runner contract: `run(dY, W, prob)`. |
| `04_bwd_weight.py` | Backward-weight variant. Runner contract: `run(X, dY, prob)`. |
| `05_benchmark.py` | Multi-kernel sweep + timing (slow; runs many configs). |
| `06_registry_json.py` | Build a registry from a JSON config file. |
| `09_ml_heuristic.py` | Demo of LightGBM heuristic (requires `lightgbm`); see *ML heuristic* below. |
| `10_test_all_pipelines.py` | For each variant, test all 8 pipelines with `intrawave`. |
| `11_test_schedulers.py` | For each variant, test all 8 pipelines × {intrawave, interwave}. |
| `12_test_config_options.py` | Test the 5 config options (see *Config-options harness* below). |
## Runner argument contract
`runner.run(input_np, weight_np, prob)` — order matters per variant:
| Variant | `input_np` | `weight_np` |
|---|---|---|
| `forward` | `X` (NHWGC) | `W` (GKYXC) |
| `bwd_data` | `dY` | `W` |
| `bwd_weight` | `X` | `dY` |
## Pipelines & schedulers
All 8 pipelines: `basic_v1, mem, compv3, compv4, compv5, compv6, comp_async,
basic_async_v1`.
* `compv4` and `comp_async` require `double_smem_buffer=True` (loud
`static_assert` otherwise).
* Not every pipeline supports both `intrawave` and `interwave`. `11_test_schedulers.py`
treats a pipeline as supported if **at least one** scheduler runs successfully.
## Config-options harness (`12_test_config_options.py`)
Verifies the 5 `GroupedConvKernelConfig` options:
1. `double_smem_buffer` — LDS ping-pong (required for compv4 / comp_async).
2. `num_groups_to_merge` — fuse groups into one tile.
3. `split_image` — split spatial dims for large tensors.
4. `explicit_gemm` — explicit GEMM path (experimental).
5. `two_stage` — two-stage bwd_weight with fp32 workspace.
Each test is run in its **own subprocess** (`--single-test '<json>'` mode) so a
single GPU page fault doesnt take down the whole sweep — failing combinations
are reported as `CRASH` and the run continues.
Test problem sizes are kept small (e.g. 2D: `N=1, G=2, C=K=64, Hi=Wi=8, 3×3`)
to avoid OOM / aperture violations on the test GPU.
## ML heuristic (`09_ml_heuristic.py`)
LightGBM regression model that predicts kernel TFLOPS and selects a kernel for
a given problem. Requires the `lightgbm` Python package.
* Models live in `dispatcher/heuristics/models/grouped_conv_<variant>_bf16_<arch>/`
(forward, bwd_data, bwd_weight all available).
* Feature engine: `dispatcher/heuristics/feature_engine_grouped_conv.py`.
* Training entry point: `dispatcher/heuristics/train.py`.
* Prediction: `dispatcher/heuristics/predict.py` (use `Predictor` with
`GroupedConvFeatureEngine`; build the candidate kernel pool from a
training/holdout parquet via `df["kernel_name"].unique()`).
Typical training flow:
```bash
# 1. Benchmark to CSV (slow)
cd tile_engine/ops/grouped_conv
python3 -u grouped_conv_full_benchmark.py configs/forward_bf16.json \
--arch gfx950 --problems forward_training \
--csv benchmark_forward_bf16_gfx950.csv --workers 8
# 2. CSV → Parquet
cd ../../../dispatcher/heuristics
python3 convert_csv_to_parquet.py \
--input ../../tile_engine/ops/grouped_conv/benchmark_forward_bf16_gfx950.csv \
--output data/grouped_conv_forward_bf16_gfx950.parquet --arch gfx950
# 3. Train
python3 train.py --data_dir data \
--out_dir models/grouped_conv_forward_bf16_gfx950 \
--op grouped_conv --dtype bf16 --arch gfx950 --targets tflops --n_splits 5
```
To add a new pipeline (e.g. `compv6`) update:
`dispatcher/codegen/grouped_config_rules.py` (`VARIANT_PIPELINES`),
`dispatcher/heuristics/feature_engine_grouped_conv.py` (add the `is_<name>`
flag), and the relevant `tile_engine/ops/grouped_conv/configs/*.json`. Then
re-run the benchmark + train flow above.
## Notes
* Use `python3 -u` for any long-running script so logs arent buffered.
* Kernels are compiled once and cached under `/tmp/dispatcher/`; subsequent
runs reuse the cached `.so`.
* This repo has 1 GPU — do not run benchmarks in parallel.