composable_kernel

ROCm/composable_kernel

Fork 0

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Commit Graph

Author	SHA1	Message	Date
Yaswanth Raparti	fe085f8a69	[rocm-libraries] ROCm/rocm-libraries#7761 (commit 237b766) [CK][CK TILE] Clean up tile_engine grouped_conv harness (#7761) ## Motivation Tile_engine grouped_conv contains ML heuristic validation scripts which cause confusion to new developers. So, this PR is intended to relocate the scripts into dispatcher/heuristic directory to maintain separation of concern. ## Technical Details The grouped_conv tile_engine directory is a benchmarking harness for grouped convolution kernels; ML-heuristic content does not belong there. - Move compare_ml_vs_oracle.py and validate_ml_vs_oracle.py from tile_engine/ops/grouped_conv/ to dispatcher/heuristics/validation/grouped_conv/, and rebase their sys.path / oracle CSV / model dir lookups for the new location (CSV path is now an --oracle-csv flag instead of a hard-coded sibling). - Move GROUPED_CONV_HEURISTIC_REPORT.md (system-level ML report) into dispatcher/heuristics/ where the rest of the heuristic docs live. - Rewrite tile_engine/ops/grouped_conv/README.md as a pure benchmarking / dispatcher-sweep doc (kernel enumeration, JIT pipeline, CSV schema, problem registry), in the style of tile_engine/ops/fmha/README.md. All ML training / model-efficiency content is removed and replaced with a pointer to dispatcher/heuristics/. ## Test Plan Validation scripts are re-wired and tested locally ## Test Result Tests passed on local machine. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-29 17:09:29 +00:00
Yaswanth Raparti	017dca1b9d	[rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e) [CK][CK TILE] Dispatcher kernel selection heuristic for grouped conv (#6327) ## Motivation The ML heuristic in dispatcher does not support grouped-conv operator yet. In this PR, the support for fwd, bdw-data, and bwd-weight grouped-conv kernels have been added. A tile_engine utility has also been added to compile and run any selected kernel configuration through dispatcher infrastructure. ## Technical Details 1. Tile engine utility is added to benchmark each shape with all the possible kernel+tile_size combinations here - [https://github.com/ROCm/rocm-libraries/blob/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/tile_engine/ops/grouped_conv/grouped_conv_full_benchmark.py](url) 2. New LGBM regressor models for grouped conv are added to models directory. We have 3 separate models for fwd, bwd-data, and bwd-weights [https://github.com/ROCm/rocm-libraries/tree/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/dispatcher/heuristics/models](url) 3. Implemented lazy GPU initialization (dispatcher/python) - Issue: ProcessPoolExecutor fork() + GPU context caused memory access faults - Solution: Mirror FMHA pattern - defer GPU initialization until first run() - Changes: - setup_multiple_grouped_conv_dispatchers() returns List[Path], not loaded libs - GpuGroupedConvRunner.__init__() no longer calls ctypes.CDLL - Added _ensure_initialized() method for lazy GPU loading - GPU context created only on first run() call - Benefit: Parallel compilation now works without GPU conflicts 4. Addressed few miscellaneous issues such as: - Fixed BF16->FP16 naming bug in the dispatcher wrapper - Added new tile sizes, and comp_v5 pipeline to the arch spec to expand the kernel selection - Added automatic padding support for unsupported shapes in dispatcher runner - Created a single source of truth between tile_engine and dispatcher about the architecture and tile_size details - Build a validation scripts to compare oracle_best vs ml_heuristic comparison ## Test Plan 1. Validated fwd, bwd-data, and bwd-weight kernels with both known and unseen data sets with up to 300 problems. 2. Ensured that test cases are added in both dispatcher and tile_engine to validate the heuristic. ## Test Result Results on Unseen shapes validated on gfx950 #### Forward Pass Model - Training Data: 48,845 measurements across 1,372 unique problem shapes - Validation Set: 300 unseen problems from model crawler - Validation Performance (vs. oracle): - Mean Efficiency: 93.05% - Median Efficiency: 96.8% - P10 Efficiency: 79.9% #### Backward Data Gradient (bwd_data) Model - Training Data: 18,773 measurements across 891 unique problem shapes - Validation Set: 300 unseen problems from model crawler - Validation Performance (vs. oracle): - Mean Efficiency: 93.8% - Median Efficiency: 96.5% - P10 Efficiency: 82.9% #### Backward Weight Gradient (bwd_weight) Model - Training Data: 34,900 measurements across 1,508 unique problem shapes - Validation Set: 300 unseen problems from model crawler - Validation Performance (vs. oracle): - Mean Efficiency: 96.1% - Median Efficiency: 99.2% - P10 Efficiency: 89.4% ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Jan Patrick Lehr <JanPatrick.Lehr@amd.com>	2026-05-08 13:47:13 -07:00

Author

SHA1

Message

Date

Yaswanth Raparti

fe085f8a69

[rocm-libraries] ROCm/rocm-libraries#7761 (commit 237b766)

[CK][CK TILE] Clean up tile_engine grouped_conv harness
 (#7761)

## Motivation
Tile_engine grouped_conv contains ML heuristic validation scripts which
cause confusion to new developers. So, this PR is intended to relocate
the scripts into dispatcher/heuristic directory to maintain separation
of concern.

## Technical Details
The grouped_conv tile_engine directory is a benchmarking harness for
grouped convolution kernels; ML-heuristic content does not belong there.

- Move compare_ml_vs_oracle.py and validate_ml_vs_oracle.py from
tile_engine/ops/grouped_conv/ to
dispatcher/heuristics/validation/grouped_conv/, and rebase their
sys.path / oracle CSV / model dir lookups for the new location (CSV path
is now an --oracle-csv flag instead of a hard-coded sibling).
- Move GROUPED_CONV_HEURISTIC_REPORT.md (system-level ML report) into
dispatcher/heuristics/ where the rest of the heuristic docs live.
- Rewrite tile_engine/ops/grouped_conv/README.md as a pure benchmarking
/ dispatcher-sweep doc (kernel enumeration, JIT pipeline, CSV schema,
problem registry), in the style of tile_engine/ops/fmha/README.md. All
ML training / model-efficiency content is removed and replaced with a
pointer to dispatcher/heuristics/.

## Test Plan

Validation scripts are re-wired and tested locally

## Test Result

Tests passed on local machine.

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-05-29 17:09:29 +00:00

Yaswanth Raparti

017dca1b9d

[rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e)

[CK][CK TILE] Dispatcher kernel selection heuristic for grouped conv (#6327)

## Motivation
The ML heuristic in dispatcher does not support grouped-conv operator
yet. In this PR, the support for fwd, bdw-data, and bwd-weight
grouped-conv kernels have been added. A tile_engine utility has also
been added to compile and run any selected kernel configuration through
dispatcher infrastructure.

## Technical Details

1. Tile engine utility is added to benchmark each shape with all the
possible kernel+tile_size combinations here -
[https://github.com/ROCm/rocm-libraries/blob/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/tile_engine/ops/grouped_conv/grouped_conv_full_benchmark.py](url)
2. New LGBM regressor models for grouped conv are added to models
directory. We have 3 separate models for fwd, bwd-data, and bwd-weights
[https://github.com/ROCm/rocm-libraries/tree/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/dispatcher/heuristics/models](url)
3. Implemented lazy GPU initialization (dispatcher/python)
- **Issue**: ProcessPoolExecutor fork() + GPU context caused memory
access faults
- **Solution**: Mirror FMHA pattern - defer GPU initialization until
first run()
  - **Changes**:
- setup_multiple_grouped_conv_dispatchers() returns List[Path], not
loaded libs
    - GpuGroupedConvRunner.__init__() no longer calls ctypes.CDLL
    - Added _ensure_initialized() method for lazy GPU loading
    - GPU context created only on first run() call
  - **Benefit**: Parallel compilation now works without GPU conflicts
4. Addressed few miscellaneous issues such as:
  - Fixed BF16->FP16 naming bug in the dispatcher wrapper
- Added new tile sizes, and comp_v5 pipeline to the arch spec to expand
the kernel selection
- Added automatic padding support for unsupported shapes in dispatcher
runner
- Created a single source of truth between tile_engine and dispatcher
about the architecture and tile_size details
- Build a validation scripts to compare oracle_best vs ml_heuristic
comparison

## Test Plan

1. Validated fwd, bwd-data, and bwd-weight kernels with both known and
unseen data sets with up to 300 problems.
2. Ensured that test cases are added in both dispatcher and tile_engine
to validate the heuristic.

## Test Result
Results on Unseen shapes validated on gfx950
#### Forward Pass Model
- **Training Data**: 48,845 measurements across 1,372 unique problem
shapes
- **Validation Set**: 300 unseen problems from model crawler
- **Validation Performance** (vs. oracle):
  - Mean Efficiency: **93.05%**
  - Median Efficiency: **96.8%**
  - P10 Efficiency: **79.9%**

#### Backward Data Gradient (bwd_data) Model
- **Training Data**: 18,773 measurements across 891 unique problem
shapes
- **Validation Set**: 300 unseen problems from model crawler
- **Validation Performance** (vs. oracle):
  - Mean Efficiency: **93.8%**
  - Median Efficiency: **96.5%**
  - P10 Efficiency: **82.9%**

#### Backward Weight Gradient (bwd_weight) Model
- **Training Data**: 34,900 measurements across 1,508 unique problem
shapes
- **Validation Set**: 300 unseen problems from model crawler
- **Validation Performance** (vs. oracle):
  - Mean Efficiency: **96.1%**
  - Median Efficiency: **99.2%**
  - P10 Efficiency: **89.4%**

## Submission Checklist

- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Jan Patrick Lehr <JanPatrick.Lehr@amd.com>

2026-05-08 13:47:13 -07:00

2 Commits