3 Commits

Author SHA1 Message Date
Thrupti Raj Lakshmana Gowda
d7609923b6 [rocm-libraries] ROCm/rocm-libraries#7919 (commit 061001d)
Users/tlakshma/ck/tile engine develop

## Motivation

This PR adds multiple new GPU kernel benchmarking operations to the CK
Tile Engine, expanding its coverage of GEMM-family operations:

- **gemm_multi_abd**: GEMM with multiple A, B, and D tensors, enabling
epilogue patterns such as scale/bias fusion.
- **batched_contraction**: Batched tensor contraction supporting
multi-dimensional batch (G), M, N, and K dimensions, targeting workloads
where the contraction indices span more than one logical axis.
- **mx_gemm**: MX-format GEMM with microscaling (e8m0) scale tensors.
- **gemm_rowcolquant**: Block-scale GEMM with row/column quantization.
- **gemm_tensor_quant**: Block-scale GEMM with tensor quantization.
- **grouped_gemm_rowcolquant**: Grouped GEMM with row/column
quantization.
- **grouped_gemm_tensorquant**: Grouped GEMM with tensor quantization.
- **batched_gemm**: Batched GEMM benchmarking support.

## Technical Details

### gemm_multi_abd

  - New subdirectory: tile_engine/ops/gemm/gemm_multi_abd/
- CMakeLists.txt follows the same individual-target pattern as
gemm_universal / gemm_multi_d.
- gemm_multi_abd_instance_builder.py subclasses GemmKernelBuilder from
the shared gemm_instance_builder.py.
- gemm_multi_abd_benchmark.py delegates to the shared GemmBenchmark
parent class.
- Configs: default_config.json, default_ci_config.json,
user_provided_config.json.
  - Supported GPU targets: gfx90a, gfx942, gfx950, gfx1201.

### batched_contraction

  - New subdirectory: tile_engine/ops/gemm/batched_contraction/
- Extends GemmKernelBuilder via BatchedContractionKernelBuilder, adding
num_dim_g, num_dim_m, num_dim_n, num_dim_k, num_d_tensors, and
elementwise_function parameters.
  - Layout string uses 3-character encoding (A+B+E), e.g. rcr.
- Self-contained benchmark sweep driver
(batched_contraction_benchmark.py) with JSON/CSV export and best-kernel
selection.
  - Supported GPU targets: gfx90a, gfx942, gfx950.

### mx_gemm

  - New subdirectory: tile_engine/ops/gemm/mx_gemm/
  - Supports MX-format (e8m0) microscaling for A and B scale tensors.

### block_scale_gemm (gemm_rowcolquant, gemm_tensor_quant)

  - New subdirectory: tile_engine/ops/gemm/block_scale_gemm/
  - gemm_rowcolquant: row/column quantization epilogue.
  - gemm_tensor_quant: tensor-level quantization epilogue.

### grouped_gemm_quant (grouped_gemm_rowcolquant,
grouped_gemm_tensorquant)

  - New subdirectory: tile_engine/ops/gemm/grouped_gemm_quant/
  - grouped_gemm_rowcolquant: grouped GEMM with row/column quantization.
  - grouped_gemm_tensorquant: grouped GEMM with tensor quantization.

### batched_gemm

  - New subdirectory: tile_engine/ops/gemm/batched_gemm/
- Batched GEMM benchmark support wired into the sampling/active-op
lists.

All new ops are registered in op_weights.json for budget allocation and
wired into the active-op sampling lists in CMakeLists.txt.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-11 20:38:38 +00:00
Thrupti Raj Lakshmana Gowda
054436ca4a [rocm-libraries] ROCm/rocm-libraries#8079 (commit cf1e8f2)
[tile_engine] Integrate gemm_streamk into budget-based
 sampling system (#8079)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

`gemm_streamk` was the only GEMM op not participating in the tile
engine's budget-based sampling system. Without a budget cap, it would
always generate its full feasible set, making build times unpredictable
and inconsistent with the other ops.

## Technical Details

- **CMake budget propagation** (`ops/gemm/CMakeLists.txt`): Added
`gemm_streamk` to the active-ops detection loop so it receives a share
of the sampling budget. Because `gemm_streamk` lives in a sibling
subdirectory (`ops/gemm_streamk/`), its allocation is written via `CACHE
STRING "" FORCE` to make the variable visible across the CMake directory
boundary.

- **Per-combo budget division** (`ops/gemm_streamk/CMakeLists.txt`,
`ops/gemm/grouped_gemm/CMakeLists.txt`): Added the same per-combo
`MAX_INSTANCES` division that exists in `gemm_universal` and
`gemm_preshuffle`. The total budget is divided by `n_datatypes ×
n_layouts` before the inner `foreach` loop so that sampling fires
independently per `(dtype, layout)` combo rather than acting as a single
global cap.

- **Sampling integration** (`gemm_streamk_instance_builder.py`): Added
`_apply_sampling()` method to `GemmKernelBuilder`, mirroring the
Sobol+LHS+maximin sampling used by other ops. New constructor
parameters: `gpu_target`, `max_instances`, `seed`, `tier`,
`manifest_path`. New CLI arguments: `--gpu_target`, `--max-instances`,
`--seed`, `--tier`, `--manifest-path`. The `--gpu_target` argument is
now also forwarded on the `--list_kernels` invocation.

- **`GEMM_STREAMK_AXES`** (`sampling/feasible_set.py`): Defined as
`GEMM_AXES + ["reduction_strategy"]` to account for the extra axis
unique to stream-K. Added `reduction_strategy` to `CATEGORICAL_AXES`.

- **Weight rebalancing** (`sampling/op_weights.json`): Allocated 10%
weight to `gemm_streamk` by proportionally reducing `gemm_universal`
(0.35 → 0.30) and `gemm_preshuffle` (0.30 → 0.25). Total remains 1.00.

## Test Plan

- Configure with `TILE_ENGINE_SAMPLING_TIER=daily` and verify that
`gemm_streamk` receives a non-zero budget allocation and that
`GEMM_STREAMK_MAX_INSTANCES` is set correctly.
- Configure with `TILE_ENGINE_SAMPLING_TIER=daily` across multiple
`(dtype, layout)` combos and confirm per-combo budget = total /
n_combos.
- Configure with `-DGEMM_STREAMK_MAX_INSTANCES=50` explicit override and
verify the override is respected (budget allocation skipped).
- Verify `chosen_instances.json` manifest is written to the working path
when tier is active.
- Confirm `op_weights.json` weights still sum to 1.00.

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-05 17:06:11 +00:00
Thrupti Raj Lakshmana Gowda
c31fc4df52 [rocm-libraries] ROCm/rocm-libraries#7311 (commit 79d8cae)
[CK Tile Engine] Daily tier sampling for tile engine GEMM  (#7311)

Summary
- Replace uniform random instance sampling (random.shuffle) with
scrambled Sobol + Latin Hypercube + maximin space-filling
sampling, per the Tile Engine Benchmark Sampling RFC
- Add op-weighted budget allocation via new
TILE_ENGINE_SAMPLING_TIER=daily CMake knob that auto-distributes 8,000
instances across
ops proportional to registered weights in op_weights.json
  - Emit chosen_instances.json manifests for reproducibility tracking
- Consolidate 5 copies of sampling logic into single _apply_sampling()
method on the base class
Jenkinsfile changes
Replace per-op -D *_MAX_INSTANCES=250 with single -D
TILE_ENGINE_SAMPLING_TIER=daily in gfx942/gfx950/gfx1201 stages. Budget
  auto-distributes (8000 total per GPU target).

---------

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
2026-05-21 02:17:42 -05:00