composable_kernel

ROCm/composable_kernel

Fork 0

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-02 13:17:36 +00:00

Commit Graph

Author	SHA1	Message	Date
Thrupti Raj Lakshmana Gowda	d7609923b6	[rocm-libraries] ROCm/rocm-libraries#7919 (commit 061001d) Users/tlakshma/ck/tile engine develop ## Motivation This PR adds multiple new GPU kernel benchmarking operations to the CK Tile Engine, expanding its coverage of GEMM-family operations: - gemm_multi_abd: GEMM with multiple A, B, and D tensors, enabling epilogue patterns such as scale/bias fusion. - batched_contraction: Batched tensor contraction supporting multi-dimensional batch (G), M, N, and K dimensions, targeting workloads where the contraction indices span more than one logical axis. - mx_gemm: MX-format GEMM with microscaling (e8m0) scale tensors. - gemm_rowcolquant: Block-scale GEMM with row/column quantization. - gemm_tensor_quant: Block-scale GEMM with tensor quantization. - grouped_gemm_rowcolquant: Grouped GEMM with row/column quantization. - grouped_gemm_tensorquant: Grouped GEMM with tensor quantization. - batched_gemm: Batched GEMM benchmarking support. ## Technical Details ### gemm_multi_abd - New subdirectory: tile_engine/ops/gemm/gemm_multi_abd/ - CMakeLists.txt follows the same individual-target pattern as gemm_universal / gemm_multi_d. - gemm_multi_abd_instance_builder.py subclasses GemmKernelBuilder from the shared gemm_instance_builder.py. - gemm_multi_abd_benchmark.py delegates to the shared GemmBenchmark parent class. - Configs: default_config.json, default_ci_config.json, user_provided_config.json. - Supported GPU targets: gfx90a, gfx942, gfx950, gfx1201. ### batched_contraction - New subdirectory: tile_engine/ops/gemm/batched_contraction/ - Extends GemmKernelBuilder via BatchedContractionKernelBuilder, adding num_dim_g, num_dim_m, num_dim_n, num_dim_k, num_d_tensors, and elementwise_function parameters. - Layout string uses 3-character encoding (A+B+E), e.g. rcr. - Self-contained benchmark sweep driver (batched_contraction_benchmark.py) with JSON/CSV export and best-kernel selection. - Supported GPU targets: gfx90a, gfx942, gfx950. ### mx_gemm - New subdirectory: tile_engine/ops/gemm/mx_gemm/ - Supports MX-format (e8m0) microscaling for A and B scale tensors. ### block_scale_gemm (gemm_rowcolquant, gemm_tensor_quant) - New subdirectory: tile_engine/ops/gemm/block_scale_gemm/ - gemm_rowcolquant: row/column quantization epilogue. - gemm_tensor_quant: tensor-level quantization epilogue. ### grouped_gemm_quant (grouped_gemm_rowcolquant, grouped_gemm_tensorquant) - New subdirectory: tile_engine/ops/gemm/grouped_gemm_quant/ - grouped_gemm_rowcolquant: grouped GEMM with row/column quantization. - grouped_gemm_tensorquant: grouped GEMM with tensor quantization. ### batched_gemm - New subdirectory: tile_engine/ops/gemm/batched_gemm/ - Batched GEMM benchmark support wired into the sampling/active-op lists. All new ops are registered in op_weights.json for budget allocation and wired into the active-op sampling lists in CMakeLists.txt. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 20:38:38 +00:00

Author

SHA1

Message

Date

Thrupti Raj Lakshmana Gowda

d7609923b6

[rocm-libraries] ROCm/rocm-libraries#7919 (commit 061001d)

Users/tlakshma/ck/tile engine develop

## Motivation

This PR adds multiple new GPU kernel benchmarking operations to the CK
Tile Engine, expanding its coverage of GEMM-family operations:

- **gemm_multi_abd**: GEMM with multiple A, B, and D tensors, enabling
epilogue patterns such as scale/bias fusion.
- **batched_contraction**: Batched tensor contraction supporting
multi-dimensional batch (G), M, N, and K dimensions, targeting workloads
where the contraction indices span more than one logical axis.
- **mx_gemm**: MX-format GEMM with microscaling (e8m0) scale tensors.
- **gemm_rowcolquant**: Block-scale GEMM with row/column quantization.
- **gemm_tensor_quant**: Block-scale GEMM with tensor quantization.
- **grouped_gemm_rowcolquant**: Grouped GEMM with row/column
quantization.
- **grouped_gemm_tensorquant**: Grouped GEMM with tensor quantization.
- **batched_gemm**: Batched GEMM benchmarking support.

## Technical Details

### gemm_multi_abd

  - New subdirectory: tile_engine/ops/gemm/gemm_multi_abd/
- CMakeLists.txt follows the same individual-target pattern as
gemm_universal / gemm_multi_d.
- gemm_multi_abd_instance_builder.py subclasses GemmKernelBuilder from
the shared gemm_instance_builder.py.
- gemm_multi_abd_benchmark.py delegates to the shared GemmBenchmark
parent class.
- Configs: default_config.json, default_ci_config.json,
user_provided_config.json.
  - Supported GPU targets: gfx90a, gfx942, gfx950, gfx1201.

### batched_contraction

  - New subdirectory: tile_engine/ops/gemm/batched_contraction/
- Extends GemmKernelBuilder via BatchedContractionKernelBuilder, adding
num_dim_g, num_dim_m, num_dim_n, num_dim_k, num_d_tensors, and
elementwise_function parameters.
  - Layout string uses 3-character encoding (A+B+E), e.g. rcr.
- Self-contained benchmark sweep driver
(batched_contraction_benchmark.py) with JSON/CSV export and best-kernel
selection.
  - Supported GPU targets: gfx90a, gfx942, gfx950.

### mx_gemm

  - New subdirectory: tile_engine/ops/gemm/mx_gemm/
  - Supports MX-format (e8m0) microscaling for A and B scale tensors.

### block_scale_gemm (gemm_rowcolquant, gemm_tensor_quant)

  - New subdirectory: tile_engine/ops/gemm/block_scale_gemm/
  - gemm_rowcolquant: row/column quantization epilogue.
  - gemm_tensor_quant: tensor-level quantization epilogue.

### grouped_gemm_quant (grouped_gemm_rowcolquant,
grouped_gemm_tensorquant)

  - New subdirectory: tile_engine/ops/gemm/grouped_gemm_quant/
  - grouped_gemm_rowcolquant: grouped GEMM with row/column quantization.
  - grouped_gemm_tensorquant: grouped GEMM with tensor quantization.

### batched_gemm

  - New subdirectory: tile_engine/ops/gemm/batched_gemm/
- Batched GEMM benchmark support wired into the sampling/active-op
lists.

All new ops are registered in op_weights.json for budget allocation and
wired into the active-op sampling lists in CMakeLists.txt.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-06-11 20:38:38 +00:00

1 Commits