mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 13:11:25 +00:00

Files

Sami Remes 7ea1508b59 [CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222 )

* Add TailHandler for V3, V4 and Mem pipelines

* Adapt examples and tests to use TailHandler

* move tail-handling logic to pipeline in persistent grouped gemm

* Fix Mem pipeline dispatching, add CompV4 dispatching

* Use a macro for handling the many tails of Mem pipeline

* Fix formatting again

* Use const-ref RunFunction, remove unnecessary try_run

2025-06-04 11:50:21 +03:00

CMakeLists.txt

[CK_TILE] Grouped GEMM tile loop (#2146 )

2025-05-20 17:18:57 +03:00

grouped_gemm_tileloop.cpp

[CK_TILE] Grouped GEMM tile loop (#2146 )

2025-05-20 17:18:57 +03:00

grouped_gemm.cpp

[CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222 )

2025-06-04 11:50:21 +03:00

grouped_gemm.hpp

[CK_TILE] Grouped GEMM tile loop (#2146 )

2025-05-20 17:18:57 +03:00

README.md

Ck tile grouped GEMM example (#1713 )

2024-12-04 21:40:01 +01:00

run_grouped_gemm_example.inc

[CK_TILE] Grouped GEMM tile loop (#2146 )

2025-05-20 17:18:57 +03:00

README.md

Grouped CShuffle GEMM

This folder contains example for Grouped GEMM using ck_tile tile-programming implementation. Currently, it only supports the basic feature of the CK Tile GEMM, but creates the placeholders for the future support on different GEMM pipeline and different GEMM modules. In the near future, we will gradually migrate all the GEMM features from old CK to CK Tile.

build

# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
sh ../script/cmake-ck-dev.sh  ../ <arch>
# The basic pipeline method on the gemm calculation
make tile_example_grouped_gemm -j

This will result in an executable build/bin/tile_example_grouped_gemm

example

args:
   -a_layout    Tensor A layout (default:R)
   -b_layout    Tensor B layout (default:R)
   -c_layout    Tensor C layout (default:R)
          -v    0. No validation, 1. Validation on CPU
     -warmup    number of iterations before benchmark the kernel (default:10)
     -repeat    number of iterations to benchmark the kernel (default:100)