mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-12 09:16:52 +00:00

Files

linqunAMD 9fcc1ee9fd Support Wave32 in CK_TILE - Part 1 (#2594 )

* Support wave32/wave64 in CK_TILE - Part 1

* remove blocksize in kernel launch

* fix build error

* fix clang format

* fix clang format 2

* fix clang format 3

* fix fmha build error

* fix fmha build 2

* fix fmha build 3

* fix build error 4

* address review comment

* update change log

* replace KernelBlockSize with kBlockSize

* fix CI fail

* fix clang format

* address review comment and rebase code.

* fix universal test fail

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

2025-08-18 10:08:31 -07:00

batched_gemm.cpp

Support Wave32 in CK_TILE - Part 1 (#2594 )

2025-08-18 10:08:31 -07:00

batched_gemm.hpp

[CK_TILE] Multiple-D GEMM example (#2219 )

2025-06-13 19:39:11 +02:00

CMakeLists.txt

Revert "Add ck tile examples to package (#1880 )" (#2150 )

2025-04-30 10:20:16 -07:00

README.md

Ck tile batched gemm example (#1615 )

2024-11-29 11:52:18 +01:00

run_batched_gemm_example.inc

[CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520 )

2025-07-24 20:39:56 +02:00

README.md

Batched GEMM

This folder contains example for batched GEMM using ck_tile tile-programming implementation.

build

# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
sh ../script/cmake-ck-dev.sh  ../ <arch>
make tile_example_batched_gemm -j

This will result in an executable build/bin/tile_example_batched_gemm

example

args:
              -m     m dimension (default:256)
              -n     n dimension (default:128)
              -k     k dimension (default:128)
       -a_layout     A tensor data layout (default:R) (R for Row, C for Col)
       -b_layout     B tensor data layout (default:R) (R for Row, C for Col)
       -c_layout     C tensor data layout (default:R) (R for Row, C for Col)
       -stride_a     Tensor A stride (default:128)
       -stride_b     Tensor B stride (default:128)
       -stride_c     Tensor C stride (default:128)
 -batch_stride_a     Batch A stride (default:32768)
 -batch_stride_b     Batch B stride (default:16384)
 -batch_stride_c     Batch C stride (default:32768)
    -batch_count     Batch count (default:16)
              -v     0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:2)
              -e     Absolute error tolerance (default:1e-5)
           -prec     data type. fp16/bf16/fp8/bf8 (default:fp16)
         -warmup     number of iterations before benchmark the kernel (default:10)
         -repeat     number of iterations to benchmark the kernel (default:100)
          -timer     gpu:gpu timer, cpu:cpu timer (default:gpu)