Ck tile batched gemm example (#1615)

* [CK Tile] Batched GEMM Example * [CK Tile] Batched GEMM Example - minor refactor * [CK Tile] Batched GEMM Example - README update * [CK Tile] Batched Gemm Example - review changes - Added tensor data layours as input parameters - Changed structure of Host and Kernel args - Removed bug with invalid vector read on non-contiguous memory * [CK Tile] Batched Gemm Example - remove comment * [CK Tile] Batched Gemm Example - Add GTests part1 * [CK Tile] Batched Gemm Example - GTests part2 + review changes * [CK TILE] Batched GEMM post merge fixes * [CK Tile] Batched GEMM Example - fix pad views
2026-05-02 12:41:26 +00:00 · 2024-11-29 11:52:18 +01:00
parent bb652696e7
commit 78f0fea08e
15 changed files with 1098 additions and 2 deletions
--- a/example/ck_tile/16_batched_gemm/README.md
+++ b/example/ck_tile/16_batched_gemm/README.md
@@ -0,0 +1,37 @@
+# Batched GEMM
+
+This folder contains example for batched GEMM using ck_tile tile-programming implementation.
+
+## build
+```
+# in the root of ck_tile
+mkdir build && cd build
+# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
+sh ../script/cmake-ck-dev.sh  ../ <arch>
+make tile_example_batched_gemm -j
+```
+This will result in an executable `build/bin/tile_example_batched_gemm`
+
+## example
+```
+args:
+              -m     m dimension (default:256)
+              -n     n dimension (default:128)
+              -k     k dimension (default:128)
+       -a_layout     A tensor data layout (default:R) (R for Row, C for Col)
+       -b_layout     B tensor data layout (default:R) (R for Row, C for Col)
+       -c_layout     C tensor data layout (default:R) (R for Row, C for Col)
+       -stride_a     Tensor A stride (default:128)
+       -stride_b     Tensor B stride (default:128)
+       -stride_c     Tensor C stride (default:128)
+ -batch_stride_a     Batch A stride (default:32768)
+ -batch_stride_b     Batch B stride (default:16384)
+ -batch_stride_c     Batch C stride (default:32768)
+    -batch_count     Batch count (default:16)
+              -v     0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:2)
+              -e     Absolute error tolerance (default:1e-5)
+           -prec     data type. fp16/bf16/fp8/bf8 (default:fp16)
+         -warmup     number of iterations before benchmark the kernel (default:10)
+         -repeat     number of iterations to benchmark the kernel (default:100)
+          -timer     gpu:gpu timer, cpu:cpu timer (default:gpu)
+```