mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-02 12:41:26 +00:00
Code style clean-up and documentation
The following changes were made: - Clean-up of variable namings - Addition of README - Removal of num_cu and occupancy args; such options are meant for testing purposes and should not be exposed to the user - Removal of CK_TILE_PIPELINE_MEMORY macro and PipelineTypeTraits class since we only support one pipeline at the moment.
This commit is contained in:
committed by
Emily Martins
parent
a3499e38b2
commit
243118c275
37
example/ck_tile/40_streamk_gemm/README.md
Normal file
37
example/ck_tile/40_streamk_gemm/README.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Stream-K GEMM
|
||||
|
||||
This folder contains examples of Stream-K GEMMs using the ck_tile tile-programming implementation.
|
||||
|
||||
## build
|
||||
```
|
||||
# in the root of ck_tile
|
||||
mkdir build && cd build
|
||||
# you can replace <arch> with the appropriate architecture (for example gfx942) or leave it blank
|
||||
../script/cmake-ck-dev.sh ../ <arch>
|
||||
# Compile the Stream-K kernels
|
||||
make tile_example_streamk_gemm_basic -j
|
||||
```
|
||||
This will result in an executable `build/bin/tile_example_streamk_gemm_basic`
|
||||
|
||||
## example
|
||||
```
|
||||
args:
|
||||
-m m dimension (default:512)
|
||||
-n n dimension (default:512)
|
||||
-k k dimension (default:512)
|
||||
-a_layout tensor A data layout (default: R)
|
||||
-b_layout tensor B data layout (default: C)
|
||||
-c_layout tensor C data layout (default: R)
|
||||
-num_sk_blocks number of Stream-K blocks. -1: chosen by algorithm, or user selected (default:-1)
|
||||
-reduction_strategy strategy for storing results in C tensor. atomic/reduction (default:atomic)
|
||||
-stride_a tensor A stride (default:0)
|
||||
-stride_b tensor B stride (default:0)
|
||||
-stride_c tensor C stride (default:0)
|
||||
-v validation strategy. 0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:1)
|
||||
-prec data type. fp16/bf16 (default:fp16)
|
||||
-warmup number of iterations before benchmarking the kernel (default:50)
|
||||
-repeat number of iterations to benchmark the kernel (default:100)
|
||||
-timer timing mode. gpu:gpu timer, cpu:cpu timer (default:gpu)
|
||||
-init data initialization strategy. 0:random, 1:linear, 2:constant(1) (default:0)
|
||||
-flush_cache flush the cache before running the kernel (default:true)
|
||||
```
|
||||
Reference in New Issue
Block a user