mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-11 08:48:45 +00:00
[CK_TILE] Update Stream-K Reduction Strategy Enum (#4756) ## Motivation Currently, Stream-K has 3 reduction options: 1) atomics, 2) The reduction described in the Stream-K paper, and 3) a tree reduction. The reduction strategy described in the original Stream-K paper has the starting workgroup of each tile sequentially accumulating partial results of other contributing workgroups in the tile, which requires a linear number of steps. Hence, for clarity, this works updates the naming of the `StreamKReductionStrategy` enum members to better describe the existing reduction strategy options. ## Technical Details Prior to this change, the enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Reduction = 1u, TreeReduction = 2u }; ``` But, the distinction between `Reduction` and `TreeReduction` is not very clear and has some redundancy. Hence, the updated enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Linear = 1u, Tree = 2u }; ``` All references to `StreamKReductionStrategy` were updated to reflect this change. ## Test Plan No new functionality was added, so no new tests were added; I just validated existing tests and examples. ## Test Result All tests passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Stream-K GEMM
This folder contains examples of Stream-K GEMMs using the ck_tile tile-programming implementation.
build
# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx942) or leave it blank
../script/cmake-ck-dev.sh ../ <arch>
# Compile the Stream-K kernels
make tile_example_streamk_gemm_basic -j
This will result in an executable build/bin/tile_example_streamk_gemm_basic
example
args:
-m m dimension (default:512)
-n n dimension (default:512)
-k k dimension (default:512)
-a_layout tensor A data layout (default: R)
-b_layout tensor B data layout (default: C)
-c_layout tensor C data layout (default: R)
-reduction_strategy strategy for storing results in C tensor. atomic/linear (default:atomic)
-persistent_dp persistent strategy for data-parallel section. Set to 0 for non-persistent or to 1 for persistent. (default:0)
-stride_a tensor A stride (default:0)
-stride_b tensor B stride (default:0)
-stride_c tensor C stride (default:0)
-v validation strategy. 0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:1)
-prec data type. fp16/bf16/fp8/bf8 (default:fp16)
-warmup number of iterations before benchmarking the kernel (default:50)
-repeat number of iterations to benchmark the kernel (default:100)
-timer timing mode. gpu:gpu timer, cpu:cpu timer (default:gpu)
-init data initialization strategy. 0:random, 1:linear, 2:constant(1) (default:0)
-flush_cache flush the cache before running the kernel (default:true)