[CK_TILE] Stream-K GEMM Implementation (#2781)

* Change splitk_batch_offset parameter to k_size in UniversalGemmKernel::MakeGemmTensorViews function

Prior to this change, the splitk_batch_offset parameter of
MakeGemmTensorViews had type SplitKBatchOffset. But, the only member
variable of the SplitKBatchOffset class used in the MakeGemmTensorViews
function was splitted_k (an int32_t). The splitted_k value was used as
part of defining the dimensions of the tensor view. That said, for
Stream K, we do not need to use the SplitKBatchOffset class since we are
not using Split K. Thus, this commit changes the splitk_batch_offset
parameter to a int32_t called k_size. This will avoid the constraint of
requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset
class while still providing the same functionality. Calls to
UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.

* StreamK Kernel RunGemm Implementation

Stream K cannot simply use UniversalGemmKernel's RunGemm for the
following reasons:

1. The UniversalGemmKernel::RunGemm function computes num_loop based on
   a static function of the TilePartitioner. That said, for Stream K,
num_loop must be computed using a member function (namely
GetCurrentIterLength from PR #2708).
2. The UniversalGemmKernel::RunGemm function requires the use of a
   SplitKBatchOffset object which is not used for Stream K since we are
not using Split K.

Thus, this change adds a RunGemm function in the StreamKKernel class.

* initial implementation for operator() for StreamKKernel: adding stream-k algorithm and calls to RunGemm

* Fix indexing and offset issues for StreamK

These changes do the following:
- Ensure offsets along the M and N dimensions are multiplied by
  MPerblock or NPerBlock, respectively. This ensures tile window origins
are at the correct locations.
- Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply
  divmod to the given references to ensure correct values are available
to the caller.
- Added documentation in the Stream-K operator()

* Initial gtests for Stream-K

These changes add an initial gtest suite for the CK Tile Stream-K
kernel. Currently, due to bugs in the StreamKTilePartitioner (which will
be handled in a future PR), there are validation issues for certain
cases which may differ on different architectures. Thus, we opted to run
cases that are only fully data-parallel (skipping others). A guard was
added to Stream-K's IsSupportedArgument method to ensure that callers
are aware of this constraint. Additionally, to ensure testing
reproducibility, options for setting the number of CUs and occupancy
were added to MakeKernelArgs.

* Use GemmPipeline operator() variant that takes hot loop and tail num

In Stream-K, the num_loop value varies per WG and per iteration of a
Stream-K loop. So instead, we use the version of the GemmPipeline's
operator() function that takes in has_hot_loop and tail_num. This is
similar to what is done in Grouped GEMM.

* changes from review: comments, move readfirstlane, remove ifndef

* Switch direction of C tensor traversal & add padding guard

Prior to this change, WGs travelled backwards through their assigned
macro tiles in the C tensor. For instance, if WG0 is responsible for C
tiles 0 and 1, it would first visit tile 1 then tile 0. This means that
the iter_end decrements in each iteration of the stream-K while loop.

Since we are working with unsigned integers, the subtraction operation
may not be safe. Thus, this change makes is such that WGs travel forward
so that their iter_start is incremented and their iter_end remains
fixed.

Additionally, we added a guard against WGs that are neither sk_blocks
nor dp_blocks to ensure such WGs do not participate in the GEMM.

Together, these changes make is such that the algorithm is correct when
sk_blocks is greater than zero.

* Disable StreamK_M256_N256_K256_SKBlocks12 test case

This instance involves >=3 WGs contributing to each macro tile in C. Due
to the use of atomics, this is resulting in precision errors. These
errors will not persist once the reduction strategy is implemented. We
will re-enable this test then.

---------

Co-authored-by: Astha Rai <astha.rai713@gmail.com>
This commit is contained in:
Emily Martins
2025-09-16 16:21:47 -06:00
committed by GitHub
parent b7a806f244
commit dee185d80c
10 changed files with 612 additions and 35 deletions

View File

@@ -646,16 +646,13 @@ struct StreamKTilePartitioner
* @brief Get length of loop iterations for stream-k loop
*/
CK_TILE_DEVICE uint32_t GetCurrentIterLength(uint32_t iter_start,
uint32_t iter_end,
uint32_t total_iter_length) const noexcept
uint32_t iter_end) const noexcept
{
uint32_t iter_length_mod, iter_length_quo /*unused*/;
k_iters_per_tile.divmod(iter_end, iter_length_quo, iter_length_mod);
uint32_t total_iter_length_val = static_cast<uint32_t>(total_iter_length);
uint32_t current_iter_length =
min(iter_length_mod == 0 ? (iter_end - iter_start) : iter_length_mod,
total_iter_length_val);
return current_iter_length;
// A WG's iter_end is either in the current C macro tile or not.
// If it is not, then the macro tile boundary is where the WG must stop.
uint32_t distance_to_tile_boundary =
k_iters_per_tile.get() - (iter_start % k_iters_per_tile.get());
return min(iter_start + distance_to_tile_boundary, iter_end) - iter_start;
}
/**
@@ -672,9 +669,7 @@ struct StreamKTilePartitioner
CK_TILE_DEVICE void
GetTileIdxWithOffset(uint32_t iter, uint32_t& tile_idx, uint32_t& iter_offset) const noexcept
{
uint32_t tile_idx_val = static_cast<uint32_t>(tile_idx);
uint32_t iter_offset_val = static_cast<uint32_t>(iter_offset);
k_iters_per_tile.divmod(iter, tile_idx_val, iter_offset_val);
k_iters_per_tile.divmod(iter, tile_idx, iter_offset);
}
/**

View File

@@ -374,7 +374,7 @@ struct GroupedGemmKernel
// Create Gemm tensor views, pad views and tile windows
const auto& gemm_tensor_views_tuple =
Base::template MakeGemmTensorViews<EpiloguePipeline::MemoryOperation>(
{a_ptr}, {b_ptr}, {/*ds_ptr*/}, c_ptr, kargs, splitk_batch_offset);
{a_ptr}, {b_ptr}, {/*ds_ptr*/}, c_ptr, kargs, splitk_batch_offset.splitted_k);
const auto& gemm_pad_views = Base::MakeGemmPadViews(gemm_tensor_views_tuple);
auto gemm_tile_windows =
@@ -436,7 +436,7 @@ struct GroupedGemmKernel
// Create Gemm tensor views, pad views and tile windows
const auto& gemm_tensor_views_tuple =
Base::template MakeGemmTensorViews<EpiloguePipeline::MemoryOperation>(
{a_ptr}, {b_ptr}, {/*ds_ptr*/}, c_ptr, kargs, splitk_batch_offset);
{a_ptr}, {b_ptr}, {/*ds_ptr*/}, c_ptr, kargs, splitk_batch_offset.splitted_k);
const auto& gemm_pad_views = Base::MakeGemmPadViews(gemm_tensor_views_tuple);
auto gemm_tile_windows =

View File

@@ -141,11 +141,17 @@ struct StreamKKernel
return UniversalGemmKernel::BlockSize();
}
CK_TILE_HOST static StreamKKernelArgs MakeKernelArgs(const StreamKHostArgs& host_args)
/// @brief Constructs kernel arguments for the Stream-K kernel.
/// @param host_args Stream-K host arguments.
/// @param num_cu Number of compute units (CUs). The default is the number of CUs on the device.
/// The caller may select their own to assist with test reproducibility, etc.
/// @param occupancy The maximum number of active blocks per CU for this kernel. The caller may
/// select their own to assist with test reproducibility, etc.
/// @return The kernel arguments for Stream-K.
CK_TILE_HOST static StreamKKernelArgs MakeKernelArgs(const StreamKHostArgs& host_args,
int num_cu = NumCU(),
int occupancy = Occupancy())
{
uint32_t occupancy = static_cast<uint32_t>(Occupancy());
uint32_t num_cu = static_cast<uint32_t>(NumCU());
return StreamKKernelArgs{{host_args.as_ptr,
host_args.bs_ptr,
host_args.ds_ptr,
@@ -166,14 +172,71 @@ struct StreamKKernel
TilePartitioner{static_cast<uint32_t>(host_args.M),
static_cast<uint32_t>(host_args.N),
static_cast<uint32_t>(host_args.K),
num_cu,
occupancy,
static_cast<uint32_t>(num_cu),
static_cast<uint32_t>(occupancy),
host_args.num_sk_blocks}};
}
CK_TILE_HOST static bool
IsSupportedArgument(const typename UniversalGemmKernel::KernelArgs& kargs)
template <bool UseDefaultScheduler = true>
CK_TILE_DEVICE static void
RunGemm(const std::array<const ADataType*, UniversalGemmKernel::NumATensor>& as_ptr,
const std::array<const BDataType*, UniversalGemmKernel::NumBTensor>& bs_ptr,
const std::array<const void*, UniversalGemmKernel::NumDTensor>& ds_ptr,
CDataType* c_ptr,
void* smem_ptr_0,
const typename UniversalGemmKernel::KernelArgs& kargs,
const index_t num_loop,
const index_t block_idx_m,
const index_t block_idx_n,
const index_t k_size)
{
// Create Gemm tensor views, pad views and tile windows
const auto& gemm_tensor_views_tuple =
UniversalGemmKernel::template MakeGemmTensorViews<EpiloguePipeline::MemoryOperation>(
as_ptr, bs_ptr, ds_ptr, c_ptr, kargs, k_size);
const auto& gemm_pad_views = UniversalGemmKernel::MakeGemmPadViews(gemm_tensor_views_tuple);
auto gemm_tile_windows =
UniversalGemmKernel::MakeGemmTileWindows(gemm_pad_views, block_idx_m, block_idx_n);
// Run GEMM cooperatively by whole workgroup.
const auto& as_block_window = gemm_tile_windows.at(UniversalGemmKernel::I0);
const auto& bs_block_window = gemm_tile_windows.at(UniversalGemmKernel::I1);
const auto& ds_block_window = gemm_tile_windows.at(UniversalGemmKernel::I2);
// Since num_loop can vary per WG and per iteration of the Stream-K while loop, we compute
// has_hot_loop and tail_num here. This is a similar pattern used by grouped GEMM. In this
// case, we call the GemmPipeline's operator() function that takes both has_hot_loop and
// tail_num.
const bool has_hot_loop = GemmPipeline::BlockHasHotloop(num_loop);
const TailNumber tail_num = GemmPipeline::GetBlockLoopTailNum(num_loop);
const auto& c_block_tile = GemmPipeline{}(as_block_window[UniversalGemmKernel::I0],
bs_block_window[UniversalGemmKernel::I0],
num_loop,
has_hot_loop,
tail_num,
smem_ptr_0);
if(UseDefaultScheduler || (get_warp_id() == 0))
{
// Run Epilogue Pipeline
auto& c_block_window = gemm_tile_windows.at(UniversalGemmKernel::I3);
EpiloguePipeline{}(c_block_window, c_block_tile, ds_block_window, smem_ptr_0);
}
}
CK_TILE_HOST static bool IsSupportedArgument(const StreamKKernelArgs& kargs)
{
if(kargs.reduction_strategy == StreamKReductionStrategy::Reduction)
{
if(ck_tile::EnvIsEnabled(CK_TILE_ENV(CK_TILE_LOGGING)))
{
CK_TILE_ERROR("CK Tile Stream-K only supports the atomic reduction strategy.");
}
return false;
}
return UniversalGemmKernel::IsSupportedArgument(kargs);
}
@@ -199,9 +262,81 @@ struct StreamKKernel
kargs.workspace_ptr = workspace_ptr;
}
// Temporary placeholder to support the Occupancy() static function.
// Since the Occupancy function uses kentry, this class must have an operator() function
CK_TILE_DEVICE void operator()(StreamKKernelArgs /*kargs*/) const {}
/// @brief Entry point for the Stream-K Kernel, performing the main Stream-K loop.
CK_TILE_DEVICE void operator()(StreamKKernelArgs kargs) const
{
// Allocate LDS
__shared__ char smem_ptr_0[UniversalGemmKernel::GetSmemSize()];
uint32_t block_idx = ck_tile::get_block_1d_id();
bool is_padding_block =
__builtin_amdgcn_readfirstlane(block_idx >= kargs.tile_partitioner.sk_num_blocks &&
block_idx < kargs.tile_partitioner.dp_start_block_idx);
// Padding blocks make it such that the DP blocks are aligned with the number of CUs; they
// should not partake in the GEMM
if(is_padding_block)
return;
// Determine the K offset of the first and final macro tile in the A and B tensors along the
// K dimension.
uint32_t iter_start, iter_end;
kargs.tile_partitioner.GetBlockItr(block_idx, iter_start, iter_end);
// Main Stream-K loop
while(true)
{
// Determine the number of macro tiles in A and B this WG is resposible for in the
// current C macro tile.
uint32_t current_iter_length = __builtin_amdgcn_readfirstlane(
kargs.tile_partitioner.GetCurrentIterLength(iter_start, iter_end));
// Determine the 1D tile_idx and the iter_offset for this WG.
// The tile_idx is the 1D macro tile index in the C tensor.
// The iter_offset is the starting macro tile index in the K dimension for the WG in the
// current iteration of the while loop.
uint32_t tile_idx, iter_offset;
kargs.tile_partitioner.GetTileIdxWithOffset(iter_start, tile_idx, iter_offset);
// Get the 2D tile index in the C tensor for this WG using the 1D index (i.e. tile_idx)
auto spatial_idx = kargs.tile_partitioner.GetOutputTileIndex(tile_idx);
// Get the offsets in A, B, C tensors.
index_t i_m = static_cast<index_t>(spatial_idx[UniversalGemmKernel::I0] *
TilePartitioner::MPerBlock);
index_t i_n = static_cast<index_t>(spatial_idx[UniversalGemmKernel::I1] *
TilePartitioner::NPerBlock);
index_t i_k = static_cast<index_t>(iter_offset) * TilePartitioner::KPerBlock;
// Determine the total size along the K dimension the WG is using in this iteration
// (used to construct tensor views).
index_t k_size = static_cast<index_t>(current_iter_length * TilePartitioner::KPerBlock);
// Update pointer offsets for A, B, and C.
const ADataType* a_ptr = static_cast<const ADataType*>(kargs.as_ptr[0]) + i_k;
const BDataType* b_ptr = static_cast<const BDataType*>(kargs.bs_ptr[0]) + i_k;
CDataType* c_ptr = static_cast<CDataType*>(kargs.e_ptr);
// Run the GEMM pipeline and Epilogue.
RunGemm({a_ptr},
{b_ptr},
{/*ds_ptr*/},
c_ptr,
smem_ptr_0,
kargs,
current_iter_length,
i_m,
i_n,
k_size);
// Prepare for next Stream-K loop iteration.
iter_start += current_iter_length;
if(iter_end <= iter_start)
break;
block_sync_lds();
}
}
private:
CK_TILE_HOST static int NumCU()

View File

@@ -579,7 +579,7 @@ struct UniversalGemmKernel
const std::array<const void*, NumDTensor>& ds_ptr,
EDataType* e_ptr,
const KernelArgs& kargs,
const SplitKBatchOffset& splitk_batch_offset)
const index_t k_size)
{
static_assert(!TilePartitioner::BlockGemmShape::PermuteA, "Not implemented!");
@@ -591,7 +591,7 @@ struct UniversalGemmKernel
{
return make_naive_tensor_view<address_space_enum::global>(
static_cast<const AiDataType*>(as_ptr[i]),
make_tuple(kargs.M, splitk_batch_offset.splitted_k),
make_tuple(kargs.M, k_size),
make_tuple(kargs.stride_As[i], 1),
number<GemmPipeline::GetVectorSizeA()>{},
number<1>{});
@@ -600,7 +600,7 @@ struct UniversalGemmKernel
{
return make_naive_tensor_view<address_space_enum::global>(
static_cast<const AiDataType*>(as_ptr[i]),
make_tuple(splitk_batch_offset.splitted_k, kargs.M),
make_tuple(k_size, kargs.M),
make_tuple(kargs.stride_As[i], 1),
number<GemmPipeline::GetVectorSizeA()>{},
number<1>{});
@@ -617,7 +617,7 @@ struct UniversalGemmKernel
if constexpr(TilePartitioner::BlockGemmShape::PermuteB)
{
constexpr index_t K1 = GemmPipeline::GetSmemPackB();
const index_t K0 = splitk_batch_offset.splitted_k / K1;
const index_t K0 = k_size / K1;
constexpr index_t VectorSizeB =
std::min(K1, GemmPipeline::GetVectorSizeB());
const auto b_k0_n_k1_desc =
@@ -638,7 +638,7 @@ struct UniversalGemmKernel
{
return make_naive_tensor_view<address_space_enum::global>(
bs_ptr[i],
make_tuple(splitk_batch_offset.splitted_k, kargs.N),
make_tuple(k_size, kargs.N),
make_tuple(kargs.stride_Bs[i], 1),
number<GemmPipeline::GetVectorSizeB()>{},
number<1>{});
@@ -649,7 +649,7 @@ struct UniversalGemmKernel
if constexpr(TilePartitioner::BlockGemmShape::PermuteB)
{
constexpr index_t K1 = GemmPipeline::GetSmemPackB();
const index_t K0 = splitk_batch_offset.splitted_k / K1;
const index_t K0 = k_size / K1;
constexpr index_t VectorSizeB =
std::min(K1, GemmPipeline::GetVectorSizeB());
const auto b_k0_n_k1_desc =
@@ -672,7 +672,7 @@ struct UniversalGemmKernel
{
index_t kFlatK =
GemmPipeline::BlockGemmShape::flatKPerWarp *
(splitk_batch_offset.splitted_k /
(k_size /
TilePartitioner::BlockGemmShape::WarpTile::at(number<2>{}));
index_t kFlatN = kargs.N * kargs.K / kFlatK;
@@ -687,7 +687,7 @@ struct UniversalGemmKernel
{
return make_naive_tensor_view<address_space_enum::global>(
bs_ptr[i],
make_tuple(kargs.N, splitk_batch_offset.splitted_k),
make_tuple(kargs.N, k_size),
make_tuple(kargs.stride_Bs[i], 1),
number<GemmPipeline::GetVectorSizeB()>{},
number<1>{});
@@ -962,7 +962,7 @@ struct UniversalGemmKernel
// Create Gemm tensor views, pad views and tile windows
const auto& gemm_tensor_views_tuple =
MakeGemmTensorViews<EpiloguePipeline::MemoryOperation>(
as_ptr, bs_ptr, ds_ptr, e_ptr, kargs, splitk_batch_offset);
as_ptr, bs_ptr, ds_ptr, e_ptr, kargs, splitk_batch_offset.splitted_k);
const auto& gemm_pad_views = MakeGemmPadViews(gemm_tensor_views_tuple);
auto gemm_tile_windows = MakeGemmTileWindows(gemm_pad_views, block_idx_m, block_idx_n);
@@ -1018,7 +1018,7 @@ struct UniversalGemmKernel
// Create Gemm tensor views, pad views and tile windows
const auto& gemm_tensor_views_tuple =
MakeGemmTensorViews<EpiloguePipeline::MemoryOperation>(
as_ptr, bs_ptr, ds_ptr, e_ptr, kargs, splitk_batch_offset);
as_ptr, bs_ptr, ds_ptr, e_ptr, kargs, splitk_batch_offset.splitted_k);
const auto& gemm_pad_views = MakeGemmPadViews(gemm_tensor_views_tuple);
auto gemm_tile_windows = MakeGemmTileWindows(gemm_pad_views, block_idx_m, block_idx_n);