Files
composable_kernel/include
Emily Martins 95c916369c [rocm-libraries] ROCm/rocm-libraries#7584 (commit 060bad5)
[CK_TILE] Fix Stream-K k_size calculation

## Motivation

In a recent benchmarking task for CK Tile Stream-K algorithm, we
identified that certain instances segfault. This change works to fix the
bug and adds necessary regression tests.

## Technical Details

The StreamK kernel constructs tensor views using a `k_size` parameter
that determines how much of the K dimension to process in each
iteration. Previously, this was calculated as:
 ```cpp
index_t k_size = num_loop_sk * TilePartitioner::KPerBlock;
```
This calculation assumes all macro tiles along K are exactly `KPerBlock` in size. However, when `K % KPerBlock != 0`, the final macro tile along K has a remainder size of `K % KPerBlock`, not a full `KPerBlock` (see the figure below):
<img width="961" height="488" alt="image" src="https://github.com/user-attachments/assets/3e1cceed-5dcd-4980-8b02-cee24eecf262" />
With the old code, a workgroup working with the `MPerBlock x (K % KPerBlock)` tile in A and B risk accessing illegal memory.

Hence, this change ensures that when `K % KPerBlock != 0`, workgroups processing iterations that include the final macro-tile along K calculate the correct `k_size` based on the remainder rather than assuming a full `KPerBlock`.

## Test Plan
I added the following tests:
1. Unit tests added for the Stream-K Tile Partitioner:
- `StreamKTilePartitionerBaseGetKSize/NoRemainderTiles` - validates full tiles
- `StreamKTilePartitionerBaseGetKSize/RemainderTiles` - validates remainder handling
2. Regression tests that test a case where `K % KPerBlock != 0`

## Test Result

Tests passed locally on gfx90a, gfx942, and gfx950.

## Submission Checklist

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-29 21:36:49 +00:00
..