[CK Tile] Stream-K RDNA Support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Currently, CK Tile Stream-K only supports CDNA architectures. This
change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures.
## Technical Details
Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear,
and 3) tree. The linear and tree reductions require inter-workgroup
communication to a global flags buffer and a global partials buffer. To
ensure cache coherency, we use cache modifiers to skip cache levels that
are not visible to all workgroups. On CDNA architectures, scalar load
and scalar store instructions are available, which we use to read and
write to the flags buffer with appropriate cache skipping modifiers.
However, RDNA architectures do not support scalar store instructions, so
workgroups must use a buffer store instruction to write to flags.
Additionally, cache modifiers differ between CDNA and RDNA; they also
differ between RDNA3 and RDNA4. Given this information, the main changes
are as follows:
- Added RDNA flag signaling: Use buffer store instructions for writing
to global flags buffer
- Add appropriate cache modifiers for reading and writing to flags and
partials:
- RDNA3 (gfx11): Use `glc | dlc` coherence flags
- RDNA4 (gfx12): Use `DEVICE` coherence scope
- SFINAE-guarded overloads: Added compile-time dispatch for
`SignalStorePartialDone()` and `WaitStorePartialDone()` based on target
architecture
- RDNA alignment requirements: Increased flags buffer alignment from
128B to 256B due to RDNA cache line size
**A note about the `amd_buffer_coherence_enum`:**
- **Problem:** The `amd_buffer_coherence_enum` uses preprocessor
conditionals (`#if defined(__gfx12__)`) to define architecture-specific
values. Template specializations reference enum values from different
architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name
lookup, non-dependent names are resolved during template parsing
regardless of which architecture is being compiled, causing compilation
failures when referenced values do not exist in the active preprocessor
branch.
- **Temporary Solution**: Added compatibility enum values to each
architecture block. For example, I added `glc_dlc` in the `__gfx12__`
block. I will create a ticket to refactor this enum with a design that
has better scalability and tries to avoid the use of preprocessor
conditionals.
## Test Plan
### Summary
gtests were added to test wmma variants of Stream-K. These tests were
stressed tested locally on gfx11 and gfx12.
### More details
This PR makes the following changes/additions to the Stream-K gtests:
- Split tests into MFMA (CDNA) and WMMA (RDNA) variants
- Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction
- WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size
supported on RDNA)
- Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU
count
- As described in [HIP
documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898),
when in Workgroup Processor (WGP) mode, the value of
`hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single
WGP contains two CUs. The default mode on RDNA is WGP mode, so when
creating (M, N, K) instances for gtests using the CU count, we need to
multiply the CU count by 2 to get the correct value. This is not needed
in the kernel host code, because the occupancy ensures that overall
`max_active_wgs` is correct.
## Test Result
All tests pass locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Fix Stream-K flag store: wave-uniform SGPR address
for scalar s_store/s_load (#8099)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Stream-K grouped-conv (and GEMM) kernels fail to assemble for some
instances: the inline scalar flag store/load gets a VGPR address
operand, which scalar-memory instructions reject (`invalid operand for
instruction`). This blocks Stream-K instances from building.
## Technical Details
- `StreamKReductionOps::{Signal,Wait}StorePartialDone` (shared by GEMM
and conv, added in #5393) take `kargs` by `const&` and feed
`kargs.workspace_ptr` / `cta_idx` into inline
`s_store_dword`/`s_load_dword` with `"s"` constraints. For some
instantiations the compiler can't keep the pointer wave-uniform and
emits a VGPR address.
- Fix: route the pointer and offset through `amd_wave_read_first_lane`
so the scalar-memory address is a wave-uniform SGPR before the asm. Same
instructions, no algorithm change.
- Not arch-specific: the affected instance fails on
gfx908/gfx90a/gfx942/gfx950 without the fix; whether the compiler spills
to a VGPR depends on the instantiation (tile/warp/pipeline), not the
target.
## Test Plan
- Compile the previously-failing dispatcher instance for
gfx908/gfx90a/gfx942/gfx950.
- `test_ck_tile_grouped_conv_bwd_weight_streamk` on gfx942, gfx90a,
gfx950 hardware.
- gfx950 perf A/B (example, bf16/tree, 10 runs each) with vs without the
change.
## Test Result
- Failing instance now assembles on all four archs; previously failed on
every one.
- 30/30 conv Stream-K tests pass on gfx942, gfx90a, gfx950.
- gfx950 perf delta -0.13% (within run-to-run noise) — no regression
from the added readfirstlane on the cold flag path.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
[CK_TILE] Stream-K XCD remapping (#4279)
## Proposed changes
This PR adds support for XCD remapping as detailed in this
[document](https://amdcloud.sharepoint.com/:w:/r/sites/ComposableKernels/Shared%20Documents/Stream-K/Design%20Docs/XCD%20Mapping.docx?d=w2df1b0737dc54614970d99a2e26022d1&csf=1&web=1&e=mLVN4A).
On gfx942, workgroups are typically scheduled round-robin across XCDs,
which can lead to poor locality. We will use a remapping to assign
workgroups to contiguous tiles in the XCDs improving the locality and
the cache hit rate. This is done through a function that computes this
contiguous mapping from this
[PR](https://github.com/ROCm/composable_kernel/pull/3161), which we have
added to the StreamKTilePartitioner. This will require minimal changes
to the Stream-K algorithm, only requiring a remap at the time the
workgroups are partitioned. Through this approach we can improve the
data locality by improving cache hits therefore closing performance gaps
that are seen with the default scheduling. There have been unit tests
added to verify the function in isolation. This is an optimization that
is not specialized to just Stream-K GEMM and can be applied across GEMM.
Note: This only applies to the gfx942 as they introduce the XCDs.
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged
---
🔁 Imported from
[ROCm/composable_kernel#3652](https://github.com/ROCm/composable_kernel/pull/3652)
🧑💻 Originally authored by @arai713
---------
Co-authored-by: Astha <astha.rai713@gmail.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>
[CK] upgrade CI to rocm7.13 as default compiler (#7612)
## Motivation
Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
[CK_TILE] Update Stream-K Reduction Strategy Enum (#4756)
## Motivation
Currently, Stream-K has 3 reduction options: 1) atomics, 2) The
reduction described in the Stream-K paper, and 3) a tree reduction. The
reduction strategy described in the original Stream-K paper has the
starting workgroup of each tile sequentially accumulating partial
results of other contributing workgroups in the tile, which requires a
linear number of steps. Hence, for clarity, this works updates the
naming of the `StreamKReductionStrategy` enum members to better describe
the existing reduction strategy options.
## Technical Details
Prior to this change, the enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
Atomic = 0u,
Reduction = 1u,
TreeReduction = 2u
};
```
But, the distinction between `Reduction` and `TreeReduction` is not very
clear and has some redundancy.
Hence, the updated enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
Atomic = 0u,
Linear = 1u,
Tree = 2u
};
```
All references to `StreamKReductionStrategy` were updated to reflect
this change.
## Test Plan
No new functionality was added, so no new tests were added; I just
validated existing tests and examples.
## Test Result
All tests passed locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* CK Tile Stream-K Tree Reduction
This change adds the first implementation of the Stream-K tree reduction
strategy into CK Tile. The tree reduction reduces the the number of
steps for accumulating results for a tile from O(N) to O(logN) where N
is the number of workgroups contributing to a C tile.
Additionally, in the original non-atomic reduction strategy, atomics
were used to set the flags buffer and to read from the flags buffer.
Howeover, through investigation with the tree reduciton, atomics with
default (relaxed) semantics were not enough to guarantee workgroups
would not read stale data, leading to incorrect results. Stronger
acquire/release memory orderings are too expensive. So, this change
also eliminates the use of atomics for setting the flags. Instead, we
leverage cache modifiers (e.g., GLC) to avoid writing to cache, thereby
avoiding the use of atomics.
Prelimiary tests were also added for the normal reduction and tree
reduction. More will be added in a future PR via tile engine.
* Move Stream-K kernel files to a subdirectory
* Cleanup Code Style & Handle Unsupported Reductions
This change makes the following small changes:
- Add an explicit else block for unimplemented reduction strategies
- Clarify type of sk_flags_ptr via auto*
- Add description for extra_iters_before_me variable
* Run new copyright script on new files
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
Prior to this change, the number of accumulations passed into
calculate_rtol_atol was 1. That said, in most cases, this is not correct
when there are multiple workgroups contributing to the same macro tile
in C.
This change ensures uses the function estimate_num_wgs_per_tile, which
was extracted into a common file and generalized, to estimate the number
of workgroups per macro tile. This estimate is passed into
calculate_rtol_atol to ensure we get a better relative and absolute
tolerance.
* initial commit for skeleton code
* replaced skeleton code with old streamk b2c map functions from old CK, still need to clean up the code
* fixed up code to match CK Tile convention: data type changes, naming changes, etc.
* change for num_sk_blocks data type
* formatting fix
* minor fixes
* moved reduction argument to template
* resolved comments from PR review: standardizing naming, pruning unneeded code
* resolve errors from merge of device op PR: moved enum to common file
* switching to uint32_t due to implementation constraints: divmod only takes uint32_t and mixing signed and unsigned types causes problems
* unsigned type fix
* add const qualifier
* added documentation for template parameters
* documentation edit