Files
composable_kernel/include/ck_tile/ops/common
assistant-librarian[bot] 6181eb2adf [rocm-libraries] ROCm/rocm-libraries#4279 (commit 5b3f4b7)
[CK_TILE] Stream-K XCD remapping (#4279)

## Proposed changes

This PR adds support for XCD remapping as detailed in this
[document](https://amdcloud.sharepoint.com/:w:/r/sites/ComposableKernels/Shared%20Documents/Stream-K/Design%20Docs/XCD%20Mapping.docx?d=w2df1b0737dc54614970d99a2e26022d1&csf=1&web=1&e=mLVN4A).
On gfx942, workgroups are typically scheduled round-robin across XCDs,
which can lead to poor locality. We will use a remapping to assign
workgroups to contiguous tiles in the XCDs improving the locality and
the cache hit rate. This is done through a function that computes this
contiguous mapping from this
[PR](https://github.com/ROCm/composable_kernel/pull/3161), which we have
added to the StreamKTilePartitioner. This will require minimal changes
to the Stream-K algorithm, only requiring a remap at the time the
workgroups are partitioned. Through this approach we can improve the
data locality by improving cache hits therefore closing performance gaps
that are seen with the default scheduling. There have been unit tests
added to verify the function in isolation. This is an optimization that
is not specialized to just Stream-K GEMM and can be applied across GEMM.

Note: This only applies to the gfx942 as they introduce the XCDs.

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged

---
🔁 Imported from
[ROCm/composable_kernel#3652](https://github.com/ROCm/composable_kernel/pull/3652)
🧑‍💻 Originally authored by @arai713

---------

Co-authored-by: Astha <astha.rai713@gmail.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>
2026-05-26 09:43:03 -07:00
..
2024-04-15 19:27:12 -05:00

common

this folder is designed not to be included directly by use, e.g. if use include ck_tile/ops/fmha.hpp, then everything under common should also be included.

to achieve this we will duplicate the header include path under common to other module under ops/* inside remod.py. for internal developer, you can also include ck_tile/ops/common.hpp for convenience. (and so does external users...)