[rocm-libraries] ROCm/rocm-libraries#4274 (commit 7c380df)

Add padding to cshuffle epilogue to avoid bank conflict
 (#4274)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Added the padding to CShuffle Epilogue to avoid the bank conflicts of
64. Synced up and learned from the internal repo.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
This commit is contained in:
Thomas Ning
2026-02-11 05:52:42 +00:00
committed by assistant-librarian[bot]
parent 6d6ee8f023
commit 807efa703a
2 changed files with 109 additions and 8 deletions

View File

@@ -40,7 +40,8 @@ struct BlockSoftmax2D
#endif
// compute row max
auto reduce_row_max = BlockReduce2D<decltype(x)>{x, -numeric<DataType>::infinity()};
using X = remove_cvref_t<decltype(x)>;
BlockReduce2D<X> reduce_row_max{x, -numeric<DataType>::infinity()};
#if _BLOCK_SOFTMAX_USE_UNPACK2
auto row_max = reduce_row_max(f_max3, f_max, sequence<1, 2>{});
#else
@@ -52,7 +53,8 @@ struct BlockSoftmax2D
});
// compute row sum
auto reduce_row_sum = BlockReduce2D<decltype(y)>{y, DataType{0}};
using Y = remove_cvref_t<decltype(y)>;
BlockReduce2D<Y> reduce_row_sum{y, DataType{0}};
#if _BLOCK_SOFTMAX_USE_UNPACK2
auto row_sum = reduce_row_sum(f_sum3, f_sum, sequence<1, 2>{});
#else