* Rangify check_err()
By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.
* Re-format example code
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Add pooling in ck tile engine
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
## Motivation
After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard
around the
`num_total_loop <= 0` early-exit check, the IGLP pipeline
(`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV
gradients for
non-masking kernels (even with fix in #5915). Assembly inspection
confirms that the CFG change causes the LLVM
register allocator to reuse AGPR accumulators as scratch destinations in
the dK/dV
reduction loop, breaking the loop-carried accumulation across Q-tile
iterations.
## Technical Details
- Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in
`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing:
it
restores the CFG shape that the register allocator needs to correctly
assign
dedicated AGPRs to each column of the dK/dV accumulator.
- Only the IGLP pipeline is affected; the other two BWD pipelines do not
exhibit
this issue.
## Test Plan
## Test Result
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Consumers that identify kernels by their `GetTypeString()` (such as
hipTensor's actor-critic kernel selection, which hashes the string into
a
stable cross-platform UID) were silently dropping one of two colliding
variants during registry insertion.
`GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle`
previously
printed 13 template parameters, omitting
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.
These four parameters determine the block-transfer access width and LDS
padding strategy, and are precisely what differentiates the `kk`, `kn`,
`mk`, and `mn` layout variants from one another when all other geometry
parameters are equal. Two instantiations with identical 13-parameter
strings
are distinct C++ types that accept different stride layouts and reject
each
other's arguments via `IsSupportedArgument`.
This patch extends the output to 17 parameters so that every distinct
template instantiation of this class produces a unique
`GetTypeString()`.
## Technical Details
`include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`:
- extend `GetTypeString()` from 13 to 17 parameters including
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.
## Test Plan
Build CK and hipTensor with these changes, and verify hipTensor can
differentiate and select the
correct kernels with layout variations.
## Test Result
CK is building correctly and hipTensor is selecting the kernels
correctly.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Force padding for atomic_add bf16 C tensor to avoid memfaults.
## Technical Details
- add global atomic add for bf16 and enable them
- add padding for atomic add bf16 due to the lack of oob
- remove padding for not continous dims in conv for other cases
- minor bwd data conv fixes
## Test Plan
test_grouped_conv_*_tile
## Test Result
pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
This PR changes gemm/convolution cshuffle layout into plain one. to
improve cshuffle operation performance.
## Technical Details
The purpose is that before this change the cshuffle layout was having
some descriptor transformations that were probably aimed at reducing LDS
bank conflicts, but the transformations itself were terribly slow, which
negatively impacted the performance.
## Test Plan
There is no need for additional tests, since current tests cover this
functionality.
## Motivation
The ROCm staging compiler (newer Clang) enforces
`[[clang::lifetimebound]]` annotations on methods that return references
or pointers to internal object data. Without these annotations, the
staging compiler emits compilation errors for container accessor methods
across the CK and CK Tile namespaces.
## Technical Details
Adds `[[clang::lifetimebound]]` to all reference/pointer-returning
accessors in core container types:
**`ck::` namespace:**
- `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()`
- `index_array` -- `operator[]`
- `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()`
- `IndexLookupTable` -- `operator[]`
**`ck_tile::` namespace:**
- `array` -- `get(i)`, `at()`, `operator[]`, `operator()`
- `static_array` -- `operator[]`
- `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()`
- `make_kernel()` -- parameter pack
Also removes the unused `instance_index` variable from
`batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing
accordingly.
## Test Plan
- Compile with the staging compiler to verify all lifetimebound errors
are resolved
- Existing tests pass unchanged -- the attribute is a compile-time
annotation with no runtime effect
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
This is a follow-up fix for [PR
#5018](https://github.com/ROCm/rocm-libraries/pull/5018).
[PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added
LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it
also introduced Linux-only code paths, including `<dirent.h>`, which
break Windows builds. This change fixes that by guarding the
Linux-specific LLC probing logic so non-Linux platforms can still build
correctly.
## Technical Details
- Guard `<dirent.h>` with `#ifdef __linux__`
- Guard KFD sysfs traversal logic with `#if defined(__linux__)`
- On non-Linux platforms, return `0` from
`get_kfd_sysfs_llc_cache_bytes()`
- Preserve existing fallback behavior through:
- `CK_TILE_FMHA_LLC_CACHE_MB`
- arch-based default LLC sizes
- no head grouping when no LLC size can be resolved
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
In three FMHA backward pipelines, `num_total_loop` is computed without
`amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even
though it is logically uniform across all lanes. This raises register
pressure, and under high pressure the compiler may reuse VGPRs across
overlapping live ranges. This was confirmed via assembly inspection: the
compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an
intermediate value for dV, producing incorrect dK/dV gradients.
## Technical Details
Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three
pipelines:
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr`
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp`
- `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr`
This promotes `num_total_loop` to an SGPR, eliminating the excess
register pressure and the incorrect VGPR reuse.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
The staging compiler picked up another change from upstream that leads
to more lifetime-analysis warnings. This breaks the build, given CK is
built with -Werror. As a result, compiler promotion is blocked.
## Technical Details
This patch adds the pragma push diagnostics to ignore the
lifetime-warnings in the modified files to unblock compiler promotion.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637.
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Avoid division by 0 and remove not needed "-1".
## Technical Details
Our div up implementation return lower value if input is divisible.
There is no need to subtract 1.
## Test Plan
test_grouped_conv_bwd_weight
## Test Result
Passed locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1019
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.