[CK] Replace daily CI builds with mainline compiler with
TheRock compiler. (#6147)
## Motivation
Since the compiler team has deprecated the amd-mainline branch and
switched to TheRock, we'll start building a docker image with TheRock
artifacts and building/testing Ck with that.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Fix architecture-dependent EightWave assignment in
cshuffle_epilogue (#6102)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Tile engine CI build on the develop branch started failing after a
recent change(https://github.com/ROCm/rocm-libraries/pull/5218) in
`cshuffle_epilogue.hpp`. The `EightWave` constant was unconditionally
computed as `(MWave * NWave == 8)` for all architectures, but this logic
is only valid for gfx9*. On other architectures (e.g., gfx1201),
`EightWave` must always be `false`, otherwise it leads to incorrect
`BlockedXDLN_PerWarp` computation and build failures.
## Technical Details
In `cshuffle_epilogue.hpp`, the `EightWave` static constexpr was set as:
```cpp
static constexpr bool EightWave = (MWave * NWave == 8);
```
This was applied regardless of the target GPU architecture. The fix uses
a preprocessor guard to make this architecture-aware:
- **gfx9* (`__gfx9__`):** `EightWave` is evaluated as `(MWave * NWave ==
8)` — true or false depending on the wave configuration
- **All other architectures:** `EightWave` defaults to `false`
## Test Plan
- Tile engine CI build on develop branch
## Test Result
- *Pending CI*
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Restrict FMHA codegen to the kernel subset used by
FlashAttention (#6038)
## Motivation
Currently, the CK FlashAttention integration generates a broader FMHA
kernel set than the FlashAttention wrappers can actually dispatch, which
increases compile time without improving runtime coverage.
## Technical Details
The FlashAttention CK wrappers do not use all logits/LSE variants
emitted by the default FMHA codegen. The direct `fmha_fwd` path always
uses softcap-disabled, LSE-enabled kernels, and the `fmha_fwd_splitkv`
path only uses softcap-disabled kernels. This change trims codegen to
that subset and stops generating the unused logits/LSE variants.
This reduces the generated forward kernel set without changing
`fmha_fwd_appendkv` or `fmha_bwd`. The reduced kernel set was validated
by building and running the
[FlashAttention](https://github.com/Dao-AILab/flash-attention) CK
backend.
Across targets, the total generated FMHA kernel count is reduced by:
- `gfx942`: 29.3%
- `gfx1100`: 33.7%
- `gfx1201`: 31.3%
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
pytest test/test_flash_attn_ck.py from
https://github.com/Dao-AILab/flash-attention
## Test Result
all tests passed
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Stream-K gtest Code Gen
## Motivation
Stream-K was using the tile engine infrastructure for smoke tests.
However, tile engine creates a different target per kernel instance,
which has resulted in scalability issues when used in the context of
unit tests. To avoid burdens on cmake configuration and build time, we
have opted to remove our Stream-K tile engine tests. Instead, we use
pure gtests with code gen to generate repetitive .cpp files.
**Note: This appears to change a lot of files because many files are
removed since they are now generated at build time.**
## Technical Details
We originally used Tile Engine to facilitate code gen for unit tests
since we found that pure gtests required the addition of many repetitive
.cpp files of the following form:
```cpp
#include "test_gemm_streamk_common_includes.hpp"
template <typename Tuple>
class TestCkTileStreamKBf8 : public TestCkTileStreamK<Tuple>
{
};
#define TEST_SUITE_NAME TestCkTileStreamKBf8
TYPED_TEST_SUITE(TestCkTileStreamKBf8, KernelTypesStreamKBf8);
#include "test_gemm_streamk_atomic_cases.inc"
#undef TEST_SUITE_NAME
```
Due to issues encountered with tile engine, we instead use pure gtests
to generate the repetitive .cpp files. The code generator parses
`KernelTypesStreamK*` type aliases from the types header using a
two-phase approach:
1. At **configure time**, CMake runs the Python script with
`--list_files` to extract the type alias names from the header
(test_gemm_streamk_types.hpp) and compute the list of .cpp file paths
that will be generated. This lets CMake know the exact set of source
files for each target.
2. At **build time**, `add_custom_command` runs the script again with
`--gen_files` to actually emit the .cpp files into the build directory,
triggered only when the types header or generator script changes.
With these changes, we've removed all Stream-K tile engine tests. There
are now 5 targets for Stream-K GEMM tests:
1. test_ck_tile_streamk_atomic_smoke: smoke tests for Atomic reduction
strategy (pipeline: compv3)
2. test_ck_tile_streamk_linear_smoke: smoke tests for Linear reduction
strategy (pipeline: compv3)
3. test_ck_tile_streamk_tree_smoke: smoke tests for Tree reduction
strategy (pipeline: compv3)
4. test_ck_tile_streamk_pipelines_smoke: smoke tests (smaller set) for
pipelines other than compv3
- Since Stream-K can be thought of as a wrapper around universal GEMM,
we don't need to extensively test each pipeline. So, we opt to run a few
tests for different pipelines. Currently, this just consists of the mem
pipeline, but compv4 is coming soon.
5. test_ck_tile_streamk_extended: extended tests
## Test Plan
I have tests the gtests locally on gfx90a, gfx942, and gfx950.
## Test Result
All local tests pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Stream-K Tile Engine Fixes
## Motivation
Stream-K GEMM in Tile Engine was unable to support instances where the
matrix dimensions were not perfectly aligned due to bugs with padding
support. This PR implements support for padding back into the Stream-K
implementation in Tile Engine along with other minor fixes.
Additionally, this PR introduces a benchmarking script that is standard
for Tile Engine to run all compiled instances with user specified matrix
dimensions.
## Technical Details
- Fixed padding boolean comparison and parsing in gen_single so that
padding flags from the config files are correctly propagated into the
Stream-K template
- Updated trait combo parsing to have the reduction_strategy for
Stream-K in the correct order
- Addition of gemm_streamk_benchmark.py to run all compiled Stream-K
instances
## Test Plan
Tested using the benchmark scripts to run instances.
## Test Result
All instances passed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Use ck_pytorch docker from private repo.
## Motivation
Move the pytorch docker image used for CK testing into private repo.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK Tile] Add sink token gradient support in FMHA backward
pass (#5504)
## Motivation
Adds sink token support to the FMHA backward kernel (dot_do_o pipeline):
## Technical Details
- Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType
- Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs
- Compute per-head sink gradient via atomic accumulation in the pipeline
- Update example runner with reference validation for sink gradient
## Test Plan
Add new test case
## Test Result
WIP
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix async pivot mismatch in persistent GEMM kernel
scheduler (#5776)
## Motivation
Fix pivot mismatch in the persistent GEMM kernel's async input scheduler
that causes **GPU hangs** and incorrect results when used with AsyncTP
(Asynchronous Tensor Parallelism) on ROCm.
PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM
kernel with chunk signals to overlap communication and computation. The
pivot mechanism ensures each rank starts computing from its own local
shard first (which is already available), then moves to remote chunks as
they arrive over the network.
Because of the pivot mismatch, the kernel frequently waits on signals
for chunks that have not yet arrived, while attempting to read data from
completely different chunks. This synchronization desync reliably
triggers infinite hangs during multi-GPU native AsyncTP execution. This
fix is required to enable functional AsyncTP support on ROCm.
## Technical Details
In the persistent kernel loop (`UniversalGemmKernel::operator()`), the
M-tile coordinate used for data selection (`i_m`) and the M-tile
coordinate used for the chunk-signal wait (`chunk_idx`) were derived
from inconsistent bases:
* `i_m` was computed from the **unpivoted** tile index `iM`.
* `chunk_idx` was computed from the **pivoted** expression `(iM +
tile_idx_pivot)`.
This means the kernel could wait for chunk N's signal but then read from
chunk M's memory, or vice versa. The mismatch scales with GPU count:
with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc.
**The Fix:**
Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m`
and `chunk_idx` from it. This guarantees the kernel always waits for the
correct chunk before reading its data.
*(Note: Minor cosmetic `clang-format` changes were also pulled in
alongside the fix).*
## Test Plan
1. Build PyTorch with this CK change.
2. Run the specific multi-GPU AsyncTP native test:
`timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest
test/distributed/test_symmetric_memory.py -k
test_fused_all_gather_matmul_native -q -s -x`
## Test Result
Tests verify correct overlapping execution without hangs or accuracy
mismatches when running the AsyncTP native path with non-zero pivots.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Add pooling in tile_engine
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Add pooling in ck tile engine
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR
misallocation (#5991)
## Motivation
After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard
around the
`num_total_loop <= 0` early-exit check, the IGLP pipeline
(`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV
gradients for
non-masking kernels (even with fix in #5915). Assembly inspection
confirms that the CFG change causes the LLVM
register allocator to reuse AGPR accumulators as scratch destinations in
the dK/dV
reduction loop, breaking the loop-carried accumulation across Q-tile
iterations.
## Technical Details
- Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in
`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing:
it
restores the CFG shape that the register allocator needs to correctly
assign
dedicated AGPRs to each column of the dK/dV accumulator.
- Only the IGLP pipeline is affected; the other two BWD pipelines do not
exhibit
this issue.
## Test Plan
## Test Result
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] contraction: extend GetTypeString() to include
layout-differentiating params (#6022)
## Motivation
Consumers that identify kernels by their `GetTypeString()` (such as
hipTensor's actor-critic kernel selection, which hashes the string into
a
stable cross-platform UID) were silently dropping one of two colliding
variants during registry insertion.
`GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle`
previously
printed 13 template parameters, omitting
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.
These four parameters determine the block-transfer access width and LDS
padding strategy, and are precisely what differentiates the `kk`, `kn`,
`mk`, and `mn` layout variants from one another when all other geometry
parameters are equal. Two instantiations with identical 13-parameter
strings
are distinct C++ types that accept different stride layouts and reject
each
other's arguments via `IsSupportedArgument`.
This patch extends the output to 17 parameters so that every distinct
template instantiation of this class produces a unique
`GetTypeString()`.
## Technical Details
`include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`:
- extend `GetTypeString()` from 13 to 17 parameters including
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.
## Test Plan
Build CK and hipTensor with these changes, and verify hipTensor can
differentiate and select the
correct kernels with layout variations.
## Test Result
CK is building correctly and hipTensor is selecting the kernels
correctly.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Force padding for atomic_add bf16 C tensor
(#5842)
## Motivation
Force padding for atomic_add bf16 C tensor to avoid memfaults.
## Technical Details
- add global atomic add for bf16 and enable them
- add padding for atomic add bf16 due to the lack of oob
- remove padding for not continous dims in conv for other cases
- minor bwd data conv fixes
## Test Plan
test_grouped_conv_*_tile
## Test Result
pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Changed cshuffle LDS descriptor to naive layout
(#5729)
## Motivation
This PR changes gemm/convolution cshuffle layout into plain one. to
improve cshuffle operation performance.
## Technical Details
The purpose is that before this change the cshuffle layout was having
some descriptor transformations that were probably aimed at reducing LDS
bank conflicts, but the transformations itself were terribly slow, which
negatively impacted the performance.
## Test Plan
There is no need for additional tests, since current tests cover this
functionality.
[CK] fix clang lifetimebound errors with staging compiler
(#5921)
## Motivation
The ROCm staging compiler (newer Clang) enforces
`[[clang::lifetimebound]]` annotations on methods that return references
or pointers to internal object data. Without these annotations, the
staging compiler emits compilation errors for container accessor methods
across the CK and CK Tile namespaces.
## Technical Details
Adds `[[clang::lifetimebound]]` to all reference/pointer-returning
accessors in core container types:
**`ck::` namespace:**
- `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()`
- `index_array` -- `operator[]`
- `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()`
- `IndexLookupTable` -- `operator[]`
**`ck_tile::` namespace:**
- `array` -- `get(i)`, `at()`, `operator[]`, `operator()`
- `static_array` -- `operator[]`
- `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()`
- `make_kernel()` -- parameter pack
Also removes the unused `instance_index` variable from
`batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing
accordingly.
## Test Plan
- Compile with the staging compiler to verify all lifetimebound errors
are resolved
- Existing tests pass unchanged -- the attribute is a compile-time
annotation with no runtime effect
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Fix Windows build in FMHA head grouping
## Motivation
This is a follow-up fix for [PR
#5018](https://github.com/ROCm/rocm-libraries/pull/5018).
[PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added
LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it
also introduced Linux-only code paths, including `<dirent.h>`, which
break Windows builds. This change fixes that by guarding the
Linux-specific LLC probing logic so non-Linux platforms can still build
correctly.
## Technical Details
- Guard `<dirent.h>` with `#ifdef __linux__`
- Guard KFD sysfs traversal logic with `#if defined(__linux__)`
- On non-Linux platforms, return `0` from
`get_kfd_sysfs_llc_cache_bytes()`
- Preserve existing fallback behavior through:
- `CK_TILE_FMHA_LLC_CACHE_MB`
- arch-based default LLC sizes
- no head grouping when no LLC size can be resolved
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Fix FMHA BWD register pressure by wrapping
num_total_loop with amd_wave_read_first_lane (#5915)
## Motivation
In three FMHA backward pipelines, `num_total_loop` is computed without
`amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even
though it is logically uniform across all lanes. This raises register
pressure, and under high pressure the compiler may reuse VGPRs across
overlapping live ranges. This was confirmed via assembly inspection: the
compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an
intermediate value for dV, producing incorrect dK/dV gradients.
## Technical Details
Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three
pipelines:
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr`
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp`
- `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr`
This promotes `num_total_loop` to an SGPR, eliminating the excess
register pressure and the incorrect VGPR reuse.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] More lifetime-warning suppression
## Motivation
The staging compiler picked up another change from upstream that leads
to more lifetime-analysis warnings. This breaks the build, given CK is
built with -Werror. As a result, compiler promotion is blocked.
## Technical Details
This patch adds the pragma push diagnostics to ignore the
lifetime-warnings in the modified files to unblock compiler promotion.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op
(#5095)" (#5849)
This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637.
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix min k_batch calculation in conv kernels
## Motivation
Avoid division by 0 and remove not needed "-1".
## Technical Details
Our div up implementation return lower value if input is divisible.
There is no need to subtract 1.
## Test Plan
test_grouped_conv_bwd_weight
## Test Result
Passed locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1019
[CK] Fix error in dockerfile when building staging compiler.
(#5829)
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Support for CompV4 pipeline in Stream-K GEMM
(#5445)
## Motivation
This PR is extending the pipeline support for Stream-K GEMM by adding
the CompV4 pipeline. Additional pipelines will be added in subsequent
PRs.
## Technical Details
- Enable the CompV4 pipeline by adding an option to set DoubleSMemBuffer
to true if the CompV4 pipeline has been selected as it requires double
buffered shared memory
- Addition of CompV4 pipeline into the extended tests: kernel instances
mirror the existing CompV3/Mem configurations (same layout permutations,
data types, and tile sizes) with the pipeline type set to CompV4.
- Addition of CompV4 pipeline into smoke tests (generated using Tile
Engine)
## Test Plan
These were tested using the existing smoke and extended tests.
## Test Result
All tests passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Fix NaN for FMHA BWD When seq_q=0
## Motivation
This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the
effective query sequence length for a tile is zero, by ensuring the
per-tile pipelines exit early with zeroed accumulators and by avoiding
an early kernel return that prevented writing out cleared gradients.
## Technical Details
- Add unconditional early-exit in the dK/dV pipelines when
`num_total_loop <= 0` (no work), returning zeroed accumulators.
- Adjust group-mode kernel early-return logic to only return when
**both** `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and
store cleared dK/dV when `seqlen_q == 0`.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK_TILE] Fix CTest parsing to handle all test number
formats (#5880)
## Motivation
Fix a bug in the smart-build --ctest-only filter that was incorrectly
excluding tests with numbers less than 100.
## Technical Details
The issue was caused by CTest formatting test numbers with variable
spacing based on the number of digits:
- "Test `#1`: name (3 spaces for tests 1-9)"
- "Test `#79`: name (2 spaces for tests 10-99)"
- "Test `#100`: name (1 space for tests 100+)"
The previous code used `line.strip().startswith("Test #")` which only
matched tests with a single space (i.e., test numbers >= 100).
This caused tests like ck_tile_unit_sequence (Test #79) to be excluded
from smart-build test selection, resulting in CTest failures when the
binary wasn't built.
Solution: Replace string matching with a regex pattern that handles
all spacing variations: r'^\s*Test\s+#\d+:\s*(.+)$'
## Test Plan
Tested with test numbers from 1 to 12345.
## Test Result
- Before: 48 tests selected (only tests #100+)
- After: 146 tests selected (all CTest-registered tests)
## Submission Checklist
- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
fix AITER docker setup
## Motivation
Add a new python package required to build AITER.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Fix unused param mask
## Motivation
Compiler error caused by unused param mask.
## Technical Details
Skip tests using param mask in test loop.
## Test Plan
Current test improvements.
## Test Result
Passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK_TILE] Revert addional oob check in gemm IsSupported
function (#5789)
## Motivation
fix ck_tile's oob check.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Properly build HIPTENSOR_REQ_LIBS_ONLY targets when used
in addition to MIOPEN_REQ_LIBS_ONLY (#5675)
## Motivation
When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and
-DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly
installed. This change is necessary to allow hipTensor to build with
TheRock without the need to rebuild CK from source.
## Technical Details
The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and
MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake,
following the same approach used to the conv target (for MIOpen).
## Test Plan
Manually test the build and installation with
`-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON
-DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as
installed.
## Test Result
The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the
targets contraction, reduction and other, while
`-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv,
contraction, reduction and other.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] fix clang lifetime bound error in ck_builder.
## Motivation
This resolves the compilation error with latest develop compiler branch.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Re-enable daily builds with staging compiler
## Motivation
This should help us catch and fix any new compilation issues early on.
## Technical Details
We now have three compiler profiles:
* **develop**: slightly stabilized version of amd-staging with some of
the obvious offending PRs reverted, 1-2 weeks behind amd-staging;
* **amd-mainline**: more stable version of compiler, the baseline for
all other branches, e.g., release, npi, etc. 2-4 weeks behind
amd-staging.
* **amd-staging**: latest compiler version where all new PRs land, often
broken;
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: kensclin <lshyhchy@amd.com>
[CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler
(#5516)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
We want close the performance gap between old CK and CK Tile for bwd
data convolutions. To achieve this, we need tow things
- Configurations for the old CK kernel instances such that we can map
them into CK Tile instances.
- Support in CK profiler to run the CK Tile instance with the same API
as for old CK instances.
## Technical Details
Extracted kernel configurations from old CK. The codegen python script
for CK Tile convs is extended to support also bwd data. The generated
instances are added to the CMake build (target
`device_grouped_conv_bwd_data_tile_instances`).
A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the
CK Profiler. The API is same as for old CK's profiler op
`grouped_conv_bwd_data`.
Revert "Ck/joye/revert oob check (#5640)"
This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419.
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Adding New Notification Detection
## Motivation
Restricting one of the notification failure patterns to match a specific
missing drivers log pattern. This will help reduce the noise of
erroneous logs. Also adding a new failure pattern to notify us of Github
access issues.
## Technical Details
- Set the failure pattern to match the exact failure observed in the
logs.
- Switching to a plain substring search so special characters are
handled literally.
- Added a new failure pattern for Github access errors.
## Test Plan
- Force a failure using the known failure patterns.
## Test Result
The forced failures were triggered and caught by the notification
system.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Fix kbatch check in grouped conv and gemm
kernels (#5555)
## Motivation
Fix kbatch check in grouped conv and gemm kernels, allow tails for
kbatch.
## Technical Details
Round up K / Kperxdl and divide it by Kbatch to allow tail for K.
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Prune Stream-K Tile Engine Tests
## Motivation
Stream-K tile engine tests are causing issues for build time. While we
work on a more permanent solution, these changes prune the Stream-K test
instances to help reduce the build time burden.
## Technical Details
The Stream-K team recently transitioned to using CK Tile's tile engine
infrastructure for our smoke tests. However, since tile engine creates
an individual target per kernel instance, we've found that the tile
engine tests are increasing build times. Our team is currently working
to convert our existing tile engine tests back to basic gtests. While
this work takes place, we are temporarily pruning the existing Stream-K
tile engine test instances to help reduce the build time burden.
## Test Plan
Ran the pruned test set on all gfx90a, gfx942, and gfx950.
## Test Result
All tests pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Improved CI infrastructure failure detection
## Motivation
This PR re-enables CI infrastructure failure detection and notification,
which had been disabled due to performance issues caused by loading
large build logs (~80k lines) into memory for pattern scanning. The goal
is to reliably detect known infrastructure failures (GPU errors, Docker
authentication issues, disk space errors, etc.) and send actionable
Teams notifications without hanging on large logs.
## Technical Details
- Replaced full build log loading and Groovy-based pattern scanning with
a streaming wget | grep -E pipe. grep scans natively so the full log is
never loaded into Groovy, resolving the hang on large logs.
- Combined all failure patterns into a single grep -E call to avoid
multiple log fetches.
- The node name is now tracked with the observed failure.
- Added a new failure pattern for device's running out of space.
## Test Plan
- Forced failures in the "Determine CI Execution" stage with all 9
failure patterns echoed to the build log.
- Simulated large log sizes (~80k lines of dummy output) to validate
pattern detection and node name extraction at realistic log scales,
including patterns placed both before and after large blocks of dummy
output.
## Test Result
All 9 failure patterns detected correctly. Teams notifications sent with
accurate log context, node name, and job links. No hangs observed on 80k
line simulated logs.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
add self healing to ref repo
## Motivation
Check for when mirror repo gets corrupted in CI
## Technical Details
We detect broken ref objects and rebuild the local mirror in that case
of corruption
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Improve access for merged groups and remove
modulo from xor (#5334)
## Motivation
[CK][CK Tile] Improve access for merged groups and remove modulo from
xor
## Technical Details
- add template parameter to xor if modulo is needed. We don't need
modulo for merged groups
- use access by m for merged groups for a tensor
-
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Move grouped conv cpp instances to build dir
(#5609)
## Motivation
Move grouped conv .cpp instances to build dir. Fix generate instances
script.
## Technical Details
Avoid CI problem when instances in experimental directory are not
removed
## Test Plan
test_grouped_convnd_*_tile
## Test Result
Pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Ck/joye/revert oob check
## Motivation
fix ck_tile's oob check.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Rename Stream-K grid function
## Motivation
This PR introduces a change in the name of the get_grid function in the
Stream-K TilePartitioner to avoid confusion with a similarly named
method. In the Stream-K TilePartitioner, there is get_grid() which
returns num_cu*occupancy and there is grid_size() which returns the grid
size used to launch the kernel. In this PR, we change get_grid() to be
get_max_active_wgs() to better reflect what the function returns and not
confuse it with grid_size().
## Technical Details
Initially in the Stream-K TilePartitioner we had get_grid() which
returned grid_. We are renaming get_grid() to get_max_active_wgs() and
grid_ to max_active_wgs_ internally, while keeping grid_size() the same.
The parameter, grid, for the Stream-K TilePartitioner remains the same
to maintain consistency with the rest of the Stream-K API.
## Test Plan
Validated using the test suite that is already present.
## Test Result
All tests passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.