## Motivation
In three FMHA backward pipelines, `num_total_loop` is computed without
`amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even
though it is logically uniform across all lanes. This raises register
pressure, and under high pressure the compiler may reuse VGPRs across
overlapping live ranges. This was confirmed via assembly inspection: the
compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an
intermediate value for dV, producing incorrect dK/dV gradients.
## Technical Details
Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three
pipelines:
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr`
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp`
- `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr`
This promotes `num_total_loop` to an SGPR, eliminating the excess
register pressure and the incorrect VGPR reuse.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
The staging compiler picked up another change from upstream that leads
to more lifetime-analysis warnings. This breaks the build, given CK is
built with -Werror. As a result, compiler promotion is blocked.
## Technical Details
This patch adds the pragma push diagnostics to ignore the
lifetime-warnings in the modified files to unblock compiler promotion.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637.
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Avoid division by 0 and remove not needed "-1".
## Technical Details
Our div up implementation return lower value if input is divisible.
There is no need to subtract 1.
## Test Plan
test_grouped_conv_bwd_weight
## Test Result
Passed locally.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1019
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
This PR is extending the pipeline support for Stream-K GEMM by adding
the CompV4 pipeline. Additional pipelines will be added in subsequent
PRs.
## Technical Details
- Enable the CompV4 pipeline by adding an option to set DoubleSMemBuffer
to true if the CompV4 pipeline has been selected as it requires double
buffered shared memory
- Addition of CompV4 pipeline into the extended tests: kernel instances
mirror the existing CompV3/Mem configurations (same layout permutations,
data types, and tile sizes) with the pipeline type set to CompV4.
- Addition of CompV4 pipeline into smoke tests (generated using Tile
Engine)
## Test Plan
These were tested using the existing smoke and extended tests.
## Test Result
All tests passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
## Motivation
This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the
effective query sequence length for a tile is zero, by ensuring the
per-tile pipelines exit early with zeroed accumulators and by avoiding
an early kernel return that prevented writing out cleared gradients.
## Technical Details
- Add unconditional early-exit in the dK/dV pipelines when
`num_total_loop <= 0` (no work), returning zeroed accumulators.
- Adjust group-mode kernel early-return logic to only return when
**both** `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and
store cleared dK/dV when `seqlen_q == 0`.
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
## Motivation
Fix a bug in the smart-build --ctest-only filter that was incorrectly
excluding tests with numbers less than 100.
## Technical Details
The issue was caused by CTest formatting test numbers with variable
spacing based on the number of digits:
- "Test `#1`: name (3 spaces for tests 1-9)"
- "Test `#79`: name (2 spaces for tests 10-99)"
- "Test `#100`: name (1 space for tests 100+)"
The previous code used `line.strip().startswith("Test #")` which only
matched tests with a single space (i.e., test numbers >= 100).
This caused tests like ck_tile_unit_sequence (Test #79) to be excluded
from smart-build test selection, resulting in CTest failures when the
binary wasn't built.
Solution: Replace string matching with a regex pattern that handles
all spacing variations: r'^\s*Test\s+#\d+:\s*(.+)$'
## Test Plan
Tested with test numbers from 1 to 12345.
## Test Result
- Before: 48 tests selected (only tests #100+)
- After: 146 tests selected (all CTest-registered tests)
## Submission Checklist
- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
## Motivation
Add a new python package required to build AITER.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Compiler error caused by unused param mask.
## Technical Details
Skip tests using param mask in test loop.
## Test Plan
Current test improvements.
## Test Result
Passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/releases">requests's
releases</a>.</em></p>
<blockquote>
<h2>v2.33.0</h2>
<h2>2.33.0 (2026-03-25)</h2>
<p><strong>Announcements</strong></p>
<ul>
<li>📣 Requests is adding inline types. If you have a typed code base
that uses Requests, please take a look at <a
href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>.
Give it a try, and report any gaps or feedback you may have in the
issue. 📣</li>
</ul>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now
extracts contents to a non-deterministic location to prevent malicious
file replacement. This does not affect default usage of Requests, only
applications calling the utility function directly.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Migrated to a PEP 517 build system using setuptools. (<a
href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li>
</ul>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed an issue where an empty netrc entry could cause malformed
authentication to be applied to Requests on Python 3.11+. (<a
href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Dropped support for Python 3.9 following its end of support. (<a
href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li>
</ul>
<p><strong>Documentation</strong></p>
<ul>
<li>Various typo fixes and doc improvements.</li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/M0d3v1"><code>@M0d3v1</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6865">psf/requests#6865</a></li>
<li><a href="https://github.com/aminvakil"><code>@aminvakil</code></a>
made their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7220">psf/requests#7220</a></li>
<li><a href="https://github.com/E8Price"><code>@E8Price</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6960">psf/requests#6960</a></li>
<li><a href="https://github.com/mitre88"><code>@mitre88</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7244">psf/requests#7244</a></li>
<li><a href="https://github.com/magsen"><code>@magsen</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6553">psf/requests#6553</a></li>
<li><a
href="https://github.com/Rohan5commit"><code>@Rohan5commit</code></a>
made their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7227">psf/requests#7227</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25">https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's
changelog</a>.</em></p>
<blockquote>
<h2>2.33.0 (2026-03-25)</h2>
<p><strong>Announcements</strong></p>
<ul>
<li>📣 Requests is adding inline types. If you have a typed code base
that
uses Requests, please take a look at <a
href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>.
Give it a try, and report
any gaps or feedback you may have in the issue. 📣</li>
</ul>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now
extracts
contents to a non-deterministic location to prevent malicious file
replacement. This does not affect default usage of Requests, only
applications calling the utility function directly.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Migrated to a PEP 517 build system using setuptools. (<a
href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li>
</ul>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed an issue where an empty netrc entry could cause
malformed authentication to be applied to Requests on
Python 3.11+. (<a
href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Dropped support for Python 3.9 following its end of support. (<a
href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li>
</ul>
<p><strong>Documentation</strong></p>
<ul>
<li>Various typo fixes and doc improvements.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="bc04dfd6da"><code>bc04dfd</code></a>
v2.33.0</li>
<li><a
href="66d21cb07b"><code>66d21cb</code></a>
Merge commit from fork</li>
<li><a
href="8b9bc8fc0f"><code>8b9bc8f</code></a>
Move badges to top of README (<a
href="https://redirect.github.com/psf/requests/issues/7293">#7293</a>)</li>
<li><a
href="e331a288f3"><code>e331a28</code></a>
Remove unused extraction call (<a
href="https://redirect.github.com/psf/requests/issues/7292">#7292</a>)</li>
<li><a
href="753fd08c5e"><code>753fd08</code></a>
docs: fix FAQ grammar in httplib2 example</li>
<li><a
href="774a0b837a"><code>774a0b8</code></a>
docs(socks): same block as other sections</li>
<li><a
href="9c72a41bec"><code>9c72a41</code></a>
Bump github/codeql-action from 4.33.0 to 4.34.1</li>
<li><a
href="ebf7190679"><code>ebf7190</code></a>
Bump github/codeql-action from 4.32.0 to 4.33.0</li>
<li><a
href="0e4ae38f0c"><code>0e4ae38</code></a>
docs: exclude Response.is_permanent_redirect from API docs (<a
href="https://redirect.github.com/psf/requests/issues/7244">#7244</a>)</li>
<li><a
href="d568f47278"><code>d568f47</code></a>
docs: clarify Quickstart POST example (<a
href="https://redirect.github.com/psf/requests/issues/6960">#6960</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/psf/requests/compare/v2.32.5...v2.33.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/ROCm/rocm-libraries/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Motivation
fix ck_tile's oob check.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and
-DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly
installed. This change is necessary to allow hipTensor to build with
TheRock without the need to rebuild CK from source.
## Technical Details
The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and
MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake,
following the same approach used to the conv target (for MIOpen).
## Test Plan
Manually test the build and installation with
`-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON
-DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as
installed.
## Test Result
The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the
targets contraction, reduction and other, while
`-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv,
contraction, reduction and other.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
This resolves the compilation error with latest develop compiler branch.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
This should help us catch and fix any new compilation issues early on.
## Technical Details
We now have three compiler profiles:
* **develop**: slightly stabilized version of amd-staging with some of
the obvious offending PRs reverted, 1-2 weeks behind amd-staging;
* **amd-mainline**: more stable version of compiler, the baseline for
all other branches, e.g., release, npi, etc. 2-4 weeks behind
amd-staging.
* **amd-staging**: latest compiler version where all new PRs land, often
broken;
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: kensclin <lshyhchy@amd.com>
## Motivation
We want close the performance gap between old CK and CK Tile for bwd
data convolutions. To achieve this, we need tow things
- Configurations for the old CK kernel instances such that we can map
them into CK Tile instances.
- Support in CK profiler to run the CK Tile instance with the same API
as for old CK instances.
## Technical Details
Extracted kernel configurations from old CK. The codegen python script
for CK Tile convs is extended to support also bwd data. The generated
instances are added to the CMake build (target
`device_grouped_conv_bwd_data_tile_instances`).
A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the
CK Profiler. The API is same as for old CK's profiler op
`grouped_conv_bwd_data`.
---------
Co-authored-by: Ville Pietilä <>
This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419.
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Restricting one of the notification failure patterns to match a specific
missing drivers log pattern. This will help reduce the noise of
erroneous logs. Also adding a new failure pattern to notify us of Github
access issues.
## Technical Details
- Set the failure pattern to match the exact failure observed in the
logs.
- Switching to a plain substring search so special characters are
handled literally.
- Added a new failure pattern for Github access errors.
## Test Plan
- Force a failure using the known failure patterns.
## Test Result
The forced failures were triggered and caught by the notification
system.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Fix kbatch check in grouped conv and gemm kernels, allow tails for
kbatch.
## Technical Details
Round up K / Kperxdl and divide it by Kbatch to allow tail for K.
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Stream-K tile engine tests are causing issues for build time. While we
work on a more permanent solution, these changes prune the Stream-K test
instances to help reduce the build time burden.
## Technical Details
The Stream-K team recently transitioned to using CK Tile's tile engine
infrastructure for our smoke tests. However, since tile engine creates
an individual target per kernel instance, we've found that the tile
engine tests are increasing build times. Our team is currently working
to convert our existing tile engine tests back to basic gtests. While
this work takes place, we are temporarily pruning the existing Stream-K
tile engine test instances to help reduce the build time burden.
## Test Plan
Ran the pruned test set on all gfx90a, gfx942, and gfx950.
## Test Result
All tests pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
This PR re-enables CI infrastructure failure detection and notification,
which had been disabled due to performance issues caused by loading
large build logs (~80k lines) into memory for pattern scanning. The goal
is to reliably detect known infrastructure failures (GPU errors, Docker
authentication issues, disk space errors, etc.) and send actionable
Teams notifications without hanging on large logs.
## Technical Details
- Replaced full build log loading and Groovy-based pattern scanning with
a streaming wget | grep -E pipe. grep scans natively so the full log is
never loaded into Groovy, resolving the hang on large logs.
- Combined all failure patterns into a single grep -E call to avoid
multiple log fetches.
- The node name is now tracked with the observed failure.
- Added a new failure pattern for device's running out of space.
## Test Plan
- Forced failures in the "Determine CI Execution" stage with all 9
failure patterns echoed to the build log.
- Simulated large log sizes (~80k lines of dummy output) to validate
pattern detection and node name extraction at realistic log scales,
including patterns placed both before and after large blocks of dummy
output.
## Test Result
All 9 failure patterns detected correctly. Teams notifications sent with
accurate log context, node name, and job links. No hangs observed on 80k
line simulated logs.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Check for when mirror repo gets corrupted in CI
## Technical Details
We detect broken ref objects and rebuild the local mirror in that case
of corruption
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
[CK][CK Tile] Improve access for merged groups and remove modulo from
xor
## Technical Details
- add template parameter to xor if modulo is needed. We don't need
modulo for merged groups
- use access by m for merged groups for a tensor
-
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
The point of this MR is to update the intrinsic layout parameters to
simplify them and make them more clear and flexible. Also, a number of
simple refactors were performed to reduce boilerplate and code
duplication.
## Technical Details
In CK Tile and old CK, the full set of information available in the
intrinsic wrappers, for WMMA and MFMA combined, would be something like:
```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;
using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;
// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;
// Layout parameters
static constexpr index_t kAMBlock;
static constexpr index_t kBNBlock;
static constexpr index_t kRepeat;
static constexpr index_t kAMLane;
static constexpr index_t kBNLane;
static constexpr index_t kABK0PerLane;
static constexpr index_t kABKLane;
static constexpr index_t kABK1PerLane;
static constexpr index_t kCMLane;
static constexpr index_t kCNLane;
static constexpr index_t kCM0PerLane;
static constexpr index_t kCM1PerLane;
using kABPs2RHssMajor = sequence<2, 1>;
using kABPs2RHssMinor = sequence<1, 0>;
using kABYs2RHsMajor = sequence<2, 2>;
using kABYs2RHsMinor = sequence<0, 2>;
using kCPs2RHssMajor = sequence<1, 2>;
using kCPs2RHssMinor = sequence<1, 0>;
using kCYs2RHsMajor = sequence<1, 1>;
using kCYs2RHsMinor = sequence<0, 2>;
using kCTPs2RHssMajor = sequence<2, 1>;
using kCTPs2RHssMinor = sequence<1, 0>;
using kCTYs2RHsMajor = sequence<2, 2>;
using kCTYs2RHsMinor = sequence<0, 2>;
```
Note that on top of the intrinsic sizes, we have 12 layout parameters. I have reduced this in the new design to:
```
// Basic info
using ADataType = void;
using BDataType = void;
using CDataType = void;
// Fragment sizes
static constexpr index_t kM;
static constexpr index_t kN;
static constexpr index_t kK;
// Layout parameters
static constexpr index_t kABKPerLane; // K2 * K0, Always the same, even
for diff A / B layouts
static constexpr index_t kAKNumAccess; // K2
static constexpr index_t kARepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kBKNumAccess; // K2
static constexpr index_t kBRepeat; // Used for RDNA3 repeated inputs and
CDNA block hiding.
static constexpr index_t kCMPerLane; // M2 * M0
static constexpr index_t kCMNumAccess; // M2
// Derived properties
using AVecType = ext_vector_t<ADataType, 0>;
using BVecType = ext_vector_t<BDataType, 0>;
using CVecType = ext_vector_t<CDataType, 0>;
```
Note that there are now only 7 layout parameters and no more dimensionality orderings. Believe it or not these 7 parameters are more general than the original 12, and can handle intrinsic and mid-level features that are currently awkward in CK Tile, like dealing with AttrNumAccess, different A / B layouts, more general block-hiding (currently very limited in CK tile), and future arch features.
Furthermore, the A, B and C vec types are now derived directly from the layout parameters to ensure internal consistency.
I added a detailed explanation of the new params in terms of register mappings at the top of amgcn_mma.hpp
Other refactorings I did in this MR:
- Make an amdgcn_mma_base struct to drastically reduce code duplication and potential bugs. Should also make auto-generating the amd_gcn specializations much easier.
- Simplify the MmaOpTraits significantly by only including those parameters that are not directly gettable from the MmaOp itself. This removes duplicated variables and simplifies higher level code.
- Remove overloaded "Block" term for intrinsic dimensions, and replace by "Frag" instead. Some spots were already using the term "Frag" for combined intrinsics, in which case I changed that term to "Chunk" instead.
- Remove some tests that had become somewhat pointless (setting variables and then checking their values immediately).
- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Move grouped conv .cpp instances to build dir. Fix generate instances
script.
## Technical Details
Avoid CI problem when instances in experimental directory are not
removed
## Test Plan
test_grouped_convnd_*_tile
## Test Result
Pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
fix ck_tile's oob check.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
This PR introduces a change in the name of the get_grid function in the
Stream-K TilePartitioner to avoid confusion with a similarly named
method. In the Stream-K TilePartitioner, there is get_grid() which
returns num_cu*occupancy and there is grid_size() which returns the grid
size used to launch the kernel. In this PR, we change get_grid() to be
get_max_active_wgs() to better reflect what the function returns and not
confuse it with grid_size().
## Technical Details
Initially in the Stream-K TilePartitioner we had get_grid() which
returned grid_. We are renaming get_grid() to get_max_active_wgs() and
grid_ to max_active_wgs_ internally, while keeping grid_size() the same.
The parameter, grid, for the Stream-K TilePartitioner remains the same
to maintain consistency with the rest of the Stream-K API.
## Test Plan
Validated using the test suite that is already present.
## Test Result
All tests passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Add support for MXFP6 in the MX GEMM op in CK-Tile.
Depends on https://github.com/ROCm/rocm-libraries/pull/4594
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
How ninja install works:
- Builds library dependencies (device_operations, etc.)
- Installs them to CMAKE_INSTALL_PREFIX
- Skips building test executables (not install dependencies)
Affected stages (8):
- gfx942/gfx950/gfx908/gfx90a CK Client Examples
- gfx10-1/gfx10-3/gfx11/gfx12 CK Client Examples
## Motivation
Problem: When smart-build is enabled (runAllUnitTests=false), the build
step is skipped entirely. This causes client example stages to fail
because they depend on the CK library being installed to ../install.
Error seen:
Target "client_gemm" links to:
composable_kernel::device_other_operations
but the target was not found.
## Technical Details
Root cause: Line 712 only checked runAllUnitTests, so when building with
config_targets="install", the install target was never built, leaving
the install directory empty.
Fix: Added condition to always build when config_targets contains
'install'. The install target automatically builds its dependencies (the
CK libraries) but skips building tests, which aligns with smart-build
philosophy.
## Test Plan
Should be tested on CI
## Test Result
Should be tested on CI
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
## Motivation
Fix v1 pipeline for KM/KN layouts by passing correct step for dram tile
window.
## Technical Details
- Fix dram step for KM/KN layouts in V1 pipeline
- Disable instances which use more threads than warp size in continous
dim (not supported in ck tile yet)
- Use 1x1 specialization for explicit gemm
- Use two stage for vectorsize =1 and sizeof(datatype) ==2
- remove not needed check sinze GetVectorSizeA/B check if vector size is
fixed
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-966
## Proposed changes
TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.
## Checklist
Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run on all changed files
- [ ] Any dependent changes have been merged
## Discussion
---
🔁 Imported from
[ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538)
🧑💻 Originally authored by @yingluAMD
---------
Co-authored-by: yingluAMD <Yingmao.Lu@amd.com>
Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
## Motivation
Existing dependency parser needs full build of tests to determine which
tests are affected by code changes in a PR. This still takes 2-4 hours
for building the tests which slows down the CI as the number of tests
grow. To resolve this issue we implemented a smart dependency parser
which uses CMake Configure to parse dependencies and build only the
affected test cases. We have ensured that two approaches are available
1) CMake pre-build analysis for each PR to ensure fast build and test.
2) Ninja post-build analysis to enable full build for nightly tests.
## Technical Details
```bash
### 1. Configure the project with CMake
cmake -G Ninja -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ..
### 2. Analyze dependencies (no build required!)
python3 ../script/dependency-parser/main.py cmake-parse compile_commands.json build.ninja \
--workspace-root .. --output cmake_dependency_mapping.json --parallel 8
### 3. Find tests affected by changes
python3 ../script/dependency-parser/main.py select cmake_dependency_mapping.json origin/develop \
HEAD --test-prefix --output tests_to_run.json
### 4. Build only affected tests
ninja $(jq -r '.executables[]' tests_to_run.json | tr '\n' ' ')
### 5. Run affected tests
ctest -R "$(jq -r '.regex' tests_to_run.json)"
```
### Jenkins Integration
- Added `buildMode` to jenkinsfile to integrate both `selective` and
`full` build methods
### Known Limitations
### 1. Build-Time Generated Headers (HIGH RISK)
**Problem:** Files generated during the build process (e.g., via
`add_custom_command`) cannot be analyzed before building.
**Example:**
```cmake
add_custom_command(
OUTPUT ${CMAKE_BINARY_DIR}/generated/config.hpp
COMMAND generate_config.sh
DEPENDS template.hpp.in
)
```
**Impact:** If a source file includes `generated/config.hpp`, the
dependency won't be detected until after building.
**Mitigation:**
- CK analysis shows **no generated headers** currently used
- If generated headers are added in the future, they must be built first
- Recommendation: Generate headers in CMake configure phase (not build
phase) when possible
## Test Plan
**1. Modified Files:**
```
include/ck_tile/ops/common.hpp
include/ck_tile/ops/gemm.hpp
include/ck_tile/ops/gemm/warp/warp_gemm.hpp
```
**2. Compare tests selected between `build.ninja` and `cmake-parse`
methods**
## Test Result
- 1. The test completed in 5-6 minutes finding about 8000+ executables
that should be built.
- 2. We selected a commit 5ccc1387ea which resulted in same 7 tests with
both legacy and new methods.
-
PR | Legacy tests | Smart tests | Notes
-- | -- | -- | --
5261 | 453 | 455 | Only 2 tests (test_amdgcn_mma and
test_amdgcn_sparse_mma)
5168 | 0 | 0 | Changes in dispatcher only. No CK tests invoked.
5249 | 0 | 0 | Changes to dependency parser. No CK tests invoked
5260 | 0 | 0 | Changes in dispatcher only. No CK tests invoked.
5174 | 1 | 1 | One test from FMHA affected by this PR in both cases
5383 | 0 | 0 | Changes are only in benchmark files. Did not trigger any
tests
5445 | 1 | 1 | Changes are only to tests/ck_tile/gemm_streamk. Only
triggered one streamk test in both cases.
5454 | 3 | 3 | Both methods identified same test_grouped_conv_bwd tests
5427 | 234 | 234 | Core infrastructure header changes. Detected exactly
same tests
5388 | 85 | 85 | modifies warp-level GEMM operations (warp_gemm.hpp,
warp_gemm_dispatcher.hpp). Correctly identified all the streamK gemm
tests
## Submission Checklist
- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base
pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces
are still based on full K dimension. This causes hardware buffer
resource descriptors to extend beyond the actual tensor allocation,
leading to GPU memory access faults when the tensor happens to be placed
at the end of an allocated memory pool region.
Fix by subtracting the split offset from each buffer's element space in
both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer
descriptor range [shifted_base, shifted_base + reduced_space) exactly
covers the valid allocation.
Also refactor SplitKBatchOffset to accept const Problem& (instead of
Argument&) and add a default constructor, enabling direct reuse in
Run/Run_2Lds without duplicating offset calculation logic.
Made-with: Cursor
## Motivation
<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Yi DING <yi.ding@amd.com>
## Motivation
Reduce the scale loading size and also has better utilization of MFMA
scale selection.
## Technical Details
Add up the packing of mx scales.
## Test Plan
Use the existing test cases.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
## Motivation
Long-sequence FMHA can become memory-bound when K/V working sets exceed
Infinity Cache (LLC), causing repeated DRAM traffic across heads.
This PR introduces LLC-aware launch ordering improvements for FMHA
forward, and it is currently enabled only on gfx11 and gfx12. The
approach is inspired by
[`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217),
adapted to CK’s kernel/runner structure and layout handling.
In this context, `bshd` is the layout used in Flash-Attention, while
`bhsd` is the default layout used by the CK Tile FMHA example.
## Technical Details
This PR adds two complementary strategies:
- For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware
head grouping:
- Estimate LLC size (env override, KFD sysfs, or arch default).
- Compute group size from K/V bytes per head vs LLC target.
- Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and
related tensors).
- For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit
launch-order adjustment:
- Keep a single kernel launch.
- Reinterpret block linearization in `GetTileIndex` to make execution
head-major,
improving temporal locality of per-head K/V reuse.
Additional integration updates:
- Propagate `num_head_q_total` and `head_start` through FMHA args/kargs.
- Use global head indexing for dropout RNG stream mapping so grouped
launches keep
deterministic/consistent dropout behavior.
- Keep fallback behavior unchanged when grouping is not beneficial or
disabled.
## Test Plan
- `test_ck_tile_fmha`
- `tile_example_fmha_fwd`
## Test Result
- `test_ck_tile_fmha`: all tests passed.
- `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201,
and all of them show higher performance compared to the baseline. The
improvement is consistent, and performance is well maintained even at
long sequence lengths.
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128
-s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
- TFLOPs by sequence length target: gfx1100 layout: bhsd
SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 56.27 | 61.48 | 1.09x
4096 | 67.10 | 72.27 | 1.08x
8192 | 65.99 | 71.64 | 1.09x
12288 | 61.60 | 76.61 | 1.24x
16384 | 58.99 | 75.74 | 1.28x
20480 | 57.32 | 74.42 | 1.30x
24576 | 56.89 | 74.25 | 1.31x
27280 | 18.93 | 24.48 | 1.29x
- TFLOPs by sequence length target: gfx1201 layout: bshd
SeqLen | Before | After | Speedup
-- | -- | -- | --
1024 | 66.79 | 65.90 | 0.99x
4096 | 85.90 | 86.80 | 1.01x
8192 | 77.06 | 90.29 | 1.17x
12288 | 58.36 | 88.98 | 1.52x
16384 | 52.12 | 88.88 | 1.71x
20480 | 48.11 | 88.42 | 1.84x
24576 | 47.12 | 89.07 | 1.89x
27280 | 49.05 | 50.31 | 1.03x
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Flush cache to get more stable results during profiling old ck and ck
tile.
## Technical Details
Flush cache before each kernel call and one more first run.
## Test Plan
test_grouped_conv_bwd_weight_tile
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-966
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Eight waves pipeline was added for ABQuant. The goal of this PR is to
enable it also for GEMM
## Technical Details
Summary:
- Block:
- Create block struct for GEMM using eight warps specific distribution
encodings
- Use this block struct in ABQuant for encodings
- Pipeline:
- Create impl pipeline for eight waves which can be used by GEMM and
ABQuant as base (and for AQuant and BQuant in the future)
- Create eight waves pipeline for GEMM (this can not be easily
integrated in the existing async pipeline)
- Pipeline policy:
- Extract GEMM specific parts in the ABQuant policy to define GEMM
policy (then ABQuant use it as base and add Quant specific methods)
- Minor: naming was inconsistent between warp/wave, everything is now
referred to as eight waves
So overall we have:
- block struct directly used by GEMM -> ABQuant derived struct to
implement operator
- Impl base pipeline with general implementation -> GEMM and ABQuant
pipelines use it to avoid code duplication but still define their own
pipelines
- pipeline policy struct directly used by GEMM -> ABQuant derived policy
struct for Quant specific parts
## Test Plan
Added new tests for GEMM pipeline:
`test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950
supports it).
Note: K padding test is disabled for this pipeline because it's not
implemented yet
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Grouped Convolution Backward Weight split k fixes for CK tile kernels
## Technical Details
- get k batch from kargs to get deduced k batch
- multiply zeroing size by data type size
- disable v6 (producing a incorrect results)
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
Pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Ville Pietilä <>
## Motivation
PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The
two-stage kernels were not supported in the initial PR. This PR adds the
the missing bwd weight two-stage kernels to the CK Profiler.
## Technical Details
Extended the CK Tile conv builder factory to build also the elementwise
ops required for the two-stage kernels. Extended the CK Builder for CK
Tile instance to accept the two-stage flag as part of the algorithm
configuration.
## Test Plan
Added units tests for CK Builder that verify the two-stage kernel
construction.
## Test Result
If CI passes, the added unit tests are passing.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Ville Pietilä <>
Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.5.4 to
6.5.5.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst">tornado's
changelog</a>.</em></p>
<blockquote>
<h1>Release notes</h1>
<p>.. toctree::
:maxdepth: 2</p>
<p>releases/v6.5.5
releases/v6.5.4
releases/v6.5.3
releases/v6.5.2
releases/v6.5.1
releases/v6.5.0
releases/v6.4.2
releases/v6.4.1
releases/v6.4.0
releases/v6.3.3
releases/v6.3.2
releases/v6.3.1
releases/v6.3.0
releases/v6.2.0
releases/v6.1.0
releases/v6.0.4
releases/v6.0.3
releases/v6.0.2
releases/v6.0.1
releases/v6.0.0
releases/v5.1.1
releases/v5.1.0
releases/v5.0.2
releases/v5.0.1
releases/v5.0.0
releases/v4.5.3
releases/v4.5.2
releases/v4.5.1
releases/v4.5.0
releases/v4.4.3
releases/v4.4.2
releases/v4.4.1
releases/v4.4.0
releases/v4.3.0
releases/v4.2.1
releases/v4.2.0
releases/v4.1.0
releases/v4.0.2
releases/v4.0.1
releases/v4.0.0
releases/v3.2.2
releases/v3.2.1
releases/v3.2.0
releases/v3.1.1</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="7d6465056c"><code>7d64650</code></a>
Merge pull request <a
href="https://redirect.github.com/tornadoweb/tornado/issues/3586">#3586</a>
from bdarnell/update-cibw</li>
<li><a
href="d05d59b808"><code>d05d59b</code></a>
build: Bump cibuildwheel to 3.4.0</li>
<li><a
href="c2f46732b0"><code>c2f4673</code></a>
Merge pull request <a
href="https://redirect.github.com/tornadoweb/tornado/issues/3585">#3585</a>
from bdarnell/release-655</li>
<li><a
href="e5f1aa4b6f"><code>e5f1aa4</code></a>
Release notes and version bump for v6.5.5</li>
<li><a
href="78a046f99f"><code>78a046f</code></a>
httputil: Add CRLF to _FORBIDDEN_HEADER_CHARS_RE</li>
<li><a
href="24a2d96ea1"><code>24a2d96</code></a>
web: Validate characters in all cookie attributes.</li>
<li><a
href="119a195e29"><code>119a195</code></a>
httputil: Add limits on multipart form data parsing</li>
<li>See full diff in <a
href="https://github.com/tornadoweb/tornado/compare/v6.5.4...v6.5.5">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/ROCm/rocm-libraries/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Motivation
It's common in MoE workloads that some experts receive zero tokens,
which would result in some of the dimensions equal to zero. Currently we
handle such case only for non-persistent kernels where we have all GEMMs
information beforehand on host - we validate this during creation of
kernel arguments. However for the "dynamic" input path (persistent
kernel) this information is not available before kernel launch. Thus we
have to validate this during kernel execution. The goal is to add this
validation.
## Technical Details
Skip work if any of Grouped GEMM groups M/N/K are zero for persistent
kernel path.
## Test Plan
Add unit-tests which cover "dynamic" inputs with zero dims for
persistent kernel execution path.
## Test Result
All tests pass.
## Submission Checklist
- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>