3233 Commits

Author SHA1 Message Date
Illia Silin
17e4c8eac9 [rocm-libraries] ROCm/rocm-libraries#4883 (commit 56347bb)
[CK] Disable test_fmha_fwd_fp8fp16 on gfx90a by default.
 (#4883)

## Motivation

Since gfx90a has no native support for FP8 datatype, all FP8 tests
should be disabled there by default.

## Technical Details

The test_fmha_fwd_fp8fp16 is the last failing test in CK on gfx90a with
staging compiler.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-26 02:09:06 +00:00
Yung-sheng Tu
75aea70c2c [rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f)
Implement device_grouped_gemm_fixed_nk_bias for RDNA4

## Proposed changes

Summary:

- Modified implementation for grouped_gemm_fixed_nk_bias
- FP16 WMMA examples
- WMMA instances
- Profiler for grouped_gemm_fixed_nk_bias
- Add WMMA instances to existing tests

**This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299
and should be merged after it.
Only the last 6 commits are in the scope of this PR.**

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-26 00:28:58 +00:00
JP-Fernando
9a32f0ea19 [rocm-libraries] ROCm/rocm-libraries#4415 (commit b3b4af7)
[CK] Remove duplicated XDL/WMMA tests
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

When we started the RDNA4 support, the XDL instances were not supporting
WMMA instructions, so we duplicated some tests.

In this issue, we simplified most of the duplicated test files into
common test files.

## Technical Details

The following tests were unified:

- `batched_gemm`

- `batched_gemm_gemm`

- `gemm_add`

- `gemm_universal`

- `grouped_convnd_bwd_data`

The following tests were duplicated exactly, and copied into two files
with `_xdl` and `_wmma` suffixes. Now they are unified in one single
file without suffix:

- `gemm_multi_abd`

- `gemm_b_scale`

There is still an apparent duplication which is a special case, namely
`test_grouped_convnd_bwd_weight_interface_{suffix}` where `{suffix}` is
`xdl` or `wmma`.
However, the WMMA code relies on an old implementation, and is expected
to be removed in the future. In addition, it differs from the XDL
implementation significantly.
Therefore, it was decided to keep both files separate instead of
attempting any unification.

## Test Plan

`CMakeLists.txt` files were modified to support the new, unified tests.
In particular, testing was done for `gfx90a`, `gfx1201` and `gfx11`
architectures.

## Test Result

All tests passed successfully on all three tested architectures.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 23:23:02 +00:00
John Afaganis
6bf2423685 [rocm-libraries] ROCm/rocm-libraries#4898 (commit 753f2ac)
Create operation support matrix for CK Tile Engine

Introduce operation support matrix for CK Tile kernels detailing data
types, layouts, and GPU targets.

## Motivation

The tile engine currently supports a subset of CK Tile operations, but
there is no in-repo reference that maps which operations, data types,
layouts, and GPU targets are covered by the tile engine versus only
available through hand-written examples or tests. This makes it
difficult for developers to know what the tile engine already handles,
what requires manual integration, and where coverage gaps exist. This PR
introduces an operation support matrix as a markdown file in
tile_engine/, intended to be maintained as a living document alongside
the code. Because it lives in the repository rather than an external
wiki or PDF, it can be reviewed and updated in the same pull requests
that add or extend tile engine operations, keeping it accurate as
coverage evolves.

## Technical Details

Documentation only change.

## Test Plan

N/A
## Test Result

N/A

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 23:07:25 +00:00
Bartłomiej Kocot
eede24de0d [rocm-libraries] ROCm/rocm-libraries#4872 (commit ca623f7)
[CK] Small improvements for grouped conv backward weight
 (#4872)

## Motivation

Improvements for CK Tile convolution builder run function and atol/rtol
calculations.

## Technical Details

- Add preprocessing function for wrw when k_batch is larger than 1 for
builder run function
- Divide num acums by number of groups to get real number of accums

## Test Plan

CI wrw tests

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-783
2026-02-25 20:11:01 +00:00
Brock Hargreaves
c90a363589 [rocm-libraries] ROCm/rocm-libraries#4812 (commit bb5a4dd)
[CK] Use as_posix() instead of str() for paths in
 fmha_fwd_appendkv.py (#4812)

## Motivation

This is causing a failing PR for Windows:
https://github.com/ROCm/TheRock/pull/3382
```

[composable_kernel configure] -- Jenga kernel files to be generated: B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_api.cpp
[composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library):
[composable_kernel configure]   Syntax error in cmake code when parsing string
[composable_kernel configure]
[composable_kernel configure]     B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp
[composable_kernel configure]
[composable_kernel configure]   Invalid character escape '\b'.
```

## Technical Details

The file:
[fmha_fwd_appendkv.py](https://github.com/ROCm/rocm-libraries/compare/users/brockhargreaves-amd/ck/fix-windows-cmake-path-problem?expand=1#diff-bef22bf9ba21eb93c725493ecc7edcb6f2a8f0a9a173dcfca6bda7a9f4eced78)
writes a bunch of paths to a text file which is later parsed by cmake.
When passing a pathlib.Path to str(), str() converts to a native path,
in this case / to \\ on Windows which confuses cmake. In this case we
need to write paths with forward slashes and then pass those onward to
cmake.

## Test Plan

1. Ensure this doesn't impact existing CI.
2. Ensure compilation of Windows pass locally.

## Test Result

1. Passes existing CI
2. This fixes the compilation error locally.

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 16:13:56 +00:00
Brock Hargreaves
abf13bdec1 [rocm-libraries] ROCm/rocm-libraries#4819 (commit b995a0b)
[CK] Fix windows build issues

## Motivation

Full build on Windows is currently broken due to compiler errors, this
PR should help fix that. This is also holding up the following PR in the
TheRock: https://github.com/ROCm/TheRock/pull/3382

## Technical Details

1. I don't see a good reason to be nesting a windows include inside the
ck_tile namespace. It was causing compiler errors too: Windows.h comes
with min and max, which was conflicting with ck_tile::min and
ck_tile::max, so I moved it out. I also defined NOMINMAX to prevent this
inclusion in the future.
2. The TRUE/FALSE macros are already used by Windows.h, which causes an
error. So I've opted for True/False. You can see this pattern in other
rocm-libraries.
3. The M_PI macro isn't available, at least in the WIN32_LEAN_AND_MEAN
context, from \<cmath\> on Windows. We'll be able to use
std::numbers::v_pi\<float\> when we have C++20 support.
4. There was a missing \<chrono\> include.

## Test Plan

Test locally and make sure this doesn't impact existing CI.

## Test Result

Compiles locally and passes existing ci.

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 16:13:13 +00:00
Zoltán Lakatos
a32d704d89 [rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f)
[CK] Implement device grouped gemm fixed nk multi abd for
 rdna4 (#4425)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Add support for grouped gemm multi ABD fixed NK. MR

## Technical Details

Changes from the reverted PR:
- Device struct for grouped gemm with multiple ABD and fixed NK
(DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK).
- Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD
- Unit tests for both new wmma implementation and the reference xdl code
(previously missing)
- Note: Some Xdl instances were commented out because of unit test
failures. As mentioned apparently for xdl this feature was missing tests
so our assumption is either there is an implemenetation bug or these
instances were not set up correctly. Has the potential for a follow-up
issue.
- Generic ck profiler interface with the purpose of calling unit tests.
- Gemm instances with specific elementwise operations for gemm bias gelu
calculations.
- Added class for grouped gemm multi ABD reference calculations.

Fix epilogue selection in device implementation that caused unit test
failures

## Test Plan

Covered by added unit tests

## Test Result

CI successfully passing

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 05:17:08 +00:00
Bartłomiej Kocot
1a2c0d835a [rocm-libraries] ROCm/rocm-libraries#4791 (commit 6cc17c6)
[CK][CK TILE] Improve oob check

## Motivation

Improve OOB checks. Remove permutes which have been generated by thread
buffer zero clear. at now in assembly there is only condmask instead of
permute + condmask.

Change number of KPack for generated instances

## Technical Details

Remove permute instructions from assembly

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-24 21:41:44 +00:00
Max Podkorytov
f3f4d7d842 [rocm-libraries] ROCm/rocm-libraries#4430 (commit 3bcf68c)
[CK] Add project root marker for monorepo compatibility
 (#4430)

## Summary
- Add `.ck-project-root` marker file at the composablekernel project
root
- Update `find_project_root()` in `script/tools/common.sh` to look for
this marker instead of `.git`
- Fixes project root detection when CK is part of the rocm-libraries
monorepo

  ## Background
Since the project was moved into the monorepo, the `.git` directory is
at the monorepo root rather
than the CK project root. This caused `find_project_root()` to return
the wrong path, breaking tools
   in `script/tools/`.

  ## Test plan
- [x] Verify `find_project_root` returns correct path from any CK
subdirectory
  - [x] Verify `ck-build --help` works
  - [x] Verify `ck-configure --help` works

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-02-24 21:00:34 +00:00
Matti Eskelinen
cd12e8e31f [rocm-libraries] ROCm/rocm-libraries#4295 (commit fa2cfc8)
[CK_TILE] Refactor `UniversalGemm::MakeA/B/C/DBlockViews` to
 allow caller to pass desciptors directly (#4295)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Currently `UniversalGemmKernel::MakeA/B/C/DBlockViews` directly create
tensor views from strides and sizes. This refactors the descriptor
creation out and add overloaded definitions, allowing descriptors to be
created separately by the caller instead of passing explicit strides,
with no functional changes.

This will enable further refactoring of `RunGemm` to do likewise,
enabling derived kernels like BatchedContractionKernel to avoid creating
separate versions (PR
[#3457](https://github.com/ROCm/composable_kernel/pull/3457)).

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

Since the logic within the MakeXBlockviews chains together operations on
tuples, and thus the descriptors are also passed as such, adding a
template parameter for the type of the input tuple was the simplest
option to enable the overload without too much verbiage. However, for
`MakeCBlockView` this adds a complications as the templated definitions
are prone to overlap. This for now is avoided by just moving the
arguments around for the descriptor version, which avoids the collision.
It's not a great solution, so feel free to suggest a better one.
2026-02-24 20:44:27 +00:00
Enrico Degregori
4c626aeaa6 [rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)
[CK_TILE] Extend support of mix precision microscaling BQuant
 (#4267)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Supported types combinations using BQuant=e8m0:
 - A=bf16
 - B=bf16,bf8,fp4

Summary:
- remove usage of `pk_fp4_raw_t`: consistent with other implementations
and avoid taking into account of the packed size explicitly. In general,
the raw type should not be used because CK Tile internally takes care of
the PackedSize, so using the raw type adds unnecessary complexity to the
implementation
- handle microscaling by checking for `e8m0` type for BQuant (previous
implementation was inconsistent)
 - add support for scaling instructions in `DequantPack8`
 - mx pipeline:
   - extend existing pipeline to support different B types
- add support to scale and cast before writing to LDS or after reading
from LDS (this can be defined in the `Problem` by the user)
 - block gemm:
   - mx pipeline is now using block gemm BQuant
- block gemm BQuant can now load from LDS and apply scale and then call
block gemm universal operator. This adds new functionalities and remove
code duplication
 - warp gemm:
- add case to support 128bit ds_read/write for both A and B when A=16bit
and B=8bit
- add examples and tests: note that some tests for bf16/fp4 already
existed but were removed during previous tests refactoring. I added them
again and other relevant tests for new types combinations

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-24 17:57:02 +00:00
Cong Ma
3af1a0aafc [rocm-libraries] ROCm/rocm-libraries#4355 (commit e7f6909)
[CK TILE] Refactor sequence_reverse_inclusive_scan

## Proposed changes

Refactor ck tile `sequence_reverse_inclusive_scan` from recursive to
for-loop.

Tracking issue: #4229

This pull request introduces a new lightweight array type,
`static_array`, and refactors the sequence utilities to use it for
improved constexpr support and simplicity. The changes also include
updates to the build system to add container-related tests.

**Core Library Improvements:**

* Added a new header `static_array.hpp` that defines the `static_array`
type, a constexpr-friendly array with basic accessors and no custom
constructors.
* Updated includes in `core.hpp` and `sequence.hpp` to import
`static_array`.
[[1]](diffhunk://#diff-14b406eccf59794051a16c0c9c1a7e11234324bfdd107a5bbe0f173cd25bcddcR44)
[[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76R7)

**Refactoring to Use `static_array`:**

* Refactored sequence utilities in `sequence.hpp` to use `static_array`
instead of the previously forward-declared `array` type, including in
histogram and array generation logic.
[[1]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1108-R1133)
[[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1130-R1146)
* Rewrote the implementation of `sequence_reverse_inclusive_scan` to use
`static_array` for intermediate storage, improving constexpr evaluation
and clarity.

**Build System and Testing:**

* Added a new test subdirectory for container tests and a GoogleTest
executable for `unit_sequence.cpp` to the CMake build configuration.
[[1]](diffhunk://#diff-5d35ff7555d3f0b438d45cde06b661eb1332cdbec66287ac7ec3c478d688aae5R5)
[[2]](diffhunk://#diff-1f54f0d2b431b7fc74f7b4ffb66e80c381c904c3383b1d27987467e3482d6d7aR1-R7)

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-24 15:52:33 +00:00
Emily Martins
fc3180120e [rocm-libraries] ROCm/rocm-libraries#4756 (commit 79bc2ca)
[CK_TILE] Update Stream-K Reduction Strategy Enum

## Motivation

Currently, Stream-K has 3 reduction options: 1) atomics, 2) The
reduction described in the Stream-K paper, and 3) a tree reduction. The
reduction strategy described in the original Stream-K paper has the
starting workgroup of each tile sequentially accumulating partial
results of other contributing workgroups in the tile, which requires a
linear number of steps. Hence, for clarity, this works updates the
naming of the `StreamKReductionStrategy` enum members to better describe
the existing reduction strategy options.

## Technical Details

Prior to this change, the enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
    Atomic        = 0u,
    Reduction     = 1u,
    TreeReduction = 2u
};
```
But, the distinction between `Reduction` and `TreeReduction` is not very
clear and has some redundancy.
Hence, the updated enum is as follows:
```cpp
enum StreamKReductionStrategy : uint32_t
{
    Atomic = 0u,
    Linear = 1u,
    Tree   = 2u
};
```
All references to `StreamKReductionStrategy` were updated to reflect
this change.
## Test Plan

No new functionality was added, so no new tests were added; I just
validated existing tests and examples.

## Test Result

All tests passed locally.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-24 06:41:15 +00:00
Aviral Goel
6aa1cd8212 [rocm-libraries] ROCm/rocm-libraries#4704 (commit 17662f9)
[CK_TILE] Fix FP8 MXGEMM numerical error in async load path
 (#4704)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Fixes FP8 MXGEMM producing half the expected result (e.g., 128 instead
of 256 with all 1s input).

**Bug introduced in:** `b7de1e14cea70681a23cd1a136df42910c776e4a` -
"[CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280)"

  ## Root Cause

In the `static_move_ys=true` code path in `tile_window.hpp`, the IMM
optimization computes `lds_ys_offset` using a default-constructed tensor
descriptor:

  ```cpp
  make_tensor_coordinate(decltype(tensor_descriptor){}, idx_ys_offset)
```

  This default-constructed descriptor has different strides than the actual DRAM tensor descriptor used for dram_ys_offset. When these offsets are mixed in the address   calculation:

```cpp

  imm_valid = lds_ys_offset % IMM_RANGE;     // From wrong descriptor
  wave_offset = dram_ys_offset - imm_valid;  // From correct descriptor
```

  The final address wave_offset + imm_valid ≠ dram_ys_offset, causing incorrect memory accesses.

  Fix
```cpp
Set imm_valid = 0 to bypass the IMM optimization and ensure the full
offset is passed through wave_offset:

constexpr auto imm_valid = 0; // Avoids inconsistency between
lds_ys_offset and dram_ys_offset
```

  This disables the 12-bit immediate field optimization in the buffer_load_lds instruction but guarantees correctness. A proper fix would require making the DRAM tensor   descriptor constexpr, which is not feasible since tensor strides depend on runtime parameters (LDA, LDB).
2026-02-23 20:30:32 +00:00
chris-tsiaousis-hpc
816abdcf9f [rocm-libraries] ROCm/rocm-libraries#4649 (commit 642e7e3)
[CK] Updated pre-commit entry points

## Motivation

Pre-commit fails after the transition to the monorepo. This fixes it.

## Technical Details

-

## Test Plan

Try to commit on CK with pre-commit enabled.

## Test Result

Pre-commit should pass. (Scripts are correctly found)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-02-23 00:39:22 +00:00
Anton Gorenko
0d92fffedb [rocm-libraries] ROCm/rocm-libraries#4584 (commit 42efd1d)
[CK_TILE][FMHA] Support gfx11

## Motivation

Add support of gfx11 architectures (RDNA3) to FMHA.

## Technical Details

Distributions (matrix elements to lane registers mapping) of gfx11 WMMA
are completely different from distributions of gfx9 MFMA and gfx12 WMMA.
There are two cases in FMHA where this difference matters:
* usage of results (matrix C) of one GEMM as input (matrix A) of another
GEMM.
* random number generation for dropout (implementation for gfx9 MFMA,
gfx12 WMMA and host validation produce the same results).

Both cases are solved by a special remapping implemented using
`__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`.

Additional changes:
* FMHA tests are now build and run only for those types for which
instances exist (gfx11 supports only fp16 and bf16).
* Two fixes for uninitialized values (`mask.sink` and
`do_fp8_static_quant`): they may contain garbage resulting in incorrect
dispatching logic, sometimes tests report that there are no instance
available for current parameters.
* Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when
they are not requested (i.e. every time), likely has no effect on
performance but makes disassembly a bit clearer.

## Test Plan

```
ninja test_ck_tile_fmha

bin/test_ck_tile_fmha_fwd_fp16
bin/test_ck_tile_fmha_fwd_bf16
bin/test_ck_tile_fmha_bwd_fp16
bin/test_ck_tile_fmha_bwd_bf16
```

## Test Result

All tests must pass (some tests may be skipped).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-21 01:15:57 +00:00
Illia Silin
1915cdfcc2 [rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)
Revert "[ck] Support VGPR estimate in
 GridwiseGemm_wmma_cshuffle_v3" (#4762)

Reverts ROCm/rocm-libraries#4638
unfortunately, this PR interfered with the PR#4299 and caused build
errors for gfx11:

In file included from
/rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7:
In file included from
/rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11:

/rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21:
error: no matching function for call to 'CheckValidity'
  553 |                 if(!GridwiseGemm::CheckValidity(
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-02-20 22:41:34 +00:00
Geo Min
6df27cfad6 [rocm-libraries] ROCm/rocm-libraries#4705 (commit 845bc39)
[ci] Adding composablekernel to TheRock CI

Workflow files under `projects/composablekernel/.github/workflows` do
not get picked up in GitHub workflows. This will allow composable kernel
changes to be build and tested properly

CI tests will prove functionality
2026-02-20 19:19:47 +00:00
linqunAMD
29781f2ac4 [rocm-libraries] ROCm/rocm-libraries#4638 (commit 305ec71)
[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3
 (#4638)

1. Add GetEstimateVgprCount to estimate the VGPR usage in
GridwiseGemm_wmma_cshuffle_v3
2. Add IsValidCompilationParameter to disable kernel which use too many
vgprs.
- Currently, the threashold is AvailableVgprCount * 1.25
3. Modify examples to avoid test is disabled on gfx11

It is port from internal repo
PR[#192](https://github.com/ROCm/composable_kernel/issues/192)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-20 15:57:18 +00:00
Aviral Goel
7689090739 [rocm-libraries] ROCm/rocm-libraries#4556 (commit 15730e7)
fix: correct ULP calculation in get_absolute_threshold for
 BF16 tolerance (#4556)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

BF16 grouped GEMM tests were failing on gfx1201 with errors like:
```
Error: Incorrect results! out[5457621] != ref[5457621]: -66 != -65.5
max err: 0.5, number of errors: 1
```

The calculated absolute tolerance (atol ~0.26) was too small to account
for legitimate hardware vs software BF16 conversion differences (0.5
ULP).

## Changes

1. **Discrete exponent calculation**: Changed from continuous `log2()`
to `floor(log2())` to match actual IEEE 754 floating-point exponent
levels
2. **Full ULP for output_error**: Changed from 0.5 to 1.0 ULP to account
for hardware `__bf16` vs software `float_to_bf16()` conversion
differences

## Calculation Example

For the failing case with value ~66:

**Before (incorrect):**
```
expo = log2(66) = 6.044...
atol = 2^(6.044 - 7) * 0.5 = 2^(-0.956) * 0.5 ≈ 0.26
Error 0.5 > 0.26 → Test fails 
```

**After (correct):**
```
discrete_expo = floor(log2(66)) = 6
atol = 2^(6 - 7) * 1.0 = 2^(-1) * 1.0 = 0.5
Error 0.5 ≤ 0.5 → Test passes ✓
```

The ULP for values in [64, 128) is 2^(-1) = 0.5, and the error of 0.5 is
exactly 1 ULP, which is the maximum expected difference between hardware
and software BF16 conversions at tie cases.

## Rationale

Hardware and software BF16 conversions can differ by up to 1 ULP at tie
cases due to different rounding strategies (hardware vs IEEE 754
round-to-nearest-even). The discrete exponent ensures ULP is calculated
correctly for all values within an exponent range.

**Modified file**:
`projects/composablekernel/include/ck_tile/host/check_err.hpp`
2026-02-20 09:46:22 +00:00
Márton Bidlek
7b97e197ef [rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49)
173 implement device grouped gemm fixed nk for rdna4
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk
instance library using for WMMA.

The implementation is based on the existing
DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level
structure, but replaces the XDL kernel with a WMMA-based one. It uses
the GridwiseGemm_wmma_cshuffle_v3 kernel.

At this stage, the focus is functional correctness and compatibility,
not performance tuning.

## Technical Details

- Device struct for grouped gemm fixed NK
- Example code for the WMMA version
- Unit tests for both new wmma implementation and the reference XDL code
(previously missing)
- Generic ck profiler interface with the purpose of calling unit tests.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [x] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [x] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-19 08:13:46 +00:00
Thrupti Raj Lakshmana Gowda
c5ce5eee5b [rocm-libraries] ROCm/rocm-libraries#4655 (commit f8d76d1)
Update CMakeLists.txt

## Motivation

Tile Engine is an internal benchmarking tool and it need not be built
everytime which would impact the build time with this PR we are
excluding build for stream k operator in Tile Engine.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-19 06:30:48 +00:00
Ville Pietilä
b2051812bc [rocm-libraries] ROCm/rocm-libraries#4652 (commit 39a5a53)
Revert "[CK] Add new fwd conv fp16/bf16 instances optimized
 for unit group size." (#4652)

PR ROCm/rocm-libraries#4275 contains CK fwd conv instances optimized for
`gfx950` and they do not compile for other architectures such as
`gfx940`. To ensure that the optimized instances are compiled only for
`gfx950`, compile-time guard `#if defined(CK_USE_GFX950)` was used. This
approach works correctly when we compile for a single architecture, but
when we compile simultaneously for multiple architectures, flag
`CK_USE_GFX950` is set for non-gfx950 archs as well. As a result, the
multi-arch compilation fails. The problem doesn't appear in the ROCm
libraries CI/CD pipeline since only one architecture is compiled at a
time. Hence, the CI/CD passed for the original PR.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-19 00:03:19 +00:00
Tianxing Wu
0a2b6c4bcd [rocm-libraries] ROCm/rocm-libraries#4297 (commit 5ff580c)
moe flatmm xcd remap
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

co-authors: @Chi-Chu319 @juuso-oskari

Added XCD remapping for flatmm moe
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File

href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List

href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<style>
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Arial, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
-->
</style>
</head>

<body link="#467886" vlink="#96607D">

batch | Mixtral (tflops, wip_355) | Mixtral-7B  (tflops, our branch) |
perf boost
-- | -- | -- | --
64 | 865.424 | 995.455 | 15.0%
256 | 886.336 | 1020.96 | 15.2%
1024 | 890.808 | 1022.53 | 14.8%

</body>

</html>
2026-02-18 19:33:24 +00:00
Thomas Ning
5cb8109535 [rocm-libraries] ROCm/rocm-libraries#4640 (commit 37b8c81)
Fix the Composable Kernel CI and versions incompatibility
 (#4640)

## Motivation

This PR has 4 patches:
1. Fix the CI error of grouped gemm.
2. Fix the incompatibility of old linux version.
3. Fix the potential errors of flatmm.
4. Address the previous comments of abquant eight warps pipeline
solution.
2026-02-18 15:00:26 +00:00
John Shumway
1f6768472e [rocm-libraries] ROCm/rocm-libraries#4598 (commit 9ff8af1)
[CK_BUILDER] Fix two staging-compiler errors in CK builder
 code (#4598)

This PR fixes two compiler warnings that report as errors with the
latest compiler:

1. In tensor descriptor, the `operator[]` accessor needs a
`[[clang::lifetimebound]]` attribute.
2. In the unit tests for device buffer, there is a test that explicitly
checks for an error on a pointer that went out of scope, so it needs a
to disable `-Wlifetime-safety-permissive` in the test code.

I ran the CK `smoke-builder` tests with the staging compiler to verify.
2026-02-18 01:27:35 +00:00
Ville Pietilä
2b2a39be98 [rocm-libraries] ROCm/rocm-libraries#4275 (commit 2e07a39)
[CK] Add new fwd conv fp16/bf16 instances optimized for unit
 group size. (#4275)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Added new FP16/BF16 instances that are optimized for group size = 1. The
new instance use the compute optimized block GEMM pipeline.

| CK prof command | Baseline (TFLOPs) | New V3 instances (TFLOPs) |
|:-----|:------:|------:|
| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1 1 1
1 1 | 858.818 | 962.293 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 256 256 3 3 100 100 1 1 1 1 1 1
1 1 | 979.987 | 1121.11 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 1 32 2376 256 3 3 50 50 1 1 1 1 1 1 1
1 | 945.951 | 1091.66 |
2026-02-18 00:59:15 +00:00
John Shumway
270b1445b1 [rocm-libraries] ROCm/rocm-libraries#4259 (commit 223d90c)
Add multi-file trace parsing and analysis pipeline
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Extends build time analysis from ROCm/composable_kernel#3644 to handle
multiple trace files across build directories (see #4229):

- pipeline.py: Generic pipeline framework with fluent interface for
composable data processing. Provides parallel processing, progress
tracking, and error handling independent of trace-specific code.
Processes thousands of trace files at default resolution in minutes,
aggregating results into in-memory DataFrames for analysis.
- parse_build.py: Parse all trace files in a build directory
- build_analysis_example.ipynb: Demonstrates pipeline aggregation across
all build files

The pipeline design improves capability (composable operations),
performance (parallel processing), and user-friendliness (fluent API) of
the analysis modules. It enables analyzing compilation patterns across
the entire codebase with all trace data available in pandas DataFrames
for interactive exploration.
2026-02-17 21:14:11 +00:00
Aviral Goel
1bf66006c9 [rocm-libraries] ROCm/rocm-libraries#4272 (commit 52def72)
feat: add new optimized tutorial kernels
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Add 01_naive_gemm baseline implementation
- Add 02_padding_k_first with PADDING_K_FIRST + MFMA_32x32x16
- Add 03_mfma_16x16x16 with PADDING_K_FIRST + MFMA_16x16x16
- Share common reference_gemm.hpp in parent gemm/ directory

## Proposed changes

Please describe the motivation behind the pull request, whether it
enables a new feature or fixes a bug. If there are associated pull
requests or issues, please link them to the pull request.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-17 20:42:13 +00:00
John Shumway
42973fd546 [rocm-libraries] ROCm/rocm-libraries#4593 (commit a4c2a37)
[CK_BUILDER] Move some smoke tests that require GPU

Since the CK builder is focused on compile time logic, let's keep the
`smoke-builder` target CPU-only so that it can be ran without a CPU.
Alternatively, we could define a `smoke-cpu-builder` or some special
subtarget, but it's probably simpler to just stick to CPU for this. (My
thinking is that in general GPU testing will be heavier than the smoke
tests. Further, the GPU testing code will likely move outside of the
builder once builder code is moved out of experimental.)

This PR clarifies that CPU-only intention for `smoke-builder` and moves
some GPU testing code to `smoke-regression`.
2026-02-17 17:32:55 +00:00
Jan Patrick Lehr
9c2dd2941b [rocm-libraries] ROCm/rocm-libraries#4419 (commit e241f8b)
[CK] Work around staging compiler lifetime warning

## Motivation
The staging compiler enables lifetime-safety warnings and we already
worked around a few of them.
This works around a few more instances that came up recently on gfx950
builds.
The initial PR that resolved most issues:
https://github.com/ROCm/composable_kernel/pull/3640

## Technical Details
This follows the pattern to locally ignore the newly added
lifetime-safety warnings that were moved from experimental to production
in upstream LLVM.
As a result, CK turned them on and treats them as errors, which prevents
the staging compiler from building CK.

## Test Plan

## Test Result

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-12 22:12:57 +00:00
lalala-sh
dae352e8dc [rocm-libraries] ROCm/rocm-libraries#4282 (commit 2050f93)
add memsetasync for ck moe splitk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

add memsetasync for ck moe splitk to fix

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-12 17:45:52 +00:00
Illia Silin
0f55bbae61 [rocm-libraries] ROCm/rocm-libraries#4514 (commit 5378ee0)
[CK] add check for THEROCK_SANITIZER in cmake

## Motivation

Check whether the THEROCK_SANITIZER flag is set to ASAN or HOST_ASAN.

## Technical Details

In case the THEROCK_SANITIZER flag is set to ASAN or HOST_ASAN and no
GPU_TARGETS is selected, the list of the default targets will be set to
"gfx908:xnack+;gfx90a:xnack+;gfx942:xnack+;gfx950:xnack+".

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-12 17:19:01 +00:00
Illia Silin
47c7c034e9 [rocm-libraries] ROCm/rocm-libraries#4525 (commit 7f34b22)
[CK] Fix the launch_tests script.

## Motivation

Fix the script that filters the tests.

## Technical Details

There were several places where the paths had to be updated for the
launch_tests script to work correctly.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-12 04:43:27 +00:00
Christopher Millette
e1e2f7ac2e [rocm-libraries] ROCm/rocm-libraries#4447 (commit 6d08a99)
[CK] Optimize multi-dimensional static for loop decomposition
 (#4447)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
Recursive template implementations might initially seem attractive to
minimize necessary coding.

Unfortunately, this style is often affects readability and requires
significant resources from the compiler to generate instantiation
chains. In "high-traffic" code (e.g., used in many places + compilation
units), this generally does not scale well and can bloat the overall
compile times to unnecessary lengths.

The aim of this PR is to take some of most high-traffic utility code and
try our best to eliminate recursive templates in favor of fold
expansions and constexpr function helpers.

In local tests with clang build analyzer,
device_grouped_conv2d_fwd_xdl_ngchw_gkcyx_ngkhw_f16_16x16_instance.cpp
showed high hit-rates on slow template instantiations in static_for,
dimensional static_for (static_ford), which are subsequently affected by
implementation of the Sequence class and associated transforms.

Example:
**** Templates that took longest to instantiate:
70111 ms: ck::detail::applier<int, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1... (372 times, avg 188 ms) // **70 seconds!**

The above is part of the implementation of static_for which uses
Sequence classes..

## Technical Details

### Summary of Optimization Techniques

| Technique | Used In | Benefit |
 |-----------|---------|---------|
| __Constexpr for-loop computation__ | sequence_reverse_inclusive_scan,
sequence_map_inverse | Moves O(N) work from template instantiation to
constexpr evaluation |
| __Pack expansion with indexing__ | sequence_reverse, Sequence::Modify
| Single template instantiation instead of recursive |
| __Flat iteration + decomposition__ | ford, static_ford | O(1) template
depth instead of O(N^D) |
| __Pre-computed strides__ | index_decomposer | Enables O(1)
linear-to-multi-index conversion |

### Impact on Compile Time

These optimizations reduce template instantiation depth from O(N) or
O(N^D) to O(1), which:

1. Reduces compiler memory usage
2. Reduces compile time exponentially for deep instantiation chains
3. Enables larger iteration spaces without hitting template depth limits

## Test Plan

* Existing tests for Sequence are re-used to affirm correctness
* Unit tests for ford and static_ford are added (dimensional looping)
* 8 new regression tests specifically verify the fixes for the PR
feedback:

  - `NonTrivialOrder3D_201` - Tests Orders<2,0,1> for static_ford
  - `NonTrivialOrder3D_201_Runtime` - Tests Orders<2,0,1> for ford
- `ConsistencyWithNonTrivialOrder_201` - Verifies static_ford and ford
consistency
  - `NonTrivialOrder3D_120` - Tests Orders<1,2,0> for static_ford
  - `NonTrivialOrder3D_120_Runtime` - Tests Orders<1,2,0> for ford
  - `NonTrivialOrder4D` - Tests 4D with Orders<3,1,0,2> for static_ford
  - `NonTrivialOrder4D_Runtime` - Tests 4D with Orders<3,1,0,2> for ford
- `AsymmetricDimensionsWithOrder` - Tests asymmetric dimensions with
non-trivial ordering

## Test Result
### Compile Time Comparison: `8b72bc8` (base) → `477e0686` (optimized)

#### Commits in Range (8 commits)

1. `fd4ca17f48` - Optimize sequence_reverse_inclusive_scan and
sequence_reverse
2. `7a7e3fdeef` - Optimize sequence_map_inverse
3. `92855c9913` - Optimize ford and static_ford calls to eliminate
nested template recursion
4. `88a564032b` - Add unit tests for ford and static_ford
5. `1a0fb22217` - Fix clang-format
6. `8a0d26bddf` - Increase template recursion depth to 1024
7. `dc53bb6e20` - Address copilot feedback and add regression tests
8. `477e06861d` - Increase bracket depth to 1024

#### Build Timing Results

| File | Base (8b72bc8759d9 | HEAD(a0438bd398) | Improvement |
|------|------|------|-------------|
| grouped_conv2d_fwd (f16) -j1 | 313.31s | 272.93s | __12.9% faster__ |
| grouped_conv1d_fwd (bf16) -j1 | 79.33s | 68.61s | __13.5% faster__ |
| grouped_conv1d_bwd_weight (f16) -j1| 15.77s | 14.31s | __9.2% faster__
|
| device_grouped_conv2d_fwd_instance -j64 | s | s | __% faster__ |

#### Key Optimizations

1. __sequence_reverse_inclusive_scan/sequence_reverse__: O(N) → O(1)
template depth
2. __sequence_map_inverse__: O(N) → O(1) template depth
3. __ford/static_ford__: O(N^D) → O(1) template depth using flat
iteration with index decomposition
4. __Copilot feedback fixes__: Corrected New2Old mapping for non-trivial
orderings

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-11 22:13:15 +00:00
Bartłomiej Kocot
ea4942cd02 [rocm-libraries] ROCm/rocm-libraries#4506 (commit d9ccef7)
Revert "[CK Conv] Add bwd weight instance for large-k shape"
 (#4506)

Reverts ROCm/rocm-libraries#4266 due to CI failures. Should be
investigated by @johannes-graner
2026-02-11 21:37:50 +00:00
Christopher Millette
04eddbc5ce [rocm-libraries] ROCm/rocm-libraries#4471 (commit 10fa702)
[CK] Optimize vector type build times

**Supercedes https://github.com/ROCm/rocm-libraries/pull/4281 due to CI
issues on import**

## Proposed changes

Build times can be affected by many different things and is highly
attributed to the way we write and use the code. Two critical areas of
the builds are **frontend parsing** and **backend codegen and
compilation**.

### Frontend Parsing
The length of the code, the include header tree and macro expansions all
affect the front-end parsing time.
This PR seeks to reduce the parsing time of the dtype_vector.hpp
vector_type class by reducing redundant code by generalization.
* Partial specializations of vector_type for native and non-native
datatypes have been generalized to one single class, consolidating all
of the data initialization and AsType casting requirements into one
place.
* The class nnvb_data_t_selector (e.g., Non-native vector base dataT
selector) class has been removed and replaced with scalar_type
instantiations as they have the same purpose. Scalar type class' purpose
is already to map generalized datatypes to native types compatible with
ext_vector_t.

### Backend Codegen
Template instantiation behavior can also affect build times. Recursive
instantiations are very slow versus concrete instantiations. The
compiler must make multiple passes to expand template instantiations so
we need to be careful about how they are used.
* Previous vector_type classes declared a union storage class, which
aliases StaticallyIndexedArray<T,N>.
```
template <typename T>
struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>>
{
    using d1_t = T;
    typedef T d2_t __attribute__((ext_vector_type(2)));
    typedef T d4_t __attribute__((ext_vector_type(4)));

    using type = d4_t;

    union
    {
        d4_t d4_;
        StaticallyIndexedArray<d1_t, 4> d1x4_;
        StaticallyIndexedArray<d2_t, 2> d2x2_;
        StaticallyIndexedArray<d4_t, 1> d4x1_;
    } data_;
   ...
};
```
* Upon further inspection, StaticallyIndexedArray is built on-top of a
recursive Tuple concatenation.
```
template <typename T, index_t N>
struct StaticallyIndexedArrayImpl
{
    using type =
        typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type,
                              typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type;
};
```
This union storage has been removed from the vector_type storage class.

* Further references to StaticallyIndexedArray have been replaced with
StaticallyIndexedArray_v2, which is a concrete implementation using
C-style arrays.
```
template <typename T, index_t N>
struct StaticallyIndexedArray_v2
{
    ...

    T data_[N];
};
```

### Fixes
* Using bool datatype with vector_type was previously error prone. Bool,
as a native datatype would be stored into bool ext_vector_type(N) for
storage, which is a packed datatype. Meaning that for example,
sizeof(bool ext_vector_type(4)) == 1, which does not equal
sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The
union of these datatypes has incorrect data slicing, meaning that the
bits location of the packed bool do not match with the
StaticallyIndexedArray member. As such, vector_type will use C-Style
array storage for bool type instead of ext_vector_type.
```
template <typename T, index_t Rank>
using NativeVectorT = T __attribute__((ext_vector_type(Rank)));

sizeof(NativeVectorT<bool, 4>) == 1  (1 byte per 4 bool - packed)
element0 = bit 0 of byte 0
element1 = bit 1 of byte 0
element2 = bit 2 of byte 0
element3 = bit 3 of byte 0

sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4  (1 byte per bool)
element0 = bit 0 of byte 0
element1 = bit 0 of byte 1
element1 = bit 0 of byte 2
element1 = bit 0 of byte 3

union{
    NativeVectorT<bool, 4> d1_t;
    ...
    StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1;
};

// union size == 4 which means invalid slicing!
```
* Math utilities such as next_power_of_two addressed for invalid cases
of X < 2
* Remove redundant implementation of next_pow2

### Additions
* integer_log2_floor to math.hpp
* is_power_of_two_integer to math.hpp

### Build Time Analysis

Machine:  banff-cyxtera-s78-2
Target: gfx942

| Build Target | Threads | Frontend Parse Time (s) | Backend Codegen
Time (s) | TotalTime (s) | commitId |

|---------------|---------|-------------------------|--------------------------|---------------|
2026-02-11 19:01:05 +00:00
Bartłomiej Kocot
2dd2f114b3 [rocm-libraries] ROCm/rocm-libraries#4407 (commit adde219)
[CK][CK TILE] Add has hot loop check for pipeline v1

## Motivation

Add has hot loop check for pipeline v1 (v1 basic and v1 basic async).
Enable more tests which have been fixed by this change.

## Technical Details

Hot loop has been executed without num loop check.

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

Passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-651
AICK-663
2026-02-11 13:43:01 +00:00
Johannes Graner
e88f139c6c [rocm-libraries] ROCm/rocm-libraries#4271 (commit 6fce58e)
[Conv] Add NumGroupsToMerge to BwdWeight type string
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Add parameter to bwd weight V3 type string showing the number of groups
to merge. This is required for MIOpen to be properly tuned since it uses
type strings for performance database entries.

In order to not break existing tuning databases, the parameter is added as a named suffix and only when group merging is enabled.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-11 09:08:38 +00:00
Cong Ma
d06f35027a [rocm-libraries] ROCm/rocm-libraries#4354 (commit d41f08a)
[CK TILE] fix numerical errors of preshuffle_b

This pull request introduces several improvements and fixes related to
quantized grouped GEMM (General Matrix Multiply) pipelines and their
supporting utilities.

# The numerical issue

## Steps to reproduce
```bash
Run
./bin/tile_example_gemm_weight_preshuffle -prec=fp8
./bin/tile_example_gemm_weight_preshuffle -prec=int4
```

# Solution
The main changes address type correctness, improve data layout and
shuffling logic, and expand test coverage to better validate different
GEMM configurations.

**Key changes include:**

### Data layout and shuffling logic

* Refactored the logic in `shuffle_b_permuteN` to use `constexpr`
variables for `KLane` and `ItemsPerAccess`, simplifying tile view
construction and correcting the permutation order for improved
efficiency and correctness (`tensor_shuffle_utils.hpp`).
* Fixed the calculation of `KLaneBytes` in weight preshuffle pipeline
policies to account for internal data type conversion (e.g., from
`pk_int4_t` to `fp8`), ensuring accurate memory access and alignment in
quantized GEMM policies (`wp_pipeline_agmem_bgmem_creg_base_policy.hpp`,
`gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp`).
[[1]](diffhunk://#diff-93f16cd76e6e24404777e682a5ac8e039913ddd6a438c7efd61fdda42276e4efL274-R275)
[[2]](diffhunk://#diff-9c3d0fc3c014feed435bfd93ba1f8f9fb3e054dcc322deada3addf70bee5a58cL100-R105)

### Test infrastructure enhancements

* Unit tests did not catch this issue since there were no tests for fp8.
Added new configuration structs (`config_mn_16x16`, `config_mn_32x32`)
to support additional GEMM tile shapes and updated tests to run with
these configurations for broader coverage
(`test_gemm_pipeline_util.hpp`).
[[1]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8R86-R103)
[[2]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8L255-R269)

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-11 07:05:46 +00:00
Thomas Ning
807efa703a [rocm-libraries] ROCm/rocm-libraries#4274 (commit 7c380df)
Add padding to cshuffle epilogue to avoid bank conflict
 (#4274)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Added the padding to CShuffle Epilogue to avoid the bank conflicts of
64. Synced up and learned from the internal repo.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-11 05:52:42 +00:00
Bartłomiej Kocot
6d6ee8f023 [rocm-libraries] ROCm/rocm-libraries#4457 (commit 258a459)
[CK][CK Tile] Temporary disable grouped conv fwd tile comp
 async instances  (#4457)

## Motivation

[CK][CK Tile] Temporary disable grouped conv fwd tile comp async
instances due to the failures

## Technical Details

disable configs to not comple these instances

## Test Plan

test_grouped_convnd_fwd_Tile

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-11 01:52:59 +00:00
Joseph Macaranas
9c94c2294a [rocm-libraries] ROCm/rocm-libraries#4460 (commit ba5ef82)
[Azure External CI] Disable Azure CI on rocm-libraries
 (#4460)

- Deleting all pipeline trigger files tied to Azure External CI from
top-level and project-level.
2026-02-10 23:11:31 +00:00
John Shumway
1af75d290e [rocm-libraries] ROCm/rocm-libraries#4277 (commit 4348901)
Add a README.md file to ck/library/util
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

I'm collecting information about our current testing (#3664). As part of
this work I a README to the directory to emphasize the GPU-first testing
strategy and our support for type-specific tolerances.

This readme contains internal code comments for CK developers and does
not need ROCm documentation review.
2026-02-10 21:27:27 +00:00
Randy Spaulding
d546ec0a53 [rocm-libraries] ROCm/rocm-libraries#4269 (commit 209f62f)
Adapt parser to monorepo
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Addressing issues found trying to run the dependency parser on MIOpen:
- Ninja is recording the full path, e.g.: [json]
```
  "file_to_executables": {
     "/home/rspauldi/repos/rocm-libraries/projects/miopen/include/miopen/miopen.h": [
```
- Running git in monorepo reports the full _relative_ path, e.g.:
```
    "projects/miopen/include/miopen/miopen.h"
```
Of course, `git diff` also returns all files modified in every other
project's commits. These are filtered out as early as possible.

This solution searches for `rocm-libraries` in the `parsing` step, and
if found extracts the project name and stores it in
`enhanced_dependency_mapping.json`. Leading folders are truncated from
each file path, up to and including the project name. This allows
`_is_project_file` to remain unchanged.

The `selection` step then retrieves the project name from the json if it
is defined, and truncates the project folder from the `git diff` output
so the filenames exactly match the json entries.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [X] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

Successfully runs on rocm-libraries MIOpen PRs and produces a list of
tests. I haven't verified the results yet.

This version is not applicable to CI since it operates on a
per-executable level and MIOpen CI uses the single gtest binary. I'll be
working towards that in future PRs over the next few weeks.

```
/home/rspauldi/repos/rocm-libraries/projects/miopen# git checkout miopen/sgundabo_enable_ck_bwd_wrw_navi
<run CMake with TEST_DISCRETE=ON>
# ninja tests

# root@rjs1:/home/rspauldi/repos/rocm-libraries/projects/miopen# python3 /dep/main.py parse build/build.ninja
Parsing ninja dependencies from: build/build.ninja
Parsing ninja build file...
Found 312 executables
Found 820 object-to-source mappings
Found 820 object files
Extracting detailed dependencies for all object files...
Processed 100/820 object files...
Processed 200/820 object files...
Processed 300/820 object files...
Processed 400/820 object files...
Processed 500/820 object files...
Processed 600/820 object files...
Processed 700/820 object files...
Processed 800/820 object files...
Completed dependency extraction for 820 object files
Building file-to-executable mapping...
Found rocm-libraries project: 'miopen'
Built mapping for 608 files
Files used by multiple executables: 216
Sample files with multiple dependencies:
  build/include/miopen/config.h: 306 executables
  build/include/miopen/export.h: 306 executables
  build/include/miopen/export_internals.h: 304 executables
  driver/InputFlags.hpp: 2 executables
  driver/driver.hpp: 2 executables

=== Enhanced Dependency Mapping Summary ===
Total executables: 312
Total files mapped: 608
Total object files processed: 820

File types:
  .cpp files: 310
  .hpp files: 292
  .h files: 6

Files used by multiple executables: 216

Top files with most dependencies:
  build/include/miopen/config.h: 306 executables
  build/include/miopen/export.h: 306 executables
  include/miopen/miopen.h: 304 executables
  src/include/miopen/config.hpp: 304 executables
  build/include/miopen/export_internals.h: 304 executables
  src/include/miopen/rank.hpp: 303 executables
  src/include/miopen/errors.hpp: 302 executables
  src/include/miopen/object.hpp: 302 executables
  src/include/miopen/returns.hpp: 302 executables
  src/include/miopen/sysinfo_utils.hpp: 302 executables
Exporting mapping to build/enhanced_file_executable_mapping.csv
Exporting complete mapping to build/enhanced_dependency_mapping.json

Results exported to:
  CSV: build/enhanced_file_executable_mapping.csv
  JSON: build/enhanced_dependency_mapping.json

root@rjs1:/home/rspauldi/repos/rocm-libraries/projects/miopen# python3 /dep/main.py select  build/enhanced_dependency_mapping.json  1b13d8b72d54e34bdc7ae70dd2b6e809dca8b10e  09e5965d55ebbfacfd1ed18e5092580c2ffae748
Identified 30 files modified in project 'miopen'
Exported 304 tests to run to tests_to_run.json
```

I don't know if clang-format applies to scripts. If so, could someone
show me how to run it in CK?
2026-02-10 18:38:21 +00:00
Johannes Graner
40cec769ce [rocm-libraries] ROCm/rocm-libraries#4266 (commit 1d8094d)
[CK Conv] Add bwd weight instance for large-k shape
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

This instance improves the shape used in `./bin/ckProfiler
grouped_conv_bwd_weight 1 2 0 2 0 1 2 1 32 2376 256 3 3 100 100 1 1 1 1
1 1 1 1 all` from 10.3 ms to 6.6 ms.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-10 16:58:04 +00:00
Erwin Terpstra
b41bfece83 [rocm-libraries] ROCm/rocm-libraries#4268 (commit d2fca53)
[CK_TILE]: PreshuffleB + PreshuffleBQuant for ABQuant
 pipeline (#4268)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Implement BQuantPreshuffle option for the ABQuant PreshuffleB pipeline.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [X] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [X] I have added inline documentation which enables the maintainers
with understanding the motivation
- [X] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [X] Any dependent changes have been merged
2026-02-10 13:59:03 +00:00
Yi DING
d5acfd8d52 [rocm-libraries] ROCm/rocm-libraries#4451 (commit 091bf0f)
[CK_TILE] Blockscale Gemm Fix Multi-Arch Compilation

## Motivation
This PR updates CK_TILE blockscale GEMM-quant kernels and launch helpers
to compile across multiple GPU architectures by introducing compile-time
availability gating and a new attribute tag mechanism for kernel
symbol/attribute specialization.

## Technical Details
- Add an architecture-guarded `kIsAvailable` flag to the gfx950 pipeline
and propagate availability handling into `QuantGemmKernel`.
- Extend `make_kernel`/`kentry` to accept an `Attr` tag enabling
per-kernel compile-time attributes (e.g., `no-packed-fp32-ops`) and
unique symbols.
- Update the blockscale GEMM quant example to pass kernel attributes and
adjust gfx950 gating.

## Test Plan
- CI
- Local test: `cmake .. --preset dev -DGPU_TARGETS='gfx942;gfx950'
-GNinja && ninja tile_example_gemm_quant`
- Local test with ROCm/aiter#1954
## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-10 12:42:19 +00:00
dependabot[bot]
6a6cd05dbb [rocm-libraries] ROCm/rocm-libraries#3090 (commit 728d3a3)
Bump fonttools from 4.57.0 to 4.61.0 in
 /projects/composablekernel/docs/sphinx (#3090)

Bumps [fonttools](https://github.com/fonttools/fonttools) from 4.57.0 to
4.61.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/fonttools/fonttools/releases">fonttools's
releases</a>.</em></p>
<blockquote>
<h2>4.61.0</h2>
<ul>
<li>[varLib.main]: <strong>SECURITY</strong> Only use
basename(vf.filename) to prevent path traversal attacks when running
<code>fonttools varLib</code> command-line script, or code which invokes
<code>fonttools.varLib.main()</code>. Fixes CVE-2025-66034, see: <a
href="https://github.com/fonttools/fonttools/security/advisories/GHSA-768j-98cg-p3fv">https://github.com/fonttools/fonttools/security/advisories/GHSA-768j-98cg-p3fv</a>.</li>
<li>[feaLib] Sort BaseLangSysRecords by tag (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3986">#3986</a>).</li>
<li>Drop support for EOL Python 3.9 (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3982">#3982</a>).</li>
<li>[instancer] Support --remove-overlaps for fonts with CFF2 table (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3975">#3975</a>).</li>
<li>[CFF2ToCFF] Add --remove-overlaps option (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3976">#3976</a>).</li>
<li>[feaLib] Raise an error for rsub with NULL target (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3979">#3979</a>).</li>
<li>[bezierTools] Fix logic bug in curveCurveIntersections (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3963">#3963</a>).</li>
<li>[feaLib] Error when condition sets have the same name (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3958">#3958</a>).</li>
<li>[cu2qu.ufo] skip processing empty glyphs to support sparse kerning
masters (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3956">#3956</a>).</li>
<li>[unicodedata] Update to Unicode 17. Require <code>unicodedata2 &gt;=
17.0.0</code> when installed with 'unicode' extra.</li>
</ul>
<h2>4.60.1</h2>
<ul>
<li>[ufoLib] Reverted accidental method name change in
<code>UFOReader.getKerningGroupConversionRenameMaps</code>
that broke compatibility with downstream projects like defcon (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3948">#3948</a>,
<a
href="https://redirect.github.com/fonttools/fonttools/issues/3947">#3947</a>,
<a
href="https://redirect.github.com/robotools/defcon/issues/478">robotools/defcon#478</a>).</li>
<li>[ufoLib] Added test coverage for
<code>getKerningGroupConversionRenameMaps</code> method (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3950">#3950</a>).</li>
<li>[subset] Don't try to subset BASE table; pass it through by default
instead (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3949">#3949</a>).</li>
<li>[subset] Remove empty BaseRecord entries in MarkBasePos lookups (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3897">#3897</a>,
<a
href="https://redirect.github.com/fonttools/fonttools/issues/3892">#3892</a>).</li>
<li>[subset] Add pruning for MarkLigPos and MarkMarkPos lookups (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3946">#3946</a>).</li>
<li>[subset] Remove duplicate features when subsetting (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3945">#3945</a>).</li>
<li>[Docs] Added documentation for the visitor module (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3944">#3944</a>).</li>
</ul>
<h2>4.60.0</h2>
<ul>
<li>
<p>[pointPen] Allow <code>reverseFlipped</code> parameter of
<code>DecomposingPointPen</code> to take a <code>ReverseFlipped</code>
enum value to control whether/how to reverse contour direction of
flipped components, in addition to the existing True/False. This allows
to set <code>ReverseFlipped.ON_CURVE_FIRST</code> to ensure that the
decomposed outline starts with an on-curve point before being reversed,
for better consistency with other segment-oriented contour
transformations. The change is backward compatible, and the default
behavior hasn't changed (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3934">#3934</a>).</p>
</li>
<li>
<p>[filterPen] Added <code>ContourFilterPointPen</code>, base pen for
buffered contour operations, and <code>OnCurveStartPointPen</code>
filter to ensure contours start with an on-curve point (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3934">#3934</a>).</p>
</li>
<li>
<p>[cu2qu] Fixed difference in cython vs pure-python complex division by
real number (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3930">#3930</a>).</p>
</li>
<li>
<p>[varLib.avar] Refactored and added some new sub-modules and scripts
(<a
href="https://redirect.github.com/fonttools/fonttools/issues/3926">#3926</a>).</p>
<ul>
<li><code>varLib.avar.build</code> module to build avar (and a missing
fvar) binaries into a possibly empty TTFont,</li>
<li><code>varLib.avar.unbuild</code> module to print a .designspace
snippet that would generate the same avar binary,</li>
<li><code>varLib.avar.map</code> module to take TTFont and do the
mapping, in user/normalized space,</li>
<li><code>varLib.avar.plan</code> module moved from
<code>varLib.avarPlanner</code>.</li>
</ul>
<p>The bare <code>fonttools varLib.avar</code> script is deprecated, in
favour of <code>fonttools varLib.avar.build</code> (or
<code>unbuild</code>).</p>
</li>
<li>
<p>[interpolatable] Clarify <code>linear_sum_assignment</code> backend
options and minimal dependency usage (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3927">#3927</a>).</p>
</li>
<li>
<p>[post] Speed up <code>build_psNameMapping</code> (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3923">#3923</a>).</p>
</li>
<li>
<p>[ufoLib] Added typing annotations to fontTools.ufoLib (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3875">#3875</a>).</p>
</li>
</ul>
<h2>4.59.2</h2>
<ul>
<li>[varLib] Clear <code>USE_MY_METRICS</code> component flags when
inconsistent across masters (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3912">#3912</a>).</li>
<li>[varLib.instancer] Avoid negative advance width/height values when
instatiating HVAR/VVAR, (unlikely in well-behaved fonts) (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3918">#3918</a>).</li>
<li>[subset] Fix shaping behaviour when pruning empty mark sets (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3915">#3915</a>,
<a
href="https://redirect.github.com/harfbuzz/harfbuzz/issues/5499">harfbuzz/harfbuzz#5499</a>).</li>
<li>[cu2qu] Fixed <code>dot()</code> product of perpendicular vectors
not always returning exactly 0.0 in all Python implementations (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3911">#3911</a>)</li>
<li>[varLib.instancer] Implemented fully-instantiating
<code>avar2</code> fonts (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3909">#3909</a>).</li>
<li>[feaLib] Allow float values in <code>VariableScalar</code>'s axis
locations (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3906">#3906</a>,
<a
href="https://redirect.github.com/fonttools/fonttools/issues/3907">#3907</a>).</li>
<li>[cu2qu] Handle special case in <code>calc_intersect</code> for
degenerate cubic curves where 3 to 4 control points are equal (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3904">#3904</a>).</li>
</ul>
<h2>4.59.1</h2>
<ul>
<li>[featureVars] Update OS/2.usMaxContext if possible after
addFeatureVariationsRaw (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3894">#3894</a>).</li>
<li>[vhmtx] raise TTLibError('not enough data...') when hmtx/vmtx are
truncated (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3843">#3843</a>,
<a
href="https://redirect.github.com/fonttools/fonttools/issues/3901">#3901</a>).</li>
<li>[feaLib] Combine duplicate features that have the same set of
lookups regardless of the order in which those lookups are added to the
feature (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3895">#3895</a>).</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/fonttools/fonttools/blob/main/NEWS.rst">fonttools's
changelog</a>.</em></p>
<blockquote>
<h2>4.61.0 (released 2025-11-28)</h2>
<ul>
<li>[varLib.main]: <strong>SECURITY</strong> Only use
basename(vf.filename) to prevent path traversal attacks when
running <code>fonttools varLib</code> command, or code which invokes
<code>fonttools.varLib.main()</code>.
Fixes CVE-2025-66034, see:
<a
href="https://github.com/fonttools/fonttools/security/advisories/GHSA-768j-98cg-p3fv">https://github.com/fonttools/fonttools/security/advisories/GHSA-768j-98cg-p3fv</a>.</li>
<li>[feaLib] Sort BaseLangSysRecords by tag (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3986">#3986</a>).</li>
<li>Drop support for EOL Python 3.9 (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3982">#3982</a>).</li>
<li>[instancer] Support --remove-overlaps for fonts with CFF2 table (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3975">#3975</a>).</li>
<li>[CFF2ToCFF] Add --remove-overlaps option (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3976">#3976</a>).</li>
<li>[feaLib] Raise an error for rsub with NULL target (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3979">#3979</a>).</li>
<li>[bezierTools] Fix logic bug in curveCurveIntersections (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3963">#3963</a>).</li>
<li>[feaLib] Error when condition sets have the same name (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3958">#3958</a>).</li>
<li>[cu2qu.ufo] skip processing empty glyphs to support sparse kerning
masters (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3956">#3956</a>).</li>
<li>[unicodedata] Update to Unicode 17. Require <code>unicodedata2 &gt;=
17.0.0</code> when installed with 'unicode' extra.</li>
</ul>
<h2>4.60.1 (released 2025-09-29)</h2>
<ul>
<li>[ufoLib] Reverted accidental method name change in
<code>UFOReader.getKerningGroupConversionRenameMaps</code>
that broke compatibility with downstream projects like defcon (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3948">#3948</a>,
<a
href="https://redirect.github.com/fonttools/fonttools/issues/3947">#3947</a>,
<a
href="https://redirect.github.com/robotools/defcon/issues/478">robotools/defcon#478</a>).</li>
<li>[ufoLib] Added test coverage for
<code>getKerningGroupConversionRenameMaps</code> method (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3950">#3950</a>).</li>
<li>[subset] Don't try to subset BASE table; pass it through by default
instead (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3949">#3949</a>).</li>
<li>[subset] Remove empty BaseRecord entries in MarkBasePos lookups (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3897">#3897</a>,
<a
href="https://redirect.github.com/fonttools/fonttools/issues/3892">#3892</a>).</li>
<li>[subset] Add pruning for MarkLigPos and MarkMarkPos lookups (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3946">#3946</a>).</li>
<li>[subset] Remove duplicate features when subsetting (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3945">#3945</a>).</li>
<li>[Docs] Added documentation for the visitor module (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3944">#3944</a>).</li>
</ul>
<h2>4.60.0 (released 2025-09-17)</h2>
<ul>
<li>[pointPen] Allow <code>reverseFlipped</code> parameter of
<code>DecomposingPointPen</code> to take a <code>ReverseFlipped</code>
enum value to control whether/how to reverse contour direction of
flipped components, in addition to
the existing True/False. This allows to set
<code>ReverseFlipped.ON_CURVE_FIRST</code> to ensure that
the decomposed outline starts with an on-curve point before being
reversed, for better consistency
with other segment-oriented contour transformations. The change is
backward compatible, and the
default behavior hasn't changed (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3934">#3934</a>).</li>
<li>[filterPen] Added <code>ContourFilterPointPen</code>, base pen for
buffered contour operations, and
<code>OnCurveStartPointPen</code> filter to ensure contours start with
an on-curve point (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3934">#3934</a>).</li>
<li>[cu2qu] Fixed difference in cython vs pure-python complex division
by real number (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3930">#3930</a>).</li>
<li>[varLib.avar] Refactored and added some new sub-modules and scripts
(<a
href="https://redirect.github.com/fonttools/fonttools/issues/3926">#3926</a>).
<ul>
<li><code>varLib.avar.build</code> module to build avar (and a missing
fvar) binaries into a possibly empty TTFont,</li>
<li><code>varLib.avar.unbuild</code> module to print a .designspace
snippet that would generate the same avar binary,</li>
<li><code>varLib.avar.map</code> module to take TTFont and do the
mapping, in user/normalized space,</li>
<li><code>varLib.avar.plan</code> module moved from
<code>varLib.avarPlanner</code>.
The bare <code>fonttools varLib.avar</code> script is deprecated, in
favour of <code>fonttools varLib.avar.build</code> (or
<code>unbuild</code>).</li>
</ul>
</li>
<li>[interpolatable] Clarify <code>linear_sum_assignment</code> backend
options and minimal dependency
usage (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3927">#3927</a>).</li>
<li>[post] Speed up <code>build_psNameMapping</code> (<a
href="https://redirect.github.com/fonttools/fonttools/issues/3923">#3923</a>).</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="e691e3bef9"><code>e691e3b</code></a>
Release 4.61.0</li>
<li><a
href="c2d540f4ad"><code>c2d540f</code></a>
Update NEWS.rst</li>
<li><a
href="3859753a05"><code>3859753</code></a>
Update NEWS.rst</li>
<li><a
href="26eb070a55"><code>26eb070</code></a>
black</li>
<li><a
href="5ff73af326"><code>5ff73af</code></a>
Merge commit from fork</li>
<li><a
href="a696d5ba93"><code>a696d5b</code></a>
varLib: only use the basename(vf.filename)</li>
<li><a
href="b00bc459ef"><code>b00bc45</code></a>
varLib_test: test path traversal in variable-font filename</li>
<li><a
href="066512e4f3"><code>066512e</code></a>
Merge pull request <a
href="https://redirect.github.com/fonttools/fonttools/issues/3986">#3986</a>
from cmyr/base-minmax-sorting</li>
<li><a
href="ce78973e97"><code>ce78973</code></a>
[feaLib] Sort BasLangSysRecords by tag</li>
<li><a
href="5bb37dc201"><code>5bb37dc</code></a>
Merge pull request <a
href="https://redirect.github.com/fonttools/fonttools/issues/3983">#3983</a>
from fonttools/dependabot/pip/brotli-1.2.0</li>
<li>Additional commits viewable in <a
href="https://github.com/fonttools/fonttools/compare/4.57.0...4.61.0">compare
view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=fonttools&package-manager=pip&previous-version=4.57.0&new-version=4.61.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

You can trigger a rebase of this PR by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
2026-02-10 07:08:05 +00:00