Commit Graph

3121 Commits

Author SHA1 Message Date
John Shumway
a7b894544e [rocm-libraries] ROCm/rocm-libraries#5083 (commit d65061b)
[CK_BUILDER] Simplify the TreeFormatter.

My original design wrote each line streaming, so developers had to keep
track of the indentation depth and remember when to use `writelast` for
the last element at a depth. This was a source of a lot of cosmetic
output errors, and that is likely to get more complicated as we add
optional branches.

We switch to a tree-building interface with a simple `add` method. The
only cost is that we have to defer string building, which is a good
tradeoff for our use case.

Tested with `ninja smoke-builder`.
2026-03-07 18:06:33 +00:00
Christopher Millette
e0d11b969b [rocm-libraries] ROCm/rocm-libraries#5030 (commit 8e02a26)
[CK] Replace tuple value construction with tuple_element_t
 type extraction [1A] (#5030)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

### Rationale
CK's device operation instance registration uses
`add_device_operation_instances` at ~1,850
call sites to register GPU kernel configurations. The existing
implementation constructs
`std::tuple` values just to extract their types via `decltype`, then
copy-constructs each
instance into `make_unique`. This is wasteful — only the types matter,
not the values — and
forces the compiler to instantiate the full `std::tuple` constructor and
`std::get` machinery
at every call site.

### What changed
- Replace `remove_cvref_t<decltype(std::get<i>(tuple_obj))>` with
`std::tuple_element_t<i.value, TupleType>`, which extracts the type
directly without constructing any values
- Replace copy-from-default `make_unique<T>(value)` with direct default
construction `make_unique<T>()` — all CK device operation instances are
stateless structs with configuration encoded in template parameters
- Add `static_assert(std::is_default_constructible_v<NewOpInstance>)` to
enforce this contract at compile time with a clear error message
- Add Doxygen documentation for this high-traffic public API

### Value
- Eliminates unnecessary template instantiation of `std::tuple`
constructors and `std::get` across ~1,850 call sites
- Establishes a cleaner, more intention-revealing pattern for type-only
tuple usage
- The `static_assert` prevents silent breakage if a
non-default-constructible type is ever added
- No runtime behavior change — zero risk

### Files changed (9)
- `add_device_operation_instance.hpp`: Core pattern change
- 3 example files, 3 reduce instance headers, 1 convolution header, 1
profiler header

## Test plan
- [ ] Existing CI tests cover all ~1,850 call sites (GEMM, reduce,
softmax, convolution)
- [ ] `static_assert` provides compile-time validation stronger than
runtime tests
- [ ] No runtime behavior change — stateless struct default construction
is identical to copy-from-default
- [ ] Compatible with both `std::tuple` and `ck::type_list` containers

🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-06 16:28:22 +00:00
Christopher Millette
e2ce0cad54 [rocm-libraries] ROCm/rocm-libraries#4673 (commit ec385da)
Compile-time optimize threadwise slice transfer
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Profiling with `-ftime-trace` on representative translation units (e.g.,

`device_grouped_conv2d_fwd_xdl_nhwgc_gkyxc_nhwgk_f16_comp_instance.cpp`)
revealed
that **92% of frontend time was spent in template instantiation**. The
primary
bottleneck was redundant instantiation of identical helper logic across
multiple
threadwise transfer class variants.

Each `ThreadwiseTensorSliceTransfer_v*` class independently contained
its own
copy of the same helper computations — serpentine traversal, coordinate
stepping, thread scratch descriptors, lambda-like functors, and
compile-time
constants — duplicated across 13 header files. When a typical GEMM or
convolution kernel TU includes blockwise operations (e.g.,
`blockwise_gemm_xdlops.hpp`), it pulls in multiple transfer variants
simultaneously, causing the compiler to instantiate the same helper
logic
multiple times with the same template arguments.

This was compounded by the helpers being defined as members of the outer
`ThreadwiseTensorSliceTransfer_v*` classes, which carry 14+ template
parameters.
Functions like `ComputeForwardSweep` depend only on their two argument
types,
but as inline members of the outer class, the compiler was forced to
create
separate instantiations for every unique combination of all outer
parameters
(data types, descriptors, vector widths, etc.) — even when most of those
parameters had no effect on the helper's output.

## Technical Details

### The Fix: Shared Helper Struct Hierarchy

Duplicated logic was extracted into a standalone helper hierarchy in
`threadwise_tensor_slice_transfer_util.hpp`:

```
ThreadwiseTransferHelper_Base          (I0..I16, MoveSliceWindow, ComputeThreadScratchDescriptor,
|                                       ComputeForwardSteps, ComputeBackwardSteps, MakeVectorContainerTuple)
+-- ThreadwiseTransferHelper_Serpentine (ComputeForwardSweep, ComputeMoveOnDim, ComputeDataIndex,
|                                       ComputeCoordinateResetStep, VectorSizeLookupTable, VectorOffsetsLookupTable)
+-- ThreadwiseTransferHelper_SFC       (ComputeSFCCoordinateResetStep)
```

Each helper method is now parameterized **only by what it actually
uses**:

- `ComputeForwardSweep(idx, lengths)` — parameterized only by the two
argument
  types, not by `SrcData`, `DstData`, `SrcDesc`, etc.
- `ComputeForwardSteps(desc, scalar_per_access)` — parameterized only by
the
  descriptor and access sequence types.
- `ComputeCoordinateResetStep<SliceLengths, VectorDim, ScalarPerVector,
DimAccessOrder>()` — parameterized only by the four values it actually
needs.

This reduces template instantiation work in two ways:
1. **Across different transfer variants** (v3r1 vs v3r2 vs v3r1_gather):
the
compiler reuses a single instantiation instead of creating one per
variant.
2. **Across different outer class instantiations** (fp16 vs bf16 vs
int8): the
compiler reuses the helper instantiation because the helper doesn't
depend
   on the data type at all.

### Refactored Headers

**13 headers** now delegate to the shared helpers instead of duplicating
logic:
- Serpentine family: v3r1, v3r2, v3r1_gather, v3r1_dequant
- SFC family: v6r1, v6r1r2, v6r2, v6r3, v7r2, v7r3, v7r3_scatter
- Dead code removed: v4r1, v5r1

### Additional Fixes Found During Refactoring

- Two latent bugs in v3r2 (`forward_sweep` indexing,
`GetDstCoordinateResetStep` extraction)
- Dead `SrcCoordStep` variables in v4r1 and v5r1
- Unused `scale_element_op_` member in v3r1_dequant (restored with note)

### Net Code Change

+1,428 / -2,297 lines (~870 lines removed).

## Test Plan

### Unit Tests

28 host-side gtests in
`test/threadwise_transfer_helper/test_threadwise_transfer_helper.cpp`
covering the full helper hierarchy:

| Suite | Tests | What is verified |
|-------|-------|------------------|
| ThreadwiseTransferHelperBase | 6 | Compile-time constants,
inheritance, `MoveSliceWindow` with `ResetCoordinateAfterRun` true/false
in 2D and 3D |
| ThreadwiseTransferHelperSerpentine | 9 | `ComputeForwardSweep`
(even/odd row, 1D), `ComputeMoveOnDim` (inner complete/incomplete),
`ComputeDataIndex`, `ComputeCoordinateResetStep`,
`VectorSizeLookupTable`, `VectorOffsetsLookupTable` |
| ThreadwiseTransferHelperSFC | 6 | `ComputeSFCCoordinateResetStep` —
single access, 2D row-major, 2D column-major, 3D batch, even/odd inner
access counts |
| ThreadwiseTransferHelperInheritance | 3 | Serpentine and SFC derive
from Base, are not related to each other |
| DetailFunctors | 4 | `lambda_scalar_per_access`,
`lambda_scalar_step_in_vector`,
`lambda_scalar_per_access_for_src_and_dst` (same dim, different dims) |

### Semantic Equivalence

GPU ISA comparison using `--cuda-device-only -S` confirmed identical
assembly
output (modulo `__hip_cuid_*` metadata) between baseline and refactored
code.

## Test Results

All measurements on a 384-core machine, `-j64`, freshly rebooted,
near-idle.

### Targeted Builds (affected targets only)

| Target | Baseline | Refactored | Wall-clock Delta | CPU Delta |
|--------|----------|------------|-----------------|-----------|
| `device_grouped_conv2d_fwd_instance` (160 TUs) | 7m 37s / 189m CPU |
6m 53s / 161m CPU | **-9.7%** | **-14.9%** |
| `device_grouped_conv3d_fwd_instance` (185 TUs) | 9m 49s / 202m CPU |
6m 42s / 182m CPU | **-31.8%** | **-10.0%** |
| **Combined** | **17m 27s / 392m CPU** | **13m 35s / 344m CPU** |
**-22.2%** | **-12.4%** |

### Full Project Build (8,243 targets)

| Metric | Baseline | Refactored | Delta |
|--------|----------|------------|-------|
| Wall-clock | 103m 38s | 111m 56s | +8.0%* |
| CPU time | 4705m 7s | 4648m 17s | **-1.2%** |

\*Wall-clock inflated by external load spike during refactored build
(load 90 vs 66). CPU time is the reliable metric.

### Context

~15% of all build targets (1,262 / 8,243) transitively include the
modified
headers. These are primarily GEMM and convolution kernel instantiations
— the
core compute workloads. The 12-15% CPU savings on affected targets is
diluted
to 1.2% across the full project because 85% of targets are unaffected.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-06 16:27:59 +00:00
Wojciech Laskowski
b80e41f3bc [rocm-libraries] ROCm/rocm-libraries#4495 (commit 5664eb0)
Adding layout test for amdgcn_mma structs

## Motivation

Currently, the test suite for `amdgcn_mma` focuses on the design (e.g.
choosing the correct specialization based on SFINAE) and a single live
test that checks if selected MmaOp runs. This PR adds a simplified GEMM
test kernel that checks the exact layout of the selected MmaOp.

## Technical Details

The test in `test_amdgcn_mma_layout.cpp` launches MxKxN test cases (one
per block), where each case:
1. Constructs A and B tensors on a device with a single 1 at A(m,k) and
B(k,n) (rest is all 0s)
  2. Executes the MMA intrinsic.
  3.  Checks if C has the "1" on the excpeted position.

For the MMA instrinsic, it pulls a Mma op from amdgcn_mma specialization
based on a given input (tile dimension, data types).

Note 1: As a helper, in `test_amdgcn_mma_layout_util.hpp` we add
register map for a given amdgcn_mma specialization. Register mapping is
currently based on the `tile_distribution_encoding`.

Note 2: Everything is added to the test suite, no additions to the
actual `amdgcn_mma` structs. All the extra information that is needed,
but not yet provided by `amdgcn_mma` structs, is added as a boilerplate
to the header. TODO: Rebase this PR on top of the `amdgcn_mma` refactor
or clean it up after merge.

## Test Plan

This PR solely adds a new test to the existing code.

## Test Result

Tests pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-06 16:22:16 +00:00
lalala-sh
b0c13f3124 [rocm-libraries] ROCm/rocm-libraries#5094 (commit d4548e6)
[CK] use int64 for ptr offset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

When the number of experts (E) is large (e.g., E=257 in DeepSeek-V3),
the `expert_id * expert_stride` calculation in MOE GEMM kernels
overflows `int32` (`index_t`), causing the weight matrix (B) pointer to
wrap to an invalid address and triggering a GPU memory access fault.

For example, with `N=1024, K=7168, IsInputGemm=true`:
- `expert_stride = N * K * 2 = 14,680,064`
- `INT32_MAX / expert_stride ≈ 146`
- Any `expert_id >= 147` causes overflow → negative offset → illegal
memory access → GPU crash

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: amd-shiraz <shiraz.ali@amd.com>
2026-03-06 02:01:03 +00:00
John Afaganis
7ee6c44387 [rocm-libraries] ROCm/rocm-libraries#5101 (commit d4754fa)
[CK] Remove log spam for deprecated convolutions

## Motivation

The `CK_BUILD_DEPRECATED` flag guards legacy non-grouped convolution
instances, but both branches of every guard emit a `#pragma` message on
every build, adding noise without actionable information. According to
some recent testing, these non-grouped instances can outperform their
grouped replacements in certain configurations, so their continued
availability behind the flag remains valuable. This change removes only
the warning directives while preserving all guards and guarded code
paths.

## Technical Details

Removed all `#pragma` message lines referencing deprecated instances
from 25 convolution instance source files spanning conv1d_bwd_data,
conv2d_fwd, conv2d_bwd_data, conv2d_fwd_bias_relu,
conv2d_fwd_bias_relu_add, conv3d_bwd_data, grouped_conv3d_fwd,
grouped_conv3d_bwd_data, and grouped_conv3d_bwd_weight. The `#if
CK_BUILD_DEPRECATED` / `#else` / `#endif` preprocessor guards and all
guarded code remain unchanged.

## Test Plan

No functional change. The CK_BUILD_DEPRECATED conditional logic is
unmodified; only #pragma message directives were removed.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 00:38:38 +00:00
chris-tsiaousis-hpc
03ce21ddcb [rocm-libraries] ROCm/rocm-libraries#4837 (commit 6316035)
[CK TILE] Unification of sparse MFMA/WMMA policy structs
 (#4837)

## Motivation

The existing unification work supports DENSE intrinsics. In this PR we
enable support for SPARSE as well as SCALE intrinsics and add an example
SPARSE implementation.

## Technical Details

Mostly trivial changes. One framework change is that the desired
`MmaOpFamily` is passed to the `MmaDefaultSelector`. As my relevant
commit explains, we do not support a fallback family at the moment, but
it is something we can consider.

## Test Plan

Added a new test for the relevant sparse specializations.

## Test Result

Test should pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-05 19:53:16 +00:00
Jeff Huang
6e558658ea [rocm-libraries] ROCm/rocm-libraries#4999 (commit 45f6624)
[CK] Fix 32-bit overflow in batch prefill kernel for >4GB KV
 cache (#4999)

Use SRD rebasing for page_block_size >= kN0: move SRD base pointer to
page start via 48-bit arithmetic, encode only within-page offset in
voffset. Original code path preserved for ps1/ps16 via constexpr-if.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-05 01:09:12 +00:00
DarylHawkinsAMD
147210ac72 [rocm-libraries] ROCm/rocm-libraries#5026 (commit fd5325b)
[CK_BUILDER] Add BILINEAR and ADD_CLAMP elementwise operation
 mappings to CK builder (#5026)

## Motivation

The CK kernels that MIOpen consumes use the BILINEAR and ADD_CLAMP
operations. The operation mappings in the CK Builder API need to be in
place to be able to instantiate those kernels using the builder.

## Technical Details

Add the BILINEAR and ADD_CLAMP operation mappings to the builder

## Test Plan

* Added builder tests for new helpers

## Test Result

* New tests pass locally, waiting for test run

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-04 23:29:35 +00:00
Ville Pietilä
ae4e632c7d [rocm-libraries] ROCm/rocm-libraries#4797 (commit 1a30400)
[CK_TILE] Add CK Tile bwd weight profiler
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

To compare old CK and CK Tile, we need to extend the current CK profiler
to support running also CK Tile instance with the same API. In order to
have the same instance coverage in CK Tile compared to the old CK, I've
added code generation from old CK configurations to CK Tile instances
using the CK Builder.

## Technical Details

- The codegen python script for CK Tile fwd convs is extended to support
also bwd weight and bwd data.
- The generated instances are added to the CMake build (target
`device_grouped_conv_bwd_weight_tile_instance`s).
- A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to
the CK Profiler.
2026-03-04 21:50:29 +00:00
John Afaganis
fc1e1a5155 [rocm-libraries] ROCm/rocm-libraries#5071 (commit 444cd13)
Add Operation Support Matrix to Dispatcher README

Added an Operation Support Matrix to the Dispatcher README, detailing CK
Tile operations with support status for various data types, layouts, and
GPU targets.

## Motivation

Provide a clear understanding of which operators (and variants) are
supported by dispatcher.

## Technical Details

Entirely generated by a skill.

## Test Plan

N/A.  This is a documentation-only change.

## Test Result

N/A.  This is a documentation-only change.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-04 20:16:22 +00:00
Anton Gorenko
782f0a9eed [rocm-libraries] ROCm/rocm-libraries#4957 (commit d2e4eb8)
[CK_TILE][FMHA] Extend pipelines with pssk for gfx11/12
 (#4957)

## Motivation

Build pipelines with seqlen padding only to support vectorized loads in
the hdim dimension.
The existing pipelines have either all dims padded or all dims not
padded.
These pipelines can be used in ComfyUI for slightly better performance.

## Technical Details

Also a fix included for correct FLOPS calculation in
`tile_example_fmha_fwd` when `seqlen_q * seqlen_k` overflows index_t
capacity (signed int32).

## Test Plan

The existing test cases will use the new pipelines when parameters allow
(seqlens - padded, hdims - not padded):
```
ninja test_ck_tile_fmha_fwd

bin/test_ck_tile_fmha_fwd_fp16
bin/test_ck_tile_fmha_fwd_bf16

bin/test_ck_tile_fmha_fwd_fp8bf16 # for gfx12
```

## Test Result

All tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-04 04:50:44 +00:00
kensclin
30702c9cbc [rocm-libraries] ROCm/rocm-libraries#4834 (commit e75e6cb)
[CK_TILE][GEMM] Fix eightwarp error & Add eightwarp unit test
 (#4834)

## Motivation

The primary goal of this PR is to fix a critical issue in the EightWarps
implementation within ck_tile. Additionally, unit tests were added to
ensure that CI can detect errors.

## Test Plan

ninja test_tile_gemm_quant_abquant_eightwarps
./bin/test_tile_gemm_quant_abquant_eightwarps

## Test Result

All EightWarps related test cases in TestCkTileGemmABQuant completed
successfully without linker errors or validation mismatches.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-04 04:11:27 +00:00
Yi DING
b09112bbad [rocm-libraries] ROCm/rocm-libraries#4577 (commit a36922c)
[CK_TILE] FMHA BWD Launcher Interface

## Motivation
Reduce memory usage; Be prepared to implement optimizations of reducing
nsplits in deterministic cases.

## Technical Details
This PR introduces a new launcher interface for the FMHA backward
operation, replacing direct function calls with a more structured
approach. The launcher encapsulates kernel dispatch logic and provides
access to computed metadata like the number of dQ acc splits.

**Changes:**
- Added `fmha_bwd_launcher` class that wraps kernel execution and
exposes `dq_acc_splits`
- Moved `fmha_bwd_traits` construction earlier in the execution flow to
support launcher initialization
- Refactored code generation to produce both legacy API and new launcher
constructor

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-04 01:21:07 +00:00
Thrupti Raj Lakshmana Gowda
08b6de62f8 [rocm-libraries] ROCm/rocm-libraries#4958 (commit 713881f)
bf8 and bf16 support for Universal GEMM in Tile Engine
 (#4958)

## Motivation

Currently we have only fp8 and fp16 datatype support for universal GEMM
in Tile Engine with this PR support for bf8 and bf16 datatype will be
added during the CI phase

## Technical Details

Adding bf8 and bf16 support

## Test Plan

NA

## Test Result

NA

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-03 23:30:08 +00:00
Brock Hargreaves
2a16d53cce [rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)
[CK] Address a bunch of errors associated with targeting
 gfx1200 on Windows (#5045)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Still addressing errors that are blocking the merge of TheRock PR:
https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382

## Technical Details

1. There are multiple fmha python scripts that are writing native paths
which are confusing cmake. I addressed one of these in an earlier PR
https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing
more that are exposed with gfx1200 target:

```
[composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library):
[composable_kernel configure]   Syntax error in cmake code when parsing string
[composable_kernel configure]
[composable_kernel configure]     B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp
[composable_kernel configure]
[composable_kernel configure]   Invalid character escape '\b'.
```

2. In the following compiler error we see gemm_prec_str<ADataType,
BDataType> being passed as a function to concat(...), instead of being
evaluated with the parenthesis operator(), i.e.,
gemm_prec_str<ADataType, BDataType>(). There are multiples instances of
this, I wonder what non-msvc compilers do here:

```
[composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7:
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast]
[composable_kernel]   119 |     ((oss << sep << rest), ...);
[composable_kernel]       |                     ^~~~
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here
[composable_kernel]   248 |         return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName());
[composable_kernel]       |                ^
```

There are plenty of other places where we use gemm_prec_str with the
operator(), so I'm pretty sure these were just typos...but I'd like some
eyes on it.

3. There are 2 tests that fail to build on Windows, which I've excluded
from the build but will open bug tickets for:
    1.  gemm_weight_preshuffle
    2.  grouped_gemm_preshuffle

Here's a sample of the compiler error for these tests:

```
[composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj
[composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj
[composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe  -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt   -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89:
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations]
[composable_kernel]   110 |         const char* vp = std::getenv(name);
[composable_kernel]       |                               ^
[composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here
[composable_kernel]  1183 |     _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s)
[composable_kernel]       |                    ^
[composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE'
[composable_kernel]   368 |         #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT(    \
[composable_kernel]       |                                                       ^
[composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT'
[composable_kernel]   358 | #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text))
[composable_kernel]       |                                               ^
[composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation)
[composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918)
[composable_kernel] Target: x86_64-pc-windows-msvc
[composable_kernel] Thread model: posix
[composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin
[composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s).
[composable_kernel] ninja: build stopped: subcommand failed.
[composable_kernel FAILED WITH CODE 1 in 238 seconds]
ninja: build stopped: subcommand failed.
```

## Test Plan

Wait for internal CI and make sure build compiles locally.

## Test Result

Waiting on CI

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-03 21:55:14 +00:00
andrew clark
0d595e04ba [rocm-libraries] ROCm/rocm-libraries#4943 (commit ea40212)
[CK] Updating CI skip logic
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CI skip logic has two issues that prevented it from working
correctly:

1. **Incorrect file patterns**: After migrating from standalone repo to
`rocm-libraries`, file paths now include the
`projects/composablekernel/` prefix (e.g.,
`projects/composablekernel/docs/README.md`). The skip patterns were
still checking for paths starting with `docs/`, which never matched.

2. **Incomplete build type support**: Jenkins multibranch pipelines
provide different environment variables for PR builds (`$CHANGE_TARGET`,
`$CHANGE_ID`) vs branch builds (`$BRANCH_NAME`). The previous logic only
compared `HEAD~1..HEAD` for branch builds, which missed changes from
multi-commit pushes and didn't properly handle feature branch builds.

When CI skipped or ran, there was no visibility into which files
triggered the decision, making it difficult to diagnose issues. You can
now see which files triggered the CI run.

## Technical Details

PR builds: Compares all commits against origin/$CHANGE_TARGET.
Feature branch builds: Uses git merge-base to find divergence point from
develop and checks all touched files since then.
Scheduled develop builds are unaffected. These builds are forced to run
from the pipeline parameters.

Example log output for PR Builds:
<img width="647" height="260" alt="image"
src="https://github.com/user-attachments/assets/c8673a81-acb2-4fb2-acbb-1c07b5ab3b69"
/>

Example log output for Branch Builds:
<img width="488" height="287" alt="image"
src="https://github.com/user-attachments/assets/fbb17ba7-eb2c-42a4-b820-b2a8b9e479c4"
/>

## Test Plan

Pre-PR validation (branch builds):

Push commits with only documentation changes → CI should skip. I will
have to verify this after this PR is merged!
Push commits with code changes → CI should run
Push commits that modify then revert code → CI should run (catching
reverts)
Verify debug output clearly shows skip/run decision

Post-PR validation (PR builds):

Create PR with only doc changes → CI should skip. I will have to verify
this after this PR is merged!
Create PR with mixed doc + code changes → CI should run and log which
files triggered it
Verify debug output clearly shows skip/run decision

## Test Result

All branch build checks succeeded.
All PR build checks succeeded.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-03 15:50:45 +00:00
Aviral Goel
f00ec5afd9 [rocm-libraries] ROCm/rocm-libraries#4301 (commit 0821c9f)
test: Add umbrella test targets for CK Tile operations
 (#4301)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Adds operation-specific umbrella test targets for CK Tile to enable
running all tests for a specific operation without running the entire
test suite. This improves the development workflow by allowing faster
iteration when working on specific operations.

## Motivation

Previously, developers working on CK Tile operations could only:
- Run individual test executables one at a time
- Run global labels (, , ) which test the entire codebase
- Build all tests for an operation but had no simple way to run them all

This made it cumbersome to validate changes to a specific operation
(e.g., GEMM quantization) without either running tests individually or
running the entire test suite.

### Documentation

- - Comprehensive testing guide with usage examples and implementation
details

## Usage Examples

# Run all GEMM tests with 256 parallel jobs
ninja -j256 ck_tile_gemm_tests

# Run all GEMM block scale (quantization) tests
ninja -j256 ck_tile_gemm_block_scale_tests

# Run all GEMM StreamK tests
ninja -j256 ck_tile_gemm_streamk_tests

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [x] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-03-03 15:40:50 +00:00
Johannes Graner
1cd031c21d [rocm-libraries] ROCm/rocm-libraries#4800 (commit 9dcf0cf)
[CK Profiler] Instance selection for grouped conv profilers
 (#4800)

## Motivation

This PR adds instance selection support for ckProfiler grouped
convolution operations (forward, backward data, backward weight),
allowing users to run specific kernel instances rather than sweeping all
available instances.

When profiling or debugging convolution kernels, users often need to
test specific kernel configurations without running the full instance
sweep. This is particularly useful for:
- Debugging a specific failing instance
- Profiling a known-best configuration
- Quick validation during development

## Technical Details

**Features added**:
- `--instance <id>` flag to run only the N-th valid instance (0-indexed)
- `--list-instances` flag to list all valid instances without running
any kernels
- Named arguments can appear anywhere on the command line
- Best instance index is now printed with results for reference
- Python script support via `-ii` / `--instance_index` arguments

**Design decisions**:
- Named arguments (`--instance`, `--list-instances`) instead of
positional to avoid conflicts with existing parameters
- Instance index refers to the N-th valid instance (0-indexed), not the
global instance index
- Auto-disable verification when `--list-instances` is used for fast
enumeration
- Shared utilities in `profiler_arg_utils.hpp` to deduplicate parsing
logic

## Test Plan

Manual testing with various scenarios:

List all valid instances:
```bash
./bin/ckProfiler grouped_conv_fwd <usual args> --list-instances
```

Run only instance 5:
```bash
./bin/ckProfiler grouped_conv_fwd <usual args> --instance 5
```

Test cases:
- Single instance selection
- List instances mode
- Out-of-bounds instance index (verified warning messages)
- No instance flag (runs all instances - default behavior)
- All three operations (fwd, bwd_data, bwd_weight)

## Test Result

All test scenarios passed:
- Instance selection correctly filters kernel executions
- List mode enumerates valid instances without running kernels
- Invalid indices produce appropriate warnings without crashing
- Default behavior (all instances) unchanged when flags not provided
- Consistent behavior across all three grouped convolution operations

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-03 15:33:20 +00:00
Illia Silin
fff9f72ffb [rocm-libraries] ROCm/rocm-libraries#5036 (commit 0bee213)
[CK] Switch compiler branch from staging to develop and
 upgrade sccache. (#5036)

## Motivation

Upgrade to official sccache version 0.14, since it now supports hip.
Also, switching daily builds from amd-staging to develop compiler
branch, since it should be more stable.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-03 15:33:08 +00:00
John Shumway
54861f1f49 [rocm-libraries] ROCm/rocm-libraries#5038 (commit 6e74de7)
[CK_BUILDER] Update developer notes in the CK Builder source
 directories (#5038)

## Motivation

This PR updates the developer notes for the CK Tile builder. It captures the current state of the implementation in more detail, and frames the description around the need to have true facade.

There is no functional change, only better alignment of developer notes with the current code.

This doc clearly explains the current technical debt: that we have created many facades that expose
the implementation details. There is an expanded section on reflection that explains how unified
reflection will help clarify the unified builder design.

Additional changes are just better accounting for the current state of the code, including previously
undocumented operations. A few typos and cosmetic issues are cleaned up, too.
2026-03-03 15:16:59 +00:00
Emily Martins
f1746955fd [rocm-libraries] ROCm/rocm-libraries#4984 (commit 962b047)
[CK_TILE] Reduce Register Spills in Stream-K Reductions
 (#4984)

## Motivation

In CK Tile Stream-K, kernels using one of two non-atomic reduction
strategies (i.e., linear, tree) have high register spill count, with the
tree reduction generally being worse. These changes act a first step to
help decrease the register spill count.

## Technical Details
### Problem 1: Unvectorized access to partials
In both the linear and tree reductions, workgroups write partials
results to a global buffer; another workgroup will later read this data.
When the initial logic to support reading and writing to the partials
buffer was added (see
https://github.com/ROCm/composable_kernel/pull/3107), the tile
distribution encoding used to read from and write to partials matches
the register layout for the accumulator of the mfma instruction used for
the kernel. Since we do not currently use the transposed register layout
for the accumulator, we end with an encoding that is not optimized for
writing to HBM.

For example: Consider the register layout of the
`v_mfma_f32_16x16x32_fp8_fp8` instruction.
```bash
./matrix_calculator.py --architecture gfx942 --instruction  v_mfma_f32_16x16x32_fp8_fp8 --register-layout --C-matrix
```
<img width="1113" height="537" alt="image"
src="https://github.com/user-attachments/assets/afc8f556-08cc-4224-a6e5-b5edabc5fc02"
/>

The above shows that threads are responsible for consecutive elements
down a column of the C tile. If we use this distribution to read and
write to partials with C in row major, then threads are unable to
perform vectorized reads and writes. Note: thread 0 is shown in red and
thread 1 is shown in green.

Since the C-shuffle Epilogue only supports C in row major, reading and
writing to partials is highly unoptimized.
### Problem 2: Missed opportunity for SPGR use in tree reduction loop
Since the reduction occurs between workgroups, all threads in the
workgroup follow the same execution paths in the tree reduction logic,
hence various variables should be using SGPRs, but they are not.

### Implemented Solutions
1. Add a new tile distribution encoding that is optimized for accessing
partials in HBM. This encoding does not change the data assignment to
threads, it merely changes the addresses to which they write/read in the
partials buffer. For example, continuing with the
`v_mfma_f32_16x16x32_fp8_fp8` instruction, the new encoding would result
in threads writing in the following layout:
<img width="517" height="342" alt="image"
src="https://github.com/user-attachments/assets/93b5e0ea-bafc-47b8-89bb-c40ba75cb202"
/>

This layout ensures that each thread writes along a row, enabling
`buffer_{store|load}_dwordx4` instructions (i.e., vectorized accesses).
This helps reduce register usage due to requiring fewer offset
calculations.

2. To force SGPR usage in the tree reduction loop, I make use of CK
Tile's `amd_wave_read_first_lane` which is a wrapper around
`__builtin_amdgcn_readfirstlane`. This helps reduce VGPR spills in the
tree reduction.

_These changes do not fully eliminate register spills. Future work will
aim to further reduce spills. But these changes make good progress._

## Test Plan

Added tests for different warp tile sizes to validate that the new
encoding works with different `WarpGemm` variants.

## Test Result

All tests pass locally on all gfx9 architectures.

Some results for decreases in register spills on gfx942: (BL = baseline)
| Kernel | SGPR Spill (BL) | SGPR Spill (new) | SGPR Delta | SGPR % |
VGPR Spill (BL) | VGPR Spill (new) | VGPR Delta | VGPR % |

|--------|------------------:|------------------:|-----------:|-------:|-------------------:|------------------:|-----------:|-------:|
| fp16 linear F/F/F/T 256x256x32 2x2x1 32x32x16 | 223 | 0 | -223 |
-100.0% | 21 | 20 | -1 | -4.8% |
| fp16 tree F/F/F/T 256x256x32 2x2x1 32x32x16 | 233 | 11 | -222 | -95.3%
| 443 | 23 | -420 | -94.8% |
| fp8 linear F/F/F/F 256x256x32 2x2x1 32x32x32 | 221 | 3 | -218 | -98.6%
| 12 | 6 | -6 | -50.0% |
| fp8 tree F/F/F/F 256x256x32 2x2x1 32x32x32 | 230 | 14 | -216 | -93.9%
| 396 | 12 | -384 | -97.0% |

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-02 17:40:34 +00:00
Kiefer van Teutem
b042e1805a [rocm-libraries] ROCm/rocm-libraries#4804 (commit 832dd0e)
Add Tile Distribution Encoding Register Mapping debug utility
 for MFMA / WMMA unification work. (#4804)

## Motivation

This PR adds a small utility that allows you to use Tile Distribution
Encodings to directly map matrix elements to register locations and vice
versa. It can also print forward and backward layout mappings similar to
the Matrix Calculator utility. The utility is not meant for index
calculations in actual kernels, but rather as a debugging tool and
probably for automated verification of the policy structs in the new
WMMA / MFMA unification design.

## Technical Details

Tile Distribution Encodings are a core part of CK Tile which can define
the relationship between register and intrinsic matrix fragment
elements. They allow for any mapping based on unmerge and merge
transformations. Also, they allow for a special "Repeat" dimensions
which acts like an additional matrix dimension and allows for
replication of certain matrix elements. The new mapping utility can deal
with all aspects.

## Test Plan

Since this is a debug utility there is nothing to directly test, but
there is an example file that defines four different Tile Distribution
Encodings and prints their forward and backward mappings, along with
some extra parameters.

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-02 16:42:42 +00:00
Yi DING
a8e2ec22cf [rocm-libraries] ROCm/rocm-libraries#5011 (commit b31a678)
[CK] Fix aiter tests in CI
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Updates the CK/AITER CI Docker build to source Composable Kernel either
from `ROCm/rocm-libraries` (via sparse-checkout) or directly from
`ROCm/composable_kernel`, aiming to make aiter tests reliable in CI.

**Changes:**
- Added a build arg to toggle fetching CK from `ROCm/rocm-libraries`
(enabled by default).
- Implemented sparse-checkout + local re-init/commit flow to materialize
CK into a local `ck/` directory.
- Updated aiter’s CK vendoring step to clone from the locally prepared
`ck/` directory.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-02 15:36:01 +00:00
SamiAario-AMD
95dc496d30 [rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)
Cleanup and refactoring related to tile loading
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Cleanup and refactoring done while implementing mixed precision for
fp16/bf16 x fp8

Key changes:

- Renamed load_interleaved_pk_type.hpp to load_and_convert_tile.hpp and
refactored the API to use consistent naming conventions
- Updated load_tile_transpose functions to use output parameters instead
of return values for consistency
- Removed unused variable declarations and simplified type deduction
logic
- Define load_tile_with_elementwise to use tuple types explicitly for
clarity

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-03-02 12:21:44 +00:00
Max Podkorytov
0438ab1b79 [rocm-libraries] ROCm/rocm-libraries#4518 (commit dd161dc)
[CK_TILE] Fix CShuffleEpilogue test to use correct GEMM
 accumulator distribution (#4518)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

The test was using LDS distribution to create the accumulator tile, but
CShuffleEpilogue expects the GEMM accumulator distribution that
BlockGemm produces. This mismatch caused incorrect data permutation.

## Changes

- Use WarpGemmDispatcher to get correct accumulator distribution
encoding
- Load test input from host-initialized global memory for deterministic
verification
- Shard tests by data type (FP16, FP8) with gfx950-specific FP8 tests
- Extract scale tests into separate target for better organization
- Implement exact permutation verification (all unique values appear
once)
- Reduce tile size from 256x256 to 128x128 to fit in unique fp16 range
- Add parameterized test configurations for various warp layouts and
MFMA types

## Test plan

- [x] Run new cshuffle epilogue tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-02 08:55:05 +00:00
Linjun-AMD
78ae3835a6 [rocm-libraries] ROCm/rocm-libraries#4313 (commit 080ac66)
[CK] Fix gptoss sink

## Motivation

This PR removes conditional logic for handling infinity values in the
sink mechanism across multiple FMHA pipeline implementations, defaulting
sink_size to 0 and adding a constraint in the kernel selection logic.

## Technical Details

Changes:

Removed __builtin_isinf_sign(sink_v) checks and conditional
initialization of LSE accumulators across 7 pipeline files
Added default initialization (= 0) for sink_size in 4 argument structs
Added F_sink == "f" constraint to kernel compatibility checking

## Test Plan

Local test

## Test Result

passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-02 01:54:46 +00:00
jakpiase
d32d515f64 [rocm-libraries] ROCm/rocm-libraries#4873 (commit 580ad4f)
[CK] CK Tile improvements and fixes for depthwise merged
 convolutions forward (#4873)

## Motivation
Performance benchmarks showed that old CK's depthwise merged
convolutions are much faster than CK Tile's ones.

## Technical Details
After investigation it showed up that the requirement that A/CVectorload
is a multiple of gemm's rightmost dimension is too strict in case of
processing multiple groups, because if tensor is in NHWGC/NHWGK format,
then if C/K is equal to 1, we can use vectorloads on the G dimension,
which is added by this PR. Filter5x5 specialization was also added,
because some models are using it, it's similar to 3x3, the only
difference is the window size. This addition was needed, because of the
differences of tensor descriptor transformations betweeen CK and CK
Tile. In old CK the case of grouped depthwise 5x5 convs was supported
via Default specialization, but in CK Tile that case was not working
properly.

## Test Plan
Performance was tested by our internal test suite, which contains
several DL models.

## Test Result
Tests results showed significant performance uplift for depthwise(3x3,
5x5) cases
2026-03-01 13:27:18 +00:00
Max Podkorytov
1dd47118e2 [rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb)
Add generate_identity_sequences helper and replace lambdas
 with named functors (#4828)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Add `generate_identity_sequences<N>()` helper that returns
`Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>`
- Replace lambdas with named functors in `transform_tensor_descriptor`
- Add `unpack_and_merge_sequences` helper functor
- Reduces `transform_tensor_descriptor` instantiations from 388 to 32
(92% reduction)

## Motivation

Multiple call sites use `generate_tuple([](auto i) { return
Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda
instantiations.

Additionally, each lambda in `transform_tensor_descriptor` creates a
unique closure type, causing the function to be instantiated separately
for every call site. Named functors share a single type, so the compiler
reuses the same instantiation.

## Changes

### Part 1: generate_identity_sequences helper
- Replaces common lambda pattern for generating identity sequences
- Each lambda expression creates a unique closure type, causing separate
template instantiations at every call site
- Named helper shares a single type across all uses

### Part 2: Named functors in transform_tensor_descriptor
- Add `unpack_and_merge_sequences` helper to replace lambda in
`GetNumOfHiddenDimension`
- Use `generate_identity_sequences` in `matrix_padder.hpp`

## Test Plan

- [x] Added 7 unit tests:
  - 4 tests for `generate_identity_sequences`
  - 3 tests for `unpack_and_merge_sequences`
- [ ] Waiting for full CI

## Related PRs

This PR merges the functionality from:
- ROCm/composable_kernel#3588 (generate_identity_sequences helper)
- ROCm/composable_kernel#3589 (Named functors in
transform_tensor_descriptor)

Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times)

**Note:** This PR supersedes #4283, ROCm/composable_kernel#3588 and
ROCm/composable_kernel#3589, which can be closed once this is merged.
2026-02-28 20:11:11 +00:00
Bartłomiej Kocot
ef82340e05 [rocm-libraries] ROCm/rocm-libraries#4875 (commit e35e3f2)
[CK] Port non-grouped convolution instances to the grouped
 kernels (#4875)

## Motivation

Port non-grouped convolution instances to the grouped kernels to
deprecated older non-grouped implementations.

## Technical Details

Add the same instances as non-grouped but using grouped kernel.

## Test Plan

test_grouped_convnd_fwd

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-724
2026-02-28 01:25:33 +00:00
Andriy Roshchenko
b661eab573 [rocm-libraries] ROCm/rocm-libraries#4821 (commit 9456e0f)
[CK TILE] Refactor MX FLATMM example

Refactor the MX FLATMM example to support more pipelines
across different architectures. This work facilitates the NPI team
roadmap.
2026-02-27 23:21:39 +00:00
John Shumway
711374fcab [rocm-libraries] ROCm/rocm-libraries#4975 (commit 5bee6f0)
[CK] Add gfx1103 to GPU target lists
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Motivation

We need to fix multi-architecture CI build convergence for the
gfx110X-all shard (ROCm/TheRock#3499). The gfx110X-all CI shard targets
gfx1100–gfx1103, but gfx1103 is missing from CK's default CK_GPU_TARGETS
lists. While CK's source code already fully supports gfx1103
(architecture enums, compiler defines, WMMA intrinsics, device
detection), the CMake target lists omit it, which prevents standalone
builds from including gfx1103 by default. This is a prerequisite for the
corresponding TheRock change that adds gfx1100–gfx1103 to the
`_ck_supported_gfx_targets` allowlist in ml-libs/CMakeLists.txt.

Technical Details

Add gfx1103 to the default CK_GPU_TARGETS fallback lists in
projects/composablekernel/CMakeLists.txt:

- Line 220: comment documenting supported GPU_ARCHS values
- Line 227: target list for HIP < 6.3 (non-Windows)
- Line 229: target list for HIP 6.3–6.4 (non-Windows)
- Line 231: target list for HIP 6.4–6.4.43483 (non-Windows)

The newest HIP version block (≥ 6.4.43483) already uses gfx11-generic,
which covers all gfx11 family targets including
   gfx1103, so no change is needed there.

No source code changes are required — all architecture-specific support
for gfx1103 is already in place:
  - include/ck/ck.hpp: __gfx1103__ included in __gfx11__ macro
- include/ck_tile/core/arch/arch.hpp: GFX1103 enum and device property
mappings
  - include/ck_tile/core/config.hpp: CK_TILE_ARCH_GFX1103 flag
- include/ck/host_utility/device_prop.hpp /
include/ck_tile/host/device_prop.hpp: is_gfx11_supported() includes
  gfx1103

Test Plan

- Configure CK standalone build with -DGPU_TARGETS="gfx1103" and verify
it configures without warnings and compiles
  successfully.
- After the companion TheRock PR lands, verify the gfx110X-all CI shard
builds CK and produces a CK-enabled
libMIOpen.so matching the structure of other shards (no "gfx110X is not
supported by composable kernel" warnings).

Test Result

I configured with gfx1103 and built with `ninja -j 192` on an in-memory
filesystem in 49 minutes.

The windows build was successful and took 2 1/2  hours on 192 cores.
2026-02-27 22:56:13 +00:00
Bartłomiej Kocot
c1ec24a2de [rocm-libraries] ROCm/rocm-libraries#4963 (commit cb6bbf6)
[CK][CK Tile] Fix batched gemm kernel 2 lds

## Motivation

Fix 2 lds batched gemm universal gemm call. Disable split k for not
valid atomic add instruction size.

## Technical Details

Fix 2 lds batched gemm universal gemm call. Disable split k for not
valid atomic add instruction size.

## Test Plan

CI overall

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-27 22:17:07 +00:00
Adam Osewski
22de6a19d9 [rocm-libraries] ROCm/rocm-libraries#4750 (commit c065793)
[CK_BUILDER] ck builder conv transfer fix

## Motivation

This PR fixes how CK Builder is validating transfer vector size and adds
proper validation for LDS transfer vector size as well.

## Changes:

* [__source vector dim__] -- Before this PR the data transfer validation
logic didn't allow to set the source vectorized dimension to 1. However
there are CK instances that are doing this when the group merging is
used. This is used only for
`DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle` kernel.
* [__valid vector size__] -- Before this PR the validation logic
concerned only single instruction maximum vector size. However our
buffer loading logic has implemented support for loading more values
through multiple buffer instructions. This again was discovered to be
used in some of the convolution instances. Thus this behavior was
reflected in validation logic.
* [__valid LDS vector size__] -- Before this PR the LDS vector size
validation was done in the same way as VMEM. This PR adds proper LDS
vector size validation based on the available LDS instruction sizes.

## Test Plan

Run CK BUILDER conv fwd factories tests

## Test Result

All CK BUILDER conv fwd factories work (except DL one & ck tile since
they're not yet added now)

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-27 13:48:58 +00:00
kabrahamAMD
5e06874aae [rocm-libraries] ROCm/rocm-libraries#4582 (commit 990a00d)
[CK_Builder] added bwd data kernels to builder factory
 (#4582)

This PR adds bwd data wmma and xdl kernels to the ck builder, their
instance and conv traits as well as tests for the above.
2026-02-27 03:06:29 +00:00
Aviral Goel
c8a8449eec [rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)
[CK] Add split-K support for ABQuantGrouped in
 block_scale_gemm (#4816)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Changes

### Split-K support in `gemm_quant_kernel.hpp`

- **`SplitKBatchOffset`**: Added `aq_group_offset` and
`aq_k_split_offset` fields (mirroring the existing `bq_*` fields for B)
to track each split-K batch's position within the AQ scale tensor. For
`ABQuantGrouped`, both offsets are computed from `k_id * KRead` divided
by `AQuantGroupSize::kK`.

- **`MakeAQBlockWindow`**: Added an `aq_group_offset` parameter
(defaulting to 0 for non-split-K paths) so the AQ tensor view's K-group
dimension reflects only the remaining K-groups from the split-K offset,
consistent with how `MakeBQBlockWindow` handles the BQ tensor.

- **`RunGemm`**: Threads the `aq_k_split_offset` through to
`MakeAQBlockWindow` when in split-K mode.

### Constraints in `IsSupportedArgument()`

Four constraints gate split-K (`k_batch > 1`) for ABQuantGrouped:

1. **Mode check** — split-K is only allowed for `BQuantGrouped` (no
preshuffle) or `ABQuantGrouped` (no `APreshuffleQuant`). Any other quant
mode with `k_batch > 1` returns `false`.

2. **B quant group alignment** — `KRead` (per-batch K slice) must be
divisible by `BQuantGroupSize::kK`. Each batch must operate on complete
B quantization groups; a partial group would require splitting a scale
value across batches.

3. **A quant group alignment** (new, ABQuantGrouped only) — `KRead` must
also be divisible by `AQuantGroupSize::kK` for the same reason applied
to the AQ scale tensor.

4. **Minimum 2 K-tile iterations per batch** (new) — The
software-pipelined GEMM kernels (CompV3 family) prefetch one tile ahead,
so they require `per_batch_num_loop = KRead / KPerBlock >= 2`. When
`KRead == KPerBlock` (i.e. each batch is exactly one tile), the prefetch
reads into the next batch's memory region and produces incorrect
results. Configurations where `K == k_batch * KPerBlock` are therefore
rejected.

### Example update (`run_gemm_quant_example.inc`)

Updated the comment above the `IsSupportedArgument` call to document
that split-K is now supported for both `BQuantGrouped` (no preshuffle)
and `ABQuantGrouped` (no `APreshuffleQuant`).

## Unit Tests

Two new test files covering decode and prefill tile shapes across a
range of `k_batch` values (2–8), data types (FP8, BF8), and quantization
group sizes (1×1×128 and 1×128×128 for B):

- `test_gemm_quant_abquant_splitk_decode.cpp` — uses the decode tile
shape (M=16, N=64, K_tile=256)
- `test_gemm_quant_abquant_splitk_prefill.cpp` — uses the prefill tile
shape (M=128, N=128, K_tile=128)

Each test calls `run_test_with_validation` which runs the kernel and
checks correctness against a CPU reference. Configurations excluded from
tests are annotated with comments explaining which constraint they
violate (typically the `per_batch_num_loop >= 2` requirement).

## Prerequisites

This PR depends on #4429, which must be merged before this can be
merged.
2026-02-26 23:57:17 +00:00
spolifroni-amd
6549c320fc [rocm-libraries] ROCm/rocm-libraries#4431 (commit ca33816)
[CK] updated github repo link

The location of the github repo has changed; the landing page of the
docs needs to reflect this.

Updated only the git repo links in the docs folder.

Also added info to the install doc about how to do a sparse checkout.

Updated some refs that were messed up while I was at it.
2026-02-26 18:36:34 +00:00
Thrupti Raj Lakshmana Gowda
9059730caf [rocm-libraries] ROCm/rocm-libraries#4592 (commit 45f76cb)
Tile Engine support for gfx950

## Motivation

This PR adds support for the gfx950 GPU architecture to the Tile Engine
in Composable Kernel library, focusing on GEMM operations with FP8 and
BF8 data types.

## Technical Details

Added gfx950-specific MFMA warp GEMM implementations with conditional
compilation.
Updated default GEMM configuration parameters for tile sizes and warp
configurations.
Added Jenkins CI pipeline stage for testing TILE_ENGINE_GEMM on gfx950
hardware.

## Test Plan

Tile engine itself is a benchmarking utility, so if it passes the CI it
will be tested automatically.

## Test Result

Tile engine itself is a benchmarking utility, so if it passes the CI it
will be tested automatically.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-26 16:15:41 +00:00
Illia Silin
17e4c8eac9 [rocm-libraries] ROCm/rocm-libraries#4883 (commit 56347bb)
[CK] Disable test_fmha_fwd_fp8fp16 on gfx90a by default.
 (#4883)

## Motivation

Since gfx90a has no native support for FP8 datatype, all FP8 tests
should be disabled there by default.

## Technical Details

The test_fmha_fwd_fp8fp16 is the last failing test in CK on gfx90a with
staging compiler.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-26 02:09:06 +00:00
Yung-sheng Tu
75aea70c2c [rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f)
Implement device_grouped_gemm_fixed_nk_bias for RDNA4

## Proposed changes

Summary:

- Modified implementation for grouped_gemm_fixed_nk_bias
- FP16 WMMA examples
- WMMA instances
- Profiler for grouped_gemm_fixed_nk_bias
- Add WMMA instances to existing tests

**This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299
and should be merged after it.
Only the last 6 commits are in the scope of this PR.**

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-26 00:28:58 +00:00
JP-Fernando
9a32f0ea19 [rocm-libraries] ROCm/rocm-libraries#4415 (commit b3b4af7)
[CK] Remove duplicated XDL/WMMA tests
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

When we started the RDNA4 support, the XDL instances were not supporting
WMMA instructions, so we duplicated some tests.

In this issue, we simplified most of the duplicated test files into
common test files.

## Technical Details

The following tests were unified:

- `batched_gemm`

- `batched_gemm_gemm`

- `gemm_add`

- `gemm_universal`

- `grouped_convnd_bwd_data`

The following tests were duplicated exactly, and copied into two files
with `_xdl` and `_wmma` suffixes. Now they are unified in one single
file without suffix:

- `gemm_multi_abd`

- `gemm_b_scale`

There is still an apparent duplication which is a special case, namely
`test_grouped_convnd_bwd_weight_interface_{suffix}` where `{suffix}` is
`xdl` or `wmma`.
However, the WMMA code relies on an old implementation, and is expected
to be removed in the future. In addition, it differs from the XDL
implementation significantly.
Therefore, it was decided to keep both files separate instead of
attempting any unification.

## Test Plan

`CMakeLists.txt` files were modified to support the new, unified tests.
In particular, testing was done for `gfx90a`, `gfx1201` and `gfx11`
architectures.

## Test Result

All tests passed successfully on all three tested architectures.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 23:23:02 +00:00
John Afaganis
6bf2423685 [rocm-libraries] ROCm/rocm-libraries#4898 (commit 753f2ac)
Create operation support matrix for CK Tile Engine

Introduce operation support matrix for CK Tile kernels detailing data
types, layouts, and GPU targets.

## Motivation

The tile engine currently supports a subset of CK Tile operations, but
there is no in-repo reference that maps which operations, data types,
layouts, and GPU targets are covered by the tile engine versus only
available through hand-written examples or tests. This makes it
difficult for developers to know what the tile engine already handles,
what requires manual integration, and where coverage gaps exist. This PR
introduces an operation support matrix as a markdown file in
tile_engine/, intended to be maintained as a living document alongside
the code. Because it lives in the repository rather than an external
wiki or PDF, it can be reviewed and updated in the same pull requests
that add or extend tile engine operations, keeping it accurate as
coverage evolves.

## Technical Details

Documentation only change.

## Test Plan

N/A
## Test Result

N/A

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 23:07:25 +00:00
Bartłomiej Kocot
eede24de0d [rocm-libraries] ROCm/rocm-libraries#4872 (commit ca623f7)
[CK] Small improvements for grouped conv backward weight
 (#4872)

## Motivation

Improvements for CK Tile convolution builder run function and atol/rtol
calculations.

## Technical Details

- Add preprocessing function for wrw when k_batch is larger than 1 for
builder run function
- Divide num acums by number of groups to get real number of accums

## Test Plan

CI wrw tests

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-783
2026-02-25 20:11:01 +00:00
Brock Hargreaves
c90a363589 [rocm-libraries] ROCm/rocm-libraries#4812 (commit bb5a4dd)
[CK] Use as_posix() instead of str() for paths in
 fmha_fwd_appendkv.py (#4812)

## Motivation

This is causing a failing PR for Windows:
https://github.com/ROCm/TheRock/pull/3382
```

[composable_kernel configure] -- Jenga kernel files to be generated: B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_api.cpp
[composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library):
[composable_kernel configure]   Syntax error in cmake code when parsing string
[composable_kernel configure]
[composable_kernel configure]     B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp
[composable_kernel configure]
[composable_kernel configure]   Invalid character escape '\b'.
```

## Technical Details

The file:
[fmha_fwd_appendkv.py](https://github.com/ROCm/rocm-libraries/compare/users/brockhargreaves-amd/ck/fix-windows-cmake-path-problem?expand=1#diff-bef22bf9ba21eb93c725493ecc7edcb6f2a8f0a9a173dcfca6bda7a9f4eced78)
writes a bunch of paths to a text file which is later parsed by cmake.
When passing a pathlib.Path to str(), str() converts to a native path,
in this case / to \\ on Windows which confuses cmake. In this case we
need to write paths with forward slashes and then pass those onward to
cmake.

## Test Plan

1. Ensure this doesn't impact existing CI.
2. Ensure compilation of Windows pass locally.

## Test Result

1. Passes existing CI
2. This fixes the compilation error locally.

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 16:13:56 +00:00
Brock Hargreaves
abf13bdec1 [rocm-libraries] ROCm/rocm-libraries#4819 (commit b995a0b)
[CK] Fix windows build issues

## Motivation

Full build on Windows is currently broken due to compiler errors, this
PR should help fix that. This is also holding up the following PR in the
TheRock: https://github.com/ROCm/TheRock/pull/3382

## Technical Details

1. I don't see a good reason to be nesting a windows include inside the
ck_tile namespace. It was causing compiler errors too: Windows.h comes
with min and max, which was conflicting with ck_tile::min and
ck_tile::max, so I moved it out. I also defined NOMINMAX to prevent this
inclusion in the future.
2. The TRUE/FALSE macros are already used by Windows.h, which causes an
error. So I've opted for True/False. You can see this pattern in other
rocm-libraries.
3. The M_PI macro isn't available, at least in the WIN32_LEAN_AND_MEAN
context, from \<cmath\> on Windows. We'll be able to use
std::numbers::v_pi\<float\> when we have C++20 support.
4. There was a missing \<chrono\> include.

## Test Plan

Test locally and make sure this doesn't impact existing CI.

## Test Result

Compiles locally and passes existing ci.

## Submission Checklist

- [ x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 16:13:13 +00:00
Zoltán Lakatos
a32d704d89 [rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f)
[CK] Implement device grouped gemm fixed nk multi abd for
 rdna4 (#4425)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Add support for grouped gemm multi ABD fixed NK. MR

## Technical Details

Changes from the reverted PR:
- Device struct for grouped gemm with multiple ABD and fixed NK
(DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK).
- Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD
- Unit tests for both new wmma implementation and the reference xdl code
(previously missing)
- Note: Some Xdl instances were commented out because of unit test
failures. As mentioned apparently for xdl this feature was missing tests
so our assumption is either there is an implemenetation bug or these
instances were not set up correctly. Has the potential for a follow-up
issue.
- Generic ck profiler interface with the purpose of calling unit tests.
- Gemm instances with specific elementwise operations for gemm bias gelu
calculations.
- Added class for grouped gemm multi ABD reference calculations.

Fix epilogue selection in device implementation that caused unit test
failures

## Test Plan

Covered by added unit tests

## Test Result

CI successfully passing

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 05:17:08 +00:00
Bartłomiej Kocot
1a2c0d835a [rocm-libraries] ROCm/rocm-libraries#4791 (commit 6cc17c6)
[CK][CK TILE] Improve oob check

## Motivation

Improve OOB checks. Remove permutes which have been generated by thread
buffer zero clear. at now in assembly there is only condmask instead of
permute + condmask.

Change number of KPack for generated instances

## Technical Details

Remove permute instructions from assembly

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-24 21:41:44 +00:00
Max Podkorytov
f3f4d7d842 [rocm-libraries] ROCm/rocm-libraries#4430 (commit 3bcf68c)
[CK] Add project root marker for monorepo compatibility
 (#4430)

## Summary
- Add `.ck-project-root` marker file at the composablekernel project
root
- Update `find_project_root()` in `script/tools/common.sh` to look for
this marker instead of `.git`
- Fixes project root detection when CK is part of the rocm-libraries
monorepo

  ## Background
Since the project was moved into the monorepo, the `.git` directory is
at the monorepo root rather
than the CK project root. This caused `find_project_root()` to return
the wrong path, breaking tools
   in `script/tools/`.

  ## Test plan
- [x] Verify `find_project_root` returns correct path from any CK
subdirectory
  - [x] Verify `ck-build --help` works
  - [x] Verify `ck-configure --help` works

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-02-24 21:00:34 +00:00
Matti Eskelinen
cd12e8e31f [rocm-libraries] ROCm/rocm-libraries#4295 (commit fa2cfc8)
[CK_TILE] Refactor `UniversalGemm::MakeA/B/C/DBlockViews` to
 allow caller to pass desciptors directly (#4295)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Currently `UniversalGemmKernel::MakeA/B/C/DBlockViews` directly create
tensor views from strides and sizes. This refactors the descriptor
creation out and add overloaded definitions, allowing descriptors to be
created separately by the caller instead of passing explicit strides,
with no functional changes.

This will enable further refactoring of `RunGemm` to do likewise,
enabling derived kernels like BatchedContractionKernel to avoid creating
separate versions (PR
[#3457](https://github.com/ROCm/composable_kernel/pull/3457)).

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

Since the logic within the MakeXBlockviews chains together operations on
tuples, and thus the descriptors are also passed as such, adding a
template parameter for the type of the input tuple was the simplest
option to enable the overload without too much verbiage. However, for
`MakeCBlockView` this adds a complications as the templated definitions
are prone to overlap. This for now is avoided by just moving the
arguments around for the descriptor version, which avoids the collision.
It's not a great solution, so feel free to suggest a better one.
2026-02-24 20:44:27 +00:00
Enrico Degregori
4c626aeaa6 [rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)
[CK_TILE] Extend support of mix precision microscaling BQuant
 (#4267)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Supported types combinations using BQuant=e8m0:
 - A=bf16
 - B=bf16,bf8,fp4

Summary:
- remove usage of `pk_fp4_raw_t`: consistent with other implementations
and avoid taking into account of the packed size explicitly. In general,
the raw type should not be used because CK Tile internally takes care of
the PackedSize, so using the raw type adds unnecessary complexity to the
implementation
- handle microscaling by checking for `e8m0` type for BQuant (previous
implementation was inconsistent)
 - add support for scaling instructions in `DequantPack8`
 - mx pipeline:
   - extend existing pipeline to support different B types
- add support to scale and cast before writing to LDS or after reading
from LDS (this can be defined in the `Problem` by the user)
 - block gemm:
   - mx pipeline is now using block gemm BQuant
- block gemm BQuant can now load from LDS and apply scale and then call
block gemm universal operator. This adds new functionalities and remove
code duplication
 - warp gemm:
- add case to support 128bit ds_read/write for both A and B when A=16bit
and B=8bit
- add examples and tests: note that some tests for bf16/fp4 already
existed but were removed during previous tests refactoring. I added them
again and other relevant tests for new types combinations

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-24 17:57:02 +00:00