848 Commits

Author SHA1 Message Date
Bartłomiej Kocot
9c414d2e59 [rocm-libraries] ROCm/rocm-libraries#5454 (commit 8dade31)
[CK][CK Tile] Grouped Convolution backward weight profiler
 flush cache (#5454)

## Motivation

Flush cache to get more stable results during profiling old ck and ck
tile.

## Technical Details

Flush cache before each kernel call and one more first run.

## Test Plan

test_grouped_conv_bwd_weight_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-966
2026-03-16 17:47:07 +00:00
lalala-sh
a3ccd5dca1 [rocm-libraries] ROCm/rocm-libraries#5225 (commit 880166b)
[CK] fix moe memset size which is bigger than alloc

## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-16 09:30:57 +00:00
Bartłomiej Kocot
2169367735 [rocm-libraries] ROCm/rocm-libraries#5114 (commit 59b8cb5)
[CK][CK Tile] Improvements for grouped conv fwd tile
 profiling (#5114)

## Motivation

Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details

- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-786
2026-03-11 22:39:20 +00:00
JP-Fernando
d8ee107a47 [rocm-libraries] ROCm/rocm-libraries#4421 (commit 5bb5769)
[CK] Unify the grouped convolution gridwise Run() functions
 (#4421)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

There are currently three different grouped convolution related Run()
function overloads that exist in `gridwise_gemm_wmma_cshuffle_v3.hpp`.
These are used for the different types of grouped convolution: Forward,
Backward weights, and Backward data.
The functions are very similar and should be unified to a single `Run()`
function for all types of grouped convolution.

## Technical Details

The three old `Run<>()` functions were replaced with a single unified
function.
The new `Run<>()` function is run from device implementations:

-  DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3

-  DeviceGroupedConvBwdDataMultipleD_Wmma_CShuffleV3

-  DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3

-  DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3

-  DeviceGroupedConvBwdWeight_Wmma_CShuffleV3

The DeviceGroupedConvFwdMultipleD_Wmma_CShuffle_V3_Large_Tensor
implementation uses a different `Run<>()` overload and was therefore not
modified.

## Test Plan

Run the following grouped convolution tests on `gfx1201`, as this
architecture is WMMA-capable:

- `test_grouped_convnd_fwd`

- `test_grouped_convnd_bwd_weight`

- `test_grouped_convnd_bwd_data`

Compilation and testing were also executed on `gfx1100` to avoid CI
problems.

## Test Result

First part (unification of `Run<>()` function): All tests successful.

Second part (integration of single `Run<>()` function as a direct call):
All tests successful.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-11 16:40:12 +00:00
John Shumway
9f47b8a63d [rocm-libraries] ROCm/rocm-libraries#5284 (commit 76b5b15)
[CK_BUILDER] Add
 DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284)

Add factory, InstanceTraits, and conv traits support for the WMMA V3
forward convolution kernel, enabling the CK Builder to generate and
dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs.

## Motivation

As reported in issue #4944, MIOpen includes WMMA V3 forward convolution
kernels, so this PR adds support for those kernels similarly to other
supported kernels.

## Technical Details

This follows the same implementation as the other kernels. I added some
support for reflection, but I left a few todos since we need to
generalize our convolution traits to generalize across WMMA/MFMA and
CK/CKTile.

## Test Plan

Added faster tests to `ninja smoke-builder` that check the
instance-traits logic, and I added longer tests that instantiate
kernels, following the existing pattern in other kernals.

## Test Result

I tested all code with `ninja check-builder` on a gfx1101 build and ran
on gfx1101.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 23:43:03 +00:00
Max Podkorytov
b8def2c724 [rocm-libraries] ROCm/rocm-libraries#5041 (commit 481aecc)
[CK] Precompute SpaceFillingCurve indices to reduce compile
 time by 31% (#5041)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Optimize `SpaceFillingCurve` in CK to reduce compile time by
precomputing all index values into a static constexpr lookup table.

### Problem
- `GetIndex<N>` was instantiated separately for every index value (0 to
NumAccesses-1)
- Each instantiation triggered nested `static_for` loops with O(N²)
template depth
- This caused **34,000+ template instantiations** taking **69 seconds**
in frontend

### Solution
- Add `IndexLookupTable<NumAccesses, nDim>` to store all precomputed
indices
- Add `compute_single_index()` helper using O(N) `static_for` loops
- Add `compute_all_indices()` to build entire table in one constexpr
evaluation
- `GetIndex<N>` becomes simple array lookup: `return index_table[N]`

### Results (conv2d_fwd_xdl_nhwc_kyxc_nhwk_f16_instance.cpp)

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Total compile time | 120.4s | 83.6s | **-31%** |
| Frontend time | 88.7s | 52.6s | **-41%** |
| GetIndex instantiations | 34,176 | 384 | **-99%** |
| GetIndex time | 69.0s | 0.11s | **-99.8%** |
| SpaceFillingCurve time | 75.7s | 4.3s | **-94%** |

## Test plan
- [x] Builds successfully with `-Werror -Weverything`
- [ ] Run existing unit tests
- [ ] Verify numerical correctness on sample kernels

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 19:41:40 +00:00
Márton Bidlek
683865895e [rocm-libraries] ROCm/rocm-libraries#5135 (commit 5ccc138)
Proof of concept for removing forward declarations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Currently, we forward declare CK device operation templates in
CK-Builder's reflection code:

9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)
This is mainly required to break a circular dependency in reflection.
The architecture of that is as follows:

MyDeviceOp implements GetInstanceString(). This is typically defined
directly in the class definition (no forward declaration).

GetInstanceString() calls instance_string<MyDeviceOp>()

instance_string<MyDeviceOp>() calls
InstanceTraits<MyDeviceOp>::instance_string()

InstanceTraits has a specialization for MyDeviceOp which implements
instance_string()

So order for GetInstanceString() to work properly, InstanceTraits must
already be defined. And for InstanceTraits to be defined, the device op
needs to be defined. In order to do that, we are currently using
aforementioned forward declaration.

## Technical Details

C++'s lazy template evaluation is used by calling into an as-of-yet
undefined function static member function of
`InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then
specializing `InstanceTraits` only _after that_. The caveat here is that
both the device op itself as well as the instance traits specialization
must be in scope, otherwise there would be an undefined function error.
In practise, we can solve that either by placing the instance traits
directly into the file that defines `MyDeviceOp`, or possibly by using a
`.inc` file to keep the concerns separated.

## Test Plan

The results were verified by running the existing regression tests for
CK Builder

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-09 16:35:26 +00:00
Christopher Millette
e2ce0cad54 [rocm-libraries] ROCm/rocm-libraries#4673 (commit ec385da)
Compile-time optimize threadwise slice transfer
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Profiling with `-ftime-trace` on representative translation units (e.g.,

`device_grouped_conv2d_fwd_xdl_nhwgc_gkyxc_nhwgk_f16_comp_instance.cpp`)
revealed
that **92% of frontend time was spent in template instantiation**. The
primary
bottleneck was redundant instantiation of identical helper logic across
multiple
threadwise transfer class variants.

Each `ThreadwiseTensorSliceTransfer_v*` class independently contained
its own
copy of the same helper computations — serpentine traversal, coordinate
stepping, thread scratch descriptors, lambda-like functors, and
compile-time
constants — duplicated across 13 header files. When a typical GEMM or
convolution kernel TU includes blockwise operations (e.g.,
`blockwise_gemm_xdlops.hpp`), it pulls in multiple transfer variants
simultaneously, causing the compiler to instantiate the same helper
logic
multiple times with the same template arguments.

This was compounded by the helpers being defined as members of the outer
`ThreadwiseTensorSliceTransfer_v*` classes, which carry 14+ template
parameters.
Functions like `ComputeForwardSweep` depend only on their two argument
types,
but as inline members of the outer class, the compiler was forced to
create
separate instantiations for every unique combination of all outer
parameters
(data types, descriptors, vector widths, etc.) — even when most of those
parameters had no effect on the helper's output.

## Technical Details

### The Fix: Shared Helper Struct Hierarchy

Duplicated logic was extracted into a standalone helper hierarchy in
`threadwise_tensor_slice_transfer_util.hpp`:

```
ThreadwiseTransferHelper_Base          (I0..I16, MoveSliceWindow, ComputeThreadScratchDescriptor,
|                                       ComputeForwardSteps, ComputeBackwardSteps, MakeVectorContainerTuple)
+-- ThreadwiseTransferHelper_Serpentine (ComputeForwardSweep, ComputeMoveOnDim, ComputeDataIndex,
|                                       ComputeCoordinateResetStep, VectorSizeLookupTable, VectorOffsetsLookupTable)
+-- ThreadwiseTransferHelper_SFC       (ComputeSFCCoordinateResetStep)
```

Each helper method is now parameterized **only by what it actually
uses**:

- `ComputeForwardSweep(idx, lengths)` — parameterized only by the two
argument
  types, not by `SrcData`, `DstData`, `SrcDesc`, etc.
- `ComputeForwardSteps(desc, scalar_per_access)` — parameterized only by
the
  descriptor and access sequence types.
- `ComputeCoordinateResetStep<SliceLengths, VectorDim, ScalarPerVector,
DimAccessOrder>()` — parameterized only by the four values it actually
needs.

This reduces template instantiation work in two ways:
1. **Across different transfer variants** (v3r1 vs v3r2 vs v3r1_gather):
the
compiler reuses a single instantiation instead of creating one per
variant.
2. **Across different outer class instantiations** (fp16 vs bf16 vs
int8): the
compiler reuses the helper instantiation because the helper doesn't
depend
   on the data type at all.

### Refactored Headers

**13 headers** now delegate to the shared helpers instead of duplicating
logic:
- Serpentine family: v3r1, v3r2, v3r1_gather, v3r1_dequant
- SFC family: v6r1, v6r1r2, v6r2, v6r3, v7r2, v7r3, v7r3_scatter
- Dead code removed: v4r1, v5r1

### Additional Fixes Found During Refactoring

- Two latent bugs in v3r2 (`forward_sweep` indexing,
`GetDstCoordinateResetStep` extraction)
- Dead `SrcCoordStep` variables in v4r1 and v5r1
- Unused `scale_element_op_` member in v3r1_dequant (restored with note)

### Net Code Change

+1,428 / -2,297 lines (~870 lines removed).

## Test Plan

### Unit Tests

28 host-side gtests in
`test/threadwise_transfer_helper/test_threadwise_transfer_helper.cpp`
covering the full helper hierarchy:

| Suite | Tests | What is verified |
|-------|-------|------------------|
| ThreadwiseTransferHelperBase | 6 | Compile-time constants,
inheritance, `MoveSliceWindow` with `ResetCoordinateAfterRun` true/false
in 2D and 3D |
| ThreadwiseTransferHelperSerpentine | 9 | `ComputeForwardSweep`
(even/odd row, 1D), `ComputeMoveOnDim` (inner complete/incomplete),
`ComputeDataIndex`, `ComputeCoordinateResetStep`,
`VectorSizeLookupTable`, `VectorOffsetsLookupTable` |
| ThreadwiseTransferHelperSFC | 6 | `ComputeSFCCoordinateResetStep` —
single access, 2D row-major, 2D column-major, 3D batch, even/odd inner
access counts |
| ThreadwiseTransferHelperInheritance | 3 | Serpentine and SFC derive
from Base, are not related to each other |
| DetailFunctors | 4 | `lambda_scalar_per_access`,
`lambda_scalar_step_in_vector`,
`lambda_scalar_per_access_for_src_and_dst` (same dim, different dims) |

### Semantic Equivalence

GPU ISA comparison using `--cuda-device-only -S` confirmed identical
assembly
output (modulo `__hip_cuid_*` metadata) between baseline and refactored
code.

## Test Results

All measurements on a 384-core machine, `-j64`, freshly rebooted,
near-idle.

### Targeted Builds (affected targets only)

| Target | Baseline | Refactored | Wall-clock Delta | CPU Delta |
|--------|----------|------------|-----------------|-----------|
| `device_grouped_conv2d_fwd_instance` (160 TUs) | 7m 37s / 189m CPU |
6m 53s / 161m CPU | **-9.7%** | **-14.9%** |
| `device_grouped_conv3d_fwd_instance` (185 TUs) | 9m 49s / 202m CPU |
6m 42s / 182m CPU | **-31.8%** | **-10.0%** |
| **Combined** | **17m 27s / 392m CPU** | **13m 35s / 344m CPU** |
**-22.2%** | **-12.4%** |

### Full Project Build (8,243 targets)

| Metric | Baseline | Refactored | Delta |
|--------|----------|------------|-------|
| Wall-clock | 103m 38s | 111m 56s | +8.0%* |
| CPU time | 4705m 7s | 4648m 17s | **-1.2%** |

\*Wall-clock inflated by external load spike during refactored build
(load 90 vs 66). CPU time is the reliable metric.

### Context

~15% of all build targets (1,262 / 8,243) transitively include the
modified
headers. These are primarily GEMM and convolution kernel instantiations
— the
core compute workloads. The 12-15% CPU savings on affected targets is
diluted
to 1.2% across the full project because 85% of targets are unaffected.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-06 16:27:59 +00:00
lalala-sh
b0c13f3124 [rocm-libraries] ROCm/rocm-libraries#5094 (commit d4548e6)
[CK] use int64 for ptr offset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

When the number of experts (E) is large (e.g., E=257 in DeepSeek-V3),
the `expert_id * expert_stride` calculation in MOE GEMM kernels
overflows `int32` (`index_t`), causing the weight matrix (B) pointer to
wrap to an invalid address and triggering a GPU memory access fault.

For example, with `N=1024, K=7168, IsInputGemm=true`:
- `expert_stride = N * K * 2 = 14,680,064`
- `INT32_MAX / expert_stride ≈ 146`
- Any `expert_id >= 147` causes overflow → negative offset → illegal
memory access → GPU crash

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: amd-shiraz <shiraz.ali@amd.com>
2026-03-06 02:01:03 +00:00
Max Podkorytov
1dd47118e2 [rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb)
Add generate_identity_sequences helper and replace lambdas
 with named functors (#4828)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Add `generate_identity_sequences<N>()` helper that returns
`Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>`
- Replace lambdas with named functors in `transform_tensor_descriptor`
- Add `unpack_and_merge_sequences` helper functor
- Reduces `transform_tensor_descriptor` instantiations from 388 to 32
(92% reduction)

## Motivation

Multiple call sites use `generate_tuple([](auto i) { return
Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda
instantiations.

Additionally, each lambda in `transform_tensor_descriptor` creates a
unique closure type, causing the function to be instantiated separately
for every call site. Named functors share a single type, so the compiler
reuses the same instantiation.

## Changes

### Part 1: generate_identity_sequences helper
- Replaces common lambda pattern for generating identity sequences
- Each lambda expression creates a unique closure type, causing separate
template instantiations at every call site
- Named helper shares a single type across all uses

### Part 2: Named functors in transform_tensor_descriptor
- Add `unpack_and_merge_sequences` helper to replace lambda in
`GetNumOfHiddenDimension`
- Use `generate_identity_sequences` in `matrix_padder.hpp`

## Test Plan

- [x] Added 7 unit tests:
  - 4 tests for `generate_identity_sequences`
  - 3 tests for `unpack_and_merge_sequences`
- [ ] Waiting for full CI

## Related PRs

This PR merges the functionality from:
- ROCm/composable_kernel#3588 (generate_identity_sequences helper)
- ROCm/composable_kernel#3589 (Named functors in
transform_tensor_descriptor)

Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times)

**Note:** This PR supersedes #4283, ROCm/composable_kernel#3588 and
ROCm/composable_kernel#3589, which can be closed once this is merged.
2026-02-28 20:11:11 +00:00
Bartłomiej Kocot
ef82340e05 [rocm-libraries] ROCm/rocm-libraries#4875 (commit e35e3f2)
[CK] Port non-grouped convolution instances to the grouped
 kernels (#4875)

## Motivation

Port non-grouped convolution instances to the grouped kernels to
deprecated older non-grouped implementations.

## Technical Details

Add the same instances as non-grouped but using grouped kernel.

## Test Plan

test_grouped_convnd_fwd

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-724
2026-02-28 01:25:33 +00:00
kabrahamAMD
5e06874aae [rocm-libraries] ROCm/rocm-libraries#4582 (commit 990a00d)
[CK_Builder] added bwd data kernels to builder factory
 (#4582)

This PR adds bwd data wmma and xdl kernels to the ck builder, their
instance and conv traits as well as tests for the above.
2026-02-27 03:06:29 +00:00
Yung-sheng Tu
75aea70c2c [rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f)
Implement device_grouped_gemm_fixed_nk_bias for RDNA4

## Proposed changes

Summary:

- Modified implementation for grouped_gemm_fixed_nk_bias
- FP16 WMMA examples
- WMMA instances
- Profiler for grouped_gemm_fixed_nk_bias
- Add WMMA instances to existing tests

**This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299
and should be merged after it.
Only the last 6 commits are in the scope of this PR.**

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-26 00:28:58 +00:00
Zoltán Lakatos
a32d704d89 [rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f)
[CK] Implement device grouped gemm fixed nk multi abd for
 rdna4 (#4425)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Add support for grouped gemm multi ABD fixed NK. MR

## Technical Details

Changes from the reverted PR:
- Device struct for grouped gemm with multiple ABD and fixed NK
(DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK).
- Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD
- Unit tests for both new wmma implementation and the reference xdl code
(previously missing)
- Note: Some Xdl instances were commented out because of unit test
failures. As mentioned apparently for xdl this feature was missing tests
so our assumption is either there is an implemenetation bug or these
instances were not set up correctly. Has the potential for a follow-up
issue.
- Generic ck profiler interface with the purpose of calling unit tests.
- Gemm instances with specific elementwise operations for gemm bias gelu
calculations.
- Added class for grouped gemm multi ABD reference calculations.

Fix epilogue selection in device implementation that caused unit test
failures

## Test Plan

Covered by added unit tests

## Test Result

CI successfully passing

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 05:17:08 +00:00
Illia Silin
1915cdfcc2 [rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)
Revert "[ck] Support VGPR estimate in
 GridwiseGemm_wmma_cshuffle_v3" (#4762)

Reverts ROCm/rocm-libraries#4638
unfortunately, this PR interfered with the PR#4299 and caused build
errors for gfx11:

In file included from
/rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7:
In file included from
/rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11:

/rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21:
error: no matching function for call to 'CheckValidity'
  553 |                 if(!GridwiseGemm::CheckValidity(
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-02-20 22:41:34 +00:00
linqunAMD
29781f2ac4 [rocm-libraries] ROCm/rocm-libraries#4638 (commit 305ec71)
[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3
 (#4638)

1. Add GetEstimateVgprCount to estimate the VGPR usage in
GridwiseGemm_wmma_cshuffle_v3
2. Add IsValidCompilationParameter to disable kernel which use too many
vgprs.
- Currently, the threashold is AvailableVgprCount * 1.25
3. Modify examples to avoid test is disabled on gfx11

It is port from internal repo
PR[#192](https://github.com/ROCm/composable_kernel/issues/192)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-20 15:57:18 +00:00
Márton Bidlek
7b97e197ef [rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49)
173 implement device grouped gemm fixed nk for rdna4
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk
instance library using for WMMA.

The implementation is based on the existing
DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level
structure, but replaces the XDL kernel with a WMMA-based one. It uses
the GridwiseGemm_wmma_cshuffle_v3 kernel.

At this stage, the focus is functional correctness and compatibility,
not performance tuning.

## Technical Details

- Device struct for grouped gemm fixed NK
- Example code for the WMMA version
- Unit tests for both new wmma implementation and the reference XDL code
(previously missing)
- Generic ck profiler interface with the purpose of calling unit tests.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [x] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [x] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-19 08:13:46 +00:00
Jan Patrick Lehr
9c2dd2941b [rocm-libraries] ROCm/rocm-libraries#4419 (commit e241f8b)
[CK] Work around staging compiler lifetime warning

## Motivation
The staging compiler enables lifetime-safety warnings and we already
worked around a few of them.
This works around a few more instances that came up recently on gfx950
builds.
The initial PR that resolved most issues:
https://github.com/ROCm/composable_kernel/pull/3640

## Technical Details
This follows the pattern to locally ignore the newly added
lifetime-safety warnings that were moved from experimental to production
in upstream LLVM.
As a result, CK turned them on and treats them as errors, which prevents
the staging compiler from building CK.

## Test Plan

## Test Result

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-12 22:12:57 +00:00
lalala-sh
dae352e8dc [rocm-libraries] ROCm/rocm-libraries#4282 (commit 2050f93)
add memsetasync for ck moe splitk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

add memsetasync for ck moe splitk to fix

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-12 17:45:52 +00:00
Christopher Millette
e1e2f7ac2e [rocm-libraries] ROCm/rocm-libraries#4447 (commit 6d08a99)
[CK] Optimize multi-dimensional static for loop decomposition
 (#4447)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
Recursive template implementations might initially seem attractive to
minimize necessary coding.

Unfortunately, this style is often affects readability and requires
significant resources from the compiler to generate instantiation
chains. In "high-traffic" code (e.g., used in many places + compilation
units), this generally does not scale well and can bloat the overall
compile times to unnecessary lengths.

The aim of this PR is to take some of most high-traffic utility code and
try our best to eliminate recursive templates in favor of fold
expansions and constexpr function helpers.

In local tests with clang build analyzer,
device_grouped_conv2d_fwd_xdl_ngchw_gkcyx_ngkhw_f16_16x16_instance.cpp
showed high hit-rates on slow template instantiations in static_for,
dimensional static_for (static_ford), which are subsequently affected by
implementation of the Sequence class and associated transforms.

Example:
**** Templates that took longest to instantiate:
70111 ms: ck::detail::applier<int, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1... (372 times, avg 188 ms) // **70 seconds!**

The above is part of the implementation of static_for which uses
Sequence classes..

## Technical Details

### Summary of Optimization Techniques

| Technique | Used In | Benefit |
 |-----------|---------|---------|
| __Constexpr for-loop computation__ | sequence_reverse_inclusive_scan,
sequence_map_inverse | Moves O(N) work from template instantiation to
constexpr evaluation |
| __Pack expansion with indexing__ | sequence_reverse, Sequence::Modify
| Single template instantiation instead of recursive |
| __Flat iteration + decomposition__ | ford, static_ford | O(1) template
depth instead of O(N^D) |
| __Pre-computed strides__ | index_decomposer | Enables O(1)
linear-to-multi-index conversion |

### Impact on Compile Time

These optimizations reduce template instantiation depth from O(N) or
O(N^D) to O(1), which:

1. Reduces compiler memory usage
2. Reduces compile time exponentially for deep instantiation chains
3. Enables larger iteration spaces without hitting template depth limits

## Test Plan

* Existing tests for Sequence are re-used to affirm correctness
* Unit tests for ford and static_ford are added (dimensional looping)
* 8 new regression tests specifically verify the fixes for the PR
feedback:

  - `NonTrivialOrder3D_201` - Tests Orders<2,0,1> for static_ford
  - `NonTrivialOrder3D_201_Runtime` - Tests Orders<2,0,1> for ford
- `ConsistencyWithNonTrivialOrder_201` - Verifies static_ford and ford
consistency
  - `NonTrivialOrder3D_120` - Tests Orders<1,2,0> for static_ford
  - `NonTrivialOrder3D_120_Runtime` - Tests Orders<1,2,0> for ford
  - `NonTrivialOrder4D` - Tests 4D with Orders<3,1,0,2> for static_ford
  - `NonTrivialOrder4D_Runtime` - Tests 4D with Orders<3,1,0,2> for ford
- `AsymmetricDimensionsWithOrder` - Tests asymmetric dimensions with
non-trivial ordering

## Test Result
### Compile Time Comparison: `8b72bc8` (base) → `477e0686` (optimized)

#### Commits in Range (8 commits)

1. `fd4ca17f48` - Optimize sequence_reverse_inclusive_scan and
sequence_reverse
2. `7a7e3fdeef` - Optimize sequence_map_inverse
3. `92855c9913` - Optimize ford and static_ford calls to eliminate
nested template recursion
4. `88a564032b` - Add unit tests for ford and static_ford
5. `1a0fb22217` - Fix clang-format
6. `8a0d26bddf` - Increase template recursion depth to 1024
7. `dc53bb6e20` - Address copilot feedback and add regression tests
8. `477e06861d` - Increase bracket depth to 1024

#### Build Timing Results

| File | Base (8b72bc8759d9 | HEAD(a0438bd398) | Improvement |
|------|------|------|-------------|
| grouped_conv2d_fwd (f16) -j1 | 313.31s | 272.93s | __12.9% faster__ |
| grouped_conv1d_fwd (bf16) -j1 | 79.33s | 68.61s | __13.5% faster__ |
| grouped_conv1d_bwd_weight (f16) -j1| 15.77s | 14.31s | __9.2% faster__
|
| device_grouped_conv2d_fwd_instance -j64 | s | s | __% faster__ |

#### Key Optimizations

1. __sequence_reverse_inclusive_scan/sequence_reverse__: O(N) → O(1)
template depth
2. __sequence_map_inverse__: O(N) → O(1) template depth
3. __ford/static_ford__: O(N^D) → O(1) template depth using flat
iteration with index decomposition
4. __Copilot feedback fixes__: Corrected New2Old mapping for non-trivial
orderings

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-11 22:13:15 +00:00
Christopher Millette
04eddbc5ce [rocm-libraries] ROCm/rocm-libraries#4471 (commit 10fa702)
[CK] Optimize vector type build times

**Supercedes https://github.com/ROCm/rocm-libraries/pull/4281 due to CI
issues on import**

## Proposed changes

Build times can be affected by many different things and is highly
attributed to the way we write and use the code. Two critical areas of
the builds are **frontend parsing** and **backend codegen and
compilation**.

### Frontend Parsing
The length of the code, the include header tree and macro expansions all
affect the front-end parsing time.
This PR seeks to reduce the parsing time of the dtype_vector.hpp
vector_type class by reducing redundant code by generalization.
* Partial specializations of vector_type for native and non-native
datatypes have been generalized to one single class, consolidating all
of the data initialization and AsType casting requirements into one
place.
* The class nnvb_data_t_selector (e.g., Non-native vector base dataT
selector) class has been removed and replaced with scalar_type
instantiations as they have the same purpose. Scalar type class' purpose
is already to map generalized datatypes to native types compatible with
ext_vector_t.

### Backend Codegen
Template instantiation behavior can also affect build times. Recursive
instantiations are very slow versus concrete instantiations. The
compiler must make multiple passes to expand template instantiations so
we need to be careful about how they are used.
* Previous vector_type classes declared a union storage class, which
aliases StaticallyIndexedArray<T,N>.
```
template <typename T>
struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>>
{
    using d1_t = T;
    typedef T d2_t __attribute__((ext_vector_type(2)));
    typedef T d4_t __attribute__((ext_vector_type(4)));

    using type = d4_t;

    union
    {
        d4_t d4_;
        StaticallyIndexedArray<d1_t, 4> d1x4_;
        StaticallyIndexedArray<d2_t, 2> d2x2_;
        StaticallyIndexedArray<d4_t, 1> d4x1_;
    } data_;
   ...
};
```
* Upon further inspection, StaticallyIndexedArray is built on-top of a
recursive Tuple concatenation.
```
template <typename T, index_t N>
struct StaticallyIndexedArrayImpl
{
    using type =
        typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type,
                              typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type;
};
```
This union storage has been removed from the vector_type storage class.

* Further references to StaticallyIndexedArray have been replaced with
StaticallyIndexedArray_v2, which is a concrete implementation using
C-style arrays.
```
template <typename T, index_t N>
struct StaticallyIndexedArray_v2
{
    ...

    T data_[N];
};
```

### Fixes
* Using bool datatype with vector_type was previously error prone. Bool,
as a native datatype would be stored into bool ext_vector_type(N) for
storage, which is a packed datatype. Meaning that for example,
sizeof(bool ext_vector_type(4)) == 1, which does not equal
sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The
union of these datatypes has incorrect data slicing, meaning that the
bits location of the packed bool do not match with the
StaticallyIndexedArray member. As such, vector_type will use C-Style
array storage for bool type instead of ext_vector_type.
```
template <typename T, index_t Rank>
using NativeVectorT = T __attribute__((ext_vector_type(Rank)));

sizeof(NativeVectorT<bool, 4>) == 1  (1 byte per 4 bool - packed)
element0 = bit 0 of byte 0
element1 = bit 1 of byte 0
element2 = bit 2 of byte 0
element3 = bit 3 of byte 0

sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4  (1 byte per bool)
element0 = bit 0 of byte 0
element1 = bit 0 of byte 1
element1 = bit 0 of byte 2
element1 = bit 0 of byte 3

union{
    NativeVectorT<bool, 4> d1_t;
    ...
    StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1;
};

// union size == 4 which means invalid slicing!
```
* Math utilities such as next_power_of_two addressed for invalid cases
of X < 2
* Remove redundant implementation of next_pow2

### Additions
* integer_log2_floor to math.hpp
* is_power_of_two_integer to math.hpp

### Build Time Analysis

Machine:  banff-cyxtera-s78-2
Target: gfx942

| Build Target | Threads | Frontend Parse Time (s) | Backend Codegen
Time (s) | TotalTime (s) | commitId |

|---------------|---------|-------------------------|--------------------------|---------------|
2026-02-11 19:01:05 +00:00
Johannes Graner
e88f139c6c [rocm-libraries] ROCm/rocm-libraries#4271 (commit 6fce58e)
[Conv] Add NumGroupsToMerge to BwdWeight type string
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Add parameter to bwd weight V3 type string showing the number of groups
to merge. This is required for MIOpen to be properly tuned since it uses
type strings for performance database entries.

In order to not break existing tuning databases, the parameter is added as a named suffix and only when group merging is enabled.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-11 09:08:38 +00:00
John Shumway
1af75d290e [rocm-libraries] ROCm/rocm-libraries#4277 (commit 4348901)
Add a README.md file to ck/library/util
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

I'm collecting information about our current testing (#3664). As part of
this work I a README to the directory to emphasize the GPU-first testing
strategy and our support for type-specific tolerances.

This readme contains internal code comments for CK developers and does
not need ROCm documentation review.
2026-02-10 21:27:27 +00:00
Bartłomiej Kocot
ea6363ad78 [rocm-libraries] ROCm/rocm-libraries#4399 (commit 331512e)
[CK] Fix grouped conv fwd transform for merged groups

## Motivation

[CK] Fix grouped conv fwd transform for merged groups for 1d and 3d.

## Technical Details

After optimizations for 2d there is a lack of implementation for 1d and
3d

## Test Plan

test_grouped_convnd_fwd

## Test Result

pending CI

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-09 15:37:36 +00:00
Ville Pietilä
57d26db844 [rocm-libraries] ROCm/rocm-libraries#4273 (commit 591f504)
[CK] Add fwd conv group merging to v3 conv instances
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Added conv group merging to the (universal) V3 fwd conv pipeline. The
new instance improves fwd conv performance when the number of
input/output channel per group is low.

On MI300 (`gfx942`) we get

| CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) |
|:-----|:------:|------:|
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1
| 3.86035 | 8.36796 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1
| 10.1867 | 13.4677 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1
| 11.7875 | 16.3657 |
2026-02-08 11:35:56 +00:00
Enrico Degregori
984a3d1828 [rocm-libraries] ROCm/rocm-libraries#4372 (commit 738ffd7)
[CK] Workaround blockscale wp test failure

## Motivation

Workaround to fix blockscale wp test failure for pipeline v3

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-07 00:09:58 +00:00
Illia Silin
4dd4869fbf [rocm-libraries] ROCm/rocm-libraries#4361 (commit 37a74ef)
[CK]  a bunch of CI fixes.

## Motivation

Fixing some of the CK CI issues

## Technical Details

fixing paths to dockerfiles and scripts;
moving codegen tests to separate stage (collides with main build since
you must call cmake from same folder but different options);
fixing a couple of clang compilation issues with staging compiler;
2026-02-06 01:07:34 +00:00
Illia Silin
569640dc70 Revert "Implement device grouped gemm fixed nk multi abd for rdna4 (#3619)" (#3705)
This reverts commit 301eb5cf08.
2026-02-03 09:52:14 -08:00
Max Podkorytov
3f04d27b68 Remove concrete performance numbers from BUILD_TIME_OPTIMIZATION.md (#3702)
Replace specific benchmark numbers with qualitative descriptions since
measurements vary across environments and may become outdated.

Co-authored-by: Claude <noreply@anthropic.com>
2026-02-03 03:54:18 -07:00
Illia Silin
8b56ffb6ae Fix one more lifetimebound error. (#3703)
* fix staging compiler errors

* fix clang format
2026-02-02 18:25:56 -08:00
Zoltán Lakatos
301eb5cf08 Implement device grouped gemm fixed nk multi abd for rdna4 (#3619)
* device struct implementation

* added xdl grouped multi abd fixed nk testing

* wmma implementation fixed

* avoid unnecessary device mem allocation and code cleanups

* cleanup instances definitions

* wmma examples added

* code cleanups

* fix clang format

* typo and compilation fixes related to reference gemm

* fix compilation error due to std::remove_cvref_t

* added missing hip_check_error includes

* correction to example instances

* review commentes addressed

* removed split-k from testing

* code formatting

---------

Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 13:58:11 -08:00
Jan Patrick Lehr
069500464d [Compiler] Addressing new compiler warnings (#3640)
* [Compiler] Addressing new compiler warnings

Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.

The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.

* Update some more instances

* Adds file-level ignores via clang diagnostic pragma

The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.

It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.

* This adds the remaining instances

For a build on gfx90a.

* fix clang format

* Adding couple more instances from gfx1200 build

* Fixed another few instances

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 09:39:48 -08:00
ApoorvaKalyani
70d71b1514 Test fix for gemm_b_scale_xdl_v3. (#3674) 2026-01-30 10:34:54 -07:00
Kiefer van Teutem
2377a62837 Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529)
* Adding remaining flavors for grouped conv fwd

As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu

* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.

* Do not build f8 / bf8 only flavor tests on RDNA3

* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.

* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.

* adding int8 and fp16 overloads to the elementwise operations

* fixed copilot nits

* Addressing review comments:

- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files

* clang-format

* reduced no of tests

---------

Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
2026-01-30 17:02:14 +01:00
Zoltán Lakatos
565fea2645 fix undefined behaviour in softmax kernel (#3683)
Co-authored-by: root <zoltan.lakatos@streamhpc.com>
2026-01-30 15:22:54 +08:00
Enrico Degregori
f16d9100e4 Multi AB support for wave transfer (#3578)
* Add multi AB support to wave transfer

* Improviments to multi ABD examples

* Add instances and use intrawave v1 instead of interwave

* Apply changes to other transfers

* Wave transfer: add support for multiple internal vgpr buffers

* Fix compilation error gfx11
2026-01-29 10:29:40 -08:00
Johannes Graner
fabac7e2c3 [Conv] Enable bwd weight splitk autodeduction with cap (#3656)
* Enable bwd weight splitk autodeduction with cap

* Fix error threshold calculations

* Add missing logic to wmma multiple d kernel

* Fix threshold calculation

* Update test with new applicability
2026-01-29 17:40:28 +00:00
Bartłomiej Kocot
83b58bb0c3 Grouped Conv Bwd Weight Direct Load (#3648)
* Grouped Conv Bwd Weight Direct Load

* Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp

* Implement group merging for bwd_weight and add instances

* Link direct load instances

* builder fixes

* fix

* fixes

* fix

---------

Co-authored-by: Graner, Johannes <johannes.graner@amd.com>
2026-01-28 15:31:54 -06:00
Robin Voetter
42048bdb7d [CK_BUILDER] Integrate CKB validation with CK verification (#3649)
* ck-builder: tensor copy function

This function copies one tensor to another, so that the memory
layout can be changed between them.

* ck-builder: fix ck::bhalf literals

These types don't work properly.

* ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it

This reduces the amount of duplicated code a bit.

* ck-builder: add flat tensor iterator

This "iterator" type pretends to be a pointer, useful for passing
tensors to functions expecting pointer-like types.

* ck-builder: integrate validation with ck gpu verification

By templating the gpu_verify function over iterators, we can use
the new FlatTensorIterator to adapt the function to multi-
dimensional tensors without changing either implementation
too much.

* ck-builder: add check_by_accumulations

This changes the gpu_verification.hpp code to also accept "iterator"
types for the relevant gpu_verify and gpu_reduce_max functions.

* ck: fix test_gpu_verification GenerateRandomData for bhalf

is_integer_it<bhalf_t> yields true, but it is not actually
an integer.

* ck: make gpu_verification kernels be proper persistent kernels

Previously these were using a hardcoded value for the grid size. This
commit changes that so that the grid size is automatically derived
from the kernel's occupancy and the number of multiprocessors on
the GPU.

* ck: clean up gpu_verification.hpp using block_reduce

This implements a small generic block reduce function, and rewrites
the rest of gpu_verification.hpp using that function to clean it up
a bit.

* ck-builder: doc typos

* ck-builder: update testing readme with validation interface.

* ck-builder: rebase fixes + review comments

* ck-builder: fix device integer generation with float types

Passing bfloat here causes a nans due to type_convert performing
a bitcast.

* ck: another bhalf_t bug

CK expects that int-generation with ck::bhalf_t yields bhalf integers,
not unsigned integers. This makes the logic of FillUniformRandInteger
compatible with GeneratorTensor_2<InDataType>, however idiotic that
may be.
2026-01-28 17:41:02 +01:00
linqunAMD
23cefda140 [ck] add gridwise base class for in all xdl kernel (#186) (#3544)
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue
2026-01-27 12:49:47 -08:00
Michał Kulikowski
b737f1dee5 [CK]Refactoring threadwise_tensor_slice_transfer_v3r1.hpp (#3263)
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-27 10:48:16 -08:00
Illia Silin
b26cb596b0 fix some syntax errors (#3658) 2026-01-27 09:59:39 -08:00
Max Podkorytov
b66597ed96 Add build time optimization documentation (#3608)
This document describes techniques for reducing C++ template instantiation
overhead in the Composable Kernel codebase, including:

- Replacing recursive templates with pack expansion (O(N) → O(1) depth)
- Using named functors instead of lambdas to share instantiations
- Replacing template recursion with constexpr loops
- Using fold expressions for accumulation operations

These techniques can significantly reduce build times for template-heavy code.
2026-01-27 06:07:27 -07:00
Johannes Graner
c190d8d61f [CK tests] Extend conv GPU reference (#3539)
* test_convnd_fwd

* test_convnd_bwd_data

* test_conv_bwd_data_scale

* test_grouped_convnd_fwd_clamp

* test_grouped_convnd_fwd_scale

* multiple A/B tensors and D tensor for fwd GPU ref

* test_grouped_convnd_fwd_scaleadd_ab

* test_grouped_convnd_fwd_bias_clamp

* test_grouped_convnd_fwd_bilinear

* test_grouped_convnd_fwd_gk_bias_clamp

* Extend GPU reference to enable batchnorm epilogue

* test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp

* test_grouped_conv_bwd_data_bilinear

* test_grouped_convnd_bwd_weight_bilinear

* Add missing template instantiation

* Perform operations in float in reference

* Slightly increase tolerance for batchnorm profiler

* Revert "Slightly increase tolerance for batchnorm profiler"

This reverts commit a3b2475229.

* Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp"

This reverts commit 6da4576060.

* Revert "Extend GPU reference to enable batchnorm epilogue"

This reverts commit e2f75fa10e.

* Clarify variable names

* Refactor elementwise ops into helper functions

* Make helpers C++17-compatible
2026-01-27 09:49:42 +01:00
Enrico Degregori
2e49b6b2f7 Padding support for wave transfer (#3537)
* Add padding support with transpose

Also move check before writing storing is_src_valid during reading

* Add/modify instances to use wave transfer for gemm universal

Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer

* Fix clang format

* Modify example

* Fix bwd data

* Add restriction for wave transfer with padding and transpose

Add test case which shows this limitation

* Fix validity checks 8 bit types

* Add validity check gemm_bias_add_reduce

* Add validity check grouped gemm tile loop

* Fix validity checks new flavours

* Minor fixes

* Fix clang format
2026-01-26 12:57:09 -08:00
yinglu
8942a19d5e ck: add CK_USE_GFX950 macro (#3636) 2026-01-26 11:38:45 -08:00
chris-tsiaousis-hpc
917f35553a Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617)
* Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation

This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-01-26 10:20:30 -08:00
Max Podkorytov
de59c0716c Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585)
This change significantly improves compile-time performance by reducing template
instantiation depth for sequence generation and merging operations:

Optimizations:
- sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using
  __make_integer_seq to generate indices in a single step, then applying the
  functor via pack expansion
- uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq
  with a helper that applies a constant value via pack expansion
- sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction
  strategy. Added direct concatenation specializations for 1-4 sequences to
  avoid recursion in common cases, falling back to binary tree merging for 5+
  sequences

Documentation:
- Added extensive inline comments explaining why sequence_merge cannot achieve
  O(1) depth like sequence_gen (requires computing cumulative sequence lengths
  from heterogeneous inputs, inherently requiring recursion)
- Documented the binary tree reduction approach and why it's superior to fold
  expressions for this use case

Testing:
- Added comprehensive unit tests for uniform_sequence_gen with different values,
  sizes, and edge cases
- Added tests for sequence_gen with custom functors (double, square, identity,
  constant) to verify the new implementation works with arbitrary functors
- Added tests for sequence_merge with 4, 5, and many sequences to verify both
  the direct concatenation path and binary tree reduction path
- Added tests for empty sequence edge cases
2026-01-26 10:08:55 -08:00
Ville Pietilä
7ac3794284 Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639)
Co-authored-by: Ville Pietilä <>
2026-01-25 13:42:23 +01:00
chris-tsiaousis-hpc
e1c46ff548 Remove code duplications in batched gemm wmma (#3580)
* Moved device struct for batched gemm wmma to a common file

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Use the common device struct in the scaled batched gemm wmma implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Boy-scout: Remove unused includes and ambiguous comment

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Moved pointer offset calculation and gridwise argument to common struct

This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common"

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-01-23 12:39:03 -08:00