Commit Graph

839 Commits

Author SHA1 Message Date
Max Podkorytov
1dd47118e2 [rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb)
Add generate_identity_sequences helper and replace lambdas
 with named functors (#4828)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Add `generate_identity_sequences<N>()` helper that returns
`Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>`
- Replace lambdas with named functors in `transform_tensor_descriptor`
- Add `unpack_and_merge_sequences` helper functor
- Reduces `transform_tensor_descriptor` instantiations from 388 to 32
(92% reduction)

## Motivation

Multiple call sites use `generate_tuple([](auto i) { return
Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda
instantiations.

Additionally, each lambda in `transform_tensor_descriptor` creates a
unique closure type, causing the function to be instantiated separately
for every call site. Named functors share a single type, so the compiler
reuses the same instantiation.

## Changes

### Part 1: generate_identity_sequences helper
- Replaces common lambda pattern for generating identity sequences
- Each lambda expression creates a unique closure type, causing separate
template instantiations at every call site
- Named helper shares a single type across all uses

### Part 2: Named functors in transform_tensor_descriptor
- Add `unpack_and_merge_sequences` helper to replace lambda in
`GetNumOfHiddenDimension`
- Use `generate_identity_sequences` in `matrix_padder.hpp`

## Test Plan

- [x] Added 7 unit tests:
  - 4 tests for `generate_identity_sequences`
  - 3 tests for `unpack_and_merge_sequences`
- [ ] Waiting for full CI

## Related PRs

This PR merges the functionality from:
- ROCm/composable_kernel#3588 (generate_identity_sequences helper)
- ROCm/composable_kernel#3589 (Named functors in
transform_tensor_descriptor)

Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times)

**Note:** This PR supersedes #4283, ROCm/composable_kernel#3588 and
ROCm/composable_kernel#3589, which can be closed once this is merged.
2026-02-28 20:11:11 +00:00
Bartłomiej Kocot
ef82340e05 [rocm-libraries] ROCm/rocm-libraries#4875 (commit e35e3f2)
[CK] Port non-grouped convolution instances to the grouped
 kernels (#4875)

## Motivation

Port non-grouped convolution instances to the grouped kernels to
deprecated older non-grouped implementations.

## Technical Details

Add the same instances as non-grouped but using grouped kernel.

## Test Plan

test_grouped_convnd_fwd

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-724
2026-02-28 01:25:33 +00:00
kabrahamAMD
5e06874aae [rocm-libraries] ROCm/rocm-libraries#4582 (commit 990a00d)
[CK_Builder] added bwd data kernels to builder factory
 (#4582)

This PR adds bwd data wmma and xdl kernels to the ck builder, their
instance and conv traits as well as tests for the above.
2026-02-27 03:06:29 +00:00
Yung-sheng Tu
75aea70c2c [rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f)
Implement device_grouped_gemm_fixed_nk_bias for RDNA4

## Proposed changes

Summary:

- Modified implementation for grouped_gemm_fixed_nk_bias
- FP16 WMMA examples
- WMMA instances
- Profiler for grouped_gemm_fixed_nk_bias
- Add WMMA instances to existing tests

**This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299
and should be merged after it.
Only the last 6 commits are in the scope of this PR.**

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers
with understanding the motivation
- [x] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-26 00:28:58 +00:00
Zoltán Lakatos
a32d704d89 [rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f)
[CK] Implement device grouped gemm fixed nk multi abd for
 rdna4 (#4425)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Add support for grouped gemm multi ABD fixed NK. MR

## Technical Details

Changes from the reverted PR:
- Device struct for grouped gemm with multiple ABD and fixed NK
(DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK).
- Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD
- Unit tests for both new wmma implementation and the reference xdl code
(previously missing)
- Note: Some Xdl instances were commented out because of unit test
failures. As mentioned apparently for xdl this feature was missing tests
so our assumption is either there is an implemenetation bug or these
instances were not set up correctly. Has the potential for a follow-up
issue.
- Generic ck profiler interface with the purpose of calling unit tests.
- Gemm instances with specific elementwise operations for gemm bias gelu
calculations.
- Added class for grouped gemm multi ABD reference calculations.

Fix epilogue selection in device implementation that caused unit test
failures

## Test Plan

Covered by added unit tests

## Test Result

CI successfully passing

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-25 05:17:08 +00:00
Illia Silin
1915cdfcc2 [rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)
Revert "[ck] Support VGPR estimate in
 GridwiseGemm_wmma_cshuffle_v3" (#4762)

Reverts ROCm/rocm-libraries#4638
unfortunately, this PR interfered with the PR#4299 and caused build
errors for gfx11:

In file included from
/rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7:
In file included from
/rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11:

/rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21:
error: no matching function for call to 'CheckValidity'
  553 |                 if(!GridwiseGemm::CheckValidity(
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
2026-02-20 22:41:34 +00:00
linqunAMD
29781f2ac4 [rocm-libraries] ROCm/rocm-libraries#4638 (commit 305ec71)
[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3
 (#4638)

1. Add GetEstimateVgprCount to estimate the VGPR usage in
GridwiseGemm_wmma_cshuffle_v3
2. Add IsValidCompilationParameter to disable kernel which use too many
vgprs.
- Currently, the threashold is AvailableVgprCount * 1.25
3. Modify examples to avoid test is disabled on gfx11

It is port from internal repo
PR[#192](https://github.com/ROCm/composable_kernel/issues/192)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-20 15:57:18 +00:00
Márton Bidlek
7b97e197ef [rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49)
173 implement device grouped gemm fixed nk for rdna4
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk
instance library using for WMMA.

The implementation is based on the existing
DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level
structure, but replaces the XDL kernel with a WMMA-based one. It uses
the GridwiseGemm_wmma_cshuffle_v3 kernel.

At this stage, the focus is functional correctness and compatibility,
not performance tuning.

## Technical Details

- Device struct for grouped gemm fixed NK
- Example code for the WMMA version
- Unit tests for both new wmma implementation and the reference XDL code
(previously missing)
- Generic ck profiler interface with the purpose of calling unit tests.

## Checklist

Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [x] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [x] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [x] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run  on all changed files
- [x] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-19 08:13:46 +00:00
Jan Patrick Lehr
9c2dd2941b [rocm-libraries] ROCm/rocm-libraries#4419 (commit e241f8b)
[CK] Work around staging compiler lifetime warning

## Motivation
The staging compiler enables lifetime-safety warnings and we already
worked around a few of them.
This works around a few more instances that came up recently on gfx950
builds.
The initial PR that resolved most issues:
https://github.com/ROCm/composable_kernel/pull/3640

## Technical Details
This follows the pattern to locally ignore the newly added
lifetime-safety warnings that were moved from experimental to production
in upstream LLVM.
As a result, CK turned them on and treats them as errors, which prevents
the staging compiler from building CK.

## Test Plan

## Test Result

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-02-12 22:12:57 +00:00
lalala-sh
dae352e8dc [rocm-libraries] ROCm/rocm-libraries#4282 (commit 2050f93)
add memsetasync for ck moe splitk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

add memsetasync for ck moe splitk to fix

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-12 17:45:52 +00:00
Christopher Millette
e1e2f7ac2e [rocm-libraries] ROCm/rocm-libraries#4447 (commit 6d08a99)
[CK] Optimize multi-dimensional static for loop decomposition
 (#4447)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
Recursive template implementations might initially seem attractive to
minimize necessary coding.

Unfortunately, this style is often affects readability and requires
significant resources from the compiler to generate instantiation
chains. In "high-traffic" code (e.g., used in many places + compilation
units), this generally does not scale well and can bloat the overall
compile times to unnecessary lengths.

The aim of this PR is to take some of most high-traffic utility code and
try our best to eliminate recursive templates in favor of fold
expansions and constexpr function helpers.

In local tests with clang build analyzer,
device_grouped_conv2d_fwd_xdl_ngchw_gkcyx_ngkhw_f16_16x16_instance.cpp
showed high hit-rates on slow template instantiations in static_for,
dimensional static_for (static_ford), which are subsequently affected by
implementation of the Sequence class and associated transforms.

Example:
**** Templates that took longest to instantiate:
70111 ms: ck::detail::applier<int, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1... (372 times, avg 188 ms) // **70 seconds!**

The above is part of the implementation of static_for which uses
Sequence classes..

## Technical Details

### Summary of Optimization Techniques

| Technique | Used In | Benefit |
 |-----------|---------|---------|
| __Constexpr for-loop computation__ | sequence_reverse_inclusive_scan,
sequence_map_inverse | Moves O(N) work from template instantiation to
constexpr evaluation |
| __Pack expansion with indexing__ | sequence_reverse, Sequence::Modify
| Single template instantiation instead of recursive |
| __Flat iteration + decomposition__ | ford, static_ford | O(1) template
depth instead of O(N^D) |
| __Pre-computed strides__ | index_decomposer | Enables O(1)
linear-to-multi-index conversion |

### Impact on Compile Time

These optimizations reduce template instantiation depth from O(N) or
O(N^D) to O(1), which:

1. Reduces compiler memory usage
2. Reduces compile time exponentially for deep instantiation chains
3. Enables larger iteration spaces without hitting template depth limits

## Test Plan

* Existing tests for Sequence are re-used to affirm correctness
* Unit tests for ford and static_ford are added (dimensional looping)
* 8 new regression tests specifically verify the fixes for the PR
feedback:

  - `NonTrivialOrder3D_201` - Tests Orders<2,0,1> for static_ford
  - `NonTrivialOrder3D_201_Runtime` - Tests Orders<2,0,1> for ford
- `ConsistencyWithNonTrivialOrder_201` - Verifies static_ford and ford
consistency
  - `NonTrivialOrder3D_120` - Tests Orders<1,2,0> for static_ford
  - `NonTrivialOrder3D_120_Runtime` - Tests Orders<1,2,0> for ford
  - `NonTrivialOrder4D` - Tests 4D with Orders<3,1,0,2> for static_ford
  - `NonTrivialOrder4D_Runtime` - Tests 4D with Orders<3,1,0,2> for ford
- `AsymmetricDimensionsWithOrder` - Tests asymmetric dimensions with
non-trivial ordering

## Test Result
### Compile Time Comparison: `8b72bc8` (base) → `477e0686` (optimized)

#### Commits in Range (8 commits)

1. `fd4ca17f48` - Optimize sequence_reverse_inclusive_scan and
sequence_reverse
2. `7a7e3fdeef` - Optimize sequence_map_inverse
3. `92855c9913` - Optimize ford and static_ford calls to eliminate
nested template recursion
4. `88a564032b` - Add unit tests for ford and static_ford
5. `1a0fb22217` - Fix clang-format
6. `8a0d26bddf` - Increase template recursion depth to 1024
7. `dc53bb6e20` - Address copilot feedback and add regression tests
8. `477e06861d` - Increase bracket depth to 1024

#### Build Timing Results

| File | Base (8b72bc8759d9 | HEAD(a0438bd398) | Improvement |
|------|------|------|-------------|
| grouped_conv2d_fwd (f16) -j1 | 313.31s | 272.93s | __12.9% faster__ |
| grouped_conv1d_fwd (bf16) -j1 | 79.33s | 68.61s | __13.5% faster__ |
| grouped_conv1d_bwd_weight (f16) -j1| 15.77s | 14.31s | __9.2% faster__
|
| device_grouped_conv2d_fwd_instance -j64 | s | s | __% faster__ |

#### Key Optimizations

1. __sequence_reverse_inclusive_scan/sequence_reverse__: O(N) → O(1)
template depth
2. __sequence_map_inverse__: O(N) → O(1) template depth
3. __ford/static_ford__: O(N^D) → O(1) template depth using flat
iteration with index decomposition
4. __Copilot feedback fixes__: Corrected New2Old mapping for non-trivial
orderings

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-11 22:13:15 +00:00
Christopher Millette
04eddbc5ce [rocm-libraries] ROCm/rocm-libraries#4471 (commit 10fa702)
[CK] Optimize vector type build times

**Supercedes https://github.com/ROCm/rocm-libraries/pull/4281 due to CI
issues on import**

## Proposed changes

Build times can be affected by many different things and is highly
attributed to the way we write and use the code. Two critical areas of
the builds are **frontend parsing** and **backend codegen and
compilation**.

### Frontend Parsing
The length of the code, the include header tree and macro expansions all
affect the front-end parsing time.
This PR seeks to reduce the parsing time of the dtype_vector.hpp
vector_type class by reducing redundant code by generalization.
* Partial specializations of vector_type for native and non-native
datatypes have been generalized to one single class, consolidating all
of the data initialization and AsType casting requirements into one
place.
* The class nnvb_data_t_selector (e.g., Non-native vector base dataT
selector) class has been removed and replaced with scalar_type
instantiations as they have the same purpose. Scalar type class' purpose
is already to map generalized datatypes to native types compatible with
ext_vector_t.

### Backend Codegen
Template instantiation behavior can also affect build times. Recursive
instantiations are very slow versus concrete instantiations. The
compiler must make multiple passes to expand template instantiations so
we need to be careful about how they are used.
* Previous vector_type classes declared a union storage class, which
aliases StaticallyIndexedArray<T,N>.
```
template <typename T>
struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>>
{
    using d1_t = T;
    typedef T d2_t __attribute__((ext_vector_type(2)));
    typedef T d4_t __attribute__((ext_vector_type(4)));

    using type = d4_t;

    union
    {
        d4_t d4_;
        StaticallyIndexedArray<d1_t, 4> d1x4_;
        StaticallyIndexedArray<d2_t, 2> d2x2_;
        StaticallyIndexedArray<d4_t, 1> d4x1_;
    } data_;
   ...
};
```
* Upon further inspection, StaticallyIndexedArray is built on-top of a
recursive Tuple concatenation.
```
template <typename T, index_t N>
struct StaticallyIndexedArrayImpl
{
    using type =
        typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type,
                              typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type;
};
```
This union storage has been removed from the vector_type storage class.

* Further references to StaticallyIndexedArray have been replaced with
StaticallyIndexedArray_v2, which is a concrete implementation using
C-style arrays.
```
template <typename T, index_t N>
struct StaticallyIndexedArray_v2
{
    ...

    T data_[N];
};
```

### Fixes
* Using bool datatype with vector_type was previously error prone. Bool,
as a native datatype would be stored into bool ext_vector_type(N) for
storage, which is a packed datatype. Meaning that for example,
sizeof(bool ext_vector_type(4)) == 1, which does not equal
sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The
union of these datatypes has incorrect data slicing, meaning that the
bits location of the packed bool do not match with the
StaticallyIndexedArray member. As such, vector_type will use C-Style
array storage for bool type instead of ext_vector_type.
```
template <typename T, index_t Rank>
using NativeVectorT = T __attribute__((ext_vector_type(Rank)));

sizeof(NativeVectorT<bool, 4>) == 1  (1 byte per 4 bool - packed)
element0 = bit 0 of byte 0
element1 = bit 1 of byte 0
element2 = bit 2 of byte 0
element3 = bit 3 of byte 0

sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4  (1 byte per bool)
element0 = bit 0 of byte 0
element1 = bit 0 of byte 1
element1 = bit 0 of byte 2
element1 = bit 0 of byte 3

union{
    NativeVectorT<bool, 4> d1_t;
    ...
    StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1;
};

// union size == 4 which means invalid slicing!
```
* Math utilities such as next_power_of_two addressed for invalid cases
of X < 2
* Remove redundant implementation of next_pow2

### Additions
* integer_log2_floor to math.hpp
* is_power_of_two_integer to math.hpp

### Build Time Analysis

Machine:  banff-cyxtera-s78-2
Target: gfx942

| Build Target | Threads | Frontend Parse Time (s) | Backend Codegen
Time (s) | TotalTime (s) | commitId |

|---------------|---------|-------------------------|--------------------------|---------------|
2026-02-11 19:01:05 +00:00
Johannes Graner
e88f139c6c [rocm-libraries] ROCm/rocm-libraries#4271 (commit 6fce58e)
[Conv] Add NumGroupsToMerge to BwdWeight type string
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Add parameter to bwd weight V3 type string showing the number of groups
to merge. This is required for MIOpen to be properly tuned since it uses
type strings for performance database entries.

In order to not break existing tuning databases, the parameter is added as a named suffix and only when group merging is enabled.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-11 09:08:38 +00:00
John Shumway
1af75d290e [rocm-libraries] ROCm/rocm-libraries#4277 (commit 4348901)
Add a README.md file to ck/library/util
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

I'm collecting information about our current testing (#3664). As part of
this work I a README to the directory to emphasize the GPU-first testing
strategy and our support for type-specific tolerances.

This readme contains internal code comments for CK developers and does
not need ROCm documentation review.
2026-02-10 21:27:27 +00:00
Bartłomiej Kocot
ea6363ad78 [rocm-libraries] ROCm/rocm-libraries#4399 (commit 331512e)
[CK] Fix grouped conv fwd transform for merged groups

## Motivation

[CK] Fix grouped conv fwd transform for merged groups for 1d and 3d.

## Technical Details

After optimizations for 2d there is a lack of implementation for 1d and
3d

## Test Plan

test_grouped_convnd_fwd

## Test Result

pending CI

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-09 15:37:36 +00:00
Ville Pietilä
57d26db844 [rocm-libraries] ROCm/rocm-libraries#4273 (commit 591f504)
[CK] Add fwd conv group merging to v3 conv instances
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Added conv group merging to the (universal) V3 fwd conv pipeline. The
new instance improves fwd conv performance when the number of
input/output channel per group is low.

On MI300 (`gfx942`) we get

| CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) |
|:-----|:------:|------:|
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1
| 3.86035 | 8.36796 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1
| 10.1867 | 13.4677 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1
| 11.7875 | 16.3657 |
2026-02-08 11:35:56 +00:00
Enrico Degregori
984a3d1828 [rocm-libraries] ROCm/rocm-libraries#4372 (commit 738ffd7)
[CK] Workaround blockscale wp test failure

## Motivation

Workaround to fix blockscale wp test failure for pipeline v3

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-07 00:09:58 +00:00
Illia Silin
4dd4869fbf [rocm-libraries] ROCm/rocm-libraries#4361 (commit 37a74ef)
[CK]  a bunch of CI fixes.

## Motivation

Fixing some of the CK CI issues

## Technical Details

fixing paths to dockerfiles and scripts;
moving codegen tests to separate stage (collides with main build since
you must call cmake from same folder but different options);
fixing a couple of clang compilation issues with staging compiler;
2026-02-06 01:07:34 +00:00
Illia Silin
569640dc70 Revert "Implement device grouped gemm fixed nk multi abd for rdna4 (#3619)" (#3705)
This reverts commit 301eb5cf08.
2026-02-03 09:52:14 -08:00
Max Podkorytov
3f04d27b68 Remove concrete performance numbers from BUILD_TIME_OPTIMIZATION.md (#3702)
Replace specific benchmark numbers with qualitative descriptions since
measurements vary across environments and may become outdated.

Co-authored-by: Claude <noreply@anthropic.com>
2026-02-03 03:54:18 -07:00
Illia Silin
8b56ffb6ae Fix one more lifetimebound error. (#3703)
* fix staging compiler errors

* fix clang format
2026-02-02 18:25:56 -08:00
Zoltán Lakatos
301eb5cf08 Implement device grouped gemm fixed nk multi abd for rdna4 (#3619)
* device struct implementation

* added xdl grouped multi abd fixed nk testing

* wmma implementation fixed

* avoid unnecessary device mem allocation and code cleanups

* cleanup instances definitions

* wmma examples added

* code cleanups

* fix clang format

* typo and compilation fixes related to reference gemm

* fix compilation error due to std::remove_cvref_t

* added missing hip_check_error includes

* correction to example instances

* review commentes addressed

* removed split-k from testing

* code formatting

---------

Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 13:58:11 -08:00
Jan Patrick Lehr
069500464d [Compiler] Addressing new compiler warnings (#3640)
* [Compiler] Addressing new compiler warnings

Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.

The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.

* Update some more instances

* Adds file-level ignores via clang diagnostic pragma

The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.

It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.

* This adds the remaining instances

For a build on gfx90a.

* fix clang format

* Adding couple more instances from gfx1200 build

* Fixed another few instances

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 09:39:48 -08:00
ApoorvaKalyani
70d71b1514 Test fix for gemm_b_scale_xdl_v3. (#3674) 2026-01-30 10:34:54 -07:00
Kiefer van Teutem
2377a62837 Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529)
* Adding remaining flavors for grouped conv fwd

As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu

* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.

* Do not build f8 / bf8 only flavor tests on RDNA3

* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.

* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.

* adding int8 and fp16 overloads to the elementwise operations

* fixed copilot nits

* Addressing review comments:

- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files

* clang-format

* reduced no of tests

---------

Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
2026-01-30 17:02:14 +01:00
Zoltán Lakatos
565fea2645 fix undefined behaviour in softmax kernel (#3683)
Co-authored-by: root <zoltan.lakatos@streamhpc.com>
2026-01-30 15:22:54 +08:00
Enrico Degregori
f16d9100e4 Multi AB support for wave transfer (#3578)
* Add multi AB support to wave transfer

* Improviments to multi ABD examples

* Add instances and use intrawave v1 instead of interwave

* Apply changes to other transfers

* Wave transfer: add support for multiple internal vgpr buffers

* Fix compilation error gfx11
2026-01-29 10:29:40 -08:00
Johannes Graner
fabac7e2c3 [Conv] Enable bwd weight splitk autodeduction with cap (#3656)
* Enable bwd weight splitk autodeduction with cap

* Fix error threshold calculations

* Add missing logic to wmma multiple d kernel

* Fix threshold calculation

* Update test with new applicability
2026-01-29 17:40:28 +00:00
Bartłomiej Kocot
83b58bb0c3 Grouped Conv Bwd Weight Direct Load (#3648)
* Grouped Conv Bwd Weight Direct Load

* Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp

* Implement group merging for bwd_weight and add instances

* Link direct load instances

* builder fixes

* fix

* fixes

* fix

---------

Co-authored-by: Graner, Johannes <johannes.graner@amd.com>
2026-01-28 15:31:54 -06:00
Robin Voetter
42048bdb7d [CK_BUILDER] Integrate CKB validation with CK verification (#3649)
* ck-builder: tensor copy function

This function copies one tensor to another, so that the memory
layout can be changed between them.

* ck-builder: fix ck::bhalf literals

These types don't work properly.

* ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it

This reduces the amount of duplicated code a bit.

* ck-builder: add flat tensor iterator

This "iterator" type pretends to be a pointer, useful for passing
tensors to functions expecting pointer-like types.

* ck-builder: integrate validation with ck gpu verification

By templating the gpu_verify function over iterators, we can use
the new FlatTensorIterator to adapt the function to multi-
dimensional tensors without changing either implementation
too much.

* ck-builder: add check_by_accumulations

This changes the gpu_verification.hpp code to also accept "iterator"
types for the relevant gpu_verify and gpu_reduce_max functions.

* ck: fix test_gpu_verification GenerateRandomData for bhalf

is_integer_it<bhalf_t> yields true, but it is not actually
an integer.

* ck: make gpu_verification kernels be proper persistent kernels

Previously these were using a hardcoded value for the grid size. This
commit changes that so that the grid size is automatically derived
from the kernel's occupancy and the number of multiprocessors on
the GPU.

* ck: clean up gpu_verification.hpp using block_reduce

This implements a small generic block reduce function, and rewrites
the rest of gpu_verification.hpp using that function to clean it up
a bit.

* ck-builder: doc typos

* ck-builder: update testing readme with validation interface.

* ck-builder: rebase fixes + review comments

* ck-builder: fix device integer generation with float types

Passing bfloat here causes a nans due to type_convert performing
a bitcast.

* ck: another bhalf_t bug

CK expects that int-generation with ck::bhalf_t yields bhalf integers,
not unsigned integers. This makes the logic of FillUniformRandInteger
compatible with GeneratorTensor_2<InDataType>, however idiotic that
may be.
2026-01-28 17:41:02 +01:00
linqunAMD
23cefda140 [ck] add gridwise base class for in all xdl kernel (#186) (#3544)
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue
2026-01-27 12:49:47 -08:00
Michał Kulikowski
b737f1dee5 [CK]Refactoring threadwise_tensor_slice_transfer_v3r1.hpp (#3263)
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-27 10:48:16 -08:00
Illia Silin
b26cb596b0 fix some syntax errors (#3658) 2026-01-27 09:59:39 -08:00
Max Podkorytov
b66597ed96 Add build time optimization documentation (#3608)
This document describes techniques for reducing C++ template instantiation
overhead in the Composable Kernel codebase, including:

- Replacing recursive templates with pack expansion (O(N) → O(1) depth)
- Using named functors instead of lambdas to share instantiations
- Replacing template recursion with constexpr loops
- Using fold expressions for accumulation operations

These techniques can significantly reduce build times for template-heavy code.
2026-01-27 06:07:27 -07:00
Johannes Graner
c190d8d61f [CK tests] Extend conv GPU reference (#3539)
* test_convnd_fwd

* test_convnd_bwd_data

* test_conv_bwd_data_scale

* test_grouped_convnd_fwd_clamp

* test_grouped_convnd_fwd_scale

* multiple A/B tensors and D tensor for fwd GPU ref

* test_grouped_convnd_fwd_scaleadd_ab

* test_grouped_convnd_fwd_bias_clamp

* test_grouped_convnd_fwd_bilinear

* test_grouped_convnd_fwd_gk_bias_clamp

* Extend GPU reference to enable batchnorm epilogue

* test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp

* test_grouped_conv_bwd_data_bilinear

* test_grouped_convnd_bwd_weight_bilinear

* Add missing template instantiation

* Perform operations in float in reference

* Slightly increase tolerance for batchnorm profiler

* Revert "Slightly increase tolerance for batchnorm profiler"

This reverts commit a3b2475229.

* Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp"

This reverts commit 6da4576060.

* Revert "Extend GPU reference to enable batchnorm epilogue"

This reverts commit e2f75fa10e.

* Clarify variable names

* Refactor elementwise ops into helper functions

* Make helpers C++17-compatible
2026-01-27 09:49:42 +01:00
Enrico Degregori
2e49b6b2f7 Padding support for wave transfer (#3537)
* Add padding support with transpose

Also move check before writing storing is_src_valid during reading

* Add/modify instances to use wave transfer for gemm universal

Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer

* Fix clang format

* Modify example

* Fix bwd data

* Add restriction for wave transfer with padding and transpose

Add test case which shows this limitation

* Fix validity checks 8 bit types

* Add validity check gemm_bias_add_reduce

* Add validity check grouped gemm tile loop

* Fix validity checks new flavours

* Minor fixes

* Fix clang format
2026-01-26 12:57:09 -08:00
yinglu
8942a19d5e ck: add CK_USE_GFX950 macro (#3636) 2026-01-26 11:38:45 -08:00
chris-tsiaousis-hpc
917f35553a Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617)
* Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation

This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-01-26 10:20:30 -08:00
Max Podkorytov
de59c0716c Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585)
This change significantly improves compile-time performance by reducing template
instantiation depth for sequence generation and merging operations:

Optimizations:
- sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using
  __make_integer_seq to generate indices in a single step, then applying the
  functor via pack expansion
- uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq
  with a helper that applies a constant value via pack expansion
- sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction
  strategy. Added direct concatenation specializations for 1-4 sequences to
  avoid recursion in common cases, falling back to binary tree merging for 5+
  sequences

Documentation:
- Added extensive inline comments explaining why sequence_merge cannot achieve
  O(1) depth like sequence_gen (requires computing cumulative sequence lengths
  from heterogeneous inputs, inherently requiring recursion)
- Documented the binary tree reduction approach and why it's superior to fold
  expressions for this use case

Testing:
- Added comprehensive unit tests for uniform_sequence_gen with different values,
  sizes, and edge cases
- Added tests for sequence_gen with custom functors (double, square, identity,
  constant) to verify the new implementation works with arbitrary functors
- Added tests for sequence_merge with 4, 5, and many sequences to verify both
  the direct concatenation path and binary tree reduction path
- Added tests for empty sequence edge cases
2026-01-26 10:08:55 -08:00
Ville Pietilä
7ac3794284 Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639)
Co-authored-by: Ville Pietilä <>
2026-01-25 13:42:23 +01:00
chris-tsiaousis-hpc
e1c46ff548 Remove code duplications in batched gemm wmma (#3580)
* Moved device struct for batched gemm wmma to a common file

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Use the common device struct in the scaled batched gemm wmma implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Boy-scout: Remove unused includes and ambiguous comment

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Moved pointer offset calculation and gridwise argument to common struct

This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common"

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-01-23 12:39:03 -08:00
Wojciech Laskowski
81ee19bd2c WMMA grouped conv fwd large tensor extra flavors (#3582)
* Additional flavors for WMMA conv fwd large tensor

- added F16/BF16 clamp operation
- added F16/BF16 bias_clamp operation
- small modification to the device code to accomodate extra tensors

* changed strategy to handle GemmArgs array

* Adding generic instance

* Added generic instance to clamp and bias_clamp ops
2026-01-23 12:19:51 +01:00
Erwin Terpstra
d5ae81b292 Implement batched gemm add relu gemm add for rdna4 (#3391)
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation

* wip: many fixes in implementation of batched gemm gemm multiple d

* wip: batched gemm gemm multiple d gridwise op compiling, not working yet

* fix: incorrect d0 grid indexing in batched gemm gemm multipled

* feat: add instances for batched gemm add relu gemm add

* chore: configure instance with low vector transfer size for odd sizes

* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense

* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes

* fix: disable odd size tests on XDL archs

* chore: removed temporary logging

* chore: update some references to C tensor to E tensor

* Tentative fix for example template params

* Tentative fix for non-multi-D batched gemm gemm device impl.

* Tentative fix for xdl example template params

* Tentative fix for profiler build on gfx90a

* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions

* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply

* fix: make the gemm1 data types match what happens in the device op

* feat: add d0s/d1s datatypes and layouts to the device op type string

* chore: change element-wise op so addition happens in fp32

* chore: add static asserts for gemm0/gemm1 calculated wave sizes

* chore: also updated other element-wise ops to use fp32 calculations

* chore: log number of supported instances

* chore: update instance comment

* chore: disable kernel timing in example by default

* fix: gemm1 wave size calculation

* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions

* chore: remove increased tolerance in batched gemm gemm multiple d example

* chore: add comment explaining that verification fails for certain input values

* chore: clarify instance comment

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
2026-01-20 13:06:59 -08:00
music-dino
6300ad3c62 Batched gemm softmax gemm descriptor fix (#3564)
* Add rocm to prefix path for codegen

* Fix issue with c0_matrix_mask construction
2026-01-20 07:25:30 -08:00
Wojciech Laskowski
b09121f860 WMMA support for batched_gemm_reduce (#3332)
Summary:
- added new device impl of Batched GEMM Reduce for WMMA
- added instance library
- added WMMA impl to the Batched GEMM Reduce tests
2026-01-20 10:50:46 +01:00
Bartłomiej Kocot
0727e85e52 [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes
2026-01-19 22:29:01 -07:00
Erwin Terpstra
fe40a5d139 Implement batched gemm bias permute for RDNA4 (#3534)
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)

* wip: device struct for WMMA batched contraction multiple d based on new gridwise op

* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases

* fix: failure to resolve template parameters when calling new function overload

* fix: passing reference type as parameter instead of underlying types

* fix: merge error caused duplicate definitions

* fix: make sure constness of template and parameters types match

* fix: don't compile batched contraction test on unsupported architectures

* feat: add example for new wmma implementation, and consolidate example code between platforms

* style: return inline instead of with branch

* chore: add extra assert on vector memory access sizes

* chore: clean up some unused variables

* fix: correct tail number calculation, added small cases and extra instances to the test

* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method
2026-01-17 08:30:27 +01:00
logicat
fec81109f1 Remove unnecessary hip_fp16 include from stream_config (#3549) 2026-01-16 10:40:05 -08:00
Yung-sheng Tu
6df2d70143 Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429)
* add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp

* add examples

* add instances to test

* remove duplicate code between examples
2026-01-15 07:19:31 -08:00
John Shumway
5122637215 [CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547)
* Factor helpers out of conv_traits.hpp

* Create a non-templated conv_traits struct

* Migrate to new instance-specific instance_to_conv_traits functions

* Clean up reflection concepts

* Clean up ConvTraits helpers

* Update testing for convolution traits

This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.

* Address reviewer comments and resolve merge conflict
2026-01-15 10:03:21 +01:00