39 Commits

Author SHA1 Message Date
Johannes Graner
58475d3f45 [rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649)
[CK Tile] StreamK support for Bwd Weight grouped convolutions
 (#5393)

## Motivation

Add StreamK work distribution to the CK Tile grouped convolution
backward weight kernel. Split-K divides the K-dimension uniformly across
a fixed `k_batch`, which causes load imbalance when the number of output
tiles doesn't evenly fill the GPU. StreamK distributes total
K-iterations evenly across workgroups, improving utilization on these
shapes.

## Technical Details

StreamK is added as an `if constexpr` branch in the existing kernel,
selected by the `TilePartitioner_` template parameter. Two reduction
strategies are supported:
- **Linear**: tile-starter sequentially accumulates partials from
contributing CTAs
- **Tree**: pairwise binary tree reduction (O(log n) depth, faster for
many contributors)

Both persistent and non-persistent data-parallel (DP) sections are
supported.

Key changes:
- `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution
path with `RunStreamK`/`RunStreamKLoop`, partial store/load via
workspace, flag-based cross-CTA synchronization,
`GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions
- `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers)
and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by
both GEMM and Conv StreamK kernels
- `streamk_gemm_kernel.hpp`: Refactored to use shared helpers
- Merged split-K and StreamK example invokers via `PartitionerPolicy`
template parameter
- StreamK example binary with `--streamk_reduction=linear|tree` and
`--streamk_persistent=0|1`
- CK Builder integration: `SpecifiesStreamK` concept,
`TilePartitionerType` factory helper, `InstanceTraits` with StreamK
fields
- 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP),
negative, builder regression

### Performance (MI355X, gfx950)

Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}):

| Shape | 16x64 tiles | | 128x128 tiles | |
|---|---|---|---|---|
| | Split-K | StreamK | Split-K | StreamK |
| 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x |
| 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x |
| 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x |
| 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x |
| 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x |
| 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x |
| 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x |
| 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x |

StreamK's value depends on tile config: with larger tiles (fewer output
tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up
to 1.16x on typical large-channel convolutions. Tree reduction
consistently outperforms Linear when multiple CTAs contribute to the
same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n)
sequential accumulation. The table reports the best of Linear and Tree
for each shape.

## Test Plan

```bash
ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk
./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk

# Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON)
ninja -C build check-builder
```

30 tests covering:
- Host-side: type traits, kernel args construction, grid size, workspace
size
- GPU end-to-end (Linear + Tree): small/medium shapes, multi-group,
stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher
occupancy
- Persistent DP: Linear + Tree with persistent data-parallel dispatch
- Negative: `IsSupportedArgument` rejects unaligned K and C
- Builder: Create (instance string validation) + Execution (reference
comparison) + instance string regression

## Test Result

All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK
tests pass. Full `check-builder` suite passes. Tolerances computed
dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 09:18:14 +00:00
Ville Pietilä
e2f5ab8000 [rocm-libraries] ROCm/rocm-libraries#5237 (commit ef10dc6)
[CK_TILE, CK_BUILDER] Add two-stage bwd weight kernels to CK
 Tile profiler (#5237)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The
two-stage kernels were not supported in the initial PR. This PR adds the
the missing bwd weight two-stage kernels to the CK Profiler.

## Technical Details

Extended the CK Tile conv builder factory to build also the elementwise
ops required for the two-stage kernels. Extended the CK Builder for CK
Tile instance to accept the two-stage flag as part of the algorithm
configuration.

## Test Plan

Added units tests for CK Builder that verify the two-stage kernel
construction.

## Test Result

If CI passes, the added unit tests are passing.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-13 01:21:08 +00:00
John Shumway
9f47b8a63d [rocm-libraries] ROCm/rocm-libraries#5284 (commit 76b5b15)
[CK_BUILDER] Add
 DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284)

Add factory, InstanceTraits, and conv traits support for the WMMA V3
forward convolution kernel, enabling the CK Builder to generate and
dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs.

## Motivation

As reported in issue #4944, MIOpen includes WMMA V3 forward convolution
kernels, so this PR adds support for those kernels similarly to other
supported kernels.

## Technical Details

This follows the same implementation as the other kernels. I added some
support for reflection, but I left a few todos since we need to
generalize our convolution traits to generalize across WMMA/MFMA and
CK/CKTile.

## Test Plan

Added faster tests to `ninja smoke-builder` that check the
instance-traits logic, and I added longer tests that instantiate
kernels, following the existing pattern in other kernals.

## Test Result

I tested all code with `ninja check-builder` on a gfx1101 build and ran
on gfx1101.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 23:43:03 +00:00
Márton Bidlek
683865895e [rocm-libraries] ROCm/rocm-libraries#5135 (commit 5ccc138)
Proof of concept for removing forward declarations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Currently, we forward declare CK device operation templates in
CK-Builder's reflection code:

9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)
This is mainly required to break a circular dependency in reflection.
The architecture of that is as follows:

MyDeviceOp implements GetInstanceString(). This is typically defined
directly in the class definition (no forward declaration).

GetInstanceString() calls instance_string<MyDeviceOp>()

instance_string<MyDeviceOp>() calls
InstanceTraits<MyDeviceOp>::instance_string()

InstanceTraits has a specialization for MyDeviceOp which implements
instance_string()

So order for GetInstanceString() to work properly, InstanceTraits must
already be defined. And for InstanceTraits to be defined, the device op
needs to be defined. In order to do that, we are currently using
aforementioned forward declaration.

## Technical Details

C++'s lazy template evaluation is used by calling into an as-of-yet
undefined function static member function of
`InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then
specializing `InstanceTraits` only _after that_. The caveat here is that
both the device op itself as well as the instance traits specialization
must be in scope, otherwise there would be an undefined function error.
In practise, we can solve that either by placing the instance traits
directly into the file that defines `MyDeviceOp`, or possibly by using a
`.inc` file to keep the concerns separated.

## Test Plan

The results were verified by running the existing regression tests for
CK Builder

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-09 16:35:26 +00:00
Adam Osewski
22de6a19d9 [rocm-libraries] ROCm/rocm-libraries#4750 (commit c065793)
[CK_BUILDER] ck builder conv transfer fix

## Motivation

This PR fixes how CK Builder is validating transfer vector size and adds
proper validation for LDS transfer vector size as well.

## Changes:

* [__source vector dim__] -- Before this PR the data transfer validation
logic didn't allow to set the source vectorized dimension to 1. However
there are CK instances that are doing this when the group merging is
used. This is used only for
`DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle` kernel.
* [__valid vector size__] -- Before this PR the validation logic
concerned only single instruction maximum vector size. However our
buffer loading logic has implemented support for loading more values
through multiple buffer instructions. This again was discovered to be
used in some of the convolution instances. Thus this behavior was
reflected in validation logic.
* [__valid LDS vector size__] -- Before this PR the LDS vector size
validation was done in the same way as VMEM. This PR adds proper LDS
vector size validation based on the available LDS instruction sizes.

## Test Plan

Run CK BUILDER conv fwd factories tests

## Test Result

All CK BUILDER conv fwd factories work (except DL one & ck tile since
they're not yet added now)

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-27 13:48:58 +00:00
kabrahamAMD
5e06874aae [rocm-libraries] ROCm/rocm-libraries#4582 (commit 990a00d)
[CK_Builder] added bwd data kernels to builder factory
 (#4582)

This PR adds bwd data wmma and xdl kernels to the ck builder, their
instance and conv traits as well as tests for the above.
2026-02-27 03:06:29 +00:00
Ville Pietilä
57d26db844 [rocm-libraries] ROCm/rocm-libraries#4273 (commit 591f504)
[CK] Add fwd conv group merging to v3 conv instances
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Added conv group merging to the (universal) V3 fwd conv pipeline. The
new instance improves fwd conv performance when the number of
input/output channel per group is low.

On MI300 (`gfx942`) we get

| CK prof command | Baseline (TFLOPS) | V3 group merging (TFLOPS) |
|:-----|:------:|------:|
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1
| 3.86035 | 8.36796 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1
| 10.1867 | 13.4677 |
| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1
| 11.7875 | 16.3657 |
2026-02-08 11:35:56 +00:00
Robin Voetter
e33f15709f ck-builder: fix test related to changed xdl bwd cshuf v3 interface (#3677)
Force merging because I verified this fix manually:

git checkout develop
git pull
ninja smoke-builder (failed to build, as expected)
git checkout rvoetter/ckb-fix
ninja smoke-builder (passed!)
2026-01-29 07:15:56 -08:00
Bartłomiej Kocot
83b58bb0c3 Grouped Conv Bwd Weight Direct Load (#3648)
* Grouped Conv Bwd Weight Direct Load

* Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp

* Implement group merging for bwd_weight and add instances

* Link direct load instances

* builder fixes

* fix

* fixes

* fix

---------

Co-authored-by: Graner, Johannes <johannes.graner@amd.com>
2026-01-28 15:31:54 -06:00
kabrahamAMD
d6cccf6093 [CK_BUILDER] Add reflection for wmma and bwd weight instances to ck builder reflection (#3592)
* added reflection for conv_fwd_multiple_d_wmma_cshuffle.hpp

* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle

* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle v3

* added reflection of max_transpose parameters

* fix printing of std optional parameters

* fix use of undefined ck::index

* added conv traits for device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle

* added xdl two stage instance to reflection

* added additional variables

* added reflection for grouped_conv_bwd_weight_multiple_d_wmma_cshuffle, _v3, grouped_conv_two_stage_wmma_cshuffle_v3,

* added reflection for device_grouped_conv_bwd_weigh_wmma_cshuffle_v3

* added reflection for bwd_weight_wmma_cshuffle

* added comments back in

* add printed output for optional parameters

* update README

* fix typo

* added num_gemm_k_prefetch_stage and small fixes

* modified test string due to reflection of new parameter

---------

Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
2026-01-28 09:33:45 -07:00
Robin Voetter
cc75948d1c [CK_BUILDER] conv bwd weight testing (#3618)
* ck-builder: restructure testing conv

In order to prepare for bwd of conv testing, this commit moves some
files and types around so that we can reuse ckt::Args for both forward
and backwards convolution.

* ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp

This will allow us to more easily include fwd.hpp from backwards
definitions, which is required for initializing bwd values.

* ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3

Turns out that the supplied layout isn't actually supported...

* ck-builder: ck and reference conv integration for bwd weight

* ck-builder: ck bwd weight execution test

* ck-builder: ckt::run support for ck-tile bwd weight

* ck-builder: ck tile bwd weight execution test

* ck-builder: extra debug printing in MatchesReference

* ck-builder: make ckt::run return RunResult

This type is more convenient than std::tuple, as it will allow us to
use google test matchers with this in the future.

* ck-builder: RunResult matcher

Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error
message about how and why running an algorithm failed.

* ck-builder: doc fixes

* ck-builder: add missing headers
2026-01-26 23:50:15 +01:00
Bartłomiej Kocot
0727e85e52 [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes
2026-01-19 22:29:01 -07:00
John Shumway
5122637215 [CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547)
* Factor helpers out of conv_traits.hpp

* Create a non-templated conv_traits struct

* Migrate to new instance-specific instance_to_conv_traits functions

* Clean up reflection concepts

* Clean up ConvTraits helpers

* Update testing for convolution traits

This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.

* Address reviewer comments and resolve merge conflict
2026-01-15 10:03:21 +01:00
Ville Pietilä
9908a87c31 [CK_BUILDER] Add bwd weight factories (#3509)
* Add placeholder test.

* Initial conv bwd weight factory.

* Conv builder test refactoring.

* Add missing pieces to bwd weight factory.

* Improve compile time erros message when no matching factory is found.

* Use amcro to ensure automatic macthing between concepts are their string representations.

* Improve compile time diagnostics.

* Small improvements.

* Improve missing member/wrong type compile-time errors.

* Improve compile time diagnostics.

* Concept bug fixes.

* Remove debug assert.

* Update algorithm signature diagnostics.

* Factory bug fixes.

* First functional version of bwd weight conv factory.

* Refactor handing of GEMM-K batch template parameter in conv bwd weight factory.

* Concept improvements.

* Improve concept diagnostics.

* Introduve a common size type for concepts.

* Update compiletime diagnostics to use the size type.

* Update conv specialization enum.

* Fix fwd conv builder tests.

* Fix smoke tests.

* Separate bwd weigth and bwd data tests into separate targets.

* Clean-up CK Tile builder tests.

* Add bwd weight XDL CShuffle V3 factory.

* Build conv bwd weigth v3 instances successfully.

* Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.

* Test fix.

* Add instance traits for bwd weight algorithms.

* Add unit tests for instance strings.

* Build new instance traits unit tests but exclude WMMA for now.

* Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.

* Conv bwd weight DL factory.

* Final implementation for bwd weight DL factory.

* Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance.

* Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle

* Treat ref algorithm the same way as real algorithms in the dispatcher.

* Refactor large tensor support and WMMA configuration.

* Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3.

* Update Readme.

* Fix WMMA bwd weight tests.

* Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3.

* Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle.

* Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle.

* Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3

* Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and  compute types for input and output tensor in bwd weigth convs.

* Fix fwd factories after refactoring.

* clang-format

* Move compile-time diagnostics to a separate branch.

* Fix ref algorithm dispatching.

* Fix smoke tests.

* clang-format

* Fix factory for regular WMMA conv bwd weight.

* Clarify builder Readme.

* Remove obsolete test file.

* Fix test after merge.

* clang-format

* Remove the C++26 extensions.

* Unify conv elementwise ops and layout definitions for fwd and bwd directions.

* Remove old layout and elementwise ops.

* Unify handling of conv tensor types between fwd and bwd directions.

* Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank.

* Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms.

* clang-format

---------

Co-authored-by: Ville Pietilä <>
2026-01-13 18:12:38 +02:00
Robin Voetter
e3884bbf05 [CK_BUILDER] Debug utilities (#3528)
* ck-builder: make toString to_string

We are using snake case for CK-Builder

* ck-builder: add debug.hpp with tensor descriptor printing function

This adds some initial functionality to debug.hpp, a header which will
be used to house some debug utilities.

* ck-builder: abstract nd-iteration

Abstracting this makes it easier to test, clearer, and allows us to
use it elsewhere (such as in debug.hpp soon)

* ck-builder: tensor printing

* ck-builder: rename INT32 to I32

This makes it more in line with the other data type definitions.
2026-01-08 10:14:13 +01:00
John Shumway
a7d6b1e700 Add unit test coverage for conversion to convolution traits (#3515)
Our concept-base conversions are fragile and too complex. We want to refactor to straightforward functions
for each intance trace class template. This change adds unit test coverage to make that refactoring safer.
2026-01-07 07:44:21 -08:00
Robin Voetter
1c433c64ec [CK_BUILDER] Integrate reference conv with testing (#3511)
* ck-builder: explicitly delete forward declarations

Before, these functions were seen as a forward declaration for an existing function.
If no actual implementation overload could be found, these would be selected and
a linker error or warning would be generated. By marking these functions as explicitly
deleted, they incorrect invocations are generated as compile error instead.

* ck-builder: ckt::run plumbing for reference conv

This implements the ckt::run plumbing for the reference convolution
implementation and sets up the first complete end-to-end test.

* ck-builder: make validation system check for all-zeros

When both the actual and reference output are both all zero bits,
there is probably something wrong in the test framework.

* ck-builder: proper implementation+tests for TensorDescriptor::is_packed

* ck-builder: fix typos
2026-01-06 09:29:06 +01:00
Robin Voetter
e6e7dc2910 [CK_BUILDER] validation (#3471)
This pull request builds on #3267 by proving the "validation" infrastructure, the means to compare a set of `Outputs`.

The design of the validation infrastructure is relatively straight forward:
- Each SIGNATURE should come with a `validate()` implementation, which should be implemented in a similar way that the other functions/types from `testing.hpp` are implemented.
- `validate()` returns a `ValidationReport`, which is a structure that keeps all relevant information about comparing the tensors from two `Outputs`. Note that crucially, `validate()` should not do any reporting by itself. Rather, glue logic should be implemented by the user to turn `ValidationReport` into a relevant error message.
- You can see this clue code for CK-Builder itself in `testing_utils.hpp`, its `MatchesReference()`. This functionality is relatively barebones right now, it will be expanded upon in a different PR to keep the scope of this one down.

The comparison is done on the GPU (using an atomic for now), to keep tests relatively quick. Some notable items from this PR:
- To help compare the tensors and with writing tests, I've written a generic function `tensor_foreach` which invokes a callback on every element of a tensor.
- For that it was useful that the `TensorDescriptor` has a rank which is known at compile-time, so I've changed the implementation of `TensorDescriptor` for that. I felt like it was a better approach than keeping it dynamic, for multiple reasons:
  - This is C++ and we should use static typing where possible and useful. This way, we don't have to implement runtime assertions about the tensor rank.
  - We know already know the rank of tensors statically, as it can be derived from the SIGNATURE.
  - It simpifies the implementation of `tensor_foreach` and other comparison code.
- There are a lot of new tests for validating the validation implementation, validating validation validation tests (Only 3 recursive levels though...). For a few of those functions, I felt like it would be useful to expose them to the user.
- Doc comments everywhere.
2026-01-05 04:57:34 -08:00
Max Podkorytov
e339101e9c [CK-Tile] move out memory operation from cshuffle epilogue class (#3359)
* initial poc

* factor out common parts in operator()

* cv4

* rest of the universal gemm pipelines

* fix test

* remove boilerplate from tile engine

* fix example

* fix example

* format

* fix tests build for gemm

* remove base pipeline codegen from gemm instance builder

* unify v3 logic with the rest of universal gemm pipelines

* fix build for multi abd test

* fix test gemm multi d

* fix build for weight preshuffle

* fix grouped gemm test

* fix grouped gemm multi d test

* fix grouped gemm preshuffle

* fix grouped gemm example except for quant

* fix gemm preshuffle

* fix splitk 2 stage example

* fix batched gemm example

* fix multid example

* fix multiabd example

* fix batched gemm test

* fixup

* fix examples build

* fix grouped gemm test build

* fix smoke builder

* hacky poc

* fix tile engine

* kill the lambda

* maybe fix test build

* more fixes

* clang-format

* save temp

* clang-format

* mostly fix examples

* clang-format

* remove dead code

* more cleanup

* fix fmha bwd build (default epilogue set/add appears to be broken)

* fix default epilogue tests but not correctness

* clang-format

* fix bquant

* clang-format

* cleanup dead code

* rearrange make windows for readability

* restore changes to IsSupportedArgument

* fix smoke-builder

* clang-format

* fixup rename class

* build fixes

* clang-format

* fix builder

* fixup

* remove set from builder tests

* fix test

* clang-format

* re-refactor the kernels

* clang-format

* fix header license

* remove memory operation from conv bwd test

* clang-format

* clang-format example,include

* clang-format test

* build fixes

* clang-format

* solve compilation error

* fix the CI

* solve compilation error

* clang format

* solve merge conflict

* solve merge conflict

* solve the gfx11 error

* solve test error

* moar build fixes

* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-04 03:28:14 -08:00
kabrahamAMD
f86bbb1aef [CK_Builder] [testing] Integrate device random generators (#3427)
Implemented device random number generators for ck tensors.
Includes tests and integration to ck builder testing interface.
2025-12-30 10:03:05 -08:00
Bartłomiej Kocot
cbc8335964 Improve XDL to WMMA porting for grouped conv fwd (#3456)
Refactors the way the number of XDL (matrix multiply-accumulate) instructions per wave is calculated and used in the grouped convolution forward implementations, especially to better support WMMA (Wave Matrix Multiply-Accumulate) instructions and 16x16 tiles. 
The changes use MXdlPerWave instead of NXdlPerWave to increase number of waves per M dim.
2025-12-19 15:58:51 -07:00
John Shumway
9ac51aa0f4 Add describe() method to device ops for runtime introspection (#3375)
Introduces a polymorphic describe() method to BaseOperator that enables runtime introspection of kernel configurations through a unified interface.

Key changes:

* Add virtual describe() method to BaseOperator returning Description objects
* Implement describe() in 6 device operation classes (conv fwd/bwd variants)
* Create conv_describe.hpp with factory function for ConvDescription
* Extract type definitions to conv_types.hpp to resolve circular dependencies
* Add InstanceStringDescription for kernels without full ConvDescription support

Other Improvements:

* Update tests to use describe() instead of GetInstanceString()
* Remove circular dependency include from conv_traits.hpp
* Add ODD_C to ConvFwdSpecialization enum and fix OddC mapping
* Replace silent fallback in conv_layout() with compile-time error

This provides a foundation for runtime kernel introspection and better tooling support for analyzing and debugging kernel configurations.
2025-12-14 12:49:12 -08:00
Robin Voetter
6219b12730 [CK_BUILDER] convolution testing (#3267)
* Add README.md for testing

* Add tensor_memory_manager.

* ck-builder: tensor memory manager rebase fixes

This fixes some issues caused by the API being changed recently.
Also, this streamlines the ckt namespace to always be ck_tile::builder::test,
as this is already being used by other tests

Really, this commit should be squashed into the previous,
but I'm keeping it separate for brevity.

* ck-builder: test arguments initial prototype

* ck-builder: test system initial prototype

* ck-builder: fix non-standardized copyright comments

* ck-builder: new prototype

* ck-builder: group testing inputs/outputs into a separate structure

This is basically the return of the tensor memory manager after all,
except that the design is more closely tied to the actual operation.
Using a struct allows us to add additional input/output tensors
without breaking code (by defaulting those new parameters). Note
that the tensors are split into a separate inputs/outputs because we
usually want to allocate the output _twice_: once for the real
computation and once for the reference computation.

* ck-builder: simplify prototype naming; start docs

* ck-builder: update testing readme

* ck-builder: testing documentation

* ck-builder: HipStatusMatcher

This matcher can be used to check HIP status codes and provide
nice and readable error messages.

* ck-builder: tensor_buffer.hpp tests

* ck-builder: conv_fwd.hpp tests

* ck-builder: add example end-to-end test in conv fwd 2d fp16

* ck-builder: simplify extent usage

* ck-builder: update testing doc

* ck-builder: skip end to end test on non-gfx9

* fix check_copyright_year interpreter

/bin/bash is not guaranteed to exist on Linux. Signed,
a NixOS user

* ck-builder: fix copyrights

* ck-builder: reduce conv fwd testing size

This test allocated 24GB of memory, too much for 16GB cards.

---------

Co-authored-by: John Shumway <jshumway@amd.com>
2025-12-13 15:33:41 +01:00
Ville Pietilä
d66e5f667c [CK_BUILDER] Improve CK Builder and CK Builder tests (#3382)
* Remove stale documentation.

* Add placeholder for conv algorithm design description. Add link to conv factory description.

* Improve testing transfer parameters.

* Python script to check the block tilings.

* Improve tests and conv types serialization.

* Change representation of boolean values from 1/0 to true/false in instance strings.

* Change representation of boolean values from 1/0 to true/false in conv algorithm types.

* Test code improvements.

* Improve covn descriptions tests.

* Improve conv signature definition in conv fwd builder tests.

* clang-format.

* Remove obsolete script.

* Revert StaticAssertTypeEq changes in conv layout tests.

* Remove obsolete using declaration.

---------

Co-authored-by: Ville Pietilä <>
2025-12-11 09:50:00 +02:00
Bartłomiej Kocot
04612c30ce [CK_BUILDER] Ck Tile Grouped convolution factory (#3352)
* [BUILDER] Ck Tile Grouped convolution factory

* Part 2

* Fixes after rebase

* Remove leftovers
2025-12-08 10:32:56 +01:00
Ville Pietilä
9cb1f421bc [CK_BUILDER] Refactor convolution signature to provide data type/layout/elementwise op per tensor (#3331)
* Separate layouts into separate entities for input, weight, and output tensors.

* Add test for handling bias tensor layouts.

* Use instance string in builder tests.

* Add handling of output bias data types and layouts.

* Generalize handling of the elementwise ops.

* Test fix.

* Create builder for layouts.

* Layout builder improvements.

* Improve layout builder.

* Simplify bias layout handling.

* Code clean-up.

* Move layout utils into separate file.

* Remove hard-coded layout combinations.

* Small code clean-up.

* Move data type utils into a separate file.

* Add data types, layouts, and elementwise ops per conv tensor.

* Builder bug fixes after refactoring.

* Working baseline.

* Make signature definition look nice in the test code.

* Move TensorConfig into test implementations.

* Fix all fwd conv builder tests.

* Fix conv traits and descriptors tests.

* More factory assets under a separate directory.

* Fix building conv traits.

* Fix clang-format.

* Add Readme doc to describe the design.

* Add link to main Readme. Fix links in the builder design doc.

* Clean-up data type/layout/elementwise op conversions.

* Switch from dimension and tensor type specific layouts to a flat list of tensor layouts.

* Fix clang-formatting.

* Fix clang-format for test code.

* Simplify fwd conv signature definitions in the test code.

* Remove accidental edits.

* Fix comment string.

* Fix instance factory after rebase.

* Fix tests after rebase.

* Unify layout handling.

* Add more conv layout unit tests.

* Clang-format.

* Fix merge conflicts.

* Improve elementwise op handling.

---------

Co-authored-by: Ville Pietilä <>
2025-12-04 12:58:31 +02:00
John Shumway
1bc7529977 Guard a builder test to avoid gfx11 and gfx12 (#3268)
We're getting a compile error on gfx11 and gfx12 for an I8 test that doesn't have a supported WMMA implmentation. We'll need to build architecture support into the builder, but to get things green I'm just adding an ifndef guard around the test.
2025-11-24 10:10:09 -08:00
John Shumway
f38c3de9f9 Fix copyright messages in experimental/builder. (#3253)
Our copyright were were mostly correct, but we inconsistently used (C) instead of (c) like the rest of the CK code. This PR fixes that (using lowercase c) and adds a missing copyright header to one file.
2025-11-20 17:40:55 -08:00
Robin Voetter
245c6011cf ck-builder: group transfer operations per tensor (#3217)
Grouping transfer operations per tensor makes it easier to
constrain on and operate with the transfer operations. As an
example, we can now deduplicate the logic for translating
the transfer operations from the ck-builder interface to the old
ck interface for the A and B tensors.
2025-11-20 10:40:48 -08:00
Robin Voetter
7fe7aa76f5 [CK_BUILDER] fixes (#3222)
* ck-builder: some miscellaneous fixes
* ck-builder: fix InstanceSet.FromFactory test

The exact syntax that the instance string functionality
returns has changed. This commit updates the test to expect
the right string.
2025-11-19 09:05:25 -08:00
John Shumway
ad57f6ef0b [CK_BUILDER] Put global CK functions in an the CK namespace (#3232)
* Wrap ck host utitlies in CK namespace.

The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.

Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.

There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate  kernels from either template library.

* Add using declarations to test code.

After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.

* Add using declarations to client examples.
2025-11-19 11:23:02 +01:00
Ville Pietilä
7d57bc169f [CK_BUILDER] Forward convolution builder improvements (#3179)
Proposed changes
Improve the forward convolution builder implementation and addressed leftover feedback left from PR #3138. Main changes

Refactored tests such that they reflect better the builder pattern. The templates and types for the convolution algorithm concepts are created via factory that facilitates programmatic creation of the device op instances.
Moved tests into anonymous namespace.
The convolution factory had lot of if-else constructs when CK Builder types were converted into CK library types. I had initially trouble in using static_assert in the default branch of switch as the static_assert was evaluated at compile time even for valid types. However, if we change the static_assert to throw "<error message>", it will result in a compile-time error only if the default branch is actually hit. This assumes that the function is consteval. Hence, changed all conversions in the convolution factory to use switch, which is more intuitive.
Removed the explicit device op definition from convolution signature and the corresponding predicate file. The device ops are defined by the corresponding concepts. This allowed to remove lot of boilerplate code from the convolution factory.
Adde inheritance and convolution algorithm specialization to handle device ops that are specialization of a more generic ones. The large tensor support is more naturally expressed by this pattern.
Added support for the FP8 data type.

* WIP: Builder for expected test results.

* Improve ckb fwd conv instance tests.

* clang-format

* Change if-else statements into switch in conv factory.

* Fix clang-formatting.

* Removed unnecessary includes.

* Added missing copyright.

* Remove explicit device op flag from from convolution signature.

* Add missing concept.

* Fix build.

* clang-format

* Add test for building conv fwd FP8 instances.

* Add missing header to instance traits.

* Clean-up recently added instances.

* Introduce inheritance and specialization.

* Use builder to build conv algorithm templates and types.

* clang-format

* Fix conv description tests.

---------

Co-authored-by: John Shumway <john.shumwayjr@gmail.com>
2025-11-13 08:47:25 -08:00
JH-Leon-KIM-AMD
5f3cae3e28 [CK_BUILDER]ckb add remining fwd conv device ops (#3155)
* Add device operation to conv signature. Use unions to hold conv layouts and device operations.

* Add predicates for all device op instances.

* Use the device op signature for validation.

* Fix ckb CMakeLists.txt file for tests.

* Fix building CK Builder instance traits after the introduction of direct load template parameter in CK.

* Fix clang-formatting.

* add device_grouped_conv_fwd_dl_multiple_d_nhwc_kyxc_nhwk

* Add full DL configurability with Option A implementation

- Added 5 DL descriptor structs (39 configurable parameters)
- Added 10 C++20 concepts for type-safe validation
- Updated factory to read all parameters from descriptors
- Updated test helper to populate all descriptors
- All tests passing (13/13 including 3 new DL tests)

* Add factory and test support for DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor

- Add factory specialization for Large_Tensor device operation (conv_factory.hpp lines 1145-1265)
- Add macro collision workaround using pragma push/pop (conv_factory.hpp lines 43-51)
- Add test helper function run_test_DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor
- Add builder test file test_ckb_conv_fwd_2d_large_tensor_fp16.cpp with 2 test cases
- Update CMakeLists.txt to include new test file
- Reuse existing ConvAlgorithm_DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle descriptor
- Map all 42 template parameters identical to regular XDL CShuffle
- All 15 builder tests passing including 2 new Large_Tensor tests

Completes Task 350: All 4 forward convolution device operations now supported in CK Builder.

* Update copyright headers to new format

- Change copyright format to: Copyright (C) Advanced Micro Devices, Inc., or its affiliates.
- Reorder headers: Copyright first, then SPDX-License-Identifier
- Updated files:
  * experimental/builder/test/conv/test_ckb_conv_fwd_2d_dl_fp16.cpp
  * experimental/builder/test/conv/test_ckb_conv_fwd_2d_large_tensor_fp16.cpp
  * experimental/builder/include/ck_tile/builder/device_op_types.hpp

* fix c++ 18 format

* Fix clang-format-18 error in device_op_types.hpp

---------

Co-authored-by: Ville Pietilä <ville.pietila@amd.com>
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
2025-11-06 16:29:48 -08:00
Adam Osewski
b8527a9236 [CK_BUILDER] Convolution traits. (#3152)
Added:

1. Convolution traits & unit tests
2. Update builder enumerators to have representation of Convolution Kernels properties.
3. Unified builder pipeline version & scheduler enumerators
2025-11-05 08:53:06 -08:00
John Shumway
0be0288f58 [CK_BUILDER] Update copyright messages. (#3150)
* Update copyright messages.

Copyright messages should no longer include a year. This PR updates all 38 source files to the new format.

* Switch to (C) from unicode copyright symbol.

The unicodein comments  was causing compilation errors.
2025-11-04 15:35:16 +01:00
Ville Pietilä
3ae3992c18 [CK_BUILDER] Add conv factories for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle and DeviceGroupedConvFwdMultipleD_Wmma_CShuffle (#3138)
* Add device operation to conv signature. Use unions to hold conv layouts and device operations.

* Add predicates for all device op instances.

* Use the device op signature for validation.

* Fix ckb CMakeLists.txt file for tests.

* Fix building CK Builder instance traits after the introduction of direct load template parameter in CK.

* Fix clang-formatting.

* Add factory for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle device op.

* Add conv factory for  DeviceGroupedConvFwdMultipleD_Wmma_CShuffle

* Rename elements per wave per shuffle member in the epilogue concept.

* clang-format

* Add concepts and types for optional device op template parameters.

* Add optional compute, direct load, and loop scheduler arguments to conv factory.

* Add number of groups to merge template parameter.

* clang-format.
2025-11-03 09:03:25 +02:00
Ville Pietilä
b387249fd9 [CK_BUILDER] Generalize convolution factory to build arbitrary device operations. (#3116)
Generalize the current convolution factory in CK Builder to be able to build instances of any relevant convolution device operation. The main changes are:

* Added new enums FwdGroupConvDeviceOperation, BwdDataGroupConvDeviceOperation, and * BwdWeightGroupConvDeviceOperation that contain the device operations for which the builder should be able to build instances.
* Create a union structure GroupConvDeviceOp that can represent a single value of the fwd, bwd weight, or bwd data device operations. This would be more naturally represented by std::variant object, but we cannot use std::variant in NTTPs because it is not a structural object.
* Introduced a new member device_operation in the ConvSignatureDescriptor concept that assumes GroupConvDeviceOp value.
* Added predicates to be used in creation ConvFactory specialization for the different device operation. When we add support for a new device operation, we'll just create a new ConvFactory specialization with appropriate predicates.
* Changed handling of the convolution layouts (GroupConvLayout1D, GroupConvLayout2D, GroupConvLayout3D) to use the union based handling, i.e., there's now a GroupConvLayout union struct that can hold a single value of the 1D, 2D, or 3D layouts. This simplifies the handling of the different layouts as we get rid of templatized convolution signature.

These code changes allow developers to work more easily in parallel when adding new device operations.

* Fix building CK Builder instance traits after the introduction of direct load template parameter in CK.

* Fix clang-formatting.
2025-10-30 16:13:58 -07:00
Ville Pietilä
13e13ce359 [CK_BUILDER] Clean-up fwd conv builder implementation (#3110) 2025-10-28 20:37:33 -07:00
Ville Pietilä
6c2ca1211a [CK_BUILDER] First fwd convolution builder implementation (#3070)
* Add experimental builder infrastructure for composable_kernel

- Add experimental/builder directory with README documentation.
- Create initial test infrastructure with CMakeLists.txt and placeholder test.
- Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option.
- Update .gitignore to not treat `experimental/builder` as a CMake build directory.

This establishes the directory structure  for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration.

* Fix clang formatting.

* Fix CMake build infrastructure for experimental builder

- Add experimental/builder CMakeLists.txt with proper subdirectory structure
- Add placeholder include/ck_tile/builder CMakeLists.txt for header installation
- Fix gtest.cmake to use include_guard to prevent multiple inclusions
- Update root CMakeLists.txt to include full builder directory instead of just tests

* Scope C++20 settingto the test code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove redundant GTest::gtest linkage

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Introduce basic types, and convolution algorithm concepts and limits.

* Add convolution signature concepts.

* Add convolution factory.

* Finalize conv factory implementation for fwd convolutions.

* Add type definitions for testing.

* Add placeholder test.

* Add convolution builder definition.

* Fully functional fwd conv builder.

* Test improvements.

* Clean-up include headers.

* Enable the limit checks for the convolution algorithm parameters.

* Remove dead code.

* clang formatting.

* Add more tests and missing conv specialization argument.

* clang formatting.

* Add explicit handling of the tensor layouts.

* Add complete 2D/3D layout support to CK Builder

  - Add missing 2D layouts: GNHWC_GKYXC_GNHWK, NGCHW_GKCYX_NGKHW
  - Add missing 3D layout: GNDHWC_GKZYXC_GNDHWK
  - Add 1D layouts (NWGC, NGCW, GNWC, NGCW_GKCX) for future support
  - Add 3 tests for new 2D/3D layouts
  - All tests pass (5/5)

* Add tests for remaining 2D/3D layouts

  - Add test for 2D NGCHW_GKYXC_NGKHW (channels-first) with Filter1x1Stride1Pad0
  - Add test for 3D NDHWGC_GKZYXC_NDHWGK (channels-last)
  - All 7 tests pass (complete coverage for all 2D/3D forward layouts)

* Change enum converters to consteval.

* 7 tests with pipeline and specialization| Test # | Dim | Type | Layout               | Pipeline | Specialization          |
  |--------|-----|------|----------------------|----------|-------------------------|
  | 1      | 2D  | BF16 | NHWGC_GKYXC_NHWGK    | V1       | DEFAULT                 |
  | 2      | 2D  | FP16 | GNHWC_GKYXC_GNHWK    | V3       | FILTER_1X1_PAD0         |
  | 3      | 2D  | FP32 | NGCHW_GKCYX_NGKHW    | V4       | FILTER_1X1_STRIDE1_PAD0 |
  | 4      | 2D  | BF16 | NHWGC_GKYXC_NHWGK    | V5       | FILTER_3x3              |
  | 5      | 3D  | FP32 | NGCDHW_GKCZYX_NGKDHW | V1       | FILTER_1X1_PAD0         |
  | 6      | 3D  | BF16 | GNDHWC_GKZYXC_GNDHWK | V3       | DEFAULT                 |
  | 7      | 3D  | FP16 | NDHWGC_GKZYXC_NDHWGK | V4       | FILTER_1X1_PAD0         |

* Add missing convolution layouts and provide better compile-time error in instance traits.

* Fix clang formatting.

* Changed I8 -> S8.

* Fix signature.

* Rename concepts and corresponding members.

* Rename LDS related parameters.

* Remove ODD_C specialization. Add V2 pipeline.

* Add missing types.

* Add elementwise operation to the conv signature.

* Improve compile-time error message for unsupported elementwise ops.

* Separate different fwd conv builder tests into separate compilation units.

* Fix layout to string and add name to old CK PassThrough elementwise op.

* Enable both CK and CK Tile tensor layouts in instance traits.

* Fix clang-format.

---------

Co-authored-by: John Shumway <jshumway@amd.com>
Co-authored-by: John Shumway <john.shumwayjr@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: JH-Leon-KIM-AMD <jeonghyun.kim@amd.com>
2025-10-27 20:09:24 +02:00