[CK_BUILDER] Add
DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284)
Add factory, InstanceTraits, and conv traits support for the WMMA V3
forward convolution kernel, enabling the CK Builder to generate and
dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs.
## Motivation
As reported in issue #4944, MIOpen includes WMMA V3 forward convolution
kernels, so this PR adds support for those kernels similarly to other
supported kernels.
## Technical Details
This follows the same implementation as the other kernels. I added some
support for reflection, but I left a few todos since we need to
generalize our convolution traits to generalize across WMMA/MFMA and
CK/CKTile.
## Test Plan
Added faster tests to `ninja smoke-builder` that check the
instance-traits logic, and I added longer tests that instantiate
kernels, following the existing pattern in other kernals.
## Test Result
I tested all code with `ninja check-builder` on a gfx1101 build and ran
on gfx1101.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Proof of concept for removing forward declarations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
Currently, we forward declare CK device operation templates in
CK-Builder's reflection code:
9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)
This is mainly required to break a circular dependency in reflection.
The architecture of that is as follows:
MyDeviceOp implements GetInstanceString(). This is typically defined
directly in the class definition (no forward declaration).
GetInstanceString() calls instance_string<MyDeviceOp>()
instance_string<MyDeviceOp>() calls
InstanceTraits<MyDeviceOp>::instance_string()
InstanceTraits has a specialization for MyDeviceOp which implements
instance_string()
So order for GetInstanceString() to work properly, InstanceTraits must
already be defined. And for InstanceTraits to be defined, the device op
needs to be defined. In order to do that, we are currently using
aforementioned forward declaration.
## Technical Details
C++'s lazy template evaluation is used by calling into an as-of-yet
undefined function static member function of
`InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then
specializing `InstanceTraits` only _after that_. The caveat here is that
both the device op itself as well as the instance traits specialization
must be in scope, otherwise there would be an undefined function error.
In practise, we can solve that either by placing the instance traits
directly into the file that defines `MyDeviceOp`, or possibly by using a
`.inc` file to keep the concerns separated.
## Test Plan
The results were verified by running the existing regression tests for
CK Builder
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_BUILDER] ck builder conv transfer fix
## Motivation
This PR fixes how CK Builder is validating transfer vector size and adds
proper validation for LDS transfer vector size as well.
## Changes:
* [__source vector dim__] -- Before this PR the data transfer validation
logic didn't allow to set the source vectorized dimension to 1. However
there are CK instances that are doing this when the group merging is
used. This is used only for
`DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle` kernel.
* [__valid vector size__] -- Before this PR the validation logic
concerned only single instruction maximum vector size. However our
buffer loading logic has implemented support for loading more values
through multiple buffer instructions. This again was discovered to be
used in some of the convolution instances. Thus this behavior was
reflected in validation logic.
* [__valid LDS vector size__] -- Before this PR the LDS vector size
validation was done in the same way as VMEM. This PR adds proper LDS
vector size validation based on the available LDS instruction sizes.
## Test Plan
Run CK BUILDER conv fwd factories tests
## Test Result
All CK BUILDER conv fwd factories work (except DL one & ck tile since
they're not yet added now)
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_Builder] added bwd data kernels to builder factory
(#4582)
This PR adds bwd data wmma and xdl kernels to the ck builder, their
instance and conv traits as well as tests for the above.
Force merging because I verified this fix manually:
git checkout develop
git pull
ninja smoke-builder (failed to build, as expected)
git checkout rvoetter/ckb-fix
ninja smoke-builder (passed!)
* added reflection for conv_fwd_multiple_d_wmma_cshuffle.hpp
* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle
* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle v3
* added reflection of max_transpose parameters
* fix printing of std optional parameters
* fix use of undefined ck::index
* added conv traits for device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle
* added xdl two stage instance to reflection
* added additional variables
* added reflection for grouped_conv_bwd_weight_multiple_d_wmma_cshuffle, _v3, grouped_conv_two_stage_wmma_cshuffle_v3,
* added reflection for device_grouped_conv_bwd_weigh_wmma_cshuffle_v3
* added reflection for bwd_weight_wmma_cshuffle
* added comments back in
* add printed output for optional parameters
* update README
* fix typo
* added num_gemm_k_prefetch_stage and small fixes
* modified test string due to reflection of new parameter
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
* ck-builder: restructure testing conv
In order to prepare for bwd of conv testing, this commit moves some
files and types around so that we can reuse ckt::Args for both forward
and backwards convolution.
* ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp
This will allow us to more easily include fwd.hpp from backwards
definitions, which is required for initializing bwd values.
* ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3
Turns out that the supplied layout isn't actually supported...
* ck-builder: ck and reference conv integration for bwd weight
* ck-builder: ck bwd weight execution test
* ck-builder: ckt::run support for ck-tile bwd weight
* ck-builder: ck tile bwd weight execution test
* ck-builder: extra debug printing in MatchesReference
* ck-builder: make ckt::run return RunResult
This type is more convenient than std::tuple, as it will allow us to
use google test matchers with this in the future.
* ck-builder: RunResult matcher
Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error
message about how and why running an algorithm failed.
* ck-builder: doc fixes
* ck-builder: add missing headers
* Factor helpers out of conv_traits.hpp
* Create a non-templated conv_traits struct
* Migrate to new instance-specific instance_to_conv_traits functions
* Clean up reflection concepts
* Clean up ConvTraits helpers
* Update testing for convolution traits
This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.
* Address reviewer comments and resolve merge conflict
* Add placeholder test.
* Initial conv bwd weight factory.
* Conv builder test refactoring.
* Add missing pieces to bwd weight factory.
* Improve compile time erros message when no matching factory is found.
* Use amcro to ensure automatic macthing between concepts are their string representations.
* Improve compile time diagnostics.
* Small improvements.
* Improve missing member/wrong type compile-time errors.
* Improve compile time diagnostics.
* Concept bug fixes.
* Remove debug assert.
* Update algorithm signature diagnostics.
* Factory bug fixes.
* First functional version of bwd weight conv factory.
* Refactor handing of GEMM-K batch template parameter in conv bwd weight factory.
* Concept improvements.
* Improve concept diagnostics.
* Introduve a common size type for concepts.
* Update compiletime diagnostics to use the size type.
* Update conv specialization enum.
* Fix fwd conv builder tests.
* Fix smoke tests.
* Separate bwd weigth and bwd data tests into separate targets.
* Clean-up CK Tile builder tests.
* Add bwd weight XDL CShuffle V3 factory.
* Build conv bwd weigth v3 instances successfully.
* Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.
* Test fix.
* Add instance traits for bwd weight algorithms.
* Add unit tests for instance strings.
* Build new instance traits unit tests but exclude WMMA for now.
* Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.
* Conv bwd weight DL factory.
* Final implementation for bwd weight DL factory.
* Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance.
* Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle
* Treat ref algorithm the same way as real algorithms in the dispatcher.
* Refactor large tensor support and WMMA configuration.
* Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3.
* Update Readme.
* Fix WMMA bwd weight tests.
* Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3.
* Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle.
* Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle.
* Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
* Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs.
* Fix fwd factories after refactoring.
* clang-format
* Move compile-time diagnostics to a separate branch.
* Fix ref algorithm dispatching.
* Fix smoke tests.
* clang-format
* Fix factory for regular WMMA conv bwd weight.
* Clarify builder Readme.
* Remove obsolete test file.
* Fix test after merge.
* clang-format
* Remove the C++26 extensions.
* Unify conv elementwise ops and layout definitions for fwd and bwd directions.
* Remove old layout and elementwise ops.
* Unify handling of conv tensor types between fwd and bwd directions.
* Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank.
* Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms.
* clang-format
---------
Co-authored-by: Ville Pietilä <>
* ck-builder: make toString to_string
We are using snake case for CK-Builder
* ck-builder: add debug.hpp with tensor descriptor printing function
This adds some initial functionality to debug.hpp, a header which will
be used to house some debug utilities.
* ck-builder: abstract nd-iteration
Abstracting this makes it easier to test, clearer, and allows us to
use it elsewhere (such as in debug.hpp soon)
* ck-builder: tensor printing
* ck-builder: rename INT32 to I32
This makes it more in line with the other data type definitions.
Our concept-base conversions are fragile and too complex. We want to refactor to straightforward functions
for each intance trace class template. This change adds unit test coverage to make that refactoring safer.
* ck-builder: explicitly delete forward declarations
Before, these functions were seen as a forward declaration for an existing function.
If no actual implementation overload could be found, these would be selected and
a linker error or warning would be generated. By marking these functions as explicitly
deleted, they incorrect invocations are generated as compile error instead.
* ck-builder: ckt::run plumbing for reference conv
This implements the ckt::run plumbing for the reference convolution
implementation and sets up the first complete end-to-end test.
* ck-builder: make validation system check for all-zeros
When both the actual and reference output are both all zero bits,
there is probably something wrong in the test framework.
* ck-builder: proper implementation+tests for TensorDescriptor::is_packed
* ck-builder: fix typos
This pull request builds on #3267 by proving the "validation" infrastructure, the means to compare a set of `Outputs`.
The design of the validation infrastructure is relatively straight forward:
- Each SIGNATURE should come with a `validate()` implementation, which should be implemented in a similar way that the other functions/types from `testing.hpp` are implemented.
- `validate()` returns a `ValidationReport`, which is a structure that keeps all relevant information about comparing the tensors from two `Outputs`. Note that crucially, `validate()` should not do any reporting by itself. Rather, glue logic should be implemented by the user to turn `ValidationReport` into a relevant error message.
- You can see this clue code for CK-Builder itself in `testing_utils.hpp`, its `MatchesReference()`. This functionality is relatively barebones right now, it will be expanded upon in a different PR to keep the scope of this one down.
The comparison is done on the GPU (using an atomic for now), to keep tests relatively quick. Some notable items from this PR:
- To help compare the tensors and with writing tests, I've written a generic function `tensor_foreach` which invokes a callback on every element of a tensor.
- For that it was useful that the `TensorDescriptor` has a rank which is known at compile-time, so I've changed the implementation of `TensorDescriptor` for that. I felt like it was a better approach than keeping it dynamic, for multiple reasons:
- This is C++ and we should use static typing where possible and useful. This way, we don't have to implement runtime assertions about the tensor rank.
- We know already know the rank of tensors statically, as it can be derived from the SIGNATURE.
- It simpifies the implementation of `tensor_foreach` and other comparison code.
- There are a lot of new tests for validating the validation implementation, validating validation validation tests (Only 3 recursive levels though...). For a few of those functions, I felt like it would be useful to expose them to the user.
- Doc comments everywhere.
Refactors the way the number of XDL (matrix multiply-accumulate) instructions per wave is calculated and used in the grouped convolution forward implementations, especially to better support WMMA (Wave Matrix Multiply-Accumulate) instructions and 16x16 tiles.
The changes use MXdlPerWave instead of NXdlPerWave to increase number of waves per M dim.
Introduces a polymorphic describe() method to BaseOperator that enables runtime introspection of kernel configurations through a unified interface.
Key changes:
* Add virtual describe() method to BaseOperator returning Description objects
* Implement describe() in 6 device operation classes (conv fwd/bwd variants)
* Create conv_describe.hpp with factory function for ConvDescription
* Extract type definitions to conv_types.hpp to resolve circular dependencies
* Add InstanceStringDescription for kernels without full ConvDescription support
Other Improvements:
* Update tests to use describe() instead of GetInstanceString()
* Remove circular dependency include from conv_traits.hpp
* Add ODD_C to ConvFwdSpecialization enum and fix OddC mapping
* Replace silent fallback in conv_layout() with compile-time error
This provides a foundation for runtime kernel introspection and better tooling support for analyzing and debugging kernel configurations.
* Add README.md for testing
* Add tensor_memory_manager.
* ck-builder: tensor memory manager rebase fixes
This fixes some issues caused by the API being changed recently.
Also, this streamlines the ckt namespace to always be ck_tile::builder::test,
as this is already being used by other tests
Really, this commit should be squashed into the previous,
but I'm keeping it separate for brevity.
* ck-builder: test arguments initial prototype
* ck-builder: test system initial prototype
* ck-builder: fix non-standardized copyright comments
* ck-builder: new prototype
* ck-builder: group testing inputs/outputs into a separate structure
This is basically the return of the tensor memory manager after all,
except that the design is more closely tied to the actual operation.
Using a struct allows us to add additional input/output tensors
without breaking code (by defaulting those new parameters). Note
that the tensors are split into a separate inputs/outputs because we
usually want to allocate the output _twice_: once for the real
computation and once for the reference computation.
* ck-builder: simplify prototype naming; start docs
* ck-builder: update testing readme
* ck-builder: testing documentation
* ck-builder: HipStatusMatcher
This matcher can be used to check HIP status codes and provide
nice and readable error messages.
* ck-builder: tensor_buffer.hpp tests
* ck-builder: conv_fwd.hpp tests
* ck-builder: add example end-to-end test in conv fwd 2d fp16
* ck-builder: simplify extent usage
* ck-builder: update testing doc
* ck-builder: skip end to end test on non-gfx9
* fix check_copyright_year interpreter
/bin/bash is not guaranteed to exist on Linux. Signed,
a NixOS user
* ck-builder: fix copyrights
* ck-builder: reduce conv fwd testing size
This test allocated 24GB of memory, too much for 16GB cards.
---------
Co-authored-by: John Shumway <jshumway@amd.com>