* added reflection for conv_fwd_multiple_d_wmma_cshuffle.hpp
* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle
* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle v3
* added reflection of max_transpose parameters
* fix printing of std optional parameters
* fix use of undefined ck::index
* added conv traits for device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle
* added xdl two stage instance to reflection
* added additional variables
* added reflection for grouped_conv_bwd_weight_multiple_d_wmma_cshuffle, _v3, grouped_conv_two_stage_wmma_cshuffle_v3,
* added reflection for device_grouped_conv_bwd_weigh_wmma_cshuffle_v3
* added reflection for bwd_weight_wmma_cshuffle
* added comments back in
* add printed output for optional parameters
* update README
* fix typo
* added num_gemm_k_prefetch_stage and small fixes
* modified test string due to reflection of new parameter
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
[ROCm/composable_kernel commit: d6cccf6093]
* Factor helpers out of conv_traits.hpp
* Create a non-templated conv_traits struct
* Migrate to new instance-specific instance_to_conv_traits functions
* Clean up reflection concepts
* Clean up ConvTraits helpers
* Update testing for convolution traits
This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.
* Address reviewer comments and resolve merge conflict
[ROCm/composable_kernel commit: 5122637215]
* Add placeholder test.
* Initial conv bwd weight factory.
* Conv builder test refactoring.
* Add missing pieces to bwd weight factory.
* Improve compile time erros message when no matching factory is found.
* Use amcro to ensure automatic macthing between concepts are their string representations.
* Improve compile time diagnostics.
* Small improvements.
* Improve missing member/wrong type compile-time errors.
* Improve compile time diagnostics.
* Concept bug fixes.
* Remove debug assert.
* Update algorithm signature diagnostics.
* Factory bug fixes.
* First functional version of bwd weight conv factory.
* Refactor handing of GEMM-K batch template parameter in conv bwd weight factory.
* Concept improvements.
* Improve concept diagnostics.
* Introduve a common size type for concepts.
* Update compiletime diagnostics to use the size type.
* Update conv specialization enum.
* Fix fwd conv builder tests.
* Fix smoke tests.
* Separate bwd weigth and bwd data tests into separate targets.
* Clean-up CK Tile builder tests.
* Add bwd weight XDL CShuffle V3 factory.
* Build conv bwd weigth v3 instances successfully.
* Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.
* Test fix.
* Add instance traits for bwd weight algorithms.
* Add unit tests for instance strings.
* Build new instance traits unit tests but exclude WMMA for now.
* Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.
* Conv bwd weight DL factory.
* Final implementation for bwd weight DL factory.
* Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance.
* Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle
* Treat ref algorithm the same way as real algorithms in the dispatcher.
* Refactor large tensor support and WMMA configuration.
* Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3.
* Update Readme.
* Fix WMMA bwd weight tests.
* Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3.
* Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle.
* Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle.
* Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
* Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs.
* Fix fwd factories after refactoring.
* clang-format
* Move compile-time diagnostics to a separate branch.
* Fix ref algorithm dispatching.
* Fix smoke tests.
* clang-format
* Fix factory for regular WMMA conv bwd weight.
* Clarify builder Readme.
* Remove obsolete test file.
* Fix test after merge.
* clang-format
* Remove the C++26 extensions.
* Unify conv elementwise ops and layout definitions for fwd and bwd directions.
* Remove old layout and elementwise ops.
* Unify handling of conv tensor types between fwd and bwd directions.
* Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank.
* Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms.
* clang-format
---------
Co-authored-by: Ville Pietilä <>
[ROCm/composable_kernel commit: 9908a87c31]
Added instance traits for the following bwd weight conv algorithms
DeviceGroupedConvBwdWeight_Xdl_CShuffleV3
DeviceGroupedConvBwdWeight_Wmma_CShuffleV3
DeviceGroupedConvBwdWeight_Wmma_CShuffle
DeviceGroupedConvBwdWeight_TwoStage_Xdl_CShuffle
DeviceGroupedConvBwdWeight_TwoStage_Wmma_CShuffleV3
DeviceGroupedConvBwdWeight_DL
DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle
DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
Added also unit tests for instance traits of those bwd weigth algorithms that are currently exposed by the narrow CK build for MIOpen.
---------
Co-authored-by: Ville Pietilä <>
[ROCm/composable_kernel commit: 6e8c401e33]
* [CK_BUILDER] Integrate GPU reference as ConvAlgorithm
Add GPU reference as a ConvAlgorithm specialization, enabling:
- Unified Builder API for reference and optimized kernels
- Future ckProfiler integration for validation
- First step toward numerical validation in Builder tests
Changes:
- Add ConvAlgorithmSpecialization::REFERENCE enum
- Add ConvAlgorithm_Reference struct
- Add IsReferenceAlgorithm concept
- Create 3 reference factories (Forward, BwdData, BwdWeight)
- Wire into conv_dispatcher
- Add proof-of-concept test (passing)
Test result: Can instantiate reference through Builder API
* Add GPU reference execution tests
- Reference kernel executes through Builder (459ms)
- Both reference and optimized can instantiate
- Tests passing
Next: Implement utilities for comparison
* Optimized Builder kernel execution works
- MakeArgument pattern implemented
- Builder-generated kernel executes successfully
- Tests passing (451ms execution)
Next: Add comparison
* VALIDATION COMPLETE: Builder == Reference
Builder-generated kernel output matches GPU reference!
Test: Validate_Optimized_vs_Reference_Forward_2D_FP16
Result: PASS ✓
This proves CK Builder generates correct code!
* Update to new Builder API
All tests passing
* Rename test file for clarity
test_builder_kernel_execution -> test_builder_kernel_validation
* Add all 3 directions support
- Forward, Backward Data, Backward Weight
- All reference factories working
- Dispatcher wired for all directions
- 9 tests passing
Tests:
- test_reference_execution: 3 tests (all directions)
- test_optimized_execution: 3 tests (all directions)
- test_builder_kernel_validation: 3 tests (fwd validated, bwd placeholders)
* Add backward direction support
- Backward data and weight dispatcher wiring
- Fix factories for new API
- All 3 directions tested
- 9 tests passing
* Refactor: Change IsReferenceAlgorithm from concept to consteval function
Address review feedback: Use consteval function in dispatcher instead of
concept, matching the pattern for other algorithms (Tile, XDL, WMMA, DL).
- Remove IsReferenceAlgorithm concept from conv_algorithm_concepts.hpp
- Add IsReferenceAlgorithm() consteval function to conv_dispatcher.hpp
- Update dispatcher to use function call: IsReferenceAlgorithm<T>()
- Remove redundant algorithm checks from reference factory requires clauses
All tests passing (9/9).
* Move Tile algorithm check outside direction block to support all directions
* Implement MakeInvokerPointer interface and add random input validation
- Implement full Argument/Invoker structs for old CK interface (not just nullptr)
- Refactor with reference_common.hpp to reduce code duplication
- Add random input validation tests: Builder vs direct GPU reference (all directions)
- Fix layout: GNHWC -> NHWGC to match reference kernel expectations
- All 12 tests pass with IDENTICAL results on random input
* Move ConvAlgorithm_Reference to test/impl/conv_algorithm_types.hpp
Keep types.hpp for data types only (enums), move algorithm descriptors
to conv_algorithm_types.hpp as suggested by review.
* Add static_assert to ensure reference factories only accept PassThrough operations
Reference implementation doesn't support fused elementwise operations.
Add compile-time validation to fail early with clear error message if
non-PassThrough operations are specified on input, weight, or output.
* Add InstanceTraits support for reference kernels
- Store SIGNATURE/ALGORITHM/VERSION in Instance for reflection
- Create shared ReferenceCommonTraits base for common properties
- Add 3 direction-specific InstanceTraits specializations in one file
- Include data type and layouts in instance_string output
* Remove optimized kernel validation tests from reference-only branch
* Use existing layout helper and organize reference tests
Use LayoutToCK from conv_tensor_layout.hpp and move reference InstanceTraits
test to validation folder.
* Merge develop branch
Fix DataType switch for new mixed precision types.
* Fix comment spacing for CI
* Convert IsReferenceAlgorithm from function to concept
* Add reference tests to CI smoke tests
* Consolidate 3 reference factories into single unified factory
---------
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
[ROCm/composable_kernel commit: a0acc83a72]
We have some metaprogramming helper functions that only exist to throw an error at build time. These should have the [[noreturn]] attribute, which is now required in our CI builds.
[ROCm/composable_kernel commit: 9a6e61de97]
Introduces a polymorphic describe() method to BaseOperator that enables runtime introspection of kernel configurations through a unified interface.
Key changes:
* Add virtual describe() method to BaseOperator returning Description objects
* Implement describe() in 6 device operation classes (conv fwd/bwd variants)
* Create conv_describe.hpp with factory function for ConvDescription
* Extract type definitions to conv_types.hpp to resolve circular dependencies
* Add InstanceStringDescription for kernels without full ConvDescription support
Other Improvements:
* Update tests to use describe() instead of GetInstanceString()
* Remove circular dependency include from conv_traits.hpp
* Add ODD_C to ConvFwdSpecialization enum and fix OddC mapping
* Replace silent fallback in conv_layout() with compile-time error
This provides a foundation for runtime kernel introspection and better tooling support for analyzing and debugging kernel configurations.
[ROCm/composable_kernel commit: 9ac51aa0f4]
We only want to import enums and types into the builder reflection code. But, some of the enums are included in much larger files or even big trees of include files. This leads to unintended mixing of code and very confusing interactions and symbol conflicts. We organize the includes and extract two new enum-only headers to help with decoupling in CK. This refactoring is critical if we want to include reflection in a device-operator "describe" method.
* Remove a few unnecessary includes from headers in builder/reflect/.
* Extract enums scheduler and pipeline to their own headers so they can be used without importing other code.
* Order includes alphabetically for better organization.
The immediate goal is to unblock reflection integration, and this type of cleanup helps the flexibility and robustness of the CK header library.
[ROCm/composable_kernel commit: f5b0af2272]
When I asked for a description of operators that didn't have ConvTraits, I was getting very long confusing errors about ConvTraits not being defined. Now we get specific errors explaining which concepts are violated, making it easier to know which code to generalize or update.
* Add concepts to conv_traits.hpp to get better error message.
* Put the correct requires clauses in the right places to get descriptive error messages.
* General cleanup of functions in conv_traits.hpp to make functions easier to read.
[ROCm/composable_kernel commit: 13f6d63565]
* Separate layouts into separate entities for input, weight, and output tensors.
* Add test for handling bias tensor layouts.
* Use instance string in builder tests.
* Add handling of output bias data types and layouts.
* Generalize handling of the elementwise ops.
* Test fix.
* Create builder for layouts.
* Layout builder improvements.
* Improve layout builder.
* Simplify bias layout handling.
* Code clean-up.
* Move layout utils into separate file.
* Remove hard-coded layout combinations.
* Small code clean-up.
* Move data type utils into a separate file.
* Add data types, layouts, and elementwise ops per conv tensor.
* Builder bug fixes after refactoring.
* Working baseline.
* Make signature definition look nice in the test code.
* Move TensorConfig into test implementations.
* Fix all fwd conv builder tests.
* Fix conv traits and descriptors tests.
* More factory assets under a separate directory.
* Fix building conv traits.
* Fix clang-format.
* Add Readme doc to describe the design.
* Add link to main Readme. Fix links in the builder design doc.
* Clean-up data type/layout/elementwise op conversions.
* Switch from dimension and tensor type specific layouts to a flat list of tensor layouts.
* Fix clang-formatting.
* Fix clang-format for test code.
* Simplify fwd conv signature definitions in the test code.
* Remove accidental edits.
* Fix comment string.
* Fix instance factory after rebase.
* Fix tests after rebase.
* Unify layout handling.
* Add more conv layout unit tests.
* Clang-format.
* Fix merge conflicts.
* Improve elementwise op handling.
---------
Co-authored-by: Ville Pietilä <>
[ROCm/composable_kernel commit: 9cb1f421bc]
* Create Description::instance_string() function
To expose more reflection capabilities in MIOpen, we add the instance_string functionality to the ckr::Description class. This PR introduces a base class, adds the instance_string method, and implements the method by injecting the Traits::instance_string method through the ConvDescription constructor.
This will enable us to replace the specialized get_instance_string() method on device operations with a describe() method in a subsequent PR.
* Test describe().instance_string()
Update the instance string tests to also call `ckr::describe<Instance>().instance_string()`. This documents that the xld kernels are supported with describe(), but WMMA and DL kernels are not yet supported. Also update namespace and add a HasConvTraits concept.
[ROCm/composable_kernel commit: f29b67cf9b]
Refactor the builder factory code into multiple files and subdirectories and a ck_tile::builder::factory namespace.
The factory implements compile-time dispatch from high-level signature and algorithm descriptors to our existing specialized convolution kernel implementations.
Major changes in this PR:
Dispatch logic is explicit in the function make_conv_instance instead of implicit in template specialization selection.
Helper code is moved to a subdirectory builder/factory/helpers.
Helpers now have unit tests.
Factories are moved to their own files.
Code moved to namespaces ck_tile::builder::factory and ck_tile::builder::factory::internal.
This does not yet fix the problem of bad error messages, but the make_conv_instance function makes the poor error messages clear. The choice of algorithm must be much more robust (perhaps with explicit enumeration in the algorithm descriptor), so that the dispatch doesn't fail.
Quality changes:
Making dispatch explicit rather than implicit will improve robustness, readability, maintainability, testability, and extensibility.
Separating code into separate files and subdirectories helps readability and extensibility.
Adding unit tests for helpers documents behavior and will enable more complex logic and functionality.
Separating files (especially unit tests) helps clarify includes and dependencies and makes code easier to refactor.
[ROCm/composable_kernel commit: 280bc42191]
The ConvDescription::detailed command wasn't using TreeFormatter::writeLast correctly, which led to extra lines being drawn in the tree view. It's a simple fix, just a cosmetic improvment out reflection output (ASCII art).
[ROCm/composable_kernel commit: d17994f3df]
Remove obsolete feature for extracting a description from a builder, since this should apply directly to the instance type. Also add some documentation, including a README.md for reflection.
[ROCm/composable_kernel commit: abd6a4b3fc]
Our copyright were were mostly correct, but we inconsistently used (C) instead of (c) like the rest of the CK code. This PR fixes that (using lowercase c) and adds a missing copyright header to one file.
[ROCm/composable_kernel commit: f38c3de9f9]
* ck-builder: some miscellaneous fixes
* ck-builder: fix InstanceSet.FromFactory test
The exact syntax that the instance string functionality
returns has changed. This commit updates the test to expect
the right string.
[ROCm/composable_kernel commit: 7fe7aa76f5]
Proposed changes
Improve the forward convolution builder implementation and addressed leftover feedback left from PR #3138. Main changes
Refactored tests such that they reflect better the builder pattern. The templates and types for the convolution algorithm concepts are created via factory that facilitates programmatic creation of the device op instances.
Moved tests into anonymous namespace.
The convolution factory had lot of if-else constructs when CK Builder types were converted into CK library types. I had initially trouble in using static_assert in the default branch of switch as the static_assert was evaluated at compile time even for valid types. However, if we change the static_assert to throw "<error message>", it will result in a compile-time error only if the default branch is actually hit. This assumes that the function is consteval. Hence, changed all conversions in the convolution factory to use switch, which is more intuitive.
Removed the explicit device op definition from convolution signature and the corresponding predicate file. The device ops are defined by the corresponding concepts. This allowed to remove lot of boilerplate code from the convolution factory.
Adde inheritance and convolution algorithm specialization to handle device ops that are specialization of a more generic ones. The large tensor support is more naturally expressed by this pattern.
Added support for the FP8 data type.
* WIP: Builder for expected test results.
* Improve ckb fwd conv instance tests.
* clang-format
* Change if-else statements into switch in conv factory.
* Fix clang-formatting.
* Removed unnecessary includes.
* Added missing copyright.
* Remove explicit device op flag from from convolution signature.
* Add missing concept.
* Fix build.
* clang-format
* Add test for building conv fwd FP8 instances.
* Add missing header to instance traits.
* Clean-up recently added instances.
* Introduce inheritance and specialization.
* Use builder to build conv algorithm templates and types.
* clang-format
* Fix conv description tests.
---------
Co-authored-by: John Shumway <john.shumwayjr@gmail.com>
[ROCm/composable_kernel commit: 7d57bc169f]
* Update copyright messages.
Copyright messages should no longer include a year. This PR updates all 38 source files to the new format.
* Switch to (C) from unicode copyright symbol.
The unicodein comments was causing compilation errors.
[ROCm/composable_kernel commit: 0be0288f58]
* Add backward weight instance traits for xdl cshuffle.
To keep instance test file sizes reasonable, we start a new test_bwd_weight_instances_traits.cpp test file.
* Fix copyright notices.
* Remove (c) symbol, replace with (C).
Having UTF-8 in source caused an error with code generation.
[ROCm/composable_kernel commit: 6dbee64886]
* Add InstanceTraits for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle
* Add InstanceTraits for kernel_grouped_conv_fwd_dl_multiple_d
* A few small changes to fix broken instance traits.
[ROCm/composable_kernel commit: 5ed2046bee]
Generalize the current convolution factory in CK Builder to be able to build instances of any relevant convolution device operation. The main changes are:
* Added new enums FwdGroupConvDeviceOperation, BwdDataGroupConvDeviceOperation, and * BwdWeightGroupConvDeviceOperation that contain the device operations for which the builder should be able to build instances.
* Create a union structure GroupConvDeviceOp that can represent a single value of the fwd, bwd weight, or bwd data device operations. This would be more naturally represented by std::variant object, but we cannot use std::variant in NTTPs because it is not a structural object.
* Introduced a new member device_operation in the ConvSignatureDescriptor concept that assumes GroupConvDeviceOp value.
* Added predicates to be used in creation ConvFactory specialization for the different device operation. When we add support for a new device operation, we'll just create a new ConvFactory specialization with appropriate predicates.
* Changed handling of the convolution layouts (GroupConvLayout1D, GroupConvLayout2D, GroupConvLayout3D) to use the union based handling, i.e., there's now a GroupConvLayout union struct that can hold a single value of the 1D, 2D, or 3D layouts. This simplifies the handling of the different layouts as we get rid of templatized convolution signature.
These code changes allow developers to work more easily in parallel when adding new device operations.
* Fix building CK Builder instance traits after the introduction of direct load template parameter in CK.
* Fix clang-formatting.
[ROCm/composable_kernel commit: b387249fd9]
* Refactor instance_traits_util and add unit tests tests
* Address reviewer comments.
Just adds some TODOs to indicate deprecated layouts in our reflection. Our strategy is to leave the reflection code broad (covering deprecated features), but keep the builder concepts narrow. Once we've removed deprecated features from all instances, we can remove them from reflection.
Also add a comment to the cmake to explain the unit test target test_conv_builder.
* Addressed more reviewer comments.
* Remove duplicate PassThrough::name
Accidentally added this field to the end of the struct, too. The `name` field should be a the start of the struct for consistency.
[ROCm/composable_kernel commit: 54746e9329]
We need to check all the architectures for build errors. This missing tf32 type came up as a build failure when I compiled for different instinct architectures.
[ROCm/composable_kernel commit: 6d709dac41]
* Add name member to unary elementwise ops.
* Update elementwise_op_name to check for name attribute.
* Require that the layout is derived from BaseTensorLayout struct.
[ROCm/composable_kernel commit: f53d857b25]
* Disable c++20-compat warnings when building old CK in C++20 mode
Turns out that this creates some warnings for no good reason.
* ck-builder: add missing layouts and element-wise op names
For layouts, we can directly use the ::name attribute, which should
cover all layouts. For element-wise ops, I just added the ones which
are currently missing when compiling CK with -DMIOPEN_REQ_LIBS_ONLY.
[ROCm/composable_kernel commit: d0364641ed]
* [CK_BILDER] Add compile-time reflection for a convolution instance
Introduce InstanceTraits template metaprogramming framework to enable runtime introspection of device kernel template parameters without requiring implementation knowledge. This reflection system extracts configuration details (block sizes, data types, layouts, tuning parameters) directly from kernel specializations through template
pattern matching. In particular, the GetInstanceString method returns a string that uniquely idenitfies the kernel, by explicitly serializing all template paramter values.
This provides critical functionality for MIOpen integration, since the existing GetTypeString method is ambiguous, and only captures some of the template paramters.
The implementation uses a two-level design: a primary InstanceTraits template declaration in instance_traits.hpp serves as the interface, while kernel-specific specializations (e.g., for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3) provide the actual extraction logic. This separation allows the reflection system to scale to additional kernel types without modifying the core interface.
Key architectural decisions:
- Forward-declare device kernels in instance_traits.hpp to avoid circular dependencies, since device implementation headers will include the reflection headers
- Use compile-time constants and type aliases to expose kernel parameters, enabling zero-overhead introspection
- Provide a templated instance_string() function that generates human-readable kernel configuration strings by serializing all template parameters in order, useful for debugging and kernel identification
- Guard reflection integration with preprocessor definition CK_EXPERIMENTAL_BUILDER to keep it opt-in until the API stabilizes
- Add GetInstanceString() virtual method to BaseOperator, allowing runtime polymorphic access to compile-time kernel information
This infrastructure also enables upcoming higher-level semantic reflection abstractions (like ConvTraits) to query kernel configurations programmatically.
Includes unit tests validating both the trait extraction accuracy and the string generation format.
[ROCm/composable_kernel commit: 37dff024c1]