[CK_Builder] added bwd data kernels to builder factory
(#4582)
This PR adds bwd data wmma and xdl kernels to the ck builder, their
instance and conv traits as well as tests for the above.
[CK_BUILDER] Move some smoke tests that require GPU
Since the CK builder is focused on compile time logic, let's keep the
`smoke-builder` target CPU-only so that it can be ran without a CPU.
Alternatively, we could define a `smoke-cpu-builder` or some special
subtarget, but it's probably simpler to just stick to CPU for this. (My
thinking is that in general GPU testing will be heavier than the smoke
tests. Further, the GPU testing code will likely move outside of the
builder once builder code is moved out of experimental.)
This PR clarifies that CPU-only intention for `smoke-builder` and moves
some GPU testing code to `smoke-regression`.
* ck-builder: restructure testing conv
In order to prepare for bwd of conv testing, this commit moves some
files and types around so that we can reuse ckt::Args for both forward
and backwards convolution.
* ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp
This will allow us to more easily include fwd.hpp from backwards
definitions, which is required for initializing bwd values.
* ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3
Turns out that the supplied layout isn't actually supported...
* ck-builder: ck and reference conv integration for bwd weight
* ck-builder: ck bwd weight execution test
* ck-builder: ckt::run support for ck-tile bwd weight
* ck-builder: ck tile bwd weight execution test
* ck-builder: extra debug printing in MatchesReference
* ck-builder: make ckt::run return RunResult
This type is more convenient than std::tuple, as it will allow us to
use google test matchers with this in the future.
* ck-builder: RunResult matcher
Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error
message about how and why running an algorithm failed.
* ck-builder: doc fixes
* ck-builder: add missing headers
* Factor helpers out of conv_traits.hpp
* Create a non-templated conv_traits struct
* Migrate to new instance-specific instance_to_conv_traits functions
* Clean up reflection concepts
* Clean up ConvTraits helpers
* Update testing for convolution traits
This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.
* Address reviewer comments and resolve merge conflict
* Add placeholder test.
* Initial conv bwd weight factory.
* Conv builder test refactoring.
* Add missing pieces to bwd weight factory.
* Improve compile time erros message when no matching factory is found.
* Use amcro to ensure automatic macthing between concepts are their string representations.
* Improve compile time diagnostics.
* Small improvements.
* Improve missing member/wrong type compile-time errors.
* Improve compile time diagnostics.
* Concept bug fixes.
* Remove debug assert.
* Update algorithm signature diagnostics.
* Factory bug fixes.
* First functional version of bwd weight conv factory.
* Refactor handing of GEMM-K batch template parameter in conv bwd weight factory.
* Concept improvements.
* Improve concept diagnostics.
* Introduve a common size type for concepts.
* Update compiletime diagnostics to use the size type.
* Update conv specialization enum.
* Fix fwd conv builder tests.
* Fix smoke tests.
* Separate bwd weigth and bwd data tests into separate targets.
* Clean-up CK Tile builder tests.
* Add bwd weight XDL CShuffle V3 factory.
* Build conv bwd weigth v3 instances successfully.
* Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.
* Test fix.
* Add instance traits for bwd weight algorithms.
* Add unit tests for instance strings.
* Build new instance traits unit tests but exclude WMMA for now.
* Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.
* Conv bwd weight DL factory.
* Final implementation for bwd weight DL factory.
* Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance.
* Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle
* Treat ref algorithm the same way as real algorithms in the dispatcher.
* Refactor large tensor support and WMMA configuration.
* Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3.
* Update Readme.
* Fix WMMA bwd weight tests.
* Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3.
* Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle.
* Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle.
* Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
* Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs.
* Fix fwd factories after refactoring.
* clang-format
* Move compile-time diagnostics to a separate branch.
* Fix ref algorithm dispatching.
* Fix smoke tests.
* clang-format
* Fix factory for regular WMMA conv bwd weight.
* Clarify builder Readme.
* Remove obsolete test file.
* Fix test after merge.
* clang-format
* Remove the C++26 extensions.
* Unify conv elementwise ops and layout definitions for fwd and bwd directions.
* Remove old layout and elementwise ops.
* Unify handling of conv tensor types between fwd and bwd directions.
* Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank.
* Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms.
* clang-format
---------
Co-authored-by: Ville Pietilä <>
This adds some utilities to automatically generate UniqueInputs,
UniqueOutputs, alloc_inputs, alloc_outputs, and validate, based
on a Inputs::reflect() and Outputs::reflect().
* ck-builder: make toString to_string
We are using snake case for CK-Builder
* ck-builder: add debug.hpp with tensor descriptor printing function
This adds some initial functionality to debug.hpp, a header which will
be used to house some debug utilities.
* ck-builder: abstract nd-iteration
Abstracting this makes it easier to test, clearer, and allows us to
use it elsewhere (such as in debug.hpp soon)
* ck-builder: tensor printing
* ck-builder: rename INT32 to I32
This makes it more in line with the other data type definitions.
Our concept-base conversions are fragile and too complex. We want to refactor to straightforward functions
for each intance trace class template. This change adds unit test coverage to make that refactoring safer.
This pull request builds on #3267 by proving the "validation" infrastructure, the means to compare a set of `Outputs`.
The design of the validation infrastructure is relatively straight forward:
- Each SIGNATURE should come with a `validate()` implementation, which should be implemented in a similar way that the other functions/types from `testing.hpp` are implemented.
- `validate()` returns a `ValidationReport`, which is a structure that keeps all relevant information about comparing the tensors from two `Outputs`. Note that crucially, `validate()` should not do any reporting by itself. Rather, glue logic should be implemented by the user to turn `ValidationReport` into a relevant error message.
- You can see this clue code for CK-Builder itself in `testing_utils.hpp`, its `MatchesReference()`. This functionality is relatively barebones right now, it will be expanded upon in a different PR to keep the scope of this one down.
The comparison is done on the GPU (using an atomic for now), to keep tests relatively quick. Some notable items from this PR:
- To help compare the tensors and with writing tests, I've written a generic function `tensor_foreach` which invokes a callback on every element of a tensor.
- For that it was useful that the `TensorDescriptor` has a rank which is known at compile-time, so I've changed the implementation of `TensorDescriptor` for that. I felt like it was a better approach than keeping it dynamic, for multiple reasons:
- This is C++ and we should use static typing where possible and useful. This way, we don't have to implement runtime assertions about the tensor rank.
- We know already know the rank of tensors statically, as it can be derived from the SIGNATURE.
- It simpifies the implementation of `tensor_foreach` and other comparison code.
- There are a lot of new tests for validating the validation implementation, validating validation validation tests (Only 3 recursive levels though...). For a few of those functions, I felt like it would be useful to expose them to the user.
- Doc comments everywhere.
Added instance traits for the following bwd weight conv algorithms
DeviceGroupedConvBwdWeight_Xdl_CShuffleV3
DeviceGroupedConvBwdWeight_Wmma_CShuffleV3
DeviceGroupedConvBwdWeight_Wmma_CShuffle
DeviceGroupedConvBwdWeight_TwoStage_Xdl_CShuffle
DeviceGroupedConvBwdWeight_TwoStage_Wmma_CShuffleV3
DeviceGroupedConvBwdWeight_DL
DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle
DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
Added also unit tests for instance traits of those bwd weigth algorithms that are currently exposed by the narrow CK build for MIOpen.
---------
Co-authored-by: Ville Pietilä <>
* [CK_BUILDER] Integrate GPU reference as ConvAlgorithm
Add GPU reference as a ConvAlgorithm specialization, enabling:
- Unified Builder API for reference and optimized kernels
- Future ckProfiler integration for validation
- First step toward numerical validation in Builder tests
Changes:
- Add ConvAlgorithmSpecialization::REFERENCE enum
- Add ConvAlgorithm_Reference struct
- Add IsReferenceAlgorithm concept
- Create 3 reference factories (Forward, BwdData, BwdWeight)
- Wire into conv_dispatcher
- Add proof-of-concept test (passing)
Test result: Can instantiate reference through Builder API
* Add GPU reference execution tests
- Reference kernel executes through Builder (459ms)
- Both reference and optimized can instantiate
- Tests passing
Next: Implement utilities for comparison
* Optimized Builder kernel execution works
- MakeArgument pattern implemented
- Builder-generated kernel executes successfully
- Tests passing (451ms execution)
Next: Add comparison
* VALIDATION COMPLETE: Builder == Reference
Builder-generated kernel output matches GPU reference!
Test: Validate_Optimized_vs_Reference_Forward_2D_FP16
Result: PASS ✓
This proves CK Builder generates correct code!
* Update to new Builder API
All tests passing
* Rename test file for clarity
test_builder_kernel_execution -> test_builder_kernel_validation
* Add all 3 directions support
- Forward, Backward Data, Backward Weight
- All reference factories working
- Dispatcher wired for all directions
- 9 tests passing
Tests:
- test_reference_execution: 3 tests (all directions)
- test_optimized_execution: 3 tests (all directions)
- test_builder_kernel_validation: 3 tests (fwd validated, bwd placeholders)
* Add backward direction support
- Backward data and weight dispatcher wiring
- Fix factories for new API
- All 3 directions tested
- 9 tests passing
* Refactor: Change IsReferenceAlgorithm from concept to consteval function
Address review feedback: Use consteval function in dispatcher instead of
concept, matching the pattern for other algorithms (Tile, XDL, WMMA, DL).
- Remove IsReferenceAlgorithm concept from conv_algorithm_concepts.hpp
- Add IsReferenceAlgorithm() consteval function to conv_dispatcher.hpp
- Update dispatcher to use function call: IsReferenceAlgorithm<T>()
- Remove redundant algorithm checks from reference factory requires clauses
All tests passing (9/9).
* Move Tile algorithm check outside direction block to support all directions
* Implement MakeInvokerPointer interface and add random input validation
- Implement full Argument/Invoker structs for old CK interface (not just nullptr)
- Refactor with reference_common.hpp to reduce code duplication
- Add random input validation tests: Builder vs direct GPU reference (all directions)
- Fix layout: GNHWC -> NHWGC to match reference kernel expectations
- All 12 tests pass with IDENTICAL results on random input
* Move ConvAlgorithm_Reference to test/impl/conv_algorithm_types.hpp
Keep types.hpp for data types only (enums), move algorithm descriptors
to conv_algorithm_types.hpp as suggested by review.
* Add static_assert to ensure reference factories only accept PassThrough operations
Reference implementation doesn't support fused elementwise operations.
Add compile-time validation to fail early with clear error message if
non-PassThrough operations are specified on input, weight, or output.
* Add InstanceTraits support for reference kernels
- Store SIGNATURE/ALGORITHM/VERSION in Instance for reflection
- Create shared ReferenceCommonTraits base for common properties
- Add 3 direction-specific InstanceTraits specializations in one file
- Include data type and layouts in instance_string output
* Remove optimized kernel validation tests from reference-only branch
* Use existing layout helper and organize reference tests
Use LayoutToCK from conv_tensor_layout.hpp and move reference InstanceTraits
test to validation folder.
* Merge develop branch
Fix DataType switch for new mixed precision types.
* Fix comment spacing for CI
* Convert IsReferenceAlgorithm from function to concept
* Add reference tests to CI smoke tests
* Consolidate 3 reference factories into single unified factory
---------
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
* Fixed typos for padded instances
* Added tests for fp16, KM_KN and KM_NK
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.
* Fixed typos
* Updated the set of tests for FP16
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* removed unnecessary ck parts from compilation
* initial gemm_add_multiply instance implementations
* fixed profiler help message for gemm_add_multiply
* improved multiply_add profiler layout help
* fixed template arguments for test instances
* added test for gemm_add_multiply
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* switched to splitK interface
* log print added to splitk benchmarks
* revert main cmake comments
* newline change reverted
* added add_fastgelu instances
* revert unintended change in xdl add_fastgelu
* created gemm_add_add_fastgelu instances
* created fastegelu instances
* added tests for all splitk fastgelus
* Added tests.
* multiply_add instances created
* updates to add_multiply splitk instances
* splitk xdl test fixes
* added wmma multiply_multiply instances
* fixed ONLY_XDL_AND_WMMA_KERNELS tag
* Added gemm_add examples for wmma v1 and v3
* fixed / workarounded i8 instances
* Modified the v3 code to added one fp16 bxdl instance.
* added bf16 xdl instance.
* adding gemm_add wmma_cshuffle and other support
(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* add instances into camkelists
(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* This is work in progress, edited the template parameters in order to build
(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype
(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* added datatype and use clang-format-12
(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* Fixing build errors
* Added instances for v3
* Adding instances and executables
* Code update of template parameters modified.
* Renamed file.
* Added tests.
* resolved error tests.
* Fixing build errors
* Updated comments
* removed the changes as per the MR review comment.
* Updated tests.
* fp8 instances - not tested
* Restored the Cmake file that was reverted by mistake during rebase.
* fixed wmma_op test
* Updated comments.
* Updated the template parameter description
* fixed rdna4 instances
* fixed back compatibility on gfx11
* cleanups
* fix ckProfiler
* one more cmake fix
* added fp8 instances
* Updated tests to ad BF16 instances as per review comment
* Added include file and cleaned up(as per review comment)
* Updated and optimized the example code for all types.
* Fixed clang format
* Resolve "Implement `device_gemm_bilinear` for RDNA4"
* test generalization to handle FP16 shuffle better
* added missing changes
* Added bf16 wmma instance for add_relu
* Added f16 wmma instance and corrected bf16 instance errors.
* Added instances to Cmake
* Modified the template parameters to make the instances work.
* Fixed typo in profiler
* Added v3 instances for gemm_add_relu
* addressed core review comments
* Added test for gemm_add_relu wmma instance
* Cleaned up the code.
* Added examples for gemm_add_relu
* Fixing typo to resolve build errors.
* Fixes applied to fix the precision loss.
* fix billinear test after merge
* Removed the old wmma instances.
* Added wrapper and renamed the wmma_v3 instances
* Updated copyrights and added wrappers.
* Fixes applied according to review comments
* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Robin Voetter <robin@streamhpc.com>
* Removed the old wmma instances.
* Updated wrapper for the v3 instances
* removed the old wmma examples
* Renamed the v3 instances
* Deleted the gtest file added by mistake.
* Updated thge profiler with wrapper
* Fixed test errors.
* Fixed the review comments
* Fixed the if condition MACROS.
* REVERTED THE PROFILER CHANGES
* Revert "REVERTED THE PROFILER CHANGES"
This reverts commit 21cb98546c.
* Revert "Fixed test errors."
This reverts commit 13efcc6fe1.
* Revert "Updated thge profiler with wrapper"
This reverts commit 536f86661d.
* Added missing wrapper instances
* Updated copyrights.
* Fixed typo.
* Fixed copyrights.
* Updated copyrights.
* updated copyrights.
* comments on the atomics workaround
* fixed cmake comment
* Fix bug from merge
* clang-format-18
* Fix compilation error
* multi_abd wmma support:
- Add multiple A and B support to multiple D implementation (gridwise level)
- Add multi_abd GEMM (device level)
- Add instances (xdl parity)
- Add tests (both xdl and wmma)
- Add examples
- Add ckProfiler support (both xdl and wmma)
* Fix bug in device print function
* Fix unused template parameter
* Add support for fwd conv in gridwise implementation. Identical to run function for bwd data.
* Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future.
* Make relevant profilers print the number of valid instances to aid testing.
* Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now. Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working.
* Reset output buffer after each run in profile_grouped_conv_fwd_impl().
* Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds.
* Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation.
* Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types.
* Add int8 instances for 2D vanilla grouped conv fwd all layouts.
* Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd
* Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts.
* Implement multi-AB support for grouped conv fwd and add example.
* Add 1D instances
* Add D layout tests to IsSupportedArgument()
* Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary.
* Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists.
* Add some more tests for vanilla grouped conv fwd
* Add 2D bias clamp instances and tests
* Add 3D bias clamp instances and tests
* Add 2D and 3D clamp instances and tests
* Unify problem sizes across vanilla and clamp flavor tests
* Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check.
* Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations.
* Remove wmma fp8 and bf8 instances when not targetting gfx12
* Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build
* Remove old years for newly created files.
* No need to time kernels for now.
* Fixup comments
* Pass struct args to Gridwise Run() function by reference.
* Don't use workspace memory in the case where A needs explicit transposition but B does not.
* Move calculation of rotating memory buffer sizes to Argument member functions.
* After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts.
* Unify xdl and wmma example code for grouped conv fwd scaleadd ab
* Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same.
* Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3.
* Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge.
* Small post-merge fixup, everything seems to work.
* Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming.
* Extend scaleadd_ab instance lists
* Extend merged groups instance lists, including adaptations of xdl "2x" instances.
* Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile.
* Extend "mem" instance lists.
* Extend regular instance lists.
* Fixup comments and ignored kernel arg name
* Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test.
* Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code.
* Re-enable sharding for wmma cshufflev3 instances
* Post merge fix to vanilla test
* Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general.
* Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop.
* Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument.
* Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed.
* Re-enable old wmma instances
* Re-enable Linqun's Xdl Wmma instances
* Small post-merge fixes
* Fix copyright headers
* Remove commented code snippet in gridwise
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance.
* Adding tuned instace list for groupoed conv fwd (#3288)
Following flavors are updated with tuned instance list:
- grouped_conv2d_fwd
- grouped_conv2d_fwd_bias_clamp
- grouped_conv2d_fwd_clamp
- grouped_conv3d_fwd
- grouped_conv3d_fwd_bias_clamp
- grouped_conv3d_fwd_clamp
- grouped_conv3d_fwd_scaleadd_ab
Re-factored instance selection:
- removed all the unnecessary instance tuples (comp/mem/16x16/generic)
- removed all unnecessary layouts and data types
* Do not use std::remove_cvref_t, does not exist in C++17, use custom one.
* Splitting grouped conv fwd instances (#3449)
* Disable unnecessary and failing tests related to experimental CK builder
* Disable unnecessary ck builder experimental tests fully
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: Wojciech Laskowski <77888887+wj-laskowski@users.noreply.github.com>
* Add README.md for testing
* Add tensor_memory_manager.
* ck-builder: tensor memory manager rebase fixes
This fixes some issues caused by the API being changed recently.
Also, this streamlines the ckt namespace to always be ck_tile::builder::test,
as this is already being used by other tests
Really, this commit should be squashed into the previous,
but I'm keeping it separate for brevity.
* ck-builder: test arguments initial prototype
* ck-builder: test system initial prototype
* ck-builder: fix non-standardized copyright comments
* ck-builder: new prototype
* ck-builder: group testing inputs/outputs into a separate structure
This is basically the return of the tensor memory manager after all,
except that the design is more closely tied to the actual operation.
Using a struct allows us to add additional input/output tensors
without breaking code (by defaulting those new parameters). Note
that the tensors are split into a separate inputs/outputs because we
usually want to allocate the output _twice_: once for the real
computation and once for the reference computation.
* ck-builder: simplify prototype naming; start docs
* ck-builder: update testing readme
* ck-builder: testing documentation
* ck-builder: HipStatusMatcher
This matcher can be used to check HIP status codes and provide
nice and readable error messages.
* ck-builder: tensor_buffer.hpp tests
* ck-builder: conv_fwd.hpp tests
* ck-builder: add example end-to-end test in conv fwd 2d fp16
* ck-builder: simplify extent usage
* ck-builder: update testing doc
* ck-builder: skip end to end test on non-gfx9
* fix check_copyright_year interpreter
/bin/bash is not guaranteed to exist on Linux. Signed,
a NixOS user
* ck-builder: fix copyrights
* ck-builder: reduce conv fwd testing size
This test allocated 24GB of memory, too much for 16GB cards.
---------
Co-authored-by: John Shumway <jshumway@amd.com>
* Separate layouts into separate entities for input, weight, and output tensors.
* Add test for handling bias tensor layouts.
* Use instance string in builder tests.
* Add handling of output bias data types and layouts.
* Generalize handling of the elementwise ops.
* Test fix.
* Create builder for layouts.
* Layout builder improvements.
* Improve layout builder.
* Simplify bias layout handling.
* Code clean-up.
* Move layout utils into separate file.
* Remove hard-coded layout combinations.
* Small code clean-up.
* Move data type utils into a separate file.
* Add data types, layouts, and elementwise ops per conv tensor.
* Builder bug fixes after refactoring.
* Working baseline.
* Make signature definition look nice in the test code.
* Move TensorConfig into test implementations.
* Fix all fwd conv builder tests.
* Fix conv traits and descriptors tests.
* More factory assets under a separate directory.
* Fix building conv traits.
* Fix clang-format.
* Add Readme doc to describe the design.
* Add link to main Readme. Fix links in the builder design doc.
* Clean-up data type/layout/elementwise op conversions.
* Switch from dimension and tensor type specific layouts to a flat list of tensor layouts.
* Fix clang-formatting.
* Fix clang-format for test code.
* Simplify fwd conv signature definitions in the test code.
* Remove accidental edits.
* Fix comment string.
* Fix instance factory after rebase.
* Fix tests after rebase.
* Unify layout handling.
* Add more conv layout unit tests.
* Clang-format.
* Fix merge conflicts.
* Improve elementwise op handling.
---------
Co-authored-by: Ville Pietilä <>
* Create Description::instance_string() function
To expose more reflection capabilities in MIOpen, we add the instance_string functionality to the ckr::Description class. This PR introduces a base class, adds the instance_string method, and implements the method by injecting the Traits::instance_string method through the ConvDescription constructor.
This will enable us to replace the specialized get_instance_string() method on device operations with a describe() method in a subsequent PR.
* Test describe().instance_string()
Update the instance string tests to also call `ckr::describe<Instance>().instance_string()`. This documents that the xld kernels are supported with describe(), but WMMA and DL kernels are not yet supported. Also update namespace and add a HasConvTraits concept.
Refactor the builder factory code into multiple files and subdirectories and a ck_tile::builder::factory namespace.
The factory implements compile-time dispatch from high-level signature and algorithm descriptors to our existing specialized convolution kernel implementations.
Major changes in this PR:
Dispatch logic is explicit in the function make_conv_instance instead of implicit in template specialization selection.
Helper code is moved to a subdirectory builder/factory/helpers.
Helpers now have unit tests.
Factories are moved to their own files.
Code moved to namespaces ck_tile::builder::factory and ck_tile::builder::factory::internal.
This does not yet fix the problem of bad error messages, but the make_conv_instance function makes the poor error messages clear. The choice of algorithm must be much more robust (perhaps with explicit enumeration in the algorithm descriptor), so that the dispatch doesn't fail.
Quality changes:
Making dispatch explicit rather than implicit will improve robustness, readability, maintainability, testability, and extensibility.
Separating code into separate files and subdirectories helps readability and extensibility.
Adding unit tests for helpers documents behavior and will enable more complex logic and functionality.
Separating files (especially unit tests) helps clarify includes and dependencies and makes code easier to refactor.
* [CKBuilder] Update the testing documentation
Now that we have clear sets for smoke tests and regression test, we rearange the CMakeLists.txt file to be organized and have description and instructional comments.
Move all the test targets that compile quickly into the smoke test suite.
Update the builder README.md to reflect this new test organization and functionality.
* Update experimental/builder/README.md
Clarify integration tests description from review comment.
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Correct README.md
The regression tests here still run very fast like smoke tests, but can take minutes or even tens of minutes to compile. We test most of the builder functionality without compiling heavily-templated kernel code, but these regression tests do an expensive full build of the CK kernels.
---------
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* build and run ck_builder tests
* add test_ckb_all to targets
* fix syntax
* fix test path
* Update CMake targets for builder testing in CI (#3290)
Our existing CMake only had build targets. Update CMakeLists.txt to have CTEST targets:
* smoke-builder
* regression-builder
* check-builder
Co-authored-by: John Shumway <jshumway@amd.com>
* use check-builder target
* get rid of test_ckb_all target
* call ninja check-builder separately
---------
Co-authored-by: John Shumway <jshumway@amd.com>
There were four errors to fix:
1. The checks for defaulted direction were not implemented in the predicate concept.
2. Had to delete an obsolete and undefined operation enum.
3. A factory was passing a boolean in place of an integer.
4. Some of the factory tests are not compiling correctly when linking in the full source (with CK_EXPERIMENTAL_BUILDER=ON), so I commented them out.
Grouping transfer operations per tensor makes it easier to
constrain on and operate with the transfer operations. As an
example, we can now deduplicate the logic for translating
the transfer operations from the ck-builder interface to the old
ck interface for the A and B tensors.
Proposed changes
Improve the forward convolution builder implementation and addressed leftover feedback left from PR #3138. Main changes
Refactored tests such that they reflect better the builder pattern. The templates and types for the convolution algorithm concepts are created via factory that facilitates programmatic creation of the device op instances.
Moved tests into anonymous namespace.
The convolution factory had lot of if-else constructs when CK Builder types were converted into CK library types. I had initially trouble in using static_assert in the default branch of switch as the static_assert was evaluated at compile time even for valid types. However, if we change the static_assert to throw "<error message>", it will result in a compile-time error only if the default branch is actually hit. This assumes that the function is consteval. Hence, changed all conversions in the convolution factory to use switch, which is more intuitive.
Removed the explicit device op definition from convolution signature and the corresponding predicate file. The device ops are defined by the corresponding concepts. This allowed to remove lot of boilerplate code from the convolution factory.
Adde inheritance and convolution algorithm specialization to handle device ops that are specialization of a more generic ones. The large tensor support is more naturally expressed by this pattern.
Added support for the FP8 data type.
* WIP: Builder for expected test results.
* Improve ckb fwd conv instance tests.
* clang-format
* Change if-else statements into switch in conv factory.
* Fix clang-formatting.
* Removed unnecessary includes.
* Added missing copyright.
* Remove explicit device op flag from from convolution signature.
* Add missing concept.
* Fix build.
* clang-format
* Add test for building conv fwd FP8 instances.
* Add missing header to instance traits.
* Clean-up recently added instances.
* Introduce inheritance and specialization.
* Use builder to build conv algorithm templates and types.
* clang-format
* Fix conv description tests.
---------
Co-authored-by: John Shumway <john.shumwayjr@gmail.com>
* Add device operation to conv signature. Use unions to hold conv layouts and device operations.
* Add predicates for all device op instances.
* Use the device op signature for validation.
* Fix ckb CMakeLists.txt file for tests.
* Fix building CK Builder instance traits after the introduction of direct load template parameter in CK.
* Fix clang-formatting.
* add device_grouped_conv_fwd_dl_multiple_d_nhwc_kyxc_nhwk
* Add full DL configurability with Option A implementation
- Added 5 DL descriptor structs (39 configurable parameters)
- Added 10 C++20 concepts for type-safe validation
- Updated factory to read all parameters from descriptors
- Updated test helper to populate all descriptors
- All tests passing (13/13 including 3 new DL tests)
* Add factory and test support for DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor
- Add factory specialization for Large_Tensor device operation (conv_factory.hpp lines 1145-1265)
- Add macro collision workaround using pragma push/pop (conv_factory.hpp lines 43-51)
- Add test helper function run_test_DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor
- Add builder test file test_ckb_conv_fwd_2d_large_tensor_fp16.cpp with 2 test cases
- Update CMakeLists.txt to include new test file
- Reuse existing ConvAlgorithm_DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle descriptor
- Map all 42 template parameters identical to regular XDL CShuffle
- All 15 builder tests passing including 2 new Large_Tensor tests
Completes Task 350: All 4 forward convolution device operations now supported in CK Builder.
* Update copyright headers to new format
- Change copyright format to: Copyright (C) Advanced Micro Devices, Inc., or its affiliates.
- Reorder headers: Copyright first, then SPDX-License-Identifier
- Updated files:
* experimental/builder/test/conv/test_ckb_conv_fwd_2d_dl_fp16.cpp
* experimental/builder/test/conv/test_ckb_conv_fwd_2d_large_tensor_fp16.cpp
* experimental/builder/include/ck_tile/builder/device_op_types.hpp
* fix c++ 18 format
* Fix clang-format-18 error in device_op_types.hpp
---------
Co-authored-by: Ville Pietilä <ville.pietila@amd.com>
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
* Add backward weight instance traits for xdl cshuffle.
To keep instance test file sizes reasonable, we start a new test_bwd_weight_instances_traits.cpp test file.
* Fix copyright notices.
* Remove (c) symbol, replace with (C).
Having UTF-8 in source caused an error with code generation.
* Add device operation to conv signature. Use unions to hold conv layouts and device operations.
* Add predicates for all device op instances.
* Use the device op signature for validation.
* Fix ckb CMakeLists.txt file for tests.
* Fix building CK Builder instance traits after the introduction of direct load template parameter in CK.
* Fix clang-formatting.
* Add factory for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle device op.
* Add conv factory for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle
* Rename elements per wave per shuffle member in the epilogue concept.
* clang-format
* Add concepts and types for optional device op template parameters.
* Add optional compute, direct load, and loop scheduler arguments to conv factory.
* Add number of groups to merge template parameter.
* clang-format.
* Add InstanceTraits for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle
* Add InstanceTraits for kernel_grouped_conv_fwd_dl_multiple_d
* A few small changes to fix broken instance traits.
* Rename CK Builder test targets with consistent prefix test_ckb.
* Add test_ckb_all target to build all CK Builder tests.
* Update Readme for CK Builder.
This pull requests adds some initial "factory tests" - these check that the instances which are used in MIOpen are actually present in CK. The main reason for this is documentation and sanity checking. Its likely that these tests get outdated fast, so we'll have to maintain them, but fortunately this is quite straight forward and shouldn't take a lot of time once they are in place.
* Refactor instance_traits_util and add unit tests tests
* Address reviewer comments.
Just adds some TODOs to indicate deprecated layouts in our reflection. Our strategy is to leave the reflection code broad (covering deprecated features), but keep the builder concepts narrow. Once we've removed deprecated features from all instances, we can remove them from reflection.
Also add a comment to the cmake to explain the unit test target test_conv_builder.
* Addressed more reviewer comments.
* Remove duplicate PassThrough::name
Accidentally added this field to the end of the struct, too. The `name` field should be a the start of the struct for consistency.
Adds new testing functionality: an inline diff for string comparison.
Example usage:
EXPECT_THAT("Actual string", ck_tile::test::StringEqWithDiff("Expected string"));
Failure message:
Value of: "Actual string"
Expected: "Expected string"
Actual: "Actual string" (of type char [14]),
Diff: "[Expe|A]ct[ed|ual] string"
The inline-diff function uses the Wagner-Fischer algorithm to find the minimum edit distance and generate diff markers, which has O(N^2) complexity. It has optional color codes that are enabled with the matcher.
* [CK_BILDER] Add compile-time reflection for a convolution instance
Introduce InstanceTraits template metaprogramming framework to enable runtime introspection of device kernel template parameters without requiring implementation knowledge. This reflection system extracts configuration details (block sizes, data types, layouts, tuning parameters) directly from kernel specializations through template
pattern matching. In particular, the GetInstanceString method returns a string that uniquely idenitfies the kernel, by explicitly serializing all template paramter values.
This provides critical functionality for MIOpen integration, since the existing GetTypeString method is ambiguous, and only captures some of the template paramters.
The implementation uses a two-level design: a primary InstanceTraits template declaration in instance_traits.hpp serves as the interface, while kernel-specific specializations (e.g., for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3) provide the actual extraction logic. This separation allows the reflection system to scale to additional kernel types without modifying the core interface.
Key architectural decisions:
- Forward-declare device kernels in instance_traits.hpp to avoid circular dependencies, since device implementation headers will include the reflection headers
- Use compile-time constants and type aliases to expose kernel parameters, enabling zero-overhead introspection
- Provide a templated instance_string() function that generates human-readable kernel configuration strings by serializing all template parameters in order, useful for debugging and kernel identification
- Guard reflection integration with preprocessor definition CK_EXPERIMENTAL_BUILDER to keep it opt-in until the API stabilizes
- Add GetInstanceString() virtual method to BaseOperator, allowing runtime polymorphic access to compile-time kernel information
This infrastructure also enables upcoming higher-level semantic reflection abstractions (like ConvTraits) to query kernel configurations programmatically.
Includes unit tests validating both the trait extraction accuracy and the string generation format.
Add experimental builder infrastructure for composable_kernel
- Add experimental/builder directory with README documentation.
- Create initial test infrastructure with CMakeLists.txt and placeholder test.
- Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option.
- Update .gitignore to not treat `experimental/builder` as a CMake build directory.
This establishes the directory structure for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration.