* Added device level implementation for bwd_data_wmma_v3.
* Added first instance of bwd_data_wmma_v3(f16).
* Add support for bwd data in gridwise implementation
Some changes are general for convolution and some are specific for bwd
data. We need to generalize them once we have fwd, bwd data and bwd
weight
* Initial device implementation of bwd data
* Remove unused template parameters in device impl
* Add one instance for different layout
initial check of device implementation
* Add tests for splitk and for different layouts
* Appended more instances to wmma_v3_f16.
* Added conv_2d bf16 wmma_v3 instances.
* Added conv_3d_bf16 wmma_v3_instances.
* Added conv_3d_f16_wmma_v3_instances.
* Added SplitN test cases for wmma.
* Conv3d_bwd_data_scale_wmma_v3 instances.
* Conv3d_bwd_data_bilinear_wmma_v3_instances
* Renaming the device level instances file to common name , since it is defined for different DataTypes.
* Renaming the instances and fixing typo
* Added the test cases to regression test list
* NCHW support for wmma_v3
* Examples for bf16 and f16 bwd_data_wmma_v3
* Added transpose conditons for device impl
* fixing bugs
* Added the gemm_args array implmentation
* WIP debug conv bwd
* fix splitk
* Grouped gemm fix
* Update CmakeLists with EOF
* Added more instances for tests
* Fixed the run time error in examples and removed 3d conv examples.
* Fixed a typo.
* Updated CmakeLists to removed the 3d convultion deleted files
* Added print error statements for unsupoorted argument
* Added the merge conflict related changes
* Fixed compilation error
* Fixed the InstanceFactory duplication error.
* Removed the print statements and added logs to Arg function
* All the merge conflict related errors resolved
* Added d_tensor tests.
* Added the missing example types of wmm_v3
* Merge error fix
* Corrected the instance name
* Reverted the bias relu change
* Revereted the transpose load local change
* Updated the regression test list with bwd_data_scale
* Revert "Revereted the transpose load local change"
This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b.
* Revert "Merge error fix"
This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b.
* Reverting the local change
* Added merge error fix
* Build error fix due to merge conflicts
* Added bias_relu example for wmma_v3
* Modified the main method in dtensor tests
* Updated the dtensor tests to pick all the shapes
* Updated the dtensor test shapes.
* Updated the mem operations in tests.
* Added reference func
* Fixed typos in device impl
* Added new header file and modified the include file for 3d tests
* Renamed the test file and added reference func call.
* clang format fix
* Added ignore params
* Modified device impl and tests
* Removed debug print statements and updated dtensor test shapes
* Fixing merge conflicts
* Fixing more merge conflicts
* Fixed copyrights
* Updated the tuned instances to bilinear and scale.
* Adding tuned instances to vanilla wmma_v3
* Removed all unused instances and modified test layouts.
* Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances.
* Fix clang format
* Updated tuned f16/-genric instances
* Formatting the instances file
* Fixed copyrights and clang issues
* Nonsense commit to force git to force
* Removed the transpose instances
* Added verified genric instances
* Fixing namespace errors
* Added todo for failing shapes
* Formatting instance file
* Fix instance list formatting
* Removing unnecessary formats
* Renamed the common file
* Unification of xdl and wmma bwd_data tests
* Updated Cmake
* Added all layout types and deleted code.
* Updated Cmake to add the condition to all tests.
---------
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
[ROCm/composable_kernel commit: 53a1e4f551]
* [CK_BUILDER] Integrate GPU reference as ConvAlgorithm
Add GPU reference as a ConvAlgorithm specialization, enabling:
- Unified Builder API for reference and optimized kernels
- Future ckProfiler integration for validation
- First step toward numerical validation in Builder tests
Changes:
- Add ConvAlgorithmSpecialization::REFERENCE enum
- Add ConvAlgorithm_Reference struct
- Add IsReferenceAlgorithm concept
- Create 3 reference factories (Forward, BwdData, BwdWeight)
- Wire into conv_dispatcher
- Add proof-of-concept test (passing)
Test result: Can instantiate reference through Builder API
* Add GPU reference execution tests
- Reference kernel executes through Builder (459ms)
- Both reference and optimized can instantiate
- Tests passing
Next: Implement utilities for comparison
* Optimized Builder kernel execution works
- MakeArgument pattern implemented
- Builder-generated kernel executes successfully
- Tests passing (451ms execution)
Next: Add comparison
* VALIDATION COMPLETE: Builder == Reference
Builder-generated kernel output matches GPU reference!
Test: Validate_Optimized_vs_Reference_Forward_2D_FP16
Result: PASS ✓
This proves CK Builder generates correct code!
* Update to new Builder API
All tests passing
* Rename test file for clarity
test_builder_kernel_execution -> test_builder_kernel_validation
* Add all 3 directions support
- Forward, Backward Data, Backward Weight
- All reference factories working
- Dispatcher wired for all directions
- 9 tests passing
Tests:
- test_reference_execution: 3 tests (all directions)
- test_optimized_execution: 3 tests (all directions)
- test_builder_kernel_validation: 3 tests (fwd validated, bwd placeholders)
* Add backward direction support
- Backward data and weight dispatcher wiring
- Fix factories for new API
- All 3 directions tested
- 9 tests passing
* Refactor: Change IsReferenceAlgorithm from concept to consteval function
Address review feedback: Use consteval function in dispatcher instead of
concept, matching the pattern for other algorithms (Tile, XDL, WMMA, DL).
- Remove IsReferenceAlgorithm concept from conv_algorithm_concepts.hpp
- Add IsReferenceAlgorithm() consteval function to conv_dispatcher.hpp
- Update dispatcher to use function call: IsReferenceAlgorithm<T>()
- Remove redundant algorithm checks from reference factory requires clauses
All tests passing (9/9).
* Move Tile algorithm check outside direction block to support all directions
* Implement MakeInvokerPointer interface and add random input validation
- Implement full Argument/Invoker structs for old CK interface (not just nullptr)
- Refactor with reference_common.hpp to reduce code duplication
- Add random input validation tests: Builder vs direct GPU reference (all directions)
- Fix layout: GNHWC -> NHWGC to match reference kernel expectations
- All 12 tests pass with IDENTICAL results on random input
* Move ConvAlgorithm_Reference to test/impl/conv_algorithm_types.hpp
Keep types.hpp for data types only (enums), move algorithm descriptors
to conv_algorithm_types.hpp as suggested by review.
* Add static_assert to ensure reference factories only accept PassThrough operations
Reference implementation doesn't support fused elementwise operations.
Add compile-time validation to fail early with clear error message if
non-PassThrough operations are specified on input, weight, or output.
* Add InstanceTraits support for reference kernels
- Store SIGNATURE/ALGORITHM/VERSION in Instance for reflection
- Create shared ReferenceCommonTraits base for common properties
- Add 3 direction-specific InstanceTraits specializations in one file
- Include data type and layouts in instance_string output
* Remove optimized kernel validation tests from reference-only branch
* Use existing layout helper and organize reference tests
Use LayoutToCK from conv_tensor_layout.hpp and move reference InstanceTraits
test to validation folder.
* Merge develop branch
Fix DataType switch for new mixed precision types.
* Fix comment spacing for CI
* Convert IsReferenceAlgorithm from function to concept
* Add reference tests to CI smoke tests
* Consolidate 3 reference factories into single unified factory
---------
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
[ROCm/composable_kernel commit: a0acc83a72]
* Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests.
* Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong.
[ROCm/composable_kernel commit: 88ae445580]
* feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time
* chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times
* chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems
* fix: missing check for transposed load in bquant pipeline
* chore: lower unit test tensors dimensions a bit for faster tests
* chore: set grouped gemm example M/N warp tile to 16
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
[ROCm/composable_kernel commit: e08efa551f]
* fix for splitk if splitk < grid
* add different splitk implementation
* minor bugfix for streamk gemm
* Add test
---------
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
[ROCm/composable_kernel commit: c0797c1671]
* reinstate conv_signature_utils.hpp
* added tests for elementwise operation getters
* add tests for getDataType functions
* added test for no data type specified
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
[ROCm/composable_kernel commit: 4ce7d4c511]
* add splitk support to ck tile conv bwd data
* add reviewers suggestions
* minor fix
* removed splitkbatchoffset struct
[ROCm/composable_kernel commit: ead81d1b0b]
Replace `decltype(TailHandler<>(...)){}` with direct function call
to fix compilation error when return type is void.
Co-authored-by: Yi DING <yi.ding@amd.com>
[ROCm/composable_kernel commit: 8b73633e65]
* remove duplicate aliases
* Split scaleadd_ab instances for WMMA grouped conv fwd
* removed big shape from the test
[ROCm/composable_kernel commit: a8aebb7a8e]
* [CK-TILE] Guard against compiler lexer diagnostic
A recent change to Clang added a lexer-level diagnostic about that C2y
language feature. Since that is lexer level, the `__extension__`
compiler built-in does not work as it is only respected *after* the
lexer when parsing.
This change adds guarding pragmas to disable the diagnostic in the
lexer and not lead to warnings being treated as errors.
* Fixing still existing build issue
Once the one warning was removed, another one poppoed up. Both are
related to the same c2y feature. Thus, ignoring both.
* clang-format handling
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[ROCm/composable_kernel commit: 9bd67c2cf2]
Refactors the way the number of XDL (matrix multiply-accumulate) instructions per wave is calculated and used in the grouped convolution forward implementations, especially to better support WMMA (Wave Matrix Multiply-Accumulate) instructions and 16x16 tiles.
The changes use MXdlPerWave instead of NXdlPerWave to increase number of waves per M dim.
[ROCm/composable_kernel commit: cbc8335964]