Refactor and integrate CK GPU references into ckProfiler.
- All convolution layouts and groupings supported for all three directions
- Unit tests verifying GPU and CPU reference is the same
- Support added to profiler (do_verification = 2 enables GPU reference)
- One profiler-based test per direction changed to GPU reference to demonstrate usag
Closes AICK-427
* Convolution bwd weight device implementation
* Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature/conv_bwd_weight_wmma'
Convolution bwd weight device implementation
See merge request amd/ai/composable_kernel!38
* Fix bug and disable splitK=-1 tests for wmma
* Add generic instances for bf16 f32 bf16
* check gridwise level validity in device impl for 1 stage D0
* Fix bugs in device implementation:
- rdna3 compilation error
- gridwise layouts (need to be correct to ensure that CheckValidaity()
works correctly)
* Add padding in conv to gemm transformers for 1x1Stride1Pad0 specialization
* Remove workaround for 1x1Stride1Pad0 conv specialization
* Add instances for xdl parity (for pipeline v1)
* Add two stage instances (xdl parity)
* Add multiple Ds instances
* Add examples
* Uncomment scale instances
* Fix copyright
* Fix examples compilation
* Add atomic add float4
* Fix compilation error
* Fix instances
* Compute tolerances in examples instead of using default ones
* Compute tolerances instead of using default ones in bilinear and scale tests
* Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'feature/conv_bwd_weight_wmma'
Grouped conv: Instances and example bwd weight
See merge request amd/ai/composable_kernel!47
* Device implementation of explicit gemm for grouped conv bwd weight
Based on batched gemm multiple D
* Add instances for pipeline v1 and v3
* Add support for occupancy-based splitk
* Fix ckProfiler dependencies
* Review fixes
* Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma'
Device implementation of explicit gemm for grouped conv bwd weight
See merge request amd/ai/composable_kernel!52
* Fix cmake file for tests
* fix clang format
* fix instance factory error
* Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test.
* Revert "Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test."
This reverts commit d20c869d3d.
* Disable splitk for 2stage xdl on rdna (bug to be fixed)
* Fix add_test_executable
* Always ForceThreadTileTransfer for now, WaveTileTransfer does not work for convolution yet.
* Grab device and gridwise files from bkp branch, this should enable splitK support for convolution and also we no longer ForceThreadTileTransfer for explicit gemm. Also grab some updates from 7e7243783008b11e904f127ecf1df55ef95e9af2 to fix building on clang20.
* Fix bug in various bwd wei device implementations / profiler where the occupancy based split_k value could not be found because the Argument did not derive from ArgumentSplitK, leading to incorrect error tolerances.
* Actually print the reason when a device implementation is not supported.
* Print number of valid instances in profiler and tests.
* Fix clang format for Two Stage implementation
* Fix copyright
* Address review comments
* Fix explicit conv bwd weight struct
* Fix gridwise common
* Fix gridwise ab scale
* Remove autodeduce 1 stage
* Restore example tolerance calculation
* Fix compilation error
* Fix gridwise common
* Fix gridwise gemm
* Fix typo
* Fix splitk
* Fix splitk ab scale
* Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test.
* Reduce instances to only the tuned wmma V3 ones for implicit v1 intra and explicit v1 intra pad/nopad.
* Add explicit oddMN support with custom tuned instances
* Add two stage instances based on the parameters from the tuned cshuffle V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists.
* Replace cshuffle non-v3 lists with v3 lists, making sure to not have duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases.
* Remove some instances that give incorrect results (f16 NHWGC)
* Add bf16 f32 bf16 instances based on tuned b16 NHWGC GKYXC instances.
* Add back some generic instances to make sure we have the same shape / layout / datatype support as before the instance selection process.
* Add instances for scale and bilinear based on the bf16 NHWGC GKYXC tuning. Keep generic instances for support.
* Disable two stage f16 instances which produce incorrect results.
* Remove more instances which fail verification, for bf16_f32_bf16 and for f16 scale / bilinear.
* Disable all non-generic two-stage instances in the instance lists for NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit.
* Remove unused instance lists and related add_x_instance() functions, fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3.
* Re-enable all xdl instances (un-16x16-adapted) and dl instances. Remove custom ckProfiler target.
* Remove straggler comments
* Remove [[maybe_unused]]
* Fix clang format
* Remove unwanted instances. This includes all instances which are not NHWGCxGKYXC and F16 or BF16 (no mixed in-out types).
* Add comment
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com>
* LWPCK-4043: Add GPU reference implementations for CK Tile convolution
This commit implements GPU-based reference kernels for CK Tile convolution
operations to enable faster verification of optimized kernels, especially
for large tensors (>2GB).
Changes:
- Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution
- Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data
- Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight
- Integrate GPU references with test infrastructure (replace -v=2 error)
- Support for 1D, 2D, and 3D convolutions
- Generic data type support (FP16, BF16, FP32)
- Grid-stride loop pattern for scalability
The GPU references use a simple, readable implementation that prioritizes
correctness over performance. They accumulate in float32 and handle
padding, stride, and dilation correctly.
* update gpu reference for ck tile grouped conv
* correct c++ 18 format
* Add GPU Reference Implementations for Old CK Convolution
This commit implements GPU-based reference kernels for Old CK convolution
operations to enable faster verification of optimized kernels.
Changes:
- Fixed old CK forward GPU reference (naive_conv_fwd.hpp)
* Fixed BF16 NaN issue (use type_convert instead of static_cast)
* Fixed FP8/BF8 arithmetic (accumulate in float)
* Fixed uninitialized variables
* All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed)
- Created backward data GPU reference (naive_conv_bwd_data.hpp)
* Implements input gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Created backward weight GPU reference (naive_conv_bwd_weight.hpp)
* Implements weight gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Integrated with old CK examples
* Forward: 10 XDL examples now support do_verification=2
* Backward data: Integrated with example/17_convnd_bwd_data/
* Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only)
* Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU)
Testing:
- 50 comprehensive tests created
- 42/42 tests passing (100% success rate)
- CPU and GPU verification produce identical results
- Verified across multiple dimensions, sizes, and data types
Limitations:
- GPU references support standard convolution only (G=1)
- Fused operations (DL variants) not supported
- Some tests blocked by optimized kernel size constraints
Result: Old CK GPU references can replace CPU references for verification
with 50-100x performance improvement for large tensors.
* Apply clang-format to old CK GPU reference files
* Fix C++17 compatibility: use brace initialization for aggregate types
* add get_rtol, get_atl and consistency cout message
* Use triple bracket syntax for kernel launch per review feedback
Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski.
This is more idiomatic HIP/CUDA style and equally correct.
All tests still passing after this change.
* Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode
- Replace manual error checking with HIP_CHECK_ERROR macro
- Add v=3 verification mode (GPU ref vs CPU ref direct comparison)
- Consistent output format across all examples
- All tests passing (7/7 v=3 tests pass for FP16)
* Use ConvDims structure to simplify GPU reference kernels
Replace 24 individual parameters with ConvDims structure per review feedback.
- Add conv_common.hpp with ConvDims and helper function
- Update kernel signatures: 24 params → 1 structure
- Remove duplicate extraction code from host files
* Use get_block_id() and get_thread_id() helpers in CK Tile
Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions.
Updated 3 CK Tile GPU reference kernels per review feedback.
* Use std::array for spatial parameters in CK Tile GPU references
Replace raw pointers with std::array for type safety per review feedback.
- Add conv_common.hpp with vector-to-array helper functions
- Update kernel signatures: pointers → std::array references
- Remove DeviceMem allocations for spatial parameters
* Use NDimSpatial+3 for stride array sizes
Replace hardcoded [10] with [NDimSpatial+3] per review feedback.
Array sizes now correctly reflect actual dimensions needed.
* Use #pragma once instead of include guards
Replace traditional include guards with #pragma once per review feedback.
Updated 3 Old CK GPU reference headers.
* Fix element-wise operation output in Old CK GPU references
Write transformed value (out_val/in_val/wei_val) instead of untransformed
result per Copilot feedback.
This ensures element-wise operations are correctly applied to output.
* Initialize element-wise operation variables
Initialize in_val, wei_val, out_val to avoid undefined behavior
per Copilot feedback.
Updated backward data and backward weight kernels.
* Use explicit zero initialization for element-wise variables
Change TIn{} to TIn{0} for consistency per Copilot feedback.
All 3 kernels now use consistent zero initialization.
* Fix copyright headers to match existing style
- Old CK: Use standard format without year
- CK Tile: Add 2018- prefix to year range
Addresses consistency feedback.
* Rename GPU reference files: add _gpu suffix
* Refactor index calculations: use std::array and extract to helper functions
* Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Wrap ck host utitlies in CK namespace.
The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.
Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.
There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library.
* Add using declarations to test code.
After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.
* Add using declarations to client examples.
* GH-2368 Adding a basic glossary
GH-2368 Minor edits
GH-2368 Adding missing READMEs and standardization.
resolving readme updates
GH-2368 Minor improvements to documentation.
Improving some readmes.
Further improvement for readmes.
Cleaned up the documentation in 'client_example' (#2468)
Update for PR
Update ACRONYMS.md to remove trivial terms
Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.
revise 37_transpose readme
revise 36_copy readme
Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.
Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.
Remove references to the Tile Engine in README files across multiple examples
* GH-2368 Adding a basic glossary
GH-2368 Minor edits
GH-2368 Adding missing READMEs and standardization.
resolving readme updates
GH-2368 Minor improvements to documentation.
Improving some readmes.
Further improvement for readmes.
Cleaned up the documentation in 'client_example' (#2468)
Update for PR
Update ACRONYMS.md to remove trivial terms
Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.
revise 37_transpose readme
revise 36_copy readme
Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.
Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.
Remove references to the Tile Engine in README files across multiple examples
Refine README files by removing outdated references to the Tile Engine
* Updates based on PR feedback 1
* Updates based on PR feedback 2
* Updates based on PR feedback 3
* Updates based on PR feedback 4
* Updates based on PR feedback 5
* Updates based on PR feedback 6
* Updates based on PR feedback 7
* Updates based on PR feedback 8
* Content Modification of CK Tile Example
* Modify the ck_tile gemm config
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* parse examples inside the add_example_executable function
* fix the example 64 cmake file
* add xdl flag to the gemm_bias_softmax_gemm_permute example
* add filtering of tests based on architecture type
* enable test_grouped_gemm for gfx9 only
* enable test_transpose only for gfx9
* only linnk test_transpose if it gets built
* split the gemm instances by architectures
* split gemm_bilinear,grouped_conv_bwd_weight instances by targets
* split instances by architecture
* split grouped_conv instances by architecture
* fix clang format
* fix the if-else logic in group_conv headers
* small fix for grouped convolution instances
* fix the grouped conv bwd weight dl instances
* fix client examples
* only enable client examples 3 and 4 on gfx9
* set the gfx9 macro
* make sure the architecture macros are set by cmake
* use separate set of xdl/wmma flags for host code
* sinmplify the main cmake file
* add conv_fwd_bf8 instance declaration
* refactor cmake files for the tests
* refactor cmake files for examples
* fix cmake for gemm example
* fix the cmake file for all examples
* add splitting by data types in gemm_splitk instance header
* rename test to reflect only dl instances are used
* clean up CI workspace, update cmake for instances
* change the jenkinsfile syntax
* build all instances except DL on gfx11
* move workspace cleanup after stages
* clean up workspace after every stage
* isolate data types in grouped_conv_fwd header
* isolate dl instances for grouped_conv2d_fwd
* fix syntax
* fix cmake and batchnorm instances
* fix typo
* fix reduction instances
* fix grouped_conv headers
* fix syntax
* replace parsing logic for instances, replace bfp16 with bf16
* fix the client examples build
* clean up DTYPES from instances cmake files
* update the parsing logic in cmake files
* make an exception for reduction kernels
* update few remaining cmake files to handle DTYPES
* fix syntax
* fix cmake conflicts
* replace f8 with fp8 test name
* resolve conflicts for dpp instances
* properly split conv_nd_bwd_data instances
* split conv2d_fwd instance data types
* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm
* split the tests by data types where possible
* filter examples by DTYPES
* split few remaining examples by DTYPES
* filter most instances by DTYPES
* add new lines at end of headers, fix grouped_gemm profiler
* fix syntax
* split the ckprofiler instances by DTYPES
* split the conv2d and quantization DL and XDL instances
* fix the splitting of conv2d DL instances
* split softmax and pool_fwd tests for fp16 and fp32 types
* fix syntax
* fix the dl_int8 quantization instances isolation
* Add wei_strides to grouped conv3d wei to keep consistency
* Fix strides in client examples
* Unify backward weight api with forward
* Fix for example
* Fixes for examples
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* enable gfx941/942 targets
* fix clang format
* fix the cmake logic for multiple targets
* fix cmake syntax for looping over targets
* add gfx941/942 support for gemm_xdl instances
* enable dl kernels on navi3
* do not build xdl tests and examples on Navi
* run tests before building everything on jenkins
* disable gemm_bilinear on gfx1030
* add gpu targets to installer on Navi
* put tests in the same order as before
* reduce the number of navi targets in CI
* build CI installed for gfx940 as well
* only build for MI300 during QA runs
* fix a bug blocking wmma_gemm_multipleD
* Utilize matrix padder in device_wmma_op
* cosmetic change for gemmpadding format
* clang format
* Change gridwise gemm from FIFO to KMN loop fashion
* Add DeviceOp and examples
* Format DeviceOp template arguments
* Remove bf16 example
* Format
* Format
* Update MakeABCGridDescriptor_A_K0_M_K1_B_K0_N_K1_C_M_N
* Refactor argument preparation
* Update conv_bwd_weight_dl to grouped_conv_bwd_weight_dl
* Rename device op file
* Update include directive in the example file
* Update descriptor preparation for grouped op
* Update the argument
* Update batch handling
* Add gridwise gemm supporting batched input
* Update blockwise indexing, working version
* Update copyright year
* Update check if argument is supported
* Refactor and make consistent with xdl examples
* Update check if argument is supported
* Add changelog entry
* Added comments on Dl op split_k>1 support
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Remove redundant CMake setting
* Extract common code from files
* Rename folder 'convnd' to 'conv'
* Use std::array<> to accept compile-time kwnown # of arguments
* Fix compilation error of tuning parameter
* In example, use same setting as unit-test
* Remove no-longer used include directive
* Add interface for grouped conv bwd weight
* Add group support for conv bwd weight
* Add grouped conv bwd weight example
* Use group parameter in example
* Rename example folder
* Remove non-grouped version example source files
* Rename device op template
* Add group support to convolution backward weight
* Remove debug messages
* Use smaller group size in example
* Use named variable as loop terminate condition
* Prettify example output message
* Enlarge used grid size
* Allow real grid size exceeds expected grid size
* Rename interface file
* Add client example for grouped conv2d bwd weight
* Fix wrong include directive
* Rename client example folder