mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-12 17:26:00 +00:00
a89e7522e38a531e615ab496da80dfcdc6dfc548
408 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6df2d70143 |
Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429)
* add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp * add examples * add instances to test * remove duplicate code between examples |
||
|
|
693ff3bbb3 |
Add support for direct store in epilogue and padding support for wave transfer without transpose (#3465)
- Add support for direct store in epilogue instead of cshuffle - Add padding support for wave transfer without transpose - Add wave transfer with interleaved layout to support direct store - Enable new functionalities on GEMMs - Add optional new functionality support for grouped convolution fwd - Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed) |
||
|
|
eb041079a3 |
Implement grouped gemm tile loop for RDNA4 (#3304)
* feat: grouped gemm tile loop support for RDNA4 * fix: removed extra parameter from grouped gemm example instance * fix: FP8 check incorrectly enabling FP8 on RDNA3 |
||
|
|
aad4cf0985 |
Wmma support for gemm_bias_add_reduce (#3316)
* Add tests for gemm_bias_add_reduce * Initial working implementation * Generalize implementation of reduce epilogue * Add tests for all layouts * Add instances * Fix test archs * Fix xdl bug * Remove library/profiler duplications * Fix num_byted error profiler * Fix typos * Fix copyright |
||
|
|
f9c6ba0403 |
Implement grouped gemm fastgelu for RDNA4 (#3303)
* Implement grouped gemm fastgelu for RDNA4 * chore: some cleanup and minor inconsistencies in grouped gemm profiler * chore: clarified logic and reporting of supported instance warnings |
||
|
|
53a1e4f551 |
Grouped convolution backward data WMMA v3 implementation (#3460)
* Added device level implementation for bwd_data_wmma_v3. * Added first instance of bwd_data_wmma_v3(f16). * Add support for bwd data in gridwise implementation Some changes are general for convolution and some are specific for bwd data. We need to generalize them once we have fwd, bwd data and bwd weight * Initial device implementation of bwd data * Remove unused template parameters in device impl * Add one instance for different layout initial check of device implementation * Add tests for splitk and for different layouts * Appended more instances to wmma_v3_f16. * Added conv_2d bf16 wmma_v3 instances. * Added conv_3d_bf16 wmma_v3_instances. * Added conv_3d_f16_wmma_v3_instances. * Added SplitN test cases for wmma. * Conv3d_bwd_data_scale_wmma_v3 instances. * Conv3d_bwd_data_bilinear_wmma_v3_instances * Renaming the device level instances file to common name , since it is defined for different DataTypes. * Renaming the instances and fixing typo * Added the test cases to regression test list * NCHW support for wmma_v3 * Examples for bf16 and f16 bwd_data_wmma_v3 * Added transpose conditons for device impl * fixing bugs * Added the gemm_args array implmentation * WIP debug conv bwd * fix splitk * Grouped gemm fix * Update CmakeLists with EOF * Added more instances for tests * Fixed the run time error in examples and removed 3d conv examples. * Fixed a typo. * Updated CmakeLists to removed the 3d convultion deleted files * Added print error statements for unsupoorted argument * Added the merge conflict related changes * Fixed compilation error * Fixed the InstanceFactory duplication error. * Removed the print statements and added logs to Arg function * All the merge conflict related errors resolved * Added d_tensor tests. * Added the missing example types of wmm_v3 * Merge error fix * Corrected the instance name * Reverted the bias relu change * Revereted the transpose load local change * Updated the regression test list with bwd_data_scale * Revert "Revereted the transpose load local change" This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b. * Revert "Merge error fix" This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b. * Reverting the local change * Added merge error fix * Build error fix due to merge conflicts * Added bias_relu example for wmma_v3 * Modified the main method in dtensor tests * Updated the dtensor tests to pick all the shapes * Updated the dtensor test shapes. * Updated the mem operations in tests. * Added reference func * Fixed typos in device impl * Added new header file and modified the include file for 3d tests * Renamed the test file and added reference func call. * clang format fix * Added ignore params * Modified device impl and tests * Removed debug print statements and updated dtensor test shapes * Fixing merge conflicts * Fixing more merge conflicts * Fixed copyrights * Updated the tuned instances to bilinear and scale. * Adding tuned instances to vanilla wmma_v3 * Removed all unused instances and modified test layouts. * Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances. * Fix clang format * Updated tuned f16/-genric instances * Formatting the instances file * Fixed copyrights and clang issues * Nonsense commit to force git to force * Removed the transpose instances * Added verified genric instances * Fixing namespace errors * Added todo for failing shapes * Formatting instance file * Fix instance list formatting * Removing unnecessary formats * Renamed the common file * Unification of xdl and wmma bwd_data tests * Updated Cmake * Added all layout types and deleted code. * Updated Cmake to add the condition to all tests. --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> |
||
|
|
88ae445580 |
Replace grouped conv bwd wei wmmaV3 bilin/scale bf16f32bf16 support with bf16bf16bf16 (#3470)
* Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests. * Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong. |
||
|
|
a8aebb7a8e |
Post-merge cleanup for WMMA grouped conv fwd (#3468)
* remove duplicate aliases * Split scaleadd_ab instances for WMMA grouped conv fwd * removed big shape from the test |
||
|
|
0fd2b2f045 |
Adding support for scale and bilinear ops for WMMA grouped conv fwd (#3450)
* Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * removed unnecessary ck parts from compilation * initial gemm_add_multiply instance implementations * fixed profiler help message for gemm_add_multiply * improved multiply_add profiler layout help * fixed template arguments for test instances * added test for gemm_add_multiply * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * switched to splitK interface * log print added to splitk benchmarks * revert main cmake comments * newline change reverted * added add_fastgelu instances * revert unintended change in xdl add_fastgelu * created gemm_add_add_fastgelu instances * created fastegelu instances * added tests for all splitk fastgelus * Added tests. * multiply_add instances created * updates to add_multiply splitk instances * splitk xdl test fixes * added wmma multiply_multiply instances * fixed ONLY_XDL_AND_WMMA_KERNELS tag * Added gemm_add examples for wmma v1 and v3 * fixed / workarounded i8 instances * Modified the v3 code to added one fp16 bxdl instance. * added bf16 xdl instance. * adding gemm_add wmma_cshuffle and other support (cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * add instances into camkelists (cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * This is work in progress, edited the template parameters in order to build (cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype (cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * added datatype and use clang-format-12 (cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * Fixing build errors * Added instances for v3 * Adding instances and executables * Code update of template parameters modified. * Renamed file. * Added tests. * resolved error tests. * Fixing build errors * Updated comments * removed the changes as per the MR review comment. * Updated tests. * fp8 instances - not tested * Restored the Cmake file that was reverted by mistake during rebase. * fixed wmma_op test * Updated comments. * Updated the template parameter description * fixed rdna4 instances * fixed back compatibility on gfx11 * cleanups * fix ckProfiler * one more cmake fix * added fp8 instances * Updated tests to ad BF16 instances as per review comment * Added include file and cleaned up(as per review comment) * Updated and optimized the example code for all types. * Fixed clang format * Resolve "Implement `device_gemm_bilinear` for RDNA4" * test generalization to handle FP16 shuffle better * added missing changes * Added bf16 wmma instance for add_relu * Added f16 wmma instance and corrected bf16 instance errors. * Added instances to Cmake * Modified the template parameters to make the instances work. * Fixed typo in profiler * Added v3 instances for gemm_add_relu * addressed core review comments * Added test for gemm_add_relu wmma instance * Cleaned up the code. * Added examples for gemm_add_relu * Fixing typo to resolve build errors. * Fixes applied to fix the precision loss. * fix billinear test after merge * Removed the old wmma instances. * Added wrapper and renamed the wmma_v3 instances * Updated copyrights and added wrappers. * Fixes applied according to review comments * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Robin Voetter <robin@streamhpc.com> * Removed the old wmma instances. * Updated wrapper for the v3 instances * removed the old wmma examples * Renamed the v3 instances * Deleted the gtest file added by mistake. * Updated thge profiler with wrapper * Fixed test errors. * Fixed the review comments * Fixed the if condition MACROS. * REVERTED THE PROFILER CHANGES * Revert "REVERTED THE PROFILER CHANGES" This reverts commit |
||
|
|
7795e73b47 |
Added large tensor support for grouped conv fwd wmma (#3437)
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * removed unnecessary ck parts from compilation * initial gemm_add_multiply instance implementations * fixed profiler help message for gemm_add_multiply * improved multiply_add profiler layout help * fixed template arguments for test instances * added test for gemm_add_multiply * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * switched to splitK interface * log print added to splitk benchmarks * revert main cmake comments * newline change reverted * added add_fastgelu instances * revert unintended change in xdl add_fastgelu * created gemm_add_add_fastgelu instances * created fastegelu instances * added tests for all splitk fastgelus * Added tests. * multiply_add instances created * updates to add_multiply splitk instances * splitk xdl test fixes * added wmma multiply_multiply instances * fixed ONLY_XDL_AND_WMMA_KERNELS tag * Added gemm_add examples for wmma v1 and v3 * fixed / workarounded i8 instances * Modified the v3 code to added one fp16 bxdl instance. * added bf16 xdl instance. * adding gemm_add wmma_cshuffle and other support (cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * add instances into camkelists (cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * This is work in progress, edited the template parameters in order to build (cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype (cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * added datatype and use clang-format-12 (cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * Fixing build errors * Added instances for v3 * Adding instances and executables * Code update of template parameters modified. * Renamed file. * Added tests. * resolved error tests. * Fixing build errors * Updated comments * removed the changes as per the MR review comment. * Updated tests. * fp8 instances - not tested * Restored the Cmake file that was reverted by mistake during rebase. * fixed wmma_op test * Updated comments. * Updated the template parameter description * fixed rdna4 instances * fixed back compatibility on gfx11 * cleanups * fix ckProfiler * one more cmake fix * added fp8 instances * Updated tests to ad BF16 instances as per review comment * Added include file and cleaned up(as per review comment) * Updated and optimized the example code for all types. * Fixed clang format * Resolve "Implement `device_gemm_bilinear` for RDNA4" * test generalization to handle FP16 shuffle better * added missing changes * Added bf16 wmma instance for add_relu * Added f16 wmma instance and corrected bf16 instance errors. * Added instances to Cmake * Modified the template parameters to make the instances work. * Fixed typo in profiler * Added v3 instances for gemm_add_relu * addressed core review comments * Added test for gemm_add_relu wmma instance * Cleaned up the code. * Added examples for gemm_add_relu * Fixing typo to resolve build errors. * Fixes applied to fix the precision loss. * fix billinear test after merge * Removed the old wmma instances. * Added wrapper and renamed the wmma_v3 instances * Updated copyrights and added wrappers. * Fixes applied according to review comments * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Robin Voetter <robin@streamhpc.com> * Removed the old wmma instances. * Updated wrapper for the v3 instances * removed the old wmma examples * Renamed the v3 instances * Deleted the gtest file added by mistake. * Updated thge profiler with wrapper * Fixed test errors. * Fixed the review comments * Fixed the if condition MACROS. * REVERTED THE PROFILER CHANGES * Revert "REVERTED THE PROFILER CHANGES" This reverts commit |
||
|
|
c0ee71d735 |
Dev/a8w4 and a8w8splitk (#3447)
* Ck moe bs splitk pr (#3440) * splitk kick-off. Compilation fail * splitk hack pass * fix scale offset calc. * clang-format for a8w8_moe_blk_gemm1 splitk change * fix testcase error --------- Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com> * Zan/moe a8w4 (#3441) * update * update * update ck moe a8w4 * update * update * update * compile pass * update * update * python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready * support new a8w4 kernel * update * update ck_tile * re format * update * update * fix conflict * fix build * update ck_tile moe * fix clang format * fix the problem * fix accruacy issue * fix --------- Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com> Co-authored-by: Zzz9990 <zanzhang@amd.com> Co-authored-by: felix <felix.li@amd.com> |
||
|
|
ba897f8435 | ck:tf32:complement CK_ENABLE_TF32 controls (#3426) | ||
|
|
2ea710e88b |
Grouped convolution forward device implementation and base flavors for RDNA3/4 (#2964)
* Fixed typos for padded instances * Added tests for fp16, KM_KN and KM_NK * Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * removed unnecessary ck parts from compilation * initial gemm_add_multiply instance implementations * fixed profiler help message for gemm_add_multiply * improved multiply_add profiler layout help * fixed template arguments for test instances * added test for gemm_add_multiply * Support multiple D in GridwiseGemm_wmma_cshuffle_v3 DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters. * Use ThreadGroupTensorSliceTransfer_v7r3 * Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support * Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma * Implement DeviceGemmMultipleD_Wmma_CShuffleV3 * Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3 * Prepare gemma_add tests for adding wmma * Add gemm_add_fastgelu instances and test * Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use DeviceGemmMultipleDSplitK instances there. * switched to splitK interface * log print added to splitk benchmarks * revert main cmake comments * newline change reverted * added add_fastgelu instances * revert unintended change in xdl add_fastgelu * created gemm_add_add_fastgelu instances * created fastegelu instances * added tests for all splitk fastgelus * Added tests. * multiply_add instances created * updates to add_multiply splitk instances * splitk xdl test fixes * added wmma multiply_multiply instances * fixed ONLY_XDL_AND_WMMA_KERNELS tag * Added gemm_add examples for wmma v1 and v3 * fixed / workarounded i8 instances * Modified the v3 code to added one fp16 bxdl instance. * added bf16 xdl instance. * adding gemm_add wmma_cshuffle and other support (cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * add instances into camkelists (cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * This is work in progress, edited the template parameters in order to build (cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype (cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * added datatype and use clang-format-12 (cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec) Co-authored-by: Cenxuan <cenxuan@streamhpc.com> * Fixing build errors * Added instances for v3 * Adding instances and executables * Code update of template parameters modified. * Renamed file. * Added tests. * resolved error tests. * Fixing build errors * Updated comments * removed the changes as per the MR review comment. * Updated tests. * fp8 instances - not tested * Restored the Cmake file that was reverted by mistake during rebase. * fixed wmma_op test * Updated comments. * Updated the template parameter description * fixed rdna4 instances * fixed back compatibility on gfx11 * cleanups * fix ckProfiler * one more cmake fix * added fp8 instances * Updated tests to ad BF16 instances as per review comment * Added include file and cleaned up(as per review comment) * Updated and optimized the example code for all types. * Fixed clang format * Resolve "Implement `device_gemm_bilinear` for RDNA4" * test generalization to handle FP16 shuffle better * added missing changes * Added bf16 wmma instance for add_relu * Added f16 wmma instance and corrected bf16 instance errors. * Added instances to Cmake * Modified the template parameters to make the instances work. * Fixed typo in profiler * Added v3 instances for gemm_add_relu * addressed core review comments * Added test for gemm_add_relu wmma instance * Cleaned up the code. * Added examples for gemm_add_relu * Fixing typo to resolve build errors. * Fixes applied to fix the precision loss. * fix billinear test after merge * Removed the old wmma instances. * Added wrapper and renamed the wmma_v3 instances * Updated copyrights and added wrappers. * Fixes applied according to review comments * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Robin Voetter <robin@streamhpc.com> * Removed the old wmma instances. * Updated wrapper for the v3 instances * removed the old wmma examples * Renamed the v3 instances * Deleted the gtest file added by mistake. * Updated thge profiler with wrapper * Fixed test errors. * Fixed the review comments * Fixed the if condition MACROS. * REVERTED THE PROFILER CHANGES * Revert "REVERTED THE PROFILER CHANGES" This reverts commit |
||
|
|
bb8445dca8 |
[CK] Integrate GPU reference into ckProfiler for convolutions (#3379)
Refactor and integrate CK GPU references into ckProfiler. - All convolution layouts and groupings supported for all three directions - Unit tests verifying GPU and CPU reference is the same - Support added to profiler (do_verification = 2 enables GPU reference) - One profiler-based test per direction changed to GPU reference to demonstrate usag Closes AICK-427 |
||
|
|
87dd073887 |
Wmma support for grouped convolution bwd weight (#2947)
* Convolution bwd weight device implementation
* Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature/conv_bwd_weight_wmma'
Convolution bwd weight device implementation
See merge request amd/ai/composable_kernel!38
* Fix bug and disable splitK=-1 tests for wmma
* Add generic instances for bf16 f32 bf16
* check gridwise level validity in device impl for 1 stage D0
* Fix bugs in device implementation:
- rdna3 compilation error
- gridwise layouts (need to be correct to ensure that CheckValidaity()
works correctly)
* Add padding in conv to gemm transformers for 1x1Stride1Pad0 specialization
* Remove workaround for 1x1Stride1Pad0 conv specialization
* Add instances for xdl parity (for pipeline v1)
* Add two stage instances (xdl parity)
* Add multiple Ds instances
* Add examples
* Uncomment scale instances
* Fix copyright
* Fix examples compilation
* Add atomic add float4
* Fix compilation error
* Fix instances
* Compute tolerances in examples instead of using default ones
* Compute tolerances instead of using default ones in bilinear and scale tests
* Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'feature/conv_bwd_weight_wmma'
Grouped conv: Instances and example bwd weight
See merge request amd/ai/composable_kernel!47
* Device implementation of explicit gemm for grouped conv bwd weight
Based on batched gemm multiple D
* Add instances for pipeline v1 and v3
* Add support for occupancy-based splitk
* Fix ckProfiler dependencies
* Review fixes
* Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma'
Device implementation of explicit gemm for grouped conv bwd weight
See merge request amd/ai/composable_kernel!52
* Fix cmake file for tests
* fix clang format
* fix instance factory error
* Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test.
* Revert "Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test."
This reverts commit
|
||
|
|
4dcc3e59c1 |
chore: update copyright header for misc files (#3402)
* chore: update copyright header for misc files * fix: typo in kernel resulting in ci failure |
||
|
|
ce99cab605 |
Wmma support for gemm_ab_scale (#3314)
* Support gemm_ab_scale: - Add tests - Integrate scaling implementation in multiple D - Generalize existing b_scale for ab_scale - Add instances - Generalize implementation for ScaleBlockM, ScaleBlockN, ScaleBlockK - Add support for all layouts supported by xdl - Fix splitk xdl * Fix copyright * Wmma support for gemm_blockscale_wp (#3315) * Support for preshuffle with ab scale - add support for b preshuffle in GridwiseGemm_wmma_cshuffle_v3_ab_scale - add support for AScaleLayout amnd BScaleLayout (can be different from ALayout and BLayout, respectively) - add Run method in v1 pipeline to support preshuffle + scaling - add support for preshuffle gemms in common invoker - Add splitk support * Fix copyright header |
||
|
|
8fec8054b2 |
ck: add tf32 in DTYPES to control instances build(#3317)
|
||
|
|
4baa4c9fae |
[CK, CK_TILE] Add GPU Reference Implementations for Grouped Convolution (#3216)
* LWPCK-4043: Add GPU reference implementations for CK Tile convolution
This commit implements GPU-based reference kernels for CK Tile convolution
operations to enable faster verification of optimized kernels, especially
for large tensors (>2GB).
Changes:
- Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution
- Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data
- Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight
- Integrate GPU references with test infrastructure (replace -v=2 error)
- Support for 1D, 2D, and 3D convolutions
- Generic data type support (FP16, BF16, FP32)
- Grid-stride loop pattern for scalability
The GPU references use a simple, readable implementation that prioritizes
correctness over performance. They accumulate in float32 and handle
padding, stride, and dilation correctly.
* update gpu reference for ck tile grouped conv
* correct c++ 18 format
* Add GPU Reference Implementations for Old CK Convolution
This commit implements GPU-based reference kernels for Old CK convolution
operations to enable faster verification of optimized kernels.
Changes:
- Fixed old CK forward GPU reference (naive_conv_fwd.hpp)
* Fixed BF16 NaN issue (use type_convert instead of static_cast)
* Fixed FP8/BF8 arithmetic (accumulate in float)
* Fixed uninitialized variables
* All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed)
- Created backward data GPU reference (naive_conv_bwd_data.hpp)
* Implements input gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Created backward weight GPU reference (naive_conv_bwd_weight.hpp)
* Implements weight gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Integrated with old CK examples
* Forward: 10 XDL examples now support do_verification=2
* Backward data: Integrated with example/17_convnd_bwd_data/
* Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only)
* Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU)
Testing:
- 50 comprehensive tests created
- 42/42 tests passing (100% success rate)
- CPU and GPU verification produce identical results
- Verified across multiple dimensions, sizes, and data types
Limitations:
- GPU references support standard convolution only (G=1)
- Fused operations (DL variants) not supported
- Some tests blocked by optimized kernel size constraints
Result: Old CK GPU references can replace CPU references for verification
with 50-100x performance improvement for large tensors.
* Apply clang-format to old CK GPU reference files
* Fix C++17 compatibility: use brace initialization for aggregate types
* add get_rtol, get_atl and consistency cout message
* Use triple bracket syntax for kernel launch per review feedback
Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski.
This is more idiomatic HIP/CUDA style and equally correct.
All tests still passing after this change.
* Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode
- Replace manual error checking with HIP_CHECK_ERROR macro
- Add v=3 verification mode (GPU ref vs CPU ref direct comparison)
- Consistent output format across all examples
- All tests passing (7/7 v=3 tests pass for FP16)
* Use ConvDims structure to simplify GPU reference kernels
Replace 24 individual parameters with ConvDims structure per review feedback.
- Add conv_common.hpp with ConvDims and helper function
- Update kernel signatures: 24 params → 1 structure
- Remove duplicate extraction code from host files
* Use get_block_id() and get_thread_id() helpers in CK Tile
Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions.
Updated 3 CK Tile GPU reference kernels per review feedback.
* Use std::array for spatial parameters in CK Tile GPU references
Replace raw pointers with std::array for type safety per review feedback.
- Add conv_common.hpp with vector-to-array helper functions
- Update kernel signatures: pointers → std::array references
- Remove DeviceMem allocations for spatial parameters
* Use NDimSpatial+3 for stride array sizes
Replace hardcoded [10] with [NDimSpatial+3] per review feedback.
Array sizes now correctly reflect actual dimensions needed.
* Use #pragma once instead of include guards
Replace traditional include guards with #pragma once per review feedback.
Updated 3 Old CK GPU reference headers.
* Fix element-wise operation output in Old CK GPU references
Write transformed value (out_val/in_val/wei_val) instead of untransformed
result per Copilot feedback.
This ensures element-wise operations are correctly applied to output.
* Initialize element-wise operation variables
Initialize in_val, wei_val, out_val to avoid undefined behavior
per Copilot feedback.
Updated backward data and backward weight kernels.
* Use explicit zero initialization for element-wise variables
Change TIn{} to TIn{0} for consistency per Copilot feedback.
All 3 kernels now use consistent zero initialization.
* Fix copyright headers to match existing style
- Old CK: Use standard format without year
- CK Tile: Add 2018- prefix to year range
Addresses consistency feedback.
* Rename GPU reference files: add _gpu suffix
* Refactor index calculations: use std::array and extract to helper functions
* Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
|
||
|
|
161835533b |
Wmma support for gemm_multiply_multiply_wp (#3278)
* Initial implementation with splitK support * Add gfx11 support * Fix compilation error * Add instances * Add irregular instances * Fix GetBuffer arguments * Minor changes * Address review comments * Fix compilation errors * Fix copyright header |
||
|
|
46f1d740f0 |
Add grouped gemm instances for RDNA4 (#3237)
* wip: grouped_gemm implementation based on wmma kernel + example for fp16 * chore: clean up grouped_gem_wmma_splitk_fp16 example * chore: add cmake options to fully disable XDL or WMMA kernels * feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts) * chore: add grouped gemm wmma bf16 example * refactor: reuse more code between instance factory functions * chore: turn test failure if not all batch sizes are supported into a warning * chore: made failing of test on unsupported instances conditional to not break old tests * chore: add log message to failure case where AK1/BK1/KBatch is too high for K value * fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run() * fix: stray comma after parameter list * fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran * chore: update copyright in header comments * nit: minor feebdack * refactor: unified XDL / wma tests * fix: properly disable FP8 instances when ONLY targeting gfx11 * refactor: add v3 suffix to grouped_gemm device struct name * fix: small typos in example code * fix: fully exclude xdl/wmma instances when using the corresponding cmake flags * chore: remove unused destructor and added pipeline support checks to remove unnecessary paths * fix: make sure to not add instance library to group if library was skipped * fix: make sure xdl grouped gemm doesnt fail the new test * fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case * fix: examples not working since dependent types and functions were moved to ck namespace in develop * fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances * chore: replace/add copyright headers with new format |
||
|
|
4155eb24f9 |
fix:bf16x3:enable all instances on gfx950 (#3248)
* fix:bf16x3:enable all instances on gfx950 * fix clang-format fail * fix clang-format fail * fix:modified wrong params previously |
||
|
|
ad57f6ef0b |
[CK_BUILDER] Put global CK functions in an the CK namespace (#3232)
* Wrap ck host utitlies in CK namespace. The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace. Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace. There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library. * Add using declarations to test code. After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace. * Add using declarations to client examples. |
||
|
|
22a934a229 |
chore(copyright): update copyright header for include directory (#3219)
* chore(copyright): update copyright header for tile_engine directory * chore(copyright): update copyright header for script directory * chore(copyright): update copyright header for test_data directory * chore(copyright): update copyright header for python directory * chore(copyright): update copyright header for profiler directory * chore(copyright): update copyright header for library directory |
||
|
|
2a73eb3bc0 |
Simulate TF32 with BF16x3 (#3142)
* tf32:bf16x3:use bf16x3 emulate tf32 gemm * change blockwiseGemm to demo bf16x3 * temp push * self review * self review * fix multi-device compile error * bug fix * code refactor * limit to gfx950 * enhance gemm gfx942 threshold * lower change from blockwise to warpwise * refact codes * refact codes * error fix * change threshold * bug fix * fix threshold error * change host reference implement to same as device * bug fix * bug fix * code refact * fix clang-format fail * code refine |
||
|
|
4ebc48a3cd |
WMMA gemm_add_relu_add_layernorm (#2989)
* Summary:
- Refactor epilogue (with CShuffle) to support fused operations:
- EpilogueCShuffleBase holds common parts
- EpilogueCShuffle: runs CShuffle and write out
- EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out
- Extend thread transfer v7r3:
- Support for intermediate data type different from src and dst type
- New functionality to write to dst buffer and keep data (to be able to use them for additional operations)
* Adress review comments
|
||
|
|
66bae4306c |
Grouped conv fwd with direct load (#3082)
* Grouped conv fwd with direct load * fix * fix * Add IsSupported check * Fix * fix inductor |
||
|
|
6f58d6e457 |
[CK_BUILDER] Factory tests (#3071)
This pull requests adds some initial "factory tests" - these check that the instances which are used in MIOpen are actually present in CK. The main reason for this is documentation and sanity checking. Its likely that these tests get outdated fast, so we'll have to maintain them, but fortunately this is quite straight forward and shouldn't take a lot of time once they are in place. |
||
|
|
6bbc05e1bd |
conv:tf32:add missed instances (#3081)
* conv:tf32:add missed instances |
||
|
|
c4b2da9cbd |
implement device batched gemm b scale for wmma (#2825)
* rebased on top of develop * fixed missing shuffeling and wrong indexing * added tests for batched_b_scale * added missing files * fixed wrong stride computation and removed k batching (for now) due to precision issues * reinstated k-batching with PRNG constrained to -1..1 * added specialization of GeneratorTensor_3 for int4 and fixed internal overflow * added k-batching to reference and increased tolerances for test * changed gemm_b_scale and gemm_universal tests to use correct parameters * adressed review commentsd * ported fixes back to non-batched version of b_scale * adressed review comments * run clang-format on older commits * add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior * added newline at end of file * reflected changes from muitl-abd branch in batched b_scale * fixed gfx11 issue * changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed * run clang format * set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested. * reduced range for pk_i4 even further to 0..0 * removed failing xld instances. Failure now uncovered now that tests were fixed * removed generation of int4 values entierly * divide B buffer by BPackedSize --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> |
||
|
|
fada1a3cae |
Conv:TF32: add more instances - 2 (#2879)
* add instances of device_grouped_conv_fwd_xdl_f32_comp_instances * add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances * add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances * tf32:conv:add instances for base class DeviceConvFwd * tf32:conv:add instances for base class DeviceGroupedConvBwdDataMultipleD * tf32:conv:add instances for base class DeviceGroupedConvBwdWeight * add tf32 in profiler * remove gnhwc/ngchw/ngcdhw instances * remove non-ndhwgc/nhwgc/nhwc instances * add check in IsSupportedArgument() |
||
|
|
5477811670 |
Grouped Conv Bwd Data out index calculation optimizations (#2917)
* Grouped Conv Bwd Data index calculation optimizations * fixes * refactor instances * gfx12 fixes * temporary disable splitK for gfx12 |
||
|
|
db2524be2d |
Verify HostTensorDescriptor when it is created (#2829)
* add proper GEMM layout verification * Handle "auto" strides. CalculateStrides only called when tensor's strides are empty or all of them are <=0 (auto strides). CalculateStrides now supports GEMM::ColumnsMajor order. The assumption is still that it applies only to the inner two dims. ValidateStrides throws if any of the tensor's strides is <=0. profile_gemm_multiply_add updated to support "auto" strides for tensors. Manual tests for profile_gemm_multiply_add (matrix B in Row and Col modes) auto-strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 -1 -1 -1 -1 -1 Note, -1 should be deprecated (use 0 instead) explicit strides (same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 128 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 128 128 128 128 128 explicit strides (not the same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 mix of explicit and auto strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 0 invalid stride bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 64 terminate called after throwing an instance of 'std::runtime_error' what(): Invalid strides for RowMajor: mLens: 128 128 , mStrides: 64 1 Aborted (core dumped) * - add more names to ck::tensor_layout for easier namespace hierarchy checking - updated convolutional layouts to use explicit ones or BaseConvolutionalLayout where it is not clear which layout to use (TBD) - see include/ck/library/utility/convolution_host_tensor_descriptor_helper.hpp * added handling of partially initialized strides for GEMM. fixed more tests. * clang-format and more fixes * replace long dash by a simple hyphen - causes build failure in CK codegen. * increase sizeof input, otherwise output size becomes zero or negative with large filter size * select stride based on layout * specify layout explicitly to avoid errors in HostTensorDescriptor creation * add validation for higher GEMM tensor dimensions.; Add docstring to `HostTensorDescriptor` * Not clear why permute test in test/permute_scale/test_permute_scale.cpp uses a lot of invalid strides. Setting layout to BypassLayoutVerification to avoid a lot of errors * fix test (incl removing invalid config) * fix moe examples: - (in .cpp) add layout argument to non-2D tensors - (in .hpp) fix asserts/failures that show up in Debug mode, specifically addressing 2D tensor by a single index (and 3D tensor by 2d index) * fix moe_gemm2 example. * fix profile and wmma examples * clean-up early mods for ckprofile. verified with: ``` ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 # ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 1 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 2 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 3 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 128 128 128 # ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 0 0 0 0 # ckProfiler gemm_add_relu 0 1 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 2 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 3 1 1 0 1 128 128 128 0 0 0 0 # not implemented ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 128 128 128 128 # ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 1 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 2 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 3 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 130 132 134 136 138 # example_gemm_add_multiply_dl_fp16 example_gemm_add_multiply_xdl_fp16 # ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 0 0 0 ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 128 128 128 ``` * temporary skip first 8 test configs - they throw error * temporary skip first 8 test configs in wmma too - they throw error --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
df97a286d5 |
Conv:TF32: add more instances - 1 (#2867)
* conv:tf32:add more instances * add instances of device_grouped_conv_fwd_xdl_f32_comp_instances * add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances * add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances * remove gnhwc/ngchw/ngcdhw instances |
||
|
|
3d29bff2f0 |
Wmma support for multiple ABD GEMM (#2803)
* multi_abd wmma support: - Add multiple A and B support to multiple D implementation (gridwise level) - Add multi_abd GEMM (device level) - Add instances (xdl parity) - Add tests (both xdl and wmma) - Add examples - Add ckProfiler support (both xdl and wmma) * Fix bug in device print function * Fix unused template parameter * Fix batched gemm for multiABD gridwise implementation * Fix gemm_universal_reduce with multiABDs gridwise implementation --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
dd7af118d7 |
TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852)
* Revert "Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)"
This reverts commit
|
||
|
|
f97b2a3f5d |
Added wmma support for gemm quantization: (#2841)
- profiler for gemm quantization for DL/XDL - tests for gemm quantization for DL/XDL - implementation for gemm quantization for WMMA - profiler/tests for gemm qunatization for WMMA Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
f22740df82 |
Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725)
* Enable xdl in gfx11 & gfx12 * update cmake file * fix all instance build (cmake) * fix batched_gemm_gemm(cmake) * rebase cmake files * fix cmake build error * remve CK_ENABLE_DYNAMIC_WARP_SIZE * update cmake build error2 * fix gfx11 build CK_USE_XDL is enabled on gfx11 and gfx12 * fix gfx10 build * fix gfx11 error --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> |
||
|
|
03b59f8c76 |
Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)
This reverts commit
|
||
|
|
c51102144f | feature:tf32:add initial conv3d fwd kernel support (#2763) | ||
|
|
b25d4d684a |
WMMA support for GEMM reduce (#2823)
Added gemm + reduce instance library for RDNA4. This includes: - New device implementation running GEMM and reduction kernel - instances for wmma (xdl parity) - examples for wmma (xdl parity) - tests for existing xdl and wmma |
||
|
|
b740380906 |
Wmma support for multiple Ds based GEMMs (#2613)
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Updates to support mixed precision (cherry picked from commit e65d71180393e7b66169c56565a6bac740427de6) Co-authored-by: Anca Hamuraru <anca@streamhpc.com> * Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip (cherry picked from commit f8c06322df0abcbd5945a56cdf5bffe56480f9f0) Co-authored-by: Anca Hamuraru <anca@streamhpc.com> * Added support for F8xF16xF16 to gemm_wmma_universal (cherry picked from commit 15c851de6daa513a12c2e3af299bab0176175fb5) Co-authored-by: Anca Hamuraru <anca@streamhpc.com> * Added support for F16xF8xF16 to gemm_wmma_universal * Added support for BF16xI4xBF16 to gemm_wmma_universal (cherry picked from commit c6a4a69d2d43d59bae8bdabfae80d648646f217e) Co-authored-by: Anca Hamuraru <anca@streamhpc.com> * Added support for F16xI4xF16 to gemm_wmma_universal * Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType * Added missing test class for FP16_KM_NK * Pre-commit hooks fixes * Added padding instances for f16xf16xf16 * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" (cherry picked from commit |
||
|
|
7330ec37ee |
Implement batched gemm gemm for RDNA (3 and 4) (#2612)
* Create new copies of existing device struct and gridwise struct for batched_gemm_softmax_gemm and disable the softmax part. Still based on old wmma pipelines. Also copy the example and remove the softmax part from the reference calculation. Works and results match reference except for tiny float errors in problem 2. * Turn DeviceBatchedGemmGemm_Wmma_CShuffleV3 into a proper DeviceBatchedGemmGemm derived class, with the right argument and invoker functions. Update example to use new definitions. * Remove unused cross-attention and self-attention kernels, arguments, and invokers. Also remove other unused Argument types. * Remove masking related code, test unusual sizes in example. * Remove remaining softmax related code from GridwiseBatchedGemmGemm_wmma_cshuffle_v3 and example. * Remove code related to numDims, bias, and TensorSpec from Device struct and example. * Add layout template parameters to device struct * Move (NPerBlock, LTilePerBlock) device struct template arguments up by two places to match XDL template argument ordering. * Merge accumulation data types into one type to match XDL device struct. * Remove NPerWmma template parameter from device struct and just set it equal to LPerWmma. Now device struct template params exactly match those for XDL batched gemm gemm. * Add support for RCCR layout and test this in example * Add batched_gemm_gemm_wmma to instance library + profiler, and add gtest just like for xdl. * Add RCCR instance and additional RCRR instance to library. * Remove unused permute and alpha related code. Time all tests. Fix B1 strides in argument verification. * Remove references to G0, G1 in favor of batch, reduce dimensionality of length and stride arrays. * Managed to replace old wmma gridwise pipeline and blockwise struct with new wmma blockwise pipeline. Some cleanup required but all tests pass. * Make TransposeC a proper template parameter that gets passed all the way from BlockGemmPipeline_Selector to WmmaGemm so we can use the correct settings for bacthed gemm gemm as well as regular gemm. Gemm universal tests now pass again. * Replace old LoopSched and PipelineVer params with BlockwiseGemm pipeline equivalents, and use these in instance factory. The v3 pipeline does not work yet, but v1 works for intrawave and interwave. * Adapt the A wave descriptor to deal with RDNA4 wmma. This fixes batched gemm gemm functionality on RDNA4. * Fixed two aspects of the v3 pipeline that were incorrect: First of all the blockwise copy operator was invoked once too many in all cases (RunRead and move window), which broke batched gemm gemm when the blockwise pipeline was used multiple times. Furthermore we should be using the mainloop (hotloop) for num_k_loop >=2 instead of num_k_loop >=3. Now we can use support any K dimension. * Remove num prefetch parameter from gridwise struct since we don't use it and it doesn't do anything, * Remove unused non-lds paths. * Test and update the IsSupportedArgument() and CheckValidity() functions for all layouts + padding modes and various problem sizes. * Add a lot of instances to the profiler with various blocksizes and pipelines, all verified. * Add support for BF16: instance library, tests, and examples. * Add examples for int8 and fp8, had to add type_convert_sp template specializations for the latter. * Template the library instance lists and add default padding instances. * Move memory calculations from the kernel to the Argument contructor. Also actually parse and use the user-provided batch strides. * Actually parse and use user-provided regular strides. * More refactor: remove references to multiple dims per dims, and g0 / g1. Also move xdl specific test utils out of generic test util header. * Small post-rebase-on-develop fix due to bscale-related pipeline changes. All tests rerun + tested bscale and regular gemm. * Introduce the correct GetCThreadDescriptor function in the blockwise gemm pipelines for the TransposeC=true case. It turns out to be identical for our batched gemm gemm (gemm0) usecases, but could theoretically be different for wmma_gemm instances with smaller-than-4-byte output data size. * Remove unused NumPrefetch template parameter, we don't need to match the XDL template params one-to-one. * Implement proper TailNum and HasMainLoop template parameters for the v3 pipeline. Now the Run() function knows at compile time whether there are 1, 2, or more loops in total, and adds or removes sections accordingly. It still uses the blockwise copy operators the correct amount of times. * Add print lambda with env check and file and func to device and gridwise level compatibility error messages. Also respect compatibility in example script. * RDNA3 does not support fp8 |
||
|
|
26d3300930 |
Add other layouts for FP8 block scaled gemm (#2665)
* Start adding other layouts for gemm_ab_scale * Add some instances * Create tensor descriptors for A/B scales depending on A/B layout * Fix formatting * Revert some comments * Revert commented instances in CMakeLists.txt * Add some more instances for col-row gemm * enable more row,row instances * Use occupancy=1 for col,row layout to avoid spills |
||
|
|
5328b232b2 |
Grouped Convolution Forward Infer Bias Bnorm Activ (#2621)
* Grouped Convolution Forward Infer Bias Bnorm Activ * 3d |
||
|
|
504b101da3 |
upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config |
||
|
|
b01a27ff22 |
Support b_scale: (#2350)
- extend pipeline v1 and v3 - add instances - add tests - add example Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
fbd9f32abe |
[CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459)
1. Port NCHW support from ConvFwd (#2375) to conv bwd data 2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw Co-authored-by: azhuang <anzhong.huang@amd.com> |
||
|
|
25b359d630 |
MX GEMM - Add FP6 GEMM Test (#2488)
* Add F6 GEMM MX Test * Add BF6 GEMM MX Test |
||
|
|
1b66f3f4a3 |
Add declarations for atomic add for fp16 and unsigned short. (#2483)
* add template for fp16 atomic add * add template for unsigned short atomic add * use atomicCAS in atomic add for fp16 and unsigned short * revrt back to atomic add using casting |