* Fixed typos for padded instances
* Added tests for fp16, KM_KN and KM_NK
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.
* Fixed typos
* Updated the set of tests for FP16
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* removed unnecessary ck parts from compilation
* initial gemm_add_multiply instance implementations
* fixed profiler help message for gemm_add_multiply
* improved multiply_add profiler layout help
* fixed template arguments for test instances
* added test for gemm_add_multiply
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* switched to splitK interface
* log print added to splitk benchmarks
* revert main cmake comments
* newline change reverted
* added add_fastgelu instances
* revert unintended change in xdl add_fastgelu
* created gemm_add_add_fastgelu instances
* created fastegelu instances
* added tests for all splitk fastgelus
* Added tests.
* multiply_add instances created
* updates to add_multiply splitk instances
* splitk xdl test fixes
* added wmma multiply_multiply instances
* fixed ONLY_XDL_AND_WMMA_KERNELS tag
* Added gemm_add examples for wmma v1 and v3
* fixed / workarounded i8 instances
* Modified the v3 code to added one fp16 bxdl instance.
* added bf16 xdl instance.
* adding gemm_add wmma_cshuffle and other support
(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* add instances into camkelists
(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* This is work in progress, edited the template parameters in order to build
(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype
(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* added datatype and use clang-format-12
(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* Fixing build errors
* Added instances for v3
* Adding instances and executables
* Code update of template parameters modified.
* Renamed file.
* Added tests.
* resolved error tests.
* Fixing build errors
* Updated comments
* removed the changes as per the MR review comment.
* Updated tests.
* fp8 instances - not tested
* Restored the Cmake file that was reverted by mistake during rebase.
* fixed wmma_op test
* Updated comments.
* Updated the template parameter description
* fixed rdna4 instances
* fixed back compatibility on gfx11
* cleanups
* fix ckProfiler
* one more cmake fix
* added fp8 instances
* Updated tests to ad BF16 instances as per review comment
* Added include file and cleaned up(as per review comment)
* Updated and optimized the example code for all types.
* Fixed clang format
* Resolve "Implement `device_gemm_bilinear` for RDNA4"
* test generalization to handle FP16 shuffle better
* added missing changes
* Added bf16 wmma instance for add_relu
* Added f16 wmma instance and corrected bf16 instance errors.
* Added instances to Cmake
* Modified the template parameters to make the instances work.
* Fixed typo in profiler
* Added v3 instances for gemm_add_relu
* addressed core review comments
* Added test for gemm_add_relu wmma instance
* Cleaned up the code.
* Added examples for gemm_add_relu
* Fixing typo to resolve build errors.
* Fixes applied to fix the precision loss.
* fix billinear test after merge
* Removed the old wmma instances.
* Added wrapper and renamed the wmma_v3 instances
* Updated copyrights and added wrappers.
* Fixes applied according to review comments
* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Robin Voetter <robin@streamhpc.com>
* Removed the old wmma instances.
* Updated wrapper for the v3 instances
* removed the old wmma examples
* Renamed the v3 instances
* Deleted the gtest file added by mistake.
* Updated thge profiler with wrapper
* Fixed test errors.
* Fixed the review comments
* Fixed the if condition MACROS.
* REVERTED THE PROFILER CHANGES
* Revert "REVERTED THE PROFILER CHANGES"
This reverts commit 21cb98546c.
* Revert "Fixed test errors."
This reverts commit 13efcc6fe1.
* Revert "Updated thge profiler with wrapper"
This reverts commit 536f86661d.
* Added missing wrapper instances
* Updated copyrights.
* Fixed typo.
* Fixed copyrights.
* Updated copyrights.
* updated copyrights.
* comments on the atomics workaround
* fixed cmake comment
* Fix bug from merge
* clang-format-18
* Fix compilation error
* multi_abd wmma support:
- Add multiple A and B support to multiple D implementation (gridwise level)
- Add multi_abd GEMM (device level)
- Add instances (xdl parity)
- Add tests (both xdl and wmma)
- Add examples
- Add ckProfiler support (both xdl and wmma)
* Fix bug in device print function
* Fix unused template parameter
* Add support for fwd conv in gridwise implementation. Identical to run function for bwd data.
* Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future.
* Make relevant profilers print the number of valid instances to aid testing.
* Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now. Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working.
* Reset output buffer after each run in profile_grouped_conv_fwd_impl().
* Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds.
* Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation.
* Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types.
* Add int8 instances for 2D vanilla grouped conv fwd all layouts.
* Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd
* Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts.
* Implement multi-AB support for grouped conv fwd and add example.
* Add 1D instances
* Add D layout tests to IsSupportedArgument()
* Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary.
* Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists.
* Add some more tests for vanilla grouped conv fwd
* Add 2D bias clamp instances and tests
* Add 3D bias clamp instances and tests
* Add 2D and 3D clamp instances and tests
* Unify problem sizes across vanilla and clamp flavor tests
* Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check.
* Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations.
* Remove wmma fp8 and bf8 instances when not targetting gfx12
* Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build
* Remove old years for newly created files.
* No need to time kernels for now.
* Fixup comments
* Pass struct args to Gridwise Run() function by reference.
* Don't use workspace memory in the case where A needs explicit transposition but B does not.
* Move calculation of rotating memory buffer sizes to Argument member functions.
* After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts.
* Unify xdl and wmma example code for grouped conv fwd scaleadd ab
* Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same.
* Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3.
* Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge.
* Small post-merge fixup, everything seems to work.
* Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming.
* Extend scaleadd_ab instance lists
* Extend merged groups instance lists, including adaptations of xdl "2x" instances.
* Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile.
* Extend "mem" instance lists.
* Extend regular instance lists.
* Fixup comments and ignored kernel arg name
* Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test.
* Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code.
* Re-enable sharding for wmma cshufflev3 instances
* Post merge fix to vanilla test
* Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general.
* Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop.
* Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument.
* Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed.
* Re-enable old wmma instances
* Re-enable Linqun's Xdl Wmma instances
* Small post-merge fixes
* Fix copyright headers
* Remove commented code snippet in gridwise
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance.
* Adding tuned instace list for groupoed conv fwd (#3288)
Following flavors are updated with tuned instance list:
- grouped_conv2d_fwd
- grouped_conv2d_fwd_bias_clamp
- grouped_conv2d_fwd_clamp
- grouped_conv3d_fwd
- grouped_conv3d_fwd_bias_clamp
- grouped_conv3d_fwd_clamp
- grouped_conv3d_fwd_scaleadd_ab
Re-factored instance selection:
- removed all the unnecessary instance tuples (comp/mem/16x16/generic)
- removed all unnecessary layouts and data types
* Do not use std::remove_cvref_t, does not exist in C++17, use custom one.
* Splitting grouped conv fwd instances (#3449)
* Disable unnecessary and failing tests related to experimental CK builder
* Disable unnecessary ck builder experimental tests fully
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: Wojciech Laskowski <77888887+wj-laskowski@users.noreply.github.com>
Refactor and integrate CK GPU references into ckProfiler.
- All convolution layouts and groupings supported for all three directions
- Unit tests verifying GPU and CPU reference is the same
- Support added to profiler (do_verification = 2 enables GPU reference)
- One profiler-based test per direction changed to GPU reference to demonstrate usag
Closes AICK-427
* Convolution bwd weight device implementation
* Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature/conv_bwd_weight_wmma'
Convolution bwd weight device implementation
See merge request amd/ai/composable_kernel!38
* Fix bug and disable splitK=-1 tests for wmma
* Add generic instances for bf16 f32 bf16
* check gridwise level validity in device impl for 1 stage D0
* Fix bugs in device implementation:
- rdna3 compilation error
- gridwise layouts (need to be correct to ensure that CheckValidaity()
works correctly)
* Add padding in conv to gemm transformers for 1x1Stride1Pad0 specialization
* Remove workaround for 1x1Stride1Pad0 conv specialization
* Add instances for xdl parity (for pipeline v1)
* Add two stage instances (xdl parity)
* Add multiple Ds instances
* Add examples
* Uncomment scale instances
* Fix copyright
* Fix examples compilation
* Add atomic add float4
* Fix compilation error
* Fix instances
* Compute tolerances in examples instead of using default ones
* Compute tolerances instead of using default ones in bilinear and scale tests
* Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'feature/conv_bwd_weight_wmma'
Grouped conv: Instances and example bwd weight
See merge request amd/ai/composable_kernel!47
* Device implementation of explicit gemm for grouped conv bwd weight
Based on batched gemm multiple D
* Add instances for pipeline v1 and v3
* Add support for occupancy-based splitk
* Fix ckProfiler dependencies
* Review fixes
* Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma'
Device implementation of explicit gemm for grouped conv bwd weight
See merge request amd/ai/composable_kernel!52
* Fix cmake file for tests
* fix clang format
* fix instance factory error
* Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test.
* Revert "Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test."
This reverts commit d20c869d3d.
* Disable splitk for 2stage xdl on rdna (bug to be fixed)
* Fix add_test_executable
* Always ForceThreadTileTransfer for now, WaveTileTransfer does not work for convolution yet.
* Grab device and gridwise files from bkp branch, this should enable splitK support for convolution and also we no longer ForceThreadTileTransfer for explicit gemm. Also grab some updates from 7e7243783008b11e904f127ecf1df55ef95e9af2 to fix building on clang20.
* Fix bug in various bwd wei device implementations / profiler where the occupancy based split_k value could not be found because the Argument did not derive from ArgumentSplitK, leading to incorrect error tolerances.
* Actually print the reason when a device implementation is not supported.
* Print number of valid instances in profiler and tests.
* Fix clang format for Two Stage implementation
* Fix copyright
* Address review comments
* Fix explicit conv bwd weight struct
* Fix gridwise common
* Fix gridwise ab scale
* Remove autodeduce 1 stage
* Restore example tolerance calculation
* Fix compilation error
* Fix gridwise common
* Fix gridwise gemm
* Fix typo
* Fix splitk
* Fix splitk ab scale
* Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test.
* Reduce instances to only the tuned wmma V3 ones for implicit v1 intra and explicit v1 intra pad/nopad.
* Add explicit oddMN support with custom tuned instances
* Add two stage instances based on the parameters from the tuned cshuffle V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists.
* Replace cshuffle non-v3 lists with v3 lists, making sure to not have duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases.
* Remove some instances that give incorrect results (f16 NHWGC)
* Add bf16 f32 bf16 instances based on tuned b16 NHWGC GKYXC instances.
* Add back some generic instances to make sure we have the same shape / layout / datatype support as before the instance selection process.
* Add instances for scale and bilinear based on the bf16 NHWGC GKYXC tuning. Keep generic instances for support.
* Disable two stage f16 instances which produce incorrect results.
* Remove more instances which fail verification, for bf16_f32_bf16 and for f16 scale / bilinear.
* Disable all non-generic two-stage instances in the instance lists for NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit.
* Remove unused instance lists and related add_x_instance() functions, fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3.
* Re-enable all xdl instances (un-16x16-adapted) and dl instances. Remove custom ckProfiler target.
* Remove straggler comments
* Remove [[maybe_unused]]
* Fix clang format
* Remove unwanted instances. This includes all instances which are not NHWGCxGKYXC and F16 or BF16 (no mixed in-out types).
* Add comment
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com>
* Support gemm_ab_scale:
- Add tests
- Integrate scaling implementation in multiple D
- Generalize existing b_scale for ab_scale
- Add instances
- Generalize implementation for ScaleBlockM, ScaleBlockN, ScaleBlockK
- Add support for all layouts supported by xdl
- Fix splitk xdl
* Fix copyright
* Wmma support for gemm_blockscale_wp (#3315)
* Support for preshuffle with ab scale
- add support for b preshuffle in GridwiseGemm_wmma_cshuffle_v3_ab_scale
- add support for AScaleLayout amnd BScaleLayout (can be different
from ALayout and BLayout, respectively)
- add Run method in v1 pipeline to support preshuffle + scaling
- add support for preshuffle gemms in common invoker
- Add splitk support
* Fix copyright header
* LWPCK-4043: Add GPU reference implementations for CK Tile convolution
This commit implements GPU-based reference kernels for CK Tile convolution
operations to enable faster verification of optimized kernels, especially
for large tensors (>2GB).
Changes:
- Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution
- Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data
- Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight
- Integrate GPU references with test infrastructure (replace -v=2 error)
- Support for 1D, 2D, and 3D convolutions
- Generic data type support (FP16, BF16, FP32)
- Grid-stride loop pattern for scalability
The GPU references use a simple, readable implementation that prioritizes
correctness over performance. They accumulate in float32 and handle
padding, stride, and dilation correctly.
* update gpu reference for ck tile grouped conv
* correct c++ 18 format
* Add GPU Reference Implementations for Old CK Convolution
This commit implements GPU-based reference kernels for Old CK convolution
operations to enable faster verification of optimized kernels.
Changes:
- Fixed old CK forward GPU reference (naive_conv_fwd.hpp)
* Fixed BF16 NaN issue (use type_convert instead of static_cast)
* Fixed FP8/BF8 arithmetic (accumulate in float)
* Fixed uninitialized variables
* All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed)
- Created backward data GPU reference (naive_conv_bwd_data.hpp)
* Implements input gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Created backward weight GPU reference (naive_conv_bwd_weight.hpp)
* Implements weight gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Integrated with old CK examples
* Forward: 10 XDL examples now support do_verification=2
* Backward data: Integrated with example/17_convnd_bwd_data/
* Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only)
* Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU)
Testing:
- 50 comprehensive tests created
- 42/42 tests passing (100% success rate)
- CPU and GPU verification produce identical results
- Verified across multiple dimensions, sizes, and data types
Limitations:
- GPU references support standard convolution only (G=1)
- Fused operations (DL variants) not supported
- Some tests blocked by optimized kernel size constraints
Result: Old CK GPU references can replace CPU references for verification
with 50-100x performance improvement for large tensors.
* Apply clang-format to old CK GPU reference files
* Fix C++17 compatibility: use brace initialization for aggregate types
* add get_rtol, get_atl and consistency cout message
* Use triple bracket syntax for kernel launch per review feedback
Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski.
This is more idiomatic HIP/CUDA style and equally correct.
All tests still passing after this change.
* Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode
- Replace manual error checking with HIP_CHECK_ERROR macro
- Add v=3 verification mode (GPU ref vs CPU ref direct comparison)
- Consistent output format across all examples
- All tests passing (7/7 v=3 tests pass for FP16)
* Use ConvDims structure to simplify GPU reference kernels
Replace 24 individual parameters with ConvDims structure per review feedback.
- Add conv_common.hpp with ConvDims and helper function
- Update kernel signatures: 24 params → 1 structure
- Remove duplicate extraction code from host files
* Use get_block_id() and get_thread_id() helpers in CK Tile
Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions.
Updated 3 CK Tile GPU reference kernels per review feedback.
* Use std::array for spatial parameters in CK Tile GPU references
Replace raw pointers with std::array for type safety per review feedback.
- Add conv_common.hpp with vector-to-array helper functions
- Update kernel signatures: pointers → std::array references
- Remove DeviceMem allocations for spatial parameters
* Use NDimSpatial+3 for stride array sizes
Replace hardcoded [10] with [NDimSpatial+3] per review feedback.
Array sizes now correctly reflect actual dimensions needed.
* Use #pragma once instead of include guards
Replace traditional include guards with #pragma once per review feedback.
Updated 3 Old CK GPU reference headers.
* Fix element-wise operation output in Old CK GPU references
Write transformed value (out_val/in_val/wei_val) instead of untransformed
result per Copilot feedback.
This ensures element-wise operations are correctly applied to output.
* Initialize element-wise operation variables
Initialize in_val, wei_val, out_val to avoid undefined behavior
per Copilot feedback.
Updated backward data and backward weight kernels.
* Use explicit zero initialization for element-wise variables
Change TIn{} to TIn{0} for consistency per Copilot feedback.
All 3 kernels now use consistent zero initialization.
* Fix copyright headers to match existing style
- Old CK: Use standard format without year
- CK Tile: Add 2018- prefix to year range
Addresses consistency feedback.
* Rename GPU reference files: add _gpu suffix
* Refactor index calculations: use std::array and extract to helper functions
* Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* wip: grouped_gemm implementation based on wmma kernel + example for fp16
* chore: clean up grouped_gem_wmma_splitk_fp16 example
* chore: add cmake options to fully disable XDL or WMMA kernels
* feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts)
* chore: add grouped gemm wmma bf16 example
* refactor: reuse more code between instance factory functions
* chore: turn test failure if not all batch sizes are supported into a warning
* chore: made failing of test on unsupported instances conditional to not break old tests
* chore: add log message to failure case where AK1/BK1/KBatch is too high for K value
* fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run()
* fix: stray comma after parameter list
* fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran
* chore: update copyright in header comments
* nit: minor feebdack
* refactor: unified XDL / wma tests
* fix: properly disable FP8 instances when ONLY targeting gfx11
* refactor: add v3 suffix to grouped_gemm device struct name
* fix: small typos in example code
* fix: fully exclude xdl/wmma instances when using the corresponding cmake flags
* chore: remove unused destructor and added pipeline support checks to remove unnecessary paths
* fix: make sure to not add instance library to group if library was skipped
* fix: make sure xdl grouped gemm doesnt fail the new test
* fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case
* fix: examples not working since dependent types and functions were moved to ck namespace in develop
* fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances
* chore: replace/add copyright headers with new format
* Wrap ck host utitlies in CK namespace.
The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.
Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.
There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library.
* Add using declarations to test code.
After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.
* Add using declarations to client examples.
* Summary:
- Refactor epilogue (with CShuffle) to support fused operations:
- EpilogueCShuffleBase holds common parts
- EpilogueCShuffle: runs CShuffle and write out
- EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out
- Extend thread transfer v7r3:
- Support for intermediate data type different from src and dst type
- New functionality to write to dst buffer and keep data (to be able to use them for additional operations)
* Adress review comments
This pull requests adds some initial "factory tests" - these check that the instances which are used in MIOpen are actually present in CK. The main reason for this is documentation and sanity checking. Its likely that these tests get outdated fast, so we'll have to maintain them, but fortunately this is quite straight forward and shouldn't take a lot of time once they are in place.
* rebased on top of develop
* fixed missing shuffeling and wrong indexing
* added tests for batched_b_scale
* added missing files
* fixed wrong stride computation and removed k batching (for now) due to precision issues
* reinstated k-batching with PRNG constrained to -1..1
* added specialization of GeneratorTensor_3 for int4 and fixed internal overflow
* added k-batching to reference and increased tolerances for test
* changed gemm_b_scale and gemm_universal tests to use correct parameters
* adressed review commentsd
* ported fixes back to non-batched version of b_scale
* adressed review comments
* run clang-format on older commits
* add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior
* added newline at end of file
* reflected changes from muitl-abd branch in batched b_scale
* fixed gfx11 issue
* changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed
* run clang format
* set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested.
* reduced range for pk_i4 even further to 0..0
* removed failing xld instances. Failure now uncovered now that tests were fixed
* removed generation of int4 values entierly
* divide B buffer by BPackedSize
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
* add instances of device_grouped_conv_fwd_xdl_f32_comp_instances
* add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances
* add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances
* tf32:conv:add instances for base class DeviceConvFwd
* tf32:conv:add instances for base class DeviceGroupedConvBwdDataMultipleD
* tf32:conv:add instances for base class DeviceGroupedConvBwdWeight
* add tf32 in profiler
* remove gnhwc/ngchw/ngcdhw instances
* remove non-ndhwgc/nhwgc/nhwc instances
* add check in IsSupportedArgument()
* Revert "Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)"
This reverts commit 03b59f8c76.
* fix compile error on gf12x
* only run tf32 example on gfx942
* only build tf32 instance on gfx942
* ckProfiler:only support tf32 in gfx942
* delete unuseful messages
- profiler for gemm quantization for DL/XDL
- tests for gemm quantization for DL/XDL
- implementation for gemm quantization for WMMA
- profiler/tests for gemm qunatization for WMMA
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Added gemm + reduce instance library for RDNA4. This includes:
- New device implementation running GEMM and reduction kernel
- instances for wmma (xdl parity)
- examples for wmma (xdl parity)
- tests for existing xdl and wmma
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Updates to support mixed precision
(cherry picked from commit e65d71180393e7b66169c56565a6bac740427de6)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip
(cherry picked from commit f8c06322df0abcbd5945a56cdf5bffe56480f9f0)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Added support for F8xF16xF16 to gemm_wmma_universal
(cherry picked from commit 15c851de6daa513a12c2e3af299bab0176175fb5)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Added support for F16xF8xF16 to gemm_wmma_universal
* Added support for BF16xI4xBF16 to gemm_wmma_universal
(cherry picked from commit c6a4a69d2d43d59bae8bdabfae80d648646f217e)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Added support for F16xI4xF16 to gemm_wmma_universal
* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType
* Added missing test class for FP16_KM_NK
* Pre-commit hooks fixes
* Added padding instances for f16xf16xf16
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
(cherry picked from commit 5bdc993dbf)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Fixed cmake build errors related to test_fp8
(cherry picked from commit 12176616b6)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Ammending changes for adding support for padding instances for f16xf16xf16
* Fixes for padding instances for f16xf16xf16
* Added padding instances for bf16xbf16, f8xf8
* Added packed instances for bf16xi4xbf16
* Added padding instances for f8xf16xf16
* Added padding instances for f16xf8xf16, f16xi4xf16
* Fixed typos for bf16xbf16xbf16 padding instances
* Fixed typos for padded instances
* Added tests for fp16, KM_KN and KM_NK
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.
* Fixed typos
* Updated the set of tests for FP16
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* removed unnecessary ck parts from compilation
* initial gemm_add_multiply instance implementations
* fixed profiler help message for gemm_add_multiply
* improved multiply_add profiler layout help
* fixed template arguments for test instances
* added test for gemm_add_multiply
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* switched to splitK interface
* log print added to splitk benchmarks
* revert main cmake comments
* newline change reverted
* added add_fastgelu instances
* revert unintended change in xdl add_fastgelu
* created gemm_add_add_fastgelu instances
* created fastegelu instances
* added tests for all splitk fastgelus
* Added tests.
* multiply_add instances created
* updates to add_multiply splitk instances
* splitk xdl test fixes
* added wmma multiply_multiply instances
* fixed ONLY_XDL_AND_WMMA_KERNELS tag
* Added gemm_add examples for wmma v1 and v3
* fixed / workarounded i8 instances
* Modified the v3 code to added one fp16 bxdl instance.
* added bf16 xdl instance.
* adding gemm_add wmma_cshuffle and other support
(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* add instances into camkelists
(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* This is work in progress, edited the template parameters in order to build
(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype
(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* added datatype and use clang-format-12
(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* Fixing build errors
* Added instances for v3
* Adding instances and executables
* Code update of template parameters modified.
* Renamed file.
* Added tests.
* resolved error tests.
* Fixing build errors
* Updated comments
* removed the changes as per the MR review comment.
* Updated tests.
* fp8 instances - not tested
* Restored the Cmake file that was reverted by mistake during rebase.
* fixed wmma_op test
* Updated comments.
* Updated the template parameter description
* fixed rdna4 instances
* fixed back compatibility on gfx11
* cleanups
* fix ckProfiler
* one more cmake fix
* added fp8 instances
* Updated tests to ad BF16 instances as per review comment
* Added include file and cleaned up(as per review comment)
* Updated and optimized the example code for all types.
* Fixed clang format
* Resolve "Implement `device_gemm_bilinear` for RDNA4"
* test generalization to handle FP16 shuffle better
* added missing changes
* Added bf16 wmma instance for add_relu
* Added f16 wmma instance and corrected bf16 instance errors.
* Added instances to Cmake
* Modified the template parameters to make the instances work.
* Fixed typo in profiler
* Added v3 instances for gemm_add_relu
* addressed core review comments
* Added test for gemm_add_relu wmma instance
* Cleaned up the code.
* Added examples for gemm_add_relu
* Fixing typo to resolve build errors.
* Fixes applied to fix the precision loss.
* fix billinear test after merge
* Removed the old wmma instances.
* Added wrapper and renamed the wmma_v3 instances
* Updated copyrights and added wrappers.
* Fixes applied according to review comments
* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Robin Voetter <robin@streamhpc.com>
* Removed the old wmma instances.
* Updated wrapper for the v3 instances
* removed the old wmma examples
* Renamed the v3 instances
* Deleted the gtest file added by mistake.
* Updated thge profiler with wrapper
* Fixed test errors.
* Fixed the review comments
* Fixed the if condition MACROS.
* REVERTED THE PROFILER CHANGES
* Revert "REVERTED THE PROFILER CHANGES"
This reverts commit 21cb98546c.
* Revert "Fixed test errors."
This reverts commit 13efcc6fe1.
* Revert "Updated thge profiler with wrapper"
This reverts commit 536f86661d.
* Added missing wrapper instances
* Updated copyrights.
* Fixed typo.
* Fixed copyrights.
* Updated copyrights.
* updated copyrights.
* comments on the atomics workaround
* fixed cmake comment
* Fix bug from merge
* clang-format-18
* Fix compilation error
* Fix linking error
* Fix bug in add and add_relu examples
* Fix error including file (typo)
* Quick fix to compile examples for different targets
* Fix for multi target
* implemented f16 and bf16 instances for gemm_silu
* addressed review comments
* addressed review comments
* Fix clang format
* Fix clang format
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Create new copies of existing device struct and gridwise struct for batched_gemm_softmax_gemm and disable the softmax part. Still based on old wmma pipelines. Also copy the example and remove the softmax part from the reference calculation. Works and results match reference except for tiny float errors in problem 2.
* Turn DeviceBatchedGemmGemm_Wmma_CShuffleV3 into a proper DeviceBatchedGemmGemm derived class, with the right argument and invoker functions. Update example to use new definitions.
* Remove unused cross-attention and self-attention kernels, arguments, and invokers. Also remove other unused Argument types.
* Remove masking related code, test unusual sizes in example.
* Remove remaining softmax related code from GridwiseBatchedGemmGemm_wmma_cshuffle_v3 and example.
* Remove code related to numDims, bias, and TensorSpec from Device struct and example.
* Add layout template parameters to device struct
* Move (NPerBlock, LTilePerBlock) device struct template arguments up by two places to match XDL template argument ordering.
* Merge accumulation data types into one type to match XDL device struct.
* Remove NPerWmma template parameter from device struct and just set it equal to LPerWmma. Now device struct template params exactly match those for XDL batched gemm gemm.
* Add support for RCCR layout and test this in example
* Add batched_gemm_gemm_wmma to instance library + profiler, and add gtest just like for xdl.
* Add RCCR instance and additional RCRR instance to library.
* Remove unused permute and alpha related code. Time all tests. Fix B1 strides in argument verification.
* Remove references to G0, G1 in favor of batch, reduce dimensionality of length and stride arrays.
* Managed to replace old wmma gridwise pipeline and blockwise struct with new wmma blockwise pipeline. Some cleanup required but all tests pass.
* Make TransposeC a proper template parameter that gets passed all the way from BlockGemmPipeline_Selector to WmmaGemm so we can use the correct settings for bacthed gemm gemm as well as regular gemm. Gemm universal tests now pass again.
* Replace old LoopSched and PipelineVer params with BlockwiseGemm pipeline equivalents, and use these in instance factory. The v3 pipeline does not work yet, but v1 works for intrawave and interwave.
* Adapt the A wave descriptor to deal with RDNA4 wmma. This fixes batched gemm gemm functionality on RDNA4.
* Fixed two aspects of the v3 pipeline that were incorrect: First of all the blockwise copy operator was invoked once too many in all cases (RunRead and move window), which broke batched gemm gemm when the blockwise pipeline was used multiple times. Furthermore we should be using the mainloop (hotloop) for num_k_loop >=2 instead of num_k_loop >=3. Now we can use support any K dimension.
* Remove num prefetch parameter from gridwise struct since we don't use it and it doesn't do anything,
* Remove unused non-lds paths.
* Test and update the IsSupportedArgument() and CheckValidity() functions for all layouts + padding modes and various problem sizes.
* Add a lot of instances to the profiler with various blocksizes and pipelines, all verified.
* Add support for BF16: instance library, tests, and examples.
* Add examples for int8 and fp8, had to add type_convert_sp template specializations for the latter.
* Template the library instance lists and add default padding instances.
* Move memory calculations from the kernel to the Argument contructor. Also actually parse and use the user-provided batch strides.
* Actually parse and use user-provided regular strides.
* More refactor: remove references to multiple dims per dims, and g0 / g1. Also move xdl specific test utils out of generic test util header.
* Small post-rebase-on-develop fix due to bscale-related pipeline changes. All tests rerun + tested bscale and regular gemm.
* Introduce the correct GetCThreadDescriptor function in the blockwise gemm pipelines for the TransposeC=true case. It turns out to be identical for our batched gemm gemm (gemm0) usecases, but could theoretically be different for wmma_gemm instances with smaller-than-4-byte output data size.
* Remove unused NumPrefetch template parameter, we don't need to match the XDL template params one-to-one.
* Implement proper TailNum and HasMainLoop template parameters for the v3 pipeline. Now the Run() function knows at compile time whether there are 1, 2, or more loops in total, and adds or removes sections accordingly. It still uses the blockwise copy operators the correct amount of times.
* Add print lambda with env check and file and func to device and gridwise level compatibility error messages. Also respect compatibility in example script.
* RDNA3 does not support fp8
* Start adding other layouts for gemm_ab_scale
* Add some instances
* Create tensor descriptors for A/B scales depending on A/B layout
* Fix formatting
* Revert some comments
* Revert commented instances in CMakeLists.txt
* Add some more instances for col-row gemm
* enable more row,row instances
* Use occupancy=1 for col,row layout to avoid spills
1. Port NCHW support from ConvFwd (#2375) to conv bwd data
2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw
Co-authored-by: azhuang <anzhong.huang@amd.com>
* add template for fp16 atomic add
* add template for unsigned short atomic add
* use atomicCAS in atomic add for fp16 and unsigned short
* revrt back to atomic add using casting
1. When conv spec is 1x1 stride1 pad0, nchw is equal with matrix A + column major, we only need minor change in conv transformer to support it.
2. when out is NKHW, it is equal with matrix C with column major. we need swap A & B to get best performance.
3. Add new instance device_grouped_conv_fwd_xdl_f16_nchw_instances for nchw.
* Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty.
* Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good.
* Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet.
* Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3. In principle the whole thing works now, just need to add other data types and perhaps do some cleanup.
* Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16).
* Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts.
* Fixup comments and TODOs
* Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small.
* Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests.
* Take batching into account when using rotating memory or clearing the C tensor.
* Implement small refactors / comments etc. from review.
* Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists.
* Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems.
* Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups.
* Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function.
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* Shard several of the most costly targets.
Introduces a filter_tuple_by_modulo to break up tuples.
Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.
time ninja -j 64 device_grouped_conv3d_fwd_instance
* fix clang format
* Fix build errors in instantiation code.
I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:
ninja -j 128 test_grouped_convnd_fwd
This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.
I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.
* Fix link errors from mismatched declarations.
Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Shard the longest 2D convolution builds
Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.
* Use PROJECT_SOURCE_DIR for submodule compatibility
I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Remove accidental copy of a file
* Remove accidental copies of template files.
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>