* Added bias_bnorm_clamp instances.
* fwd_bias_bnorm_clamp comp instances
* fwd_bias_bnorm_mem_inter and mem_intra instances
* fwd_bias_bnorm_merged_group_instances
* fwd_bias_bnorm_clamp_conv3d_bf16 and f16 instances
* Device level changes for fwd_bias_bnorm_clamp
* Added the test to the regression test list.
* Removed the part 2 and 2x instances
* Removed the irrelevant checks in wmma
* Refactored the instances to adapt to new device implementation
* Updated the reference and include files
* enabling tests
* Added missing profiler
* Added missing instance entry , deleted by mistake
* Reduce bias bnorm clamp instances to only a single generic one.
* Clean up cmakelists file
* clang-format
* Change bias bnorm clamp tests to use monotone initialization values to avoid tiny off-integer gemm results on RDNA3 from blowing up.
* Renaming some instance lists and add functions to be more standardized.
* Commented out non default instances.
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation
* wip: many fixes in implementation of batched gemm gemm multiple d
* wip: batched gemm gemm multiple d gridwise op compiling, not working yet
* fix: incorrect d0 grid indexing in batched gemm gemm multipled
* feat: add instances for batched gemm add relu gemm add
* chore: configure instance with low vector transfer size for odd sizes
* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense
* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes
* fix: disable odd size tests on XDL archs
* chore: removed temporary logging
* chore: update some references to C tensor to E tensor
* Tentative fix for example template params
* Tentative fix for non-multi-D batched gemm gemm device impl.
* Tentative fix for xdl example template params
* Tentative fix for profiler build on gfx90a
* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions
* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply
* fix: make the gemm1 data types match what happens in the device op
* feat: add d0s/d1s datatypes and layouts to the device op type string
* chore: change element-wise op so addition happens in fp32
* chore: add static asserts for gemm0/gemm1 calculated wave sizes
* chore: also updated other element-wise ops to use fp32 calculations
* chore: log number of supported instances
* chore: update instance comment
* chore: disable kernel timing in example by default
* fix: gemm1 wave size calculation
* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions
* chore: remove increased tolerance in batched gemm gemm multiple d example
* chore: add comment explaining that verification fails for certain input values
* chore: clarify instance comment
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)
* wip: device struct for WMMA batched contraction multiple d based on new gridwise op
* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases
* fix: failure to resolve template parameters when calling new function overload
* fix: passing reference type as parameter instead of underlying types
* fix: merge error caused duplicate definitions
* fix: make sure constness of template and parameters types match
* fix: don't compile batched contraction test on unsupported architectures
* feat: add example for new wmma implementation, and consolidate example code between platforms
* style: return inline instead of with branch
* chore: add extra assert on vector memory access sizes
* chore: clean up some unused variables
* fix: correct tail number calculation, added small cases and extra instances to the test
* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method
* feat: grouped gemm tile loop support for RDNA4
* fix: removed extra parameter from grouped gemm example instance
* fix: FP8 check incorrectly enabling FP8 on RDNA3
* Added device level implementation for bwd_data_wmma_v3.
* Added first instance of bwd_data_wmma_v3(f16).
* Add support for bwd data in gridwise implementation
Some changes are general for convolution and some are specific for bwd
data. We need to generalize them once we have fwd, bwd data and bwd
weight
* Initial device implementation of bwd data
* Remove unused template parameters in device impl
* Add one instance for different layout
initial check of device implementation
* Add tests for splitk and for different layouts
* Appended more instances to wmma_v3_f16.
* Added conv_2d bf16 wmma_v3 instances.
* Added conv_3d_bf16 wmma_v3_instances.
* Added conv_3d_f16_wmma_v3_instances.
* Added SplitN test cases for wmma.
* Conv3d_bwd_data_scale_wmma_v3 instances.
* Conv3d_bwd_data_bilinear_wmma_v3_instances
* Renaming the device level instances file to common name , since it is defined for different DataTypes.
* Renaming the instances and fixing typo
* Added the test cases to regression test list
* NCHW support for wmma_v3
* Examples for bf16 and f16 bwd_data_wmma_v3
* Added transpose conditons for device impl
* fixing bugs
* Added the gemm_args array implmentation
* WIP debug conv bwd
* fix splitk
* Grouped gemm fix
* Update CmakeLists with EOF
* Added more instances for tests
* Fixed the run time error in examples and removed 3d conv examples.
* Fixed a typo.
* Updated CmakeLists to removed the 3d convultion deleted files
* Added print error statements for unsupoorted argument
* Added the merge conflict related changes
* Fixed compilation error
* Fixed the InstanceFactory duplication error.
* Removed the print statements and added logs to Arg function
* All the merge conflict related errors resolved
* Added d_tensor tests.
* Added the missing example types of wmm_v3
* Merge error fix
* Corrected the instance name
* Reverted the bias relu change
* Revereted the transpose load local change
* Updated the regression test list with bwd_data_scale
* Revert "Revereted the transpose load local change"
This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b.
* Revert "Merge error fix"
This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b.
* Reverting the local change
* Added merge error fix
* Build error fix due to merge conflicts
* Added bias_relu example for wmma_v3
* Modified the main method in dtensor tests
* Updated the dtensor tests to pick all the shapes
* Updated the dtensor test shapes.
* Updated the mem operations in tests.
* Added reference func
* Fixed typos in device impl
* Added new header file and modified the include file for 3d tests
* Renamed the test file and added reference func call.
* clang format fix
* Added ignore params
* Modified device impl and tests
* Removed debug print statements and updated dtensor test shapes
* Fixing merge conflicts
* Fixing more merge conflicts
* Fixed copyrights
* Updated the tuned instances to bilinear and scale.
* Adding tuned instances to vanilla wmma_v3
* Removed all unused instances and modified test layouts.
* Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances.
* Fix clang format
* Updated tuned f16/-genric instances
* Formatting the instances file
* Fixed copyrights and clang issues
* Nonsense commit to force git to force
* Removed the transpose instances
* Added verified genric instances
* Fixing namespace errors
* Added todo for failing shapes
* Formatting instance file
* Fix instance list formatting
* Removing unnecessary formats
* Renamed the common file
* Unification of xdl and wmma bwd_data tests
* Updated Cmake
* Added all layout types and deleted code.
* Updated Cmake to add the condition to all tests.
---------
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* Fixed typos for padded instances
* Added tests for fp16, KM_KN and KM_NK
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.
* Fixed typos
* Updated the set of tests for FP16
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* removed unnecessary ck parts from compilation
* initial gemm_add_multiply instance implementations
* fixed profiler help message for gemm_add_multiply
* improved multiply_add profiler layout help
* fixed template arguments for test instances
* added test for gemm_add_multiply
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* switched to splitK interface
* log print added to splitk benchmarks
* revert main cmake comments
* newline change reverted
* added add_fastgelu instances
* revert unintended change in xdl add_fastgelu
* created gemm_add_add_fastgelu instances
* created fastegelu instances
* added tests for all splitk fastgelus
* Added tests.
* multiply_add instances created
* updates to add_multiply splitk instances
* splitk xdl test fixes
* added wmma multiply_multiply instances
* fixed ONLY_XDL_AND_WMMA_KERNELS tag
* Added gemm_add examples for wmma v1 and v3
* fixed / workarounded i8 instances
* Modified the v3 code to added one fp16 bxdl instance.
* added bf16 xdl instance.
* adding gemm_add wmma_cshuffle and other support
(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* add instances into camkelists
(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* This is work in progress, edited the template parameters in order to build
(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype
(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* added datatype and use clang-format-12
(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* Fixing build errors
* Added instances for v3
* Adding instances and executables
* Code update of template parameters modified.
* Renamed file.
* Added tests.
* resolved error tests.
* Fixing build errors
* Updated comments
* removed the changes as per the MR review comment.
* Updated tests.
* fp8 instances - not tested
* Restored the Cmake file that was reverted by mistake during rebase.
* fixed wmma_op test
* Updated comments.
* Updated the template parameter description
* fixed rdna4 instances
* fixed back compatibility on gfx11
* cleanups
* fix ckProfiler
* one more cmake fix
* added fp8 instances
* Updated tests to ad BF16 instances as per review comment
* Added include file and cleaned up(as per review comment)
* Updated and optimized the example code for all types.
* Fixed clang format
* Resolve "Implement `device_gemm_bilinear` for RDNA4"
* test generalization to handle FP16 shuffle better
* added missing changes
* Added bf16 wmma instance for add_relu
* Added f16 wmma instance and corrected bf16 instance errors.
* Added instances to Cmake
* Modified the template parameters to make the instances work.
* Fixed typo in profiler
* Added v3 instances for gemm_add_relu
* addressed core review comments
* Added test for gemm_add_relu wmma instance
* Cleaned up the code.
* Added examples for gemm_add_relu
* Fixing typo to resolve build errors.
* Fixes applied to fix the precision loss.
* fix billinear test after merge
* Removed the old wmma instances.
* Added wrapper and renamed the wmma_v3 instances
* Updated copyrights and added wrappers.
* Fixes applied according to review comments
* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Robin Voetter <robin@streamhpc.com>
* Removed the old wmma instances.
* Updated wrapper for the v3 instances
* removed the old wmma examples
* Renamed the v3 instances
* Deleted the gtest file added by mistake.
* Updated thge profiler with wrapper
* Fixed test errors.
* Fixed the review comments
* Fixed the if condition MACROS.
* REVERTED THE PROFILER CHANGES
* Revert "REVERTED THE PROFILER CHANGES"
This reverts commit 21cb98546c.
* Revert "Fixed test errors."
This reverts commit 13efcc6fe1.
* Revert "Updated thge profiler with wrapper"
This reverts commit 536f86661d.
* Added missing wrapper instances
* Updated copyrights.
* Fixed typo.
* Fixed copyrights.
* Updated copyrights.
* updated copyrights.
* comments on the atomics workaround
* fixed cmake comment
* Fix bug from merge
* clang-format-18
* Fix compilation error
* multi_abd wmma support:
- Add multiple A and B support to multiple D implementation (gridwise level)
- Add multi_abd GEMM (device level)
- Add instances (xdl parity)
- Add tests (both xdl and wmma)
- Add examples
- Add ckProfiler support (both xdl and wmma)
* Fix bug in device print function
* Fix unused template parameter
* Add support for fwd conv in gridwise implementation. Identical to run function for bwd data.
* Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future.
* Make relevant profilers print the number of valid instances to aid testing.
* Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now. Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working.
* Reset output buffer after each run in profile_grouped_conv_fwd_impl().
* Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds.
* Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation.
* Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types.
* Add int8 instances for 2D vanilla grouped conv fwd all layouts.
* Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd
* Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts.
* Implement multi-AB support for grouped conv fwd and add example.
* Add 1D instances
* Add D layout tests to IsSupportedArgument()
* Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary.
* Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists.
* Add some more tests for vanilla grouped conv fwd
* Add 2D bias clamp instances and tests
* Add 3D bias clamp instances and tests
* Add 2D and 3D clamp instances and tests
* Unify problem sizes across vanilla and clamp flavor tests
* Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check.
* Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations.
* Remove wmma fp8 and bf8 instances when not targetting gfx12
* Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build
* Remove old years for newly created files.
* No need to time kernels for now.
* Fixup comments
* Pass struct args to Gridwise Run() function by reference.
* Don't use workspace memory in the case where A needs explicit transposition but B does not.
* Move calculation of rotating memory buffer sizes to Argument member functions.
* After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts.
* Unify xdl and wmma example code for grouped conv fwd scaleadd ab
* Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same.
* Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3.
* Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge.
* Small post-merge fixup, everything seems to work.
* Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming.
* Extend scaleadd_ab instance lists
* Extend merged groups instance lists, including adaptations of xdl "2x" instances.
* Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile.
* Extend "mem" instance lists.
* Extend regular instance lists.
* Fixup comments and ignored kernel arg name
* Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test.
* Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code.
* Re-enable sharding for wmma cshufflev3 instances
* Post merge fix to vanilla test
* Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general.
* Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop.
* Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument.
* Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed.
* Re-enable old wmma instances
* Re-enable Linqun's Xdl Wmma instances
* Small post-merge fixes
* Fix copyright headers
* Remove commented code snippet in gridwise
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance.
* Adding tuned instace list for groupoed conv fwd (#3288)
Following flavors are updated with tuned instance list:
- grouped_conv2d_fwd
- grouped_conv2d_fwd_bias_clamp
- grouped_conv2d_fwd_clamp
- grouped_conv3d_fwd
- grouped_conv3d_fwd_bias_clamp
- grouped_conv3d_fwd_clamp
- grouped_conv3d_fwd_scaleadd_ab
Re-factored instance selection:
- removed all the unnecessary instance tuples (comp/mem/16x16/generic)
- removed all unnecessary layouts and data types
* Do not use std::remove_cvref_t, does not exist in C++17, use custom one.
* Splitting grouped conv fwd instances (#3449)
* Disable unnecessary and failing tests related to experimental CK builder
* Disable unnecessary ck builder experimental tests fully
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: Wojciech Laskowski <77888887+wj-laskowski@users.noreply.github.com>
Refactor and integrate CK GPU references into ckProfiler.
- All convolution layouts and groupings supported for all three directions
- Unit tests verifying GPU and CPU reference is the same
- Support added to profiler (do_verification = 2 enables GPU reference)
- One profiler-based test per direction changed to GPU reference to demonstrate usag
Closes AICK-427
* Support gemm_ab_scale:
- Add tests
- Integrate scaling implementation in multiple D
- Generalize existing b_scale for ab_scale
- Add instances
- Generalize implementation for ScaleBlockM, ScaleBlockN, ScaleBlockK
- Add support for all layouts supported by xdl
- Fix splitk xdl
* Fix copyright
* Wmma support for gemm_blockscale_wp (#3315)
* Support for preshuffle with ab scale
- add support for b preshuffle in GridwiseGemm_wmma_cshuffle_v3_ab_scale
- add support for AScaleLayout amnd BScaleLayout (can be different
from ALayout and BLayout, respectively)
- add Run method in v1 pipeline to support preshuffle + scaling
- add support for preshuffle gemms in common invoker
- Add splitk support
* Fix copyright header
Old sequence sort code was showing up on build profiles. Convert it to constexpr functions for much more efficient build-time execution. The sorting is still O(N^2), but our sequences are small enough it executes quickly. This reduced compilation time of a small convolution by more than 10% and time overall time spent in the compiler on a narrow build by %6.
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
* Persistent Stream-K Kernel Implementation
This change implements an operator() function in the
reboot::StreamKKernel class that is enabled when the Persistent flag is
set to true. In this case, the data-parallel portion and the Stream-K
portion of the kernel are fully persistent.
The changes were made in the reboot namespace. A future PR will remove
the old Stream-K kernel class and remove the reboot namespace.
* Unit Tests for Persistent Stream-K Kernel
This change contains the inital test suite for the Persitent Stream-K
Kernel. The files contain "reboot" in the name; a future PR will remove
tests for the old Stream-K Kernel and remove the "reboot" naming.
A future commit will add tests for the non-persistent kernel.
Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase
class. This allows us to estimate the number of accumulations done per
macro tile in C to use during validation when computing relative and
absolute tolerance.
* Adding implementation for the Non-Persistent Stream-K kernel
This code is adding the operator() function for the Non-Persistent Stream-K
kernel. Persistency of the kernel is determined through a template argument.
The Non-Persistent kernel will allocate additional workgroups for the data
parallel section, leading to a different structure for processing the data
parallel and Stream-K sections.
There has been an addition to the TilePartitioner to get access to the whether
Persistent has been set to true or false in the StreamKKernel.
* Adding in the tests for the Non-Persistent Stream-K kernel
* Refactor Stream-K Reboot Unit Tests
This commit makes the following changes:
- Update test cases to determine M, N, and K based on the number of CUs.
This ensures that each test case is one of Edge Case, SK Only, DP
Only, or DP + 2 Tile SK regardless of the architecture.
- Since the DP + 2 Tile SK test case takes long to run, this change
moves this case into a separate .inc file and labels it as an extended
test.
- Since the extended test takes > 30 seconds to run, this test is added
to the list of regression tests.
* Fix spelling errors in comments for test cases
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Changes based on review
Removed const volatile for typenames
Set up alias for is_tuple_t
Naming changes for clarity: GemmCommon -> BaseGemm
Moved std::enable_if_t out of template parameters and changed to a return type for operator()
Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance
---------
Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* rebased on top of develop
* fixed missing shuffeling and wrong indexing
* added tests for batched_b_scale
* added missing files
* fixed wrong stride computation and removed k batching (for now) due to precision issues
* reinstated k-batching with PRNG constrained to -1..1
* added specialization of GeneratorTensor_3 for int4 and fixed internal overflow
* added k-batching to reference and increased tolerances for test
* changed gemm_b_scale and gemm_universal tests to use correct parameters
* adressed review commentsd
* ported fixes back to non-batched version of b_scale
* adressed review comments
* run clang-format on older commits
* add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior
* added newline at end of file
* reflected changes from muitl-abd branch in batched b_scale
* fixed gfx11 issue
* changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed
* run clang format
* set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested.
* reduced range for pk_i4 even further to 0..0
* removed failing xld instances. Failure now uncovered now that tests were fixed
* removed generation of int4 values entierly
* divide B buffer by BPackedSize
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
* Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite
* Account for stride when computing K offsets for A and B tensor
This change ensures that the correct stride is used when computing the K
offsets into the A and B tensors in the Stream-K Kernel's operator()
function. This ensures that the kernel executes correct regardless of
whether A and B are row or column major.
* Move helper code to test_gemm_streamk_util.hpp
* Separate tests into smoke/regression/extended. Add bf16 datatype
* Run clang-format
* Refactor combinatorial macro expansion and naming
* Adjust the initialization values to account for better tolerance on bf16
* Correct BF16 datatypes in comments
* Move the extended tests under the REGRESSION_TESTS label
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Add F32 MFMA warp gemms
* Support f32 in fwd FMHA
* Implement transpose_vectors for 4-byte types (float)
* Fix unexpected implicit f32->uint32 cast in buffer_store<4>
__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.
* Support F32 in bwd FMHA
hdim = 256 is disabled for now because it uses too much memory on gfx90a
* Support Headdim = 48 (divisible by 16) in fwd
* Add fp32-specific receipts (800 and 801)
* Tune fwd tiles
* Tune bwd tiles
* Use small tiles only for small seqlen_q
* Fix after rebasing
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Remove constraints and adjust filtering for fp32
Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.
* Add fp32 tests
* Make splitkv and appendkv compile for fp32 only
There are no instances yet, but API still must compile when only fp32 is
requested.
* Remove unimportant f32 instances
* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS
* Replace magic numbers with a constant, improve comments for dropout
* Update changelog
* Fix condition that dq_acc must be set to zero when mask is used
The change was introduced in #2799
* Replace warp_uniform with recently added amd_wave_read_first_lane
* Add hdim = 96 and 192 to fwd
- profiler for gemm quantization for DL/XDL
- tests for gemm quantization for DL/XDL
- implementation for gemm quantization for WMMA
- profiler/tests for gemm qunatization for WMMA
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Added gemm + reduce instance library for RDNA4. This includes:
- New device implementation running GEMM and reduction kernel
- instances for wmma (xdl parity)
- examples for wmma (xdl parity)
- tests for existing xdl and wmma
* Improve random number generation
* use different seed for each input (Q, K, V...);
* use deterministic generation of:
* seqstart_q/k (for group mode);
* block_table (for paged-kvcahe);
* cache_batch_idx (for kvcache);
* Extract arg_parser-related code from run functions to use them as tests
* Split examples into main programs and fmha runners, build instances separately
* Add dummy tests that use instances and runners
* Fix a missed corner case of f32->f8 conversion
When value if < min f8 denormal but > min f8 denormal / 2, it must be
rounded to min f8 denormal (i.e. 0b1), not to 0.
* Fix incorrect fp8 scales for P and O in validation code
DataTypeConfig was incorrectly compared with fp8_t.
* Add host generation of dropout random values and use it for validation
Previously host validation (reference_batched_dropout) used random
numbers generated by BlockDropout of the kernel, meaning that incorrect
generation on device (bad distribution, repeated numbers, too many zeros,
etc.) would not trigger any validation errors.
* Implement tests from smoke_test_bwd.sh
* Return result as enum to distinguish failure and missing instance
* Add tests for bwd features: bias, alibi, dropout
* Implement tests from smoke_test_fwd.sh
* Pass seqlen_q/k as vectors to fwd and bwd runners
* Add tests for fwd features: bias, alibi, dropout
* Add tests for pagedkv and splitkv
* Fix conditions when to use splitkv and pagedkv kernels
splitkv was executed only when use_kvcache which == (need_append_kvcache || use_cache_batch_idx || 0 < page_block_size).
In the SplitKV tests: the regular fwd kernel was executed if use_cache_batch_idx was not requested even when num_splitkv > 1.
In the AppendKV tests: the pagedkv kernel was executed but it often failed to find an instance.
* Add tests for appendkv
* Use is_v_rowmajor = true because there are no instances with column layout anymore
* Split public and private compile options for instances
Tests and examples need to know only about CK_TILE_FMHA_FWD_*_API.
* Improve parsing validation in bias and mask
* Pass bias as string for consistency with mask
* Catch parsing and other exceptions
* Add bwd test for deterministic flag
* Initialize fp8 tensors (-init=ufq) similarly to uf
* Fix splitkv/pagedkv invocation: use padded sk when seqlen_k_ptr is not null
seqlen_k cannot be used to determine padding when seqlen_k_ptr is
provided. The actual seqlen_k is taken from seqlen_k_ptr[b].
Even seqlen_k values (% bn0 == 0) use padded seqlen_k while seqlen_k_ptr
may contain arbitrary values.
In the example or tests this produces incorrect results with appendkv
(for example, -d=32 -s=1 -s_k=64 -s_knew=7 -vlayout=c -b=8).
* Fix use_pagedkv value when kvcache = true but page_block_size = 0
In this case block_table_ptr is nullptr which is accessed in the kernel.
* Clean up bwd tests
* Unify fwd tests for f16/bf16 and fp8
* Use better explicit instantiation declaration for fmha_bwd<2>
* Use the same seed for all tests, allow to override it with env variable
* Undo clang-format of one irrelevant file
For some reason my local clang-format-18 and the one in CI work differently.
* Do not build instances and tests on unsupported archs
* Build instance libraries as OBJECT library
* CI: Enable sccache for HIP
There are source files with LANGUAGE HIP, they need
-DCMAKE_HIP_COMPILER_LAUNCHER=sccache
* Add tests to REGRESSION_TESTS
* Fix OOB accesses in deterministic bwd due to incorrectly assumed kN0
The runner assumes kN0 = (hdim_q <= 128) ? 128 : 64 but there are
smaller tiles (for tr_load or fp32). This can create too small dq_acc_buf.
* Pass CK_TILE_FMHA_FWD_*_API as INTERFACE compile options
The instances don't actually depend on them, only examples and tests do.
Passing these definitions as INTERFACE allows to change FMHA_FWD_ENABLE_APIS
without recompiling instances that are already in ccache.
* Fix formatting and names
* Create tests for ck tile batched transpose using example
* Create ck tile tests for smoothquant using examples
* fix precision input strings and convert batched transpose to regression tests
* Code cleanup and fix asserts
* add missing licenses
* update copyright and licensing in files
* Update smoothquant tests to use example's smoothquant.cpp
* Add custom target for batched transpose tests
* Add missing new lines at end of files for CMakelists
* fix typo in batched transpose CMakeList target_compile_options
---------
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
* Fix argument order for calls to profile_batched_gemm_impl()
* Revert previous and swap the order of the profile_batched_gemm_impl() function arguments instead.
* Revert copyright years for unchanged files.
* Remove test_batched_gemm from REGRESSION_TESTS since it no longer takes more than 30 seconds to run.
---------
Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com>
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Updates to support mixed precision
* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip
* Added support for F8xF16xF16 to gemm_wmma_universal
* Added support for F16xF8xF16 to gemm_wmma_universal
* Added support for BF16xI4xBF16 to gemm_wmma_universal
* Added support for F16xI4xF16 to gemm_wmma_universal
* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType
* Added missing test class for FP16_KM_NK
* Pre-commit hooks fixes
* Added padding instances for f16xf16xf16
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Ammending changes for adding support for padding instances for f16xf16xf16
* Fixes for padding instances for f16xf16xf16
* Added padding instances for bf16xbf16, f8xf8
* Added packed instances for bf16xi4xbf16
* Added padding instances for f8xf16xf16
* Added padding instances for f16xf8xf16, f16xi4xf16
* Fixed typos for bf16xbf16xbf16 padding instances
* Fixed typos for padded instances
* Added tests for fp16, KM_KN and KM_NK
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.
* Fixed typos
* Updated the set of tests for FP16
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Fix int8 DTYPES check for gemm_bilinear
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>
* Add gemm_mx_fp8_bf8 example with row-major B
* Add more overloads of MX MFMA instructions
* Add MK_KN (RRR) tests
* Add KM_NK (CCR) tests
* Add more problem sizes to Large tests
* Add test_gemm_mx to the list of regression tests
* enable batched_gemm_softmax_gemm_perm_wmma for gfx12
* disable instances with blocksize=256 in attention examples
* debuggging
* debug
* fixed lds_enabled
* debugging
* Fix and add limit to skiplds feature
* Enable skipLds feature and fix compilation bugs
* add ck_tile definitions for gfx12
* fix clang format and test/wmma_op
* updage instances cmake for gfx12
* disable the test_wmma_op on gfx12
* fix the builds for gfx950
* add gfx12 and gfx950 to default target list
* clean-up cmake file
* Initial introduction of OFP8 data types.
* Renamed FP8 and BF8 tests into FP8_FNUZ and BF8_FNUZ.
* Implementation of ConvertFP32Nearest in test_fp8_ocp.
* Remove dependence on possibly undeclared alias.
* Implement FP8OCP test for stochastic rounding mode.
* Implement FP8OCP tests for half_t type conversions.
* enable bf16 atomic add on gfx950
* Implement ConvertFP32Nearest test.
* Implement ConvertFP32Stochastic test.
* Implement ConvertFP16Nearest and ConvertFP16Stochastic tests.
* Refactoring. Move FP8 definitions into a separate header file.
* Enable easy switching between architectures.
* Fix compilation error for gfx942 architecture.
* Add fp4 type with constants
* only builf gfx950 branch for gfx950 target by default
* Enable OCP build of example_gemm_xdl_fp8.
* Fix formatting.
* fix the build logic for gfx950
* Improve GEMM example verbosity.
* Add constexpr where applicable.
* fix the logic of enabling XDL and WMMA instances
* Improve GEMM example verbosity.
* Enable build of example_gemm_xdl_fp8_bf8 test.
* Fix tests for gfx1101 architecture.
* Build DPP examples only on gfx103 and gfx11 architectures.
* Optionaly run either CPU or GPU verifications with GEMM examples.
* Extend GeneratorTensor_Sequential to produce values of prescribed data types.
* Add missing constructor.
* Add scale type and mxfp conversions
* Update conversions
* Add conversion tests
* Fix typo
* Improve infrastructure for OFP8 data type support.
* BUGFIX. Should not use FP8 as Compute/Accum data type.
* Add custom target for grouped_convnd_bwd_weight tests.
* Can build `tests` target on gfx950.
* Bugfixes on gfx1101 architecture.
* Fix dependencies.
* Add stochastic rounding tests
* Provide single point of truth for FP8 INF and NAN checks
* Prevent instantiation of operators that are not supported by FP8 data types
* Add FP8 type selection into client_axample CMakeLists.txt
* Prevent sccache server from shutting down during build
* Fix test success reporting logic
* Change default verification method to CPU.
GPU verification takes too much time to complete on the emulator.
* Add scale <-> float conversions
* Add scaled conversions with tests
* Add device conversions
* Make sure all tests and examples are built for gfx950
* Facilitate testing of FP8 data types on the emulator
* Introduce two new tensor generators
* Enable instances built for gfx94 to be built on gfx950
* Verify 35_splitk_gemm on floating point numbers.
splitk gemm appears to be losing precision VS reference implementation when FP numbers are involved.
* Format
* Verify 04_gemm_add_add_fastgelu on floating point numbers
* Verify 20_grouped_conv_bwd_weight on floating point numbers
* Verify 38_grouped_conv_bwd_data_multiple_d on floating point numbers
* Verify more tests on floating point data
* Fix data types and improve testing verbocity.
* Add fp4 vectors
* Add debug tests
* Upgrade to NPI 573 build docker.
* Skip on gemm_universal tests.
The tests take too long to complete on the emulator.
Need to see if it is possible to reduce the scope of the testing to just FP8 data types.
* Add new mfma instructions and examples
* Add preprocessor directives for gfx950 specific code
* Fix gfx1101 build
* Document test availability
* Re-enable fp8 gemms for gfx94/95
* Cherry-pick GEMM Universal tests for FP8 data types
* Cleanup
* Add vector types and tests
* Add check_err function
* Add tensor generators
* CK_USE_GFX94 has already been set on this branch
* Fix
* Address formatting issues and leftovers
* Make fail/pass logic consistent within 01_gemm folder
Removed multiple negations in fail/pass logic to propagate `true` as the success indicator.
* Fix GPU verification reporting logic.
* Update year in copyright notice.
* Cleanup
* Use `enum class` instead of `enum`
* Remove set_property for FP8 tests
* Add vector conversions
* Fix
* Fix linker errror
* Clean up
* Fix gfx950 conversions
* Clean up
* Fix more gfx950 conversions
* Fix even more gfx950 conversions
* Narrowing the scope of PR to OCP FP8 enablement only
* Add tests for OCP FP8 vector_type storage
* Fix client examples build
* Fix typo
* Update e8m0 casting
* Rename E8M0 type
* Update unpack method
* Cleanup merge artifacts
* Enable gemm kernel on all gfx9 architectures (#227)
* clean-up
* Implement `non_native_vector_base` with `ext_vector_type` array. (#232)
* Enable support of 1, 2, 4, and 8-byte custom types in CK.
* Fix pool tests for OCP FP8 data type
* Fix build
* Add ckProfiler gemm instances for new mfma instructions and fix ckProfiler build on gfx950
* fix clang format
* Add new mfma instructions and examples
* Add preprocessor directives for gfx950 specific code
* Add ckProfiler gemm instances for new mfma instructions and fix ckProfiler build on gfx950
* fix clang format
* Fix clang format for the newly merged files
* Use the existing example instances for fp16 bf16 and int8
* Remove comment on new mfma instructions in MfmaInstr
* Update include/ck/tensor_operation/gpu/grid/gridwise_batched_gemm_gemm_xdl_cshuffle_v1.hpp
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* merge from public repo
* Fix ck build
* Fix ck build
* Use double for max_abs_in_val
* Move scaled_type_convert functions to a separate header (#251)
* re-enable building mha lib and gemm_universal_f8 instances for gfx950
* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* fix typo for CK_USE_OCP_FP8
* fix typo for CK_USE_OCP_FP8
* Add FP6 and BF6 types (#261)
* Add a rounding flag
* Add FP6 and BF6
* Add tests
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* Clean up
---------
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* fix one more typo
* Refactor E8M0 scale implementation (#262)
* Refactor E8M0 scale implementation
* Add MXFP6 and MXBF6 conversion methods (#270)
* Add conversions
* Add tests
* Add docstrings
* Add scaled conversions
* Add fp6/bf6 tests
* Remove misleading fp4 test case
* Add docstrings
* Clean up
* Address comments
* Set stricter tolerances for RNE tests
* Add missing tests
* Add native conversions to float
* Revert "Add native conversions to float"
This reverts commit 09467111f73b753c8cc3d597533b187940353dab.
* Update copyright years
* replace the fp6 with bf6 convert calls in test_bf6
* fix test_bf6
* enable smfmac test
* [MX FP8] Add Scaled Type Convert Functions for OCP FP8/BF8 data types (#271)
* Move scaled_type_convert functions to a separate header
* Introduce MX data tests
* Build MX tests only on relevant architectures
* Refactor E8M0 scale implementation
* Fix `config.h` typo
* Cleanup deprecated symbols
* Refactor `amd_ck_fp8.hpp`
* `scaled_type_convert` for `f8_ocp_t`
* Implement test for MX FP8 scaled type convert
* Implement test for MX BF8 scaled type convert
* Scaled type convert for vectors of 2 FP8 elements
* Scaled type convert for vectors of 16 FP8 elements
* Implementation of scaled conversion from F32 to F8
* Add tests for scaled conversions from FP32 to FP8
* Add documentation to the test functions
* Implementation of scaled conversion from F32x2 to F8x2
* Implementation of scaled conversion from F32x16 to F8x16
* Implementation of scaled conversion from F32x32 to F8x32
* Implementation of scaled conversion from F8x32 to F32x32
* Verified on the emulator
* MX FP GEMM - Example Template (#277)
Temporarily uses `DeviceGemmMultiD_ABScale_Xdl_CShuffle_V3` kernel and 128x128 scaling matrices.
Must be modified to use MX-native GEMM kernell with 16 or 32 component vectors per scale.
Verified on the emulator.
* Add vector support
* Add tests
* Add missing type aliases
* Fix test naming
* only build mx example for gfx950
* disable CK_USE_AMD_MFMA_GFX950 by default
* fic build for multiple archs
* fix typo
* fix typo
* Update unpack signature
* Fix merge
* Add size checks in pack function
* Add a flag
* Add conversions
* Fix build logic
* Update pack/unpack methods
* Remove unneeded AsType accessors
* Add docstrings
* Add a flag to config file
* Test the functionality of V_MFMA_F32_16X16X128_F8F6F4 and V_MFMA_F32_32X32X64_F8F6F4 instructions. (#293)
* Introduced MFMA tests
* Verified f8f6f4 MFMA Instructions
* Move flag logic to scaled_type_convert header
* Use pointers instead of array indices
* Fix a typo
* Update tests and pack functions
* Fix gemm gemm on gfx950
* Fix clang format
* restore the default gput target lists
* fix the jenkinsfile
* add missing ifdef
---------
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: root <root@banff-cyxtera-s83-2.ctr.dcgpu>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: jefyang1 <146495389+jefyang1@users.noreply.github.com>
Co-authored-by: jefyang1 <Jeffreyj.Yang@amd.com>
* smoke and regression targets working with tests
* test filters work for both examples and test
* removed uneccesary comments
* added a missing comment
* added a missing comment
* fixed typo in the comments
* updated README
* Update PULL_REQUEST_TEMPLATE.md
updating the template for future addition of test cases
* Update PULL_REQUEST_TEMPLATE.md
* Disable building DPP kernels by default
* Disable building dpp instances, examples, or tests if DPP_KERNELS is not set
* Add new DPP_KERNELS flag to readme