* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.
* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.
* Use simple best occupancy model to calculate the split-K.
* Handle split-K autodeduction in explicit gemm conv.
* Add unit tests for split-K autodeduction.
* Remove oversubscription.
* Small fixes.
* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.
* Run clang formatting.
* Fix error handling in the conv profiler.
* Add missing documentation for the autodeducted split-K values.
* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.
* Fix clang formatting and split-K profiler documentation.
* Rename max_occupancy value variable.
* Calculate grid size for split-K autodeduction directly from input array shapes and template params.
---------
Co-authored-by: Ville Pietilä <>
* Add tests for host convesion f32/f16 to f8
* Add tests for host convesion from f8 to f32/f16
* Fix UB and corner cases in f32/f16 to/from f8 conversion
* There are UBs when very small values are converted to f8: bitshifts
can be larger that type width. Using unsigned long long does not help
because exponent_diff >= 64 in such cases. This causes that values
like 2.117582368e-22 are converted to non-zero f8 in host validation
of FMHA tests, test_f8 crashes with segfault in completely irrelevant
code like GTest internals or produces non-deterministic results etc.
* Fix FNUZ conversion to return NaN for NaN inputs.
* Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16
conversion.
* Replace some magic numbers with values from numeric_traits
* Build tests only on devices supporting the type
* update gpu_timer for rotating buffer as hipblasLt's implementation
* timing fix
* Updating gpu timer for old ck as well
* Revert "Updating gpu timer for old ck as well"
This reverts commit 958cd1bc99.
* code clean up with runtime argument; function rename
* code cleanup
* general timer fixes
* bug fix
* clang formatted
* addressing reveiew comments
* clang formatted
* Addressing review comments
* CI fix
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Expand the bandwidth of direct_global_to_lds for gfx950
* clang-format
* fix the remod.py and script for clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>