* Add logic to use new mfma instructions for fp8 bf8
* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format
* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* Fix intrin_mfma f8 calls due to merge mistake
---------
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* Unify test interface for different layouts.
* WIP: Introducing FP4/FP6/FP8 abstractions
* WIP: Introducing packed storage abstraction
* WIP: Introducing packed storage abstraction
* WIP: Improved support for FP6 data type
* Refactor packed storage for f6_t
* WIP: FP6 MFMA test
* Test if we correctly represent all FP6/FP4 numbers
* Additional output for failed FP4 test.
* More failing conversion tests
* Even more failing conversion tests
* Working FP6 MFMA tests
* Expand MX MFMA testing to BF8/6
* Update and verify MX MFMA test for packed types
* Fix fp4 and fp6 conversions on host
* Working MX MFMA tests for FP8/6/4
* Cleanup
* Add missing type
* Cleanup
* Final cleanup
* Restrict FP6/4 values output to CK_LOGGING=1
* Use CHAR_BIT instead of number 8
* Fix typo
* Remove FP6 and FP4 from the list of native types
---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
* removed comment with special characters
* fix for arg/template change after merge from develop
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* make the work compiled
* Solved the example code, but still have the profiler error
* Finished the feature
* Clang format and update the CHANGELOG
* solve the preshuffle v1 & v2 problem
* Comment Addressed
* Comment Addressed
* Add conversion tests
* Fix ctor
* Fix nan logic
* Fix conversion logic
* Permute packed f4_t values
* Fix conversion to float, repack vector elements
* Fix device tests
* Permute elements in a vector
* Add a repro test
* Add a conversion for a repro test
* Update test vectors
* Update conversion
* Fix the test
* Update test vector generator
* Fix vector sr conversion
* Permute conversion args
* Update conversion
* Test
* Fix packing
* Simplify conversion function
* Pack conversion in a loop
* Pack conversion in a loop
* Pack another conversion in a loop
* Pack one more conversion in a loop
* Pack the last conversion in a loop
* Clean up
* Add ops
* Add tests
* Add missing utils
* Update reference mx gemm
* Add f4x2 init mode
* Update host tensor utils
* Update chunk size for f4x2
* Add non scaled ops
* Add a type utility
* Update non scaled reference kernel
* Add non scaled tests
* Debug mfma arguments
* Add more debug info
* Update chunk size
* Update data layout
* Add more debugging
* Fix B stride
* Fix reference gemm
* Fix build
* One more reference fix
* Add more debug info
* Disable some tests
* Enable tests
* Add fp4 dimensions
* Update reference kernels
* Temp edits
* Remove leftovers
* Fix conflicts
* Clean up
* More clean up
* Revert "More clean up"
This reverts commit d8d35a0846.
* Add layouts to tests
---------
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>