* upgrade default docker to rocm7.0.1
* turn on build and test on gfx950 by default
* use rocm-dev instead of rocm
* link libhiprtc for codegen targets
* resolving codegen compilation errors: removed calls to other std functions, resolved issues with int32_t: needed the correct header, put use of e8m0 into header guards
---------
Co-authored-by: Astha Rai <astha.rai713@gmail.com>
* updating codegen build for MIOpen access: adding .cmake for codegen component
* updating CMake
* adding in header guards for some headers due to issues with hiprtc compilation in MIOpen
* some more header guards
* putting env file in header guard
* cleaning up some includes
* updated types file for hiprtc purposes
* fixed types file: bit-wise/memcpy issue
* updating multiple utility files to deal with standard header inclusion for hiprtc
* added some more header guards in the utility files, replacing some standard header functionality
* added some more header guards
* fixing some conflicts in utility files, another round of header guards
* fixing errors in data type file
* resolved conflict errors in a few utility files
* added header guards/replicated functionality in device files
* resolved issues with standard headers in device files: device_base and device_grouped_conv_fwd_multiple_abd
* resolved issues with standard headers in device files: device_base.hpp, device_grouped_conv_fwd_multiple_abd.hpp, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp
* added header guards for gridwise gemm files: gridwise_gemm_multiple_abd_xdl_cshuffle.hpp and gridwise_gemm_multiple_d_xdl_cshuffle.hpp
* fixed issue with numerics header, removed from transform_conv_fwd_to_gemm and added to device_column_to_image_impl, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3, device_image_to_column_impl
* replaced standard header usage and added header guards in block to ctile map and gridwise_gemm_pipeline_selector
* resolved errors in device_gemm_xdl_splitk_c_shuffle files in regards to replacement of standard headers in previous commit
* added replicated functionality for standard header methods in utility files
* replaced standard header functionality in threadwise tensor slice transfer files and added header guards in element_wise_operation.hpp
* temp fix for namespace error in MIOpen
* remove standard header usage in codegen device op
* removed standard header usage in elementwise files, resolved namespace errors
* formatting fix
* changed codegen argument to ON for testing
* temporarily removing codegen compiler flag for testing purposes
* added codegen flag again, set default to ON
* set codegen flag default back to OFF
* replaced enable_if_t standard header usage in data_type.hpp
* added some debug prints to pinpoint issues in MIOpen
* added print outs to debug in MIOpen
* removed debug print outs from device op
* resolved stdexcept include error
* formatting fix
* adding includes to new fp8 file to resolve ck::enable_if_t errors
* made changes to amd_wave_read_first_lane
* updated functionality in type utility file
* fixed end of file issue
* resovled errors in type utility file, added functionality to array utility file
* fixed standard header usage replication in data_type file, resolves error with failing examples on navi3x
* formatting fix
* replaced standard header usage in amd_ck_fp8 file
* added include to random_gen file
* removed and replicated standard header usage from data_type and type_convert files for fp8 changes
* replicated standard unsigned integer types in random_gen
* resolved comments from review: put calls to reinterpret_cast for size_t in header guards
* updated/added copyright headers
* removed duplicate header
* fixed typo in header guard
* updated copyright headers
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Overload output stream operator for LoopScheduler and PiplineVersion
* Add Run overload accepting grid descriptors MK.
* Add __device__ keyword for CalculateGridSize
* Create device op GroupedGemmMultipleD
* Add GroupedGemm MultipleD Tile Loop implementation.
* Add an example for GroupedGemm MultipleD tile loop.
* Device Op GroupedGEMMTileLoop.
* Bunch of small changes in exmaple.
* CkProfiler
* Remove unused tparam.
* Fix include statement.
* Fix output stream overloads.
* Do not make descriptors and check validity untill we find group.
* Fix gemm desc initialization.
* Revert device op
* Fix compilation for DTYPES=FP16
* Validate tensor transfers paramters.
* Validate on host only NK dims if M is not known.
* Fix bug.
* A convenient debug func for selecting threads.
* Fix has main k block loop bug.
* Make sure that b2c has up to date tile offset.
* Output stream operator for Sequence type.
* Cmake file formatting.
* dump lds content in appropriate precision type
* add squared add reduction op; allows sq sum
* initial stub from regular gemm impl
* layernorm example code & host verification
* initial layernorm implementation
* tidy up
* make C0 precision type consistent with C
* clang-tidy and additional comments
* tighten up example code
* account for extra flops/bytes from normalization
* clang-format
* c0 bias/beta/gamma now have its own precision type
* AccElemOp for gemm outputs prior to feeding to layernorm
* update workgroup mapping
* rename kernel template param to reflect its dual use
* use LDS mem pool for reduction workspace
* change cshuffle precision type to f16; clean up
* clang-format
* correct naming
* explicit cast
* fully implemented gemm + bias + activation + add + norm
* activation in correct order
* reflect reduction API's recent change
* amend
* clean up; add comment
* keep up with recent changes in reduction API
* format
* resolve merge conflicts
Co-authored-by: Chao Liu <chao.liu2@amd.com>