1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue
* replace buffer load/store intrinsics with builtins
* fix clang format
* replace buffer load/store intrinsics with built-ins in ck_tile
* fix clang format
* add switch between buffer intrinsics and built-ins
* change the builtins threshold to clang20
* fix clang format
* fix some compilation errors
* revert changes in ck_tile
* revert changes in ck_tile
* delete all root files and folders when CI completes
* try changing the username in CI
* fix groovy syntax
* add user and group id info to ci dockers
* change ownership of all files in CI to jenkins at the end
* update changelog
* Add basic fp8 definitions and prn-generator
* Format
* Add fp8<->fp32 type_convert
* Format
* Split type_convert and cast_to/from_f8
* Format
* Minor fix
* Minor fix
* Move fp8 utils to a separate header
* Add elementwise ops
* Add fp8_convert_sr
* Format
* Add element op
* Eliminate magic numbers
* Split f8_convert_sr in host and device
* Format
* Add some constexpr
* Add a datatype test
* Format
* Another format
* Add fp8<->fp16 tests
* Update type_converts
* Format
* Add fp16 casting functions
* Format
* Use seed as a runtime arg
* Use element location for PRNG
* Format
* Add fp8<->fp16 to PassThrough element op
* Clean up
* Merge host and device implementations
* Add comments on rounding modes
* Remove leftover code
* Put type_converts into a separate header
* Put random number gen to a separate header
* Rearrange f8_utils' namespaces
* Refactor type_convert.hpp
* Move f8_t definition
* Add overloaded version of __builtin_amdgcn_readfirstlane()
* Remove 'static' specifiers
* Remove more 'static' specifier
* Replace unsigne char by std::byte
* Add 'const' specifier to never changing variable
* Add 'inline' specifier to funcion definition
* Fix wrong boundar calculation logic
* Rename type trait
* Remove std:: qualifier from standard types
* Replace 'size_t' by 'unsigned'
* Use type alias to hint usage
* Replace static_for<> by ordinary 'for' loop
* Rename readfirstlane() to amd_wave_read_first_lane()
* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
* Reorder statements
* [What] Rename the example
[Why] Prepare to add unary reduction
* Add global oparation to the parameter
* Add atomicmax
* Fix compile error
* Support atomicMax (hip library)
* Rename the reduction example
* Fix target name
* use p_d1_grid as the indicator directly
* Prevent performance issue. Let passthrough handle it.
* Implement the function template the specialize the float2
* No need to separate into two lines
* Remove empty line
* add comment
* Fix compile error due to merge from develop
* make the implementation of atomic_max / atomic_add explicit for each datatype
* Refine typo
* For future CI test
* Fix compiler error in ckProfiler
* Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
* simply use remove_pointer
* Rename type and var
* Refine example
* Modify reducemax example
* Fix bug in reduction
* Change initialize range
* Implement F64 version of atomicMax
* Move reduction code together
* Add buffer atomic_max
* Fix coding style by clang-format
* Integrate new api of DeviceGemmReduce_Xdl_CShuffle
* Integrate Batch gemm reduction
* Fix example
* fix example
* clean up
* Fix batch gemm tensor operation
* Fix coding style
* Fix template augument
* Fix clang format
* Keep flexible of different stride for each D tensor
* Fix compile error for ckProfiler
* Fix typo
* [What] Fix naming
[Why] Prepare to add out elementop
* Add DoutElementOp
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: rocking <chunylai@amd.com>
* compile ck for all targets
* update the target criteria
* change the target condition
* fixed some typos
* fixed missed file
* revert changes in README
* revert device_conv3d_fwd_xdl_...
* update device_conv3d_fwd_xdl_...
* update device_batched_gemm_reduce...
* test the unused arguments fix
* test the warning suppression
* try suppress warnings in device_batched_gemm_reduce_xdl...
* fix the last warnings
* replace UNUSED with std::ignore
* fix a typo
* replaced std::ignore with ignore
* add igonre header to common_header
* refactor atomicAdd
Co-authored-by: Chao Liu <chao.liu2@amd.com>