* Initial implementation:
- add new thread group transfer supporting transpose instruction
- refactor AB transfer to switch between thread and wave tiles methods
* Add some comments and remove explicit wave and lane calculations
* Remove compiler option for performance
* fp16 example: use tuned instance
* Missing cleanup
* Integrate wave transfer in existing gemm and batched gemm instances
* Add fast instances
* extend implementation for 8 bit datatypes
packed types not supported
* Address review comments
* Optimize pipeline v1 and re-introduce compiler option
* Disable wave tile approach for b scale gemm
* Fix for clang20
* Avoid code duplication of amd_global_load_transpose_to_vgpr function
* replace buffer load/store intrinsics with builtins
* fix clang format
* replace buffer load/store intrinsics with built-ins in ck_tile
* fix clang format
* add switch between buffer intrinsics and built-ins
* change the builtins threshold to clang20
* fix clang format
* fix some compilation errors
* revert changes in ck_tile
* revert changes in ck_tile
* delete all root files and folders when CI completes
* try changing the username in CI
* fix groovy syntax
* add user and group id info to ci dockers
* change ownership of all files in CI to jenkins at the end
* update changelog
* Add basic support for direct loads from global to LDS
* Clean the code and comments
* Add support for fp16
* Add comments
* Add check for thread cluster lengths
* Align non-direct-load fp16 example
* Small fixes
* Extend IsSupported to check for supported GPU gens
* Build examples only on the supported HW
* Do not throw when instance not supported in 04 example
* Review: Apply review suggestions
* Review: small fix
* Review: small fix
* Add column to image kernel
* Minor fixes for dtypes and client examples
* Disable tests for disabled dtypes
* Disable add instances functions for disabled data types
* Minor stylistic fixes
* Revert "Disable add instances functions for disabled data types"
This reverts commit 728b869563.
* Instances reduction
* Add comments in device_column_to_image_impl
* Update changelog and Copyrights
* Improve changelog
* Add int8 specialization for elementwise Add and Subtract.
* CGEMM examples bf16, fp32, int8
* Add convert reference output to CDataType.
* Skip BF16 data type during testing.
* Lower K value to get rid of accumulation error.
* Fix merge artifact.
* Fix changed function name: GetElementSpaceSize()
* Fix merge artifact.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type
* Update to host layer and host reduction
* Merge and remove reduction kernels
* Merge and remove reduction device interfaces and update pooling device interface
* Merge and remove useless reduction device instances
* Update to reduction profiler and reduction ctests
* Update to reduction and pooling examples and add one reduction example
* Change to reduction examples to let them testable by ctest
* Add explicit pass checking for reduction and pooling examples
* Explicit assignment of tensor shapes in example reduce_blockwise_two_call
* Use atomic_add to repace atomicAdd and add atomic_add for double type
* Add reduce ctest support for double data type
* Replace to_int_vector() by using c++ std::vector::assign()
* Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise
* Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock
* Add GetAtomicOperationZeroValue() support for AtomicMax
* Tiny change to reduce example README.md
* Fix some tiny issues due to branch merging
* Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t
* Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64
* Renaming
* Clean the header includings in device_reduce instances header files
* [What] Rename the example
[Why] Prepare to add unary reduction
* Add global oparation to the parameter
* Add atomicmax
* Fix compile error
* Support atomicMax (hip library)
* Rename the reduction example
* Fix target name
* use p_d1_grid as the indicator directly
* Prevent performance issue. Let passthrough handle it.
* Implement the function template the specialize the float2
* No need to separate into two lines
* Remove empty line
* add comment
* Fix compile error due to merge from develop
* make the implementation of atomic_max / atomic_add explicit for each datatype
* Refine typo
* For future CI test
* Fix compiler error in ckProfiler
* Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
* simply use remove_pointer
* Rename type and var
* Refine example
* Modify reducemax example
* Fix bug in reduction
* Change initialize range
* Implement F64 version of atomicMax
* Move reduction code together
* Add buffer atomic_max
* Fix coding style by clang-format
* Integrate new api of DeviceGemmReduce_Xdl_CShuffle
* Integrate Batch gemm reduction
* Fix example
* fix example
* clean up
* Fix batch gemm tensor operation
* Fix coding style
* Fix template augument
* Fix clang format
* Keep flexible of different stride for each D tensor
* Fix compile error for ckProfiler
* Fix typo
* [What] Fix naming
[Why] Prepare to add out elementop
* Add DoutElementOp
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: rocking <chunylai@amd.com>
* compile ck for all targets
* update the target criteria
* change the target condition
* fixed some typos
* fixed missed file
* revert changes in README
* revert device_conv3d_fwd_xdl_...
* update device_conv3d_fwd_xdl_...
* update device_batched_gemm_reduce...
* test the unused arguments fix
* test the warning suppression
* try suppress warnings in device_batched_gemm_reduce_xdl...
* fix the last warnings
* replace UNUSED with std::ignore
* fix a typo
* replaced std::ignore with ignore
* add igonre header to common_header
* refactor atomicAdd
Co-authored-by: Chao Liu <chao.liu2@amd.com>