* [What] Rename the example
[Why] Prepare to add unary reduction
* Add global oparation to the parameter
* Add atomicmax
* Fix compile error
* Support atomicMax (hip library)
* Rename the reduction example
* Fix target name
* use p_d1_grid as the indicator directly
* Prevent performance issue. Let passthrough handle it.
* Implement the function template the specialize the float2
* No need to separate into two lines
* Remove empty line
* add comment
* Fix compile error due to merge from develop
* make the implementation of atomic_max / atomic_add explicit for each datatype
* Refine typo
* For future CI test
* Fix compiler error in ckProfiler
* Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
* simply use remove_pointer
* Rename type and var
* Refine example
* Modify reducemax example
* Fix bug in reduction
* Change initialize range
* Implement F64 version of atomicMax
* Move reduction code together
* Add buffer atomic_max
* Fix coding style by clang-format
* Integrate new api of DeviceGemmReduce_Xdl_CShuffle
* Integrate Batch gemm reduction
* Fix example
* fix example
* clean up
* Fix batch gemm tensor operation
* Fix coding style
* Fix template augument
* Fix clang format
* Keep flexible of different stride for each D tensor
* Fix compile error for ckProfiler
* Fix typo
* [What] Fix naming
[Why] Prepare to add out elementop
* Add DoutElementOp
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: rocking <chunylai@amd.com>
* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
* Rename the folder name for the pool2d and reduce examples
* Update to reduction test scripts
* Add Readme for pool2d_fwd and reduce_blockwise examples
* Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)
* Tiny fix in reduce profiler and tiny update in reduce testing scripts
* Tiny fix in testing script profile_reduce_no_index.sh
* Tiny fix in testing script profile_reduce_no_index.sh
* Add support for bfp16 reduction (using bhalf_t = ushort)
* Tiny fix in amd_buffer_addressing.hpp
* Tiny change in script/profile_reduce_with_index.sh
* Use AccDataType for Beta value and use element_wise::PassThrough
* Use type_convert for type converting in host layer reduction
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming all NumReduceDims to NumReduceDim
* Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2
* Update to testing scripts to add bf16 support
* added more static_assert
* Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp
* Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations
* minor change
* Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass
* Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp
* Tiny fix in script/profile_reduce_no_index.sh
* Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims
* Generic renaming in host reduction and DeviceReduce layer
* Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances
* Use multi-thread and simplification for host Reduction implementation
* Add ctest for reduction
* Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/
* Update to the reduce CTest executables to enable default testing behavior when no command argument
* Renaming
Co-authored-by: Jianfeng yan <jfyan008@gmail.com>