* refactor cmake files for the tests
* refactor cmake files for examples
* fix cmake for gemm example
* fix the cmake file for all examples
* add splitting by data types in gemm_splitk instance header
* rename test to reflect only dl instances are used
* clean up CI workspace, update cmake for instances
* change the jenkinsfile syntax
* build all instances except DL on gfx11
* move workspace cleanup after stages
* clean up workspace after every stage
* isolate data types in grouped_conv_fwd header
* isolate dl instances for grouped_conv2d_fwd
* fix syntax
* fix cmake and batchnorm instances
* fix typo
* fix reduction instances
* fix grouped_conv headers
* fix syntax
* replace parsing logic for instances, replace bfp16 with bf16
* fix the client examples build
* clean up DTYPES from instances cmake files
* update the parsing logic in cmake files
* make an exception for reduction kernels
* update few remaining cmake files to handle DTYPES
* fix syntax
* fix cmake conflicts
* replace f8 with fp8 test name
* resolve conflicts for dpp instances
* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
* Rename the folder name for the pool2d and reduce examples
* Update to reduction test scripts
* Add Readme for pool2d_fwd and reduce_blockwise examples
* Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)
* Tiny fix in reduce profiler and tiny update in reduce testing scripts
* Tiny fix in testing script profile_reduce_no_index.sh
* Tiny fix in testing script profile_reduce_no_index.sh
* Add support for bfp16 reduction (using bhalf_t = ushort)
* Tiny fix in amd_buffer_addressing.hpp
* Tiny change in script/profile_reduce_with_index.sh
* Use AccDataType for Beta value and use element_wise::PassThrough
* Use type_convert for type converting in host layer reduction
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming all NumReduceDims to NumReduceDim
* Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2
* Update to testing scripts to add bf16 support
* added more static_assert
* Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp
* Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations
* minor change
* Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass
* Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp
* Tiny fix in script/profile_reduce_no_index.sh
* Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims
* Generic renaming in host reduction and DeviceReduce layer
* Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances
* Use multi-thread and simplification for host Reduction implementation
* Add ctest for reduction
* Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/
* Update to the reduce CTest executables to enable default testing behavior when no command argument
* Renaming
Co-authored-by: Jianfeng yan <jfyan008@gmail.com>