* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]
* Add CountDataType as template parameter in blockwise_welford
* Add utility/get_shift.hpp
* Add BatchNorm multiblock single-kernel implementation
* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a
* Renaming in device_batchnorm_forward_impl.hpp
* Tiny fix in the batchnorm_fwd profiler
* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"
This reverts commit d16d00919c.
* Use the old two-kernel batchnorm multiblock method for gfx1030
* Use the old two-kernel batchnorm multiblock method for gfx908
* use the single-kernel batchnorm multiblock method only for gfx90a
* Remove get_wave_id() from utility/get_id.hpp since it is not used
* Set true for testing running mean/variance and saving mean/invvariance in the examples
* Fix to copy-right words
* Remove un-needed including in utility/get_id.hpp
* Add comments to workgroup_synchronization.hpp
* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp
* Renaming in the kernels
* Remove un-used kernel file
* Re-structure ckProfiler source files
* Rename profiler.cpp to main.cpp
* Modularize ckProfiler operations
* Add description for profiler operations
* Use longer name to avoid name collision
* Use macro to delay expansion
* Use std::move() to avoid object copying
* Prohibit users from calling dtor
* Use macro to eliminate redundant code
* Make friend function hidden
* Add missing include directive <iostream>
* Fix wrong include directives
* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* Update to device_batchnorm_forward base class to include all template parameters for problem description
* Add batchnorm forward instances and external api
* Add batchnorm forward profiler module which uses the external api
* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]
* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward
* Improvement to the batchnorm infer base API
* Add batchnorm forward client example which shows using the batchnorm forward external API
* Add test for batchnorm forward
* Tuning the batchnorm profiler initialized values and error threshold
* Add support for bhalf_t in instances/external api/tests
* Add support for int8_t in instances/external api/tests
* Add support for double in instances/external api/tests
* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances
* Checking before running best instance in batchnorm_fwd_nhwc client example
* Add checking for YElementwiseOp in batchnorm_forward external API
* Add more types in batchnorm forward profiler
* Add more test lengths
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>