* Refactor elementwise kernels
* Instances fixes
* Fix cmake
* Fix max pool bwd test
* Update two stage gemm split k
* Restore elementwise scale for hiptensor backward compatiblity
* Fix Acc data type check in conv fwd multiple abd
* Disable conv fp64 fwd example
* Update grouped conv weight multi d
* File renaming and class renaming for device element-wise operation
* Add batchnorm-infer instances, external API and client example
* Add batchnorm-infer profiler module and gtests
* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp
* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer
* Rename class and file due to conflict from device_elementwise_2d.hpp
* Fix namespace in batcnnorm_infer_nhwc client example
* Update to the batchnorm-forward API and base class
* Fix leeked header including in gridwise_set_buffer_value.hpp
* Add kernels and device file for batchnorm-forward welford supporting both blockwise and multi-block reduction
* Update to the batchnorm-forward example to use the new batchnorm-forward device interface
* Change the batchnorm-forward reference to use sequential welford method
* Change to assign the workspace into four buffers in the host layer
* Use GetReduceCountPerThread functor to replace the initial count for Blockwise and Multiblock welford
* Tiny correction and remove un-used file under example/34_batchnorm
* Renaming in the kernel arguments
* Explicitly use ck::math::sqrt in batchnorm-forward kernels
* Add some comments to some kernels
* Tiny fix
* Generalize the data types in reference_batchnorm_forward_nhwc_c
* Use ck::ignore to mark un-used parameters
* Move GetReduceCountPerThread functor codes from kernel to device
* Remove some un-used codes in device_batchnorm_forward_impl.hpp
* Tiny fix in batchnorm_forward example
* Move GetReduceCountPerThread() to welford_helper.hpp
* Use seperate data type for Scale and Bias
* Renaming in device Op
* Tiny fix in forward example
* Updata to batchnorm-infer (type spliting, renaming)
* Add time and bandwidth measurement to the batchnorm-forward example
* Add support of elementwise operation for batchnorm forward output
* Reduce object copying by passing object as reference type
* Tiny change for performance
* Updates for performance again
* Some Renamings
* Add GetActualVariance template parameter for ThreadwiseWelfordMerge
* Tiny update in reference batchnorm forward nhwc/c
* Move batchnorm multiblock kernel files to grid/batchnorm_multiblock sub-directory
* Fuse mean and bias in the normalization calculation
Co-authored-by: root <root@dc-smc-18.amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
* Move kernel implementation files under impl directory.
* Update examples paths.
* Update device kernel impl include paths.
* Update tensor operation instances include paths.
* Update profiler and tests include paths.
* Clang-format
* Update include paths for batched gemm reduce
* Refactor UnitTest ConvNDBwdWeight.
* Refactor fwd and bwd data convND UT.
* Fix used test macro.
* Fix include path.
* Fix include paths.
* Fix include paths in profiler and tests.
* Fix include paths.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* Implement multiple-reduction in one kernel (kernels, device ops, examples)
* Add generic elementwise kernel and device interface
* Add generator for normal-distributed data initialization
* Add host refer implementation of batchnorm-forward and batchnorm-infer
* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels
* Remove un-needed including in batchnorm example
* Renaming generic_elementwise to elementiwise in kernel and device classes/functions
* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise
* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise
* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise
* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp
* Removing and renaming files
* Update to synchronize gemm_layernorm client example to the generic element-wise device op API
* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming
* Merge two static member functions in device_elementwise.hpp
* Remove unary_elementwise_1d kernel and device