* Add maxpool f32 kernel and example
* Revise copyright
* Add device pool bwd device op
* Support f16 and bf16
* Add compute datatype for reference code.
Prevent error in bf16
* Fix type error
* Remove layout
* Fix bf16 error
* Add f16 and bf16 example
* Add more operations
* Implement IsSupportedArgument
* Add changelog
* Add comment
* Add comment
* Remove useless header
* Move initialize of workspace to the run
* Move set din zero to the device operator
* Save din_length_raw
* Remove useless header
* Calculate gridsize according to the number of CU
* Calculate gridSize according to the number of CU.
Remove useless header
* Add put example
* Remove useless header
* Fix CI fail
[ROCm/composable_kernel commit: 341ad95665]
* Add DeviceOp and examples
* Format DeviceOp template arguments
* Remove bf16 example
* Format
* Format
* Update MakeABCGridDescriptor_A_K0_M_K1_B_K0_N_K1_C_M_N
* Refactor argument preparation
* Update conv_bwd_weight_dl to grouped_conv_bwd_weight_dl
* Rename device op file
* Update include directive in the example file
* Update descriptor preparation for grouped op
* Update the argument
* Update batch handling
* Add gridwise gemm supporting batched input
* Update blockwise indexing, working version
* Update copyright year
* Update check if argument is supported
* Refactor and make consistent with xdl examples
* Update check if argument is supported
* Add changelog entry
* Added comments on Dl op split_k>1 support
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
[ROCm/composable_kernel commit: 246ceee49e]
* Sync the order of type string with template parameter
* Add more instances
* Check the vector size and remove redundant var
* Extract var to static, prepare to separate sweep once kernel
* Separate sweeponce flow and optimize the flow
* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
* Remove useless code
* Update naive variance kernel
* Refine string
* Fix typo
* Support naive variance for device_normalization
* Check the blocksize
* Share the VGPR of x and y
* Share the VGPR of gamma and beta
* Add more instances
* Support fp16 sqrt for experiment
* Add CHANGELOG
* Fix typo
* clang-format
[ROCm/composable_kernel commit: 6a6163a3d1]