* Add always_false<> util to delay symbol resolution
* Use always_false<> to prevent trying instantiate unwanted method
* Add new specializations of AddAddFastGelu::operator() method
* Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32
* Use floating point literal to simplify code
* Remove unnecessary capture in lambda expressions
* Extract fast GeLU calculation as standalone method
* Mark methods as 'constexpr'
* Add constraint for HostTensorDescriptor templated ctors
* Simplify HostTensorDescriptor ctor calls
* Add C++23 std::size_t literal suffix
* Use _uz suffix to shorten example code
* Remove unnecessary conversion to std::array<>
* Re-order include directives
* Remove C-style casting by literal suffix
* Remove unnecessary statements in main()
* Remove unused type parameter of always_false<>
* Remove unused include directive
* Exit main() by returning meaningful value
* Use 'if constexpr' to switch example flow
* Use std::is_same_v<> to shorten example code
* Add 'inline' specifier to literal functions
* Unify output methods in example
* Move common codes into .inc file
* Add type check in type_convert<>()
* Add type_convert<float>() before computation
* Merge AddAddFastGelu method specializations
* Remove always_false<>
* Add constraint to AddAddFastGelu::operator() parameter types
* Add int8 specialization for elementwise Add and Subtract.
* CGEMM examples bf16, fp32, int8
* Add convert reference output to CDataType.
* Skip BF16 data type during testing.
* Lower K value to get rid of accumulation error.
* Fix merge artifact.
* Fix changed function name: GetElementSpaceSize()
* Fix merge artifact.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* add verify flag and update scripts
* replace old check_error function with the new check_err
* fix syntax
* remove blank spaces
* remove empty line
* add check_err for tensors
* fix syntax
* replace tensors with vectors in check_err calls
* fix syntax
* remove blank spaces
* fix syntax
* add new line at end of file
* disable conv2d_bwd_weight test, add gpu check
* set check_gpu using export
* check GPU using runShell
* add definition of runShell
* fix script syntax
* reduce the number of threads, add full qa option
* run processing scripts in bash
* fix the branch and host names in performance scripts, add chronos
* replace parameterizedCron with cron
* archive the perf log files
* try to fix git call
* pass branch and host names as arguments into scripts
* fix script arguments
* fix script arguments
* process results on master
* fix pipeline
* add definition of gpu_arch
* run processing scripts in docker
* fix the brackets
* add agent master for the processing stage
* get rid of show_node_info call on master
* try using mici label instead of master, disable MI100 tests for now
* fix syntax
* simplify container for results processing
* remove node(master) from the process_results stage
* put all stages in original order
* change the agent label from master to mici for gfx908
* Implement layernorm kernel and deviceOp
* verify gpu kernel with host code
* 1. Separate gamma aand beta from affine
2. Check if argument is valid
* clean
* Sync the naming
* Support sweep once mode if we can put k dimension data inside one block
* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.
* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.
* Use 1d descriptor for gamma and beta
* Add accElementwiseOp
* Extract layernorm host code
* Support different YVectorDim in GridwiseLayernorm
* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
* Gamma and beta can share the VGPR.
* Add test for fp32 and fp16
* Fix bug of concurrency and add test case which may fail orignally
* Propagate NaN for layernorm
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* dump lds content in appropriate precision type
* add squared add reduction op; allows sq sum
* initial stub from regular gemm impl
* layernorm example code & host verification
* initial layernorm implementation
* tidy up
* make C0 precision type consistent with C
* clang-tidy and additional comments
* tighten up example code
* account for extra flops/bytes from normalization
* clang-format
* c0 bias/beta/gamma now have its own precision type
* AccElemOp for gemm outputs prior to feeding to layernorm
* update workgroup mapping
* rename kernel template param to reflect its dual use
* use LDS mem pool for reduction workspace
* change cshuffle precision type to f16; clean up
* clang-format
* correct naming
* explicit cast
* fully implemented gemm + bias + activation + add + norm
* activation in correct order
* reflect reduction API's recent change
* amend
* clean up; add comment
* keep up with recent changes in reduction API
* format
* resolve merge conflicts
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* use 'sweep once' softmax kernel where applicable
* threadwise copy's dst buffer can specify invalid element value
* add int8 in/out float compute softmax support
give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error
* format
* softmax inherits DeviceNormalization
* softmax profiler stub
* tighten up reference softmax interface
* example prints tensor dimension
* add fp32 to softmax profiler
* rename header
* hook with ckProfiler
* format
* resolve merge conflict
* resolve merge conflicts
* update normalization profiler help string
* resolve conflict
* typo
* remove residual
* softmax profiler: address feedback
* test for mixed precision input/output
* fully qualify ck::math::isnan
* add comment for device normalization interface
* revise wording
* constness for alpha/beta scaler pointer
* Extract base class for elementwise
* Refactor interface of DeviceGemmReduce. Do not use tuple in interface
* [What] Rename d into reduce in gemm + reduction related code
[Why] Prepare to add d term for add
* Unify base class of gemm + reduce and gemm + bias + add + reduce
* 1. Rename gemm_bias_add_reduce for external api
2. Refine cmake
* Add normalize device operation
* [What] Reorder the argument
[Why] Because d0 is also the input of c.
* Add type string
* Add example of gemm_bias_add_layernorm via external api
* Refactor example code
* clang-format
* Fix compile error
* clang-format
* Add external api for gemm_add_add_layernorm and normalize
* Add client example
* clang-format
* Switch to standard ROCm packaging
* Revert .gitignore changes
* install new rocm-cmake version
* update readme
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* UniforFill with integer values.
* Log tested instance type string.
* Add UT for all convolution specializations.
* debugging conv
* Fix dangling reference bug.
* Small refinements.
* Fix call to error checking function.
* Small refinements to tests.
* Configure error tolerance
* Change problem size.
* Remove OddC case from types that do not support it.
* Add helper traits for AccumulatorDataType.
* Print first 5 errs in check_err for integral types.
* Rename FillUniform to FillUniformDistribution
* Refactor
* Do not use typed tests.
* Instead use plain fixture class with templatized member functions.
* Initialize tensors with integer values.
* Refine test instances.
* Properly set accumulator data type.
* Add another "big" instance.
* Refactor convolution tests.
* Revert "debugging conv"
This reverts commit b109516455.
* Add pragma once + format + small refinement.
* Fix some unwanted changes.
* Clang-format
* Fix profile_convnd to use renamed tensor initializer.
* Add instances for ConvFWDND kernel case 2D
* Helpers to get ConvNDFwd 2D instances.
* Refactoring.
* Remove "small block" instance as it was generating compiler errors.
* Remove default template parameters values.
* Refine and fix test.
* Fix problem with default template parameter types.
* Adjust error thresholds for floating point values test.
* Use integer values initialization for instances test.
* Add tests for ConvNDFwd 2D case.
* Remove AccumulatorDataType type trait.
* Update unit-tests.
* Remove operator<< overload.
* Unlock conv1d/3d nd fwd instances.
* Enable skipping calculating reference using flag.
* Fix number of channels for first ResNet50 layer.
* Clang-format.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* initial stub for standalone softmax
* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
* host softmax validates
* compiles; to implement beta scaling
* use NaN trick to efficiently ignore OOB values during sum of exponentials
* freeload device_reduce's utility functions
* clean up interface
* adding prior value (beta scaling)
* remove restriction related to perf considerations
* apply clang-format
* clean; disable diagnostics
* resolve conflicts
* add exp wrapper
* honor HostTensorDesc interface; allow implicit cast from different vector<T> type
* test softmax for fp16/fp32
* update readme
* amend commit NaN trick
* remove redundant param added during development
* format
* replace ScalarDataType with AccDataType
* separate out test programs by precision type
* move softmax sample code to its own folder
* format
* keep up with recent changes in reduction API
* remove extra header
* Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces
* Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers
* Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers
* Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation
* Use struct-scope operator template instantiation for binary and unary element-wise operations
* Change a few more elementwise operations to use template for operator()
* Tiny correction in Normalize operator
* Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons
* Correction in some examples with regard to using ReduceAccDataType
* Use static_assert for UnaryDivide
* Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly
* Tiny fix with regard to SetWorkSpacePointer()
* Copy "gemm reduce" to "gemm bias add reduce"
* Implement gemm bias add reduction
* Fix compiler error due to merge from develop
* Add tensor operation for gemm + bias + add + reduce
* Add gemm_bais_add_reduce to ckProfiler
* Add c1 functor
* Refine type
* Use reduceAccDataType instead of explicitly float
* Change to use check_err()
* Do relu in float32 instead of bhalf_t. Because bhalf_t is unsigned
* Refactor relu. using type_trait instead of overloading
* Rename DxsReduceAccElementwiseOperation to DxsReduceAccElementwiseOperation
* Fix denominator
* Refine nameing
* Fix denominator in host
* Remove useless include header
* Use AccDataType
* Fix static_cast order
* Refine type
* [What] Remove tuple type in the base class
[Why] External api depend on base class. if base class has relationship with type, we will need many class for different type
* Use the unified naming for math functions on host and HIP kernel
* Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming
* Renaming GetReductionZeroVal() to GetIdentityValue()
* Tiny renaming in profile_reduce_impl.hpp
* More renaming in profile_reduce_impl.hpp
* Replace zeroVal by identiyVal
* Remove ck_ prefix in the naming of ck::math provided functions
* Reference CGEMM + test stub
* Format.
* Incomplete simple implementation
* Library instances
* Sketch of tests
* Test fixes.
* Example added
* Cosmetics
* Add elementwise operation kernel and example
* Add comment
* Add template argument of dim . Prepare to support multiple dimension
* Rename example
* Support 1 dimension
* Add static assert
* Add comment
* Second auxiliary buffer added
* Extract pad
* Remove redundant argument
* Support any dimension for elementwise operation
* Remove line
* Let it be the multiple number of CU
* Move thread per block to the parameter of constructor
* Consuming binary ops to do A+B / A-B
* Fix + cosmetics + bf16 test commented out temporarily
* Format
* Enabling bf16 test
* Revert "Enabling bf16 test"
This reverts commit f497e2ba44.
* Fix + test reenabled
* fix build
* Revert "fix build"
This reverts commit d73102384b.
* post PR #235 merge fix
* amend
* Single workspace for cgemm + helper
* Perf calc fix
* Review remarks: static_cast
* Review remarks: binary ops templated
* Cleaning
* Removal of instances and their tests
* Review remarks from aosew addressed
* Review remark: unnecessary attribute
* Post-merge fixes
* Restrict 4gemm to PassThrough + bug fix
* Review remarks
* update licence
* change cgemm example to fp16
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
* Implement reduction meand and reduction square mean
* Refine file name
* Add reduce mean and square mean
* Fix parameter name
* Add normalize device op (not implement invoker::run())
* Remove epislon
* Refine deviceop
* Add 5ary elementwise for normalization
* Add layernorm example
* layerNorm verication
* Fix compiler error due to merge from develop
* Fix typo
* Fix compile error
* Refine naming
* [What] Suport non pointer for invoker and argument
[Why] Snyc coding style with gemm
* Refine folder name
* Refine class name
* Evaluate perf of the kernel
* Fix compile error
* [What] Refine perf evaluation in example of gemm + reduction
[Why] evaluation of gemm + reduction may cause verification fail. Because evaluation will not initial global memory
* clang-format
* add intrin_mfma_f64_16x16x4f64
* add example
* gemm reference add double data type
* chang init data
* fix M N PerXdlops
* fix ifdef
* add comparsion config
* add conv fwd example
* format log out
* change rc matrix egister layout
* reorganize example
* reorganize example 2
* format,because merge develop
* fix call impl adding acc data type
* lost ;
* add compiler warning
* change example tunning parameters
* add test for fp64
* add instance
* add test/gemm/gemm_fp64.cpp
* fix get name issue
* remove some tunning parameter
* fix conflict
* format
* use integer value for GEMM test
* add acc data type
* remove typeid because fp16
* fix streamconfig etc bug from merging develop
* format
* remove test_gemm_xdl_fp64
* add AccDataType
* AccDataType problem
Co-authored-by: qinletao <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* start adding navi21 GEMM
* navi_gemm_km_kn_mn_fp32 compiles and passes one test.
* rename variables and functions in gridwise_gemm_dlops_v1r3
* add other 3 layouts; format instance
* adding more tuning parameters
add tuning parameters for other 3 layouts
* add gemm_dlops_f16
* tmp
* add dependence of DeviceGemm::IsSupportedArg() on arch
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* push gemm_dlops into profiler
* minor changes
* if using xdl or dlops is moved into profiler_gemm_impl
* minor changes
* minor changes
* remove is_xdl from profile_gemm_impl
* make IsSupportedArg dependent on arch for other device_gemm
* minor changes
* minor changes
* fix a bug in f_generate_tensor_value
* add 64x64x64 for gemm_dlops_int8
* add 64x64x64 for gemm_dlops_int8
* comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1
* fix
* start fixing tuning parameters
* monir
* minor changes
* minor changes
* minor changes
* fixing
* adding example
* adding example
* adding example
* add gemm fp32 example
* clean up
* use 128x128x16 as MNK tile in navi21 gemm example
* bug fix
* fix test
* use new block c tile
* clean
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>
* Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type
* Update to host layer and host reduction
* Merge and remove reduction kernels
* Merge and remove reduction device interfaces and update pooling device interface
* Merge and remove useless reduction device instances
* Update to reduction profiler and reduction ctests
* Update to reduction and pooling examples and add one reduction example
* Change to reduction examples to let them testable by ctest
* Add explicit pass checking for reduction and pooling examples
* Explicit assignment of tensor shapes in example reduce_blockwise_two_call
* Use atomic_add to repace atomicAdd and add atomic_add for double type
* Add reduce ctest support for double data type
* Replace to_int_vector() by using c++ std::vector::assign()
* Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise
* Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock
* Add GetAtomicOperationZeroValue() support for AtomicMax
* Tiny change to reduce example README.md
* Fix some tiny issues due to branch merging
* Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t
* Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64
* Renaming
* Clean the header includings in device_reduce instances header files
* enable example of conv 1d/3d for bwd weight
* make bf16 kernel do not use atomic add
* using new gridwise gemm for bwd weight on convnd bwd weight
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* [What] Rename the example
[Why] Prepare to add unary reduction
* Add global oparation to the parameter
* Add atomicmax
* Fix compile error
* Support atomicMax (hip library)
* Rename the reduction example
* Fix target name
* use p_d1_grid as the indicator directly
* Prevent performance issue. Let passthrough handle it.
* Implement the function template the specialize the float2
* No need to separate into two lines
* Remove empty line
* add comment
* Fix compile error due to merge from develop
* make the implementation of atomic_max / atomic_add explicit for each datatype
* Refine typo
* For future CI test
* Fix compiler error in ckProfiler
* Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
* simply use remove_pointer
* Rename type and var
* Refine example
* Modify reducemax example
* Fix bug in reduction
* Change initialize range
* Implement F64 version of atomicMax
* Move reduction code together
* Add buffer atomic_max
* Fix coding style by clang-format
* Integrate new api of DeviceGemmReduce_Xdl_CShuffle
* Integrate Batch gemm reduction
* Fix example
* fix example
* clean up
* Fix batch gemm tensor operation
* Fix coding style
* Fix template augument
* Fix clang format
* Keep flexible of different stride for each D tensor
* Fix compile error for ckProfiler
* Fix typo
* [What] Fix naming
[Why] Prepare to add out elementop
* Add DoutElementOp
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: rocking <chunylai@amd.com>
* Add elementwise operation kernel and example
* Add comment
* Add template argument of dim . Prepare to support multiple dimension
* Rename example
* Support 1 dimension
* Add static assert
* Add comment
* Extract pad
* Remove redundant argument
* Support any dimension for elementwise operation
* Remove line
* Let it be the multiple number of CU
* Move thread per block to the parameter of constructor
* rename threadPerBlock with blockSize
* Support double
* rename kernel function name
* remove redundant include header
* Refine type
* Need to the final dimension
* Refine variable name
* Refine type
* Use index_t instead of int in API
Co-authored-by: rocking <chunylai@amd.com>
* Turning compare warnings on
* Cleaning part I
* Cleaning part II
* Explicit static_cast to ck::type_convert
* Resolving large tensor size issue.
* format
* revert change to tensor descriptor; promote lementSpaceSize to 64bit
* use integer value for GEMM test
* Review remarks
* Review remarks + issues with (un)signed arithmetic
* Format fix
* Format
* Clang-format.
* fix 2gb limit issue
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
* [Experimental] Change to gemm+reduce and batched-gemm+reduce
* Use threadwise-reduce function to improve the gridwise_gemm_reduce_xdl_cshuffle kernel
* Tiny fix in device_batched_gemm_xdl.hpp
* clang-format library/src/utility/conv_fwd_util.cpp
* Convolution ND
* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances
* Move convnd f32 instance file to comply with repo structure.
* Conv 1D tensor layouts.
* Formatting and use ReferenceConv
* Reference ConvFwd supporting 1D and 2D convolution.
* Debug printing TensorLayout name.
* Conv fwd 1D instance f32
* Refactor conv ND example.
Needed to support various conv dimensio.
Needed to support various conv dimensions
* Rename conv nd example director to prevent conflicts.
* Refactor some common utility to single file.
Plus some tests.
* Refactor GetHostTensorDescriptor + UT.
* Add 1D test case.
* Test reference convolution 1d/2d
* Remove some leftovers.
* Fix convolution example error for 1D
* Refactor test check errors utility function.
* Test Conv2D Fwd XDL
* More UT for 1D case.
* Parameterize input & weight initializers.
* Rename example to prevent conflicts.
* Split convnd instance into separate files for 1d/2d
* Address review comments.
* Fix data type for flops/gbytes calculations.
* Assign example number 11.
* 3D cases for convolution utility functions.
* 3D reference convolution.
* Add support for 3D convolution.
* Check for inputs bigger than 2GB.
* Formatting
* Support for bf16/f16/f32/i8 - conv instances + UT.
* Use check_err from test_util.hpp.
* Split convnd test into separate files for each dim.
* Fix data generation and use proper instances.
* Formatting
* Skip tensor initialization if not necessary.
* Fix CMakefiles.
* Remove redundant conv2d_fwd test.
* Lower problem size for conv3D UT.
* 3D case for convnd example.
* Remove leftovers after merge.
* Add Conv Specialization string to GetTypeString
* Skip instance causing numerical errors.
* Small fixes.
* Remove redundant includes.
* Fix namespace name error.
* Script for automatic testing and logging convolution fwd UTs
* Comment out numactl cmd.
* Refine weights initalization and relax rtol for fp16
* Move test_util.hpp to check_err.hpp
* Refine weights initalization and relax rtol for fp16
* Refactor common part of test conv utils.
* Move utility function to single common place.
* Add additional common functions to utility.
* Refactor convnd_fwd_xdl examples.
* Remove redundant files.
* Unify structure.
* Add constructor to ConvParams.
* And add input parameters validation.
* Modify conv examples to use single utility file.
* Remove check_error from host_tensor.hpp
* Get rid of check_indices function.
* Remove bf16_to_f32 function overload for scalars.
* Fix namespace.
* Add half_float::half for check_err.
* Fix conv params size in UT.
* Fix weights initialization for int8.
* Fix weights initialization for int8.
* Add type_convert when store output in ref conv 1D.
* Get back old conv2d_fwd_xdl operation.
* Silence conv debug print.
* format
* clean
* clean
* Fix merge.
* Fix namespace for check_err
* Formatting.
* Fix merge artifacts.
* Remove deleted header.
* Fix some includes and use ck::utils::check_err.
* Remove unused check_indices restored by previous merge.
* Fix namespaces after merge.
* Fix compilation error.
* Small fixes.
* Use common functions.
* Fix filename
* Fix namespaces.
* Fix merge artifact - retrieve removed by accident fun.
* Fix ConvForwardSpecialization.
* Working example of OpInstanceRunEngine for conv2dfwd UT.
* Adhere to coding style rules.
* Formatting and adhere to coding style rules.
* Fix merge artifacts.
* Utility for collecting conv fwd instances.
+ Plus commmon part for parsing cmdline params.
* Refactor FillUniform because of segfault for int8_t.
* Naming convention.
* Elegant version of device mem allocation.
* Use OpInstanceRunEngine in conv fwd nd tests.
* Multiple refinements.
* conditional init
* don't run reference op if not provided.
* Use OpInstanceRunEngine for ckProfiler conv_fwd
* Refactor common tensor fill function to separate file.
* Clean up unused functions.
* Support different init methods.
* Create CMake target for conv_fwd_util.
* Add header for profile_convnd_fwd.cpp
* Fix CMakefiles to link with conv_fwd_util where needed.
* Fix some clutter.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Add math functions for host
* Change to host reduction to use ck::math:
* Remove the using of half_float::half and half.hpp from reduction example/profiler/ctest
* Convolution ND
* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances
* Move convnd f32 instance file to comply with repo structure.
* Conv 1D tensor layouts.
* Formatting and use ReferenceConv
* Reference ConvFwd supporting 1D and 2D convolution.
* Debug printing TensorLayout name.
* Conv fwd 1D instance f32
* Refactor conv ND example.
Needed to support various conv dimensio.
Needed to support various conv dimensions
* Rename conv nd example director to prevent conflicts.
* Refactor some common utility to single file.
Plus some tests.
* Refactor GetHostTensorDescriptor + UT.
* Add 1D test case.
* Test reference convolution 1d/2d
* Remove some leftovers.
* Fix convolution example error for 1D
* Refactor test check errors utility function.
* Test Conv2D Fwd XDL
* More UT for 1D case.
* Parameterize input & weight initializers.
* Rename example to prevent conflicts.
* Split convnd instance into separate files for 1d/2d
* Address review comments.
* Fix data type for flops/gbytes calculations.
* Assign example number 11.
* 3D cases for convolution utility functions.
* 3D reference convolution.
* Add support for 3D convolution.
* Check for inputs bigger than 2GB.
* Formatting
* Support for bf16/f16/f32/i8 - conv instances + UT.
* Use check_err from test_util.hpp.
* Split convnd test into separate files for each dim.
* Fix data generation and use proper instances.
* Formatting
* Skip tensor initialization if not necessary.
* Fix CMakefiles.
* Remove redundant conv2d_fwd test.
* Lower problem size for conv3D UT.
* 3D case for convnd example.
* Remove leftovers after merge.
* Add Conv Specialization string to GetTypeString
* Skip instance causing numerical errors.
* Small fixes.
* Remove redundant includes.
* Fix namespace name error.
* Script for automatic testing and logging convolution fwd UTs
* Comment out numactl cmd.
* Refine weights initalization and relax rtol for fp16
* Move test_util.hpp to check_err.hpp
* Refine weights initalization and relax rtol for fp16
* Refactor common part of test conv utils.
* Move utility function to single common place.
* Add additional common functions to utility.
* Refactor convnd_fwd_xdl examples.
* Remove redundant files.
* Unify structure.
* Add constructor to ConvParams.
* And add input parameters validation.
* Modify conv examples to use single utility file.
* Remove check_error from host_tensor.hpp
* Get rid of check_indices function.
* Remove bf16_to_f32 function overload for scalars.
* Fix namespace.
* Add half_float::half for check_err.
* Fix conv params size in UT.
* Fix weights initialization for int8.
* Fix weights initialization for int8.
* Add type_convert when store output in ref conv 1D.
* Get back old conv2d_fwd_xdl operation.
* Silence conv debug print.
* format
* clean
* clean
* Fix merge.
* Fix namespace for check_err
* Formatting.
* Fix merge artifacts.
* Remove deleted header.
* Fix some includes and use ck::utils::check_err.
* Remove unused check_indices restored by previous merge.
* Fix namespaces after merge.
* Fix compilation error.
* Small fixes.
* Use common functions.
* Fix filename
* Fix namespaces.
* Fix merge artifact - retrieve removed by accident fun.
* Fix ConvForwardSpecialization.
* Adhere to coding style rules.
* Fix merge artifacts.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>