* Extract base class for elementwise
* Refactor interface of DeviceGemmReduce. Do not use tuple in interface
* [What] Rename d into reduce in gemm + reduction related code
[Why] Prepare to add d term for add
* Unify base class of gemm + reduce and gemm + bias + add + reduce
* 1. Rename gemm_bias_add_reduce for external api
2. Refine cmake
* Add normalize device operation
* [What] Reorder the argument
[Why] Because d0 is also the input of c.
* Add type string
* Add example of gemm_bias_add_layernorm via external api
* Refactor example code
* clang-format
* Fix compile error
* clang-format
* Add external api for gemm_add_add_layernorm and normalize
* Add client example
* clang-format
* Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces
* Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers
* Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers
* Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation
* Use struct-scope operator template instantiation for binary and unary element-wise operations
* Change a few more elementwise operations to use template for operator()
* Tiny correction in Normalize operator
* Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons
* Correction in some examples with regard to using ReduceAccDataType
* Use static_assert for UnaryDivide
* Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly
* Tiny fix with regard to SetWorkSpacePointer()
* Copy "gemm reduce" to "gemm bias add reduce"
* Implement gemm bias add reduction
* Fix compiler error due to merge from develop
* Add tensor operation for gemm + bias + add + reduce
* Add gemm_bais_add_reduce to ckProfiler
* Add c1 functor
* Refine type
* Use reduceAccDataType instead of explicitly float
* Change to use check_err()
* Do relu in float32 instead of bhalf_t. Because bhalf_t is unsigned
* Refactor relu. using type_trait instead of overloading
* Rename DxsReduceAccElementwiseOperation to DxsReduceAccElementwiseOperation
* Fix denominator
* Refine nameing
* Fix denominator in host
* Remove useless include header
* Use AccDataType
* Fix static_cast order
* Refine type
* [What] Remove tuple type in the base class
[Why] External api depend on base class. if base class has relationship with type, we will need many class for different type
* Use the unified naming for math functions on host and HIP kernel
* Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming
* Renaming GetReductionZeroVal() to GetIdentityValue()
* Tiny renaming in profile_reduce_impl.hpp
* More renaming in profile_reduce_impl.hpp
* Replace zeroVal by identiyVal
* Remove ck_ prefix in the naming of ck::math provided functions
* Implement reduction meand and reduction square mean
* Refine file name
* Add reduce mean and square mean
* Fix parameter name
* Add normalize device op (not implement invoker::run())
* Remove epislon
* Refine deviceop
* Add 5ary elementwise for normalization
* Add layernorm example
* layerNorm verication
* Fix compiler error due to merge from develop
* Fix typo
* Fix compile error
* Refine naming
* [What] Suport non pointer for invoker and argument
[Why] Snyc coding style with gemm
* Refine folder name
* Refine class name
* Evaluate perf of the kernel
* Fix compile error
* [What] Refine perf evaluation in example of gemm + reduction
[Why] evaluation of gemm + reduction may cause verification fail. Because evaluation will not initial global memory
* clang-format
* add intrin_mfma_f64_16x16x4f64
* add example
* gemm reference add double data type
* chang init data
* fix M N PerXdlops
* fix ifdef
* add comparsion config
* add conv fwd example
* format log out
* change rc matrix egister layout
* reorganize example
* reorganize example 2
* format,because merge develop
* fix call impl adding acc data type
* lost ;
* add compiler warning
* change example tunning parameters
* add test for fp64
* add instance
* add test/gemm/gemm_fp64.cpp
* fix get name issue
* remove some tunning parameter
* fix conflict
* format
* use integer value for GEMM test
* add acc data type
* remove typeid because fp16
* fix streamconfig etc bug from merging develop
* format
* remove test_gemm_xdl_fp64
* add AccDataType
* AccDataType problem
Co-authored-by: qinletao <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* [What] Rename the example
[Why] Prepare to add unary reduction
* Add global oparation to the parameter
* Add atomicmax
* Fix compile error
* Support atomicMax (hip library)
* Rename the reduction example
* Fix target name
* use p_d1_grid as the indicator directly
* Prevent performance issue. Let passthrough handle it.
* Implement the function template the specialize the float2
* No need to separate into two lines
* Remove empty line
* add comment
* Fix compile error due to merge from develop
* make the implementation of atomic_max / atomic_add explicit for each datatype
* Refine typo
* For future CI test
* Fix compiler error in ckProfiler
* Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
* simply use remove_pointer
* Rename type and var
* Refine example
* Modify reducemax example
* Fix bug in reduction
* Change initialize range
* Implement F64 version of atomicMax
* Move reduction code together
* Add buffer atomic_max
* Fix coding style by clang-format
* Integrate new api of DeviceGemmReduce_Xdl_CShuffle
* Integrate Batch gemm reduction
* Fix example
* fix example
* clean up
* Fix batch gemm tensor operation
* Fix coding style
* Fix template augument
* Fix clang format
* Keep flexible of different stride for each D tensor
* Fix compile error for ckProfiler
* Fix typo
* [What] Fix naming
[Why] Prepare to add out elementop
* Add DoutElementOp
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: rocking <chunylai@amd.com>
* [Experimental] Change to gemm+reduce and batched-gemm+reduce
* Use threadwise-reduce function to improve the gridwise_gemm_reduce_xdl_cshuffle kernel
* Tiny fix in device_batched_gemm_xdl.hpp
* clang-format library/src/utility/conv_fwd_util.cpp