* Add gemm + layernorm instance
* Add ckProfiler
* Add test
* Add client example
* Detect if user forger to set the workrspace
* Use literal in the example
* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
* check gemm vaildity in IsSupportedArgument
* Add more testcases
* Merge duplicated folder in client example
* Print more infomation
* Use better kernel parameter for MS problem size
* clang format
* Add constexpr for if condition and remove redundant include
* Remove cstdlib and add constexpr
[ROCm/composable_kernel commit: f7d28f3e4b]
* add fused addition lyernorm
* add fused addition lyernorm
* changed CMakelist
* removed annotates
* modified descriptor of C
* fixed bug in gridwise add layernorm
* format the files
* modified name from add&layernorm into elementwise&layernorm
* created fused elementwise layernorm branch
* change input into tuple type
* add sweep once to reduce load & read of C from global memory
* modified Argument api
* modified way to malloc c in global memory
* changed gamma and beta to m_k_desc
* fixed bug when sweep once and move CDataType when define device level struct
* add src dim for gamma and beta
* implement optimization for coalesced
* delete a annotation line
* fixed some bug to meet the requirements of ck
* add bandwidth computing in example, and fixed the time unit
* move device_elementwise_layernorm_impl.hpp into device/impl
* fixed bug in device_elementwise_layernorm_impl.hpp
* changed name from layernorm into normalization
* clang-format the changed files
* changed the names
* moved immidiate results into lds, it become faster in non-sweeponce cases
* changed naming of C into X to make the defination more clear
* changed naming in example
* add tests for elementwise normalization
* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
* move test_elementwise_layernorm_fp16 into new folder
* move elementwise_normalization_instances into a new folder
* add more tests in test_elementwise_layernorm_fp16.cpp
* added some corner cases in test
* fixed method to compute lds size for matrix X
* changed name of 44_elementwise_normalization into 45_elementwise_normalization
* modified some comments
* modified some other confused comments
* reduce redundant tests in test_elementwise_layernorm_fp16.cpp
[ROCm/composable_kernel commit: 8a4253baaf]
* Sync the naming
* Sync the test of layernorm with groupnorm
* Sync the naming
* Minor change for comment and log
* [What] Add saveMean and SaveInvVariance in the interface.
[Why] These can optimize the backward
[ROCm/composable_kernel commit: d4d1147f0a]
* add fused addition lyernorm
* add fused addition lyernorm
* changed CMakelist
* removed annotates
* modified descriptor of C
* fixed bug in gridwise add layernorm
* format the files
* modified name from add&layernorm into elementwise&layernorm
* created fused elementwise layernorm branch
* change input into tuple type
* add sweep once to reduce load & read of C from global memory
* modified Argument api
* modified way to malloc c in global memory
* changed gamma and beta to m_k_desc
* fixed bug when sweep once and move CDataType when define device level struct
* add src dim for gamma and beta
* implement optimization for coalesced
* delete a annotation line
* fixed some bug to meet the requirements of ck
* add bandwidth computing in example, and fixed the time unit
* move device_elementwise_layernorm_impl.hpp into device/impl
* fixed bug in device_elementwise_layernorm_impl.hpp
* changed name from layernorm into normalization
* clang-format the changed files
* changed the names
* moved immidiate results into lds, it become faster in non-sweeponce cases
* changed naming of C into X to make the defination more clear
* changed naming in example
* add tests for elementwise normalization
* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
* move test_elementwise_layernorm_fp16 into new folder
* move elementwise_normalization_instances into a new folder
* add more tests in test_elementwise_layernorm_fp16.cpp
* added some corner cases in test
* fixed method to compute lds size for matrix X
* changed name of 44_elementwise_normalization into 45_elementwise_normalization
* modified some comments
* modified some other confused comments
* reduce redundant tests in test_elementwise_layernorm_fp16.cpp
[ROCm/composable_kernel commit: efbcc6eddc]