* rename folder
* Add type string
* Remove typo
* Add deviceOp to backward x
* Add comment to describe the behavior of backward normalization
* Add kernel function, prepare to implement
* implement generic kernel
* Check vector size
* Add sweep once pipeline for small reduce size
* Fix bug of KRaw_ error
* Fix bug of dx stride
* sanity check for mean and rstd
* backward x for groupnorm
* Add bwd x instance
* add layernorm 2d bwd gamma beta instances
* Change save mean var type from f32 to f16 in f16 mode
* Change the example to f16
* Add groupnorm bwd gamma beta instance
* Add groupnorm bwd x instance
* Fix naming
* Add layernorm bwd x ckprofiler
* Add groupnorm bwd x profiler
* clang format
* Rename bwd x to bwd data
* Fix bug of verification in profiler
* Add test of layernorm and groupnorm bwd data
* Add missing cmake
* Add layernorm2d bwd data
* rename fwd example
* Add groupnorm client example
* Fix typo. replace Invarient with Invariant
* Add checking before running the best instance
* spolit the static library into several
* update lib paths and fix client example
* do not use device_mha_operarions for client examples
* use appropriate libs to link to client examples
* remove the gpu/transpose path from the list
* try fixing clinet examples 3,4,9
* add necessary libs for client examples
* fix the layernorm client example
* fix the client examples 23 and 24
* fix typo
* add interface library and refresh clang format
* Rename folder
* Add layernorm 4d fwd example
* Rename original layernorm example
* Add layernorm 4d f16 test
* Add layernorm4d_fwd client example
* Support layernorm4D in ckProfiler
* Rename groupnorm to groupnorm fwd in example
* Rename layernorm and group fwd in test
* Rename normalization to normalization_fwd (instances)
* Add fwd to DeviceNormalization
* Rename external api header
* Rename folder, because we can also add bwd in this folder
* Add fwd in layernorm and groupnorm (profiler
* Fix compile error
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>