* [Experimental] Change to gemm+reduce and batched-gemm+reduce
* Use threadwise-reduce function to improve the gridwise_gemm_reduce_xdl_cshuffle kernel
* Tiny fix in device_batched_gemm_xdl.hpp
* clang-format library/src/utility/conv_fwd_util.cpp
* adding batched_gemm_and_reduction
* batched_gemm_reduce works with bactch_count=1
* fix a bug in grid_size; batched_gemm_reduce works for batch_count > 1
* adding profiler for batched_gemm_fp16
* fixed a bug in declaration of d1 and d0; both example and profiler work
* clang-format
* cleanup
* batched_gemm_reduce: add test
* minor change
* fixed some typo in function names