* Added two kernel for M=32 problem
* Comment the first one
* Enable multiply_multiply for Scale_Block_M = 1 for deepseek
* Modify the a_thread offset since the A data load is different from B.
* edit fp8 ab scale for Scale_Block_M=1
* edit GemmSpec to MNKPadding
* enable blockwise pipelie v1 and v2. v1 is work for small K.
* add instance for gemm_ab_scale
* fix cmakelist of ckProfiler
* optimize blockscale gemm. todo: reduce vgpr usage
* fix a correctness bug
* sanity checked
* revert ckprofiler cmake changes
* clang format
* revert unnecessary changes.
* remove commented codes.
---------
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
* add ab_scale init support
* enabled interwave
* add scale type; update isSupport
* adjust example
* clean
* enable f8 pure gemm rcr ckprofiler
* Add gemm_multiply_multiply instances
* clang format
* Optimize for ScaleBlockMNK=128
* enable abscale f8 gemm ck profiler
* Add pure f8 gemm test suite
* Reverting to the state of project at f60fd77
* update copyright
* clang format
* update copyright
---------
Co-authored-by: root <jizhan@amd.com>