* CK-Tile GEMM with memory bound pipeline.
* Memory bound gemm pipeline.
* Fix not closed namespace.
* Block gemm mem pipeline draft.
* Do not use ck_tile:: within ck_tile namespace.
* Refactoring & Move Layout info to pipeline problem.
* Get hot loop and TailNum information before lunching kernel.
* Fixes in pipeline.
* Add comment to load_tile_raw and change variable naming style.
* Few small changes & formatting.
* Do not use macro.
* Add gtests.
* Use AccDataType for Output of MFMA instruction.
* Formatting.
* Refactor gemm examples.
* Switch over to current block gemm.
* Use currently available pipeline policy.
* Refactoring and review comment.s
* Fixes after merge.
* Add missing include.
* Add load tile overload which accepts output tensor as parameter.
* This give 8% perf boost at the cost of using more registers.
* Rename example.
* Small changes.
* Fix compilation err and lower K.
* Support different layouts for A/B
* Fix vector size for different layouts.
* Rename Alignment into VectorSize
* Unblock tests.
* ake the cshuffle compilable
* modify Mhe reference on gpu and cpu. Correaccess of cshuffle
* fix the cpu reference code
* Complete the in tile shuffle logic
* restructure the kernel template input
* change the naming pattern of ck_tile gemm pipeline
* Re-format files using remod.py
* Solve the fmha conflict with gemm
* Comment Addressed from Carlus
---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
* Fix text alignment of ArgParser::print()
* Update example README files
* Clarify make-ck-dev.sh <arch> usage
* Only keep some of the argument from '-?' output
* Undo command line output changes in README
* Only keep existing argument on doc and update description
* Fix text alignment
* Make cmake-ck-*.sh compatible with 'sh' command
* Finished the feature of gpu verification
* Add the ck_tile_gemm test in the CI CD
* add the include of tensor_layou in reference_gemm
* Comment Addressed
* split ck_tile fhma and gemm tests into separate stages
* restructure the reference gemm
* restructure a new reference_gemm api that could read the device mem
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout
* Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.
* Fix: Clang Format, API fixed from fmha
* fix with better naming convention
* revert back the pipeline code of fmha
* Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.
* clang format with the reference_gemm file
* convert the clang format with the remod.py
* Changed the format and variable name of the kernel gemm_shape and partitioner
---------
Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>