* Support bf16/fb8/bf8 datatypes for ck_tile/gemm
* remove commented out code.
* Addressing code review comments and enabling universal_gemm for all the supported data types.
* Merge conflict resolution.
* Solve the memory pipeline compilation error. Merge with the new change of CShuffle
* finish the feature, pass the tests
* Fix the pipeline and add the benchmark script for other data types
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Refactor universal gemm policy.
* Adapt example to refactor changes.
* Introduce static encoding pattern
* Adding shuffled encoding patterns.
* Fix err in reverse tuple.
* Add transpose_tile2d
* Small refactoring + doc
* Enable reading on contiguous dimension in all layouts.
* Transpose A/B register tile if needed for comp v3 pipeline.
* Take contiguous dim size when calculating dram vector load size.
* A/B smem pack size taken from WarpGemm attributes
* Update B LDS layout and setup tile distribution pattern at class level.
* Fix static assert.
* Fix errors in examples.
* Formatting & fix IsTranspose
* Fix VectorSize & refactor.
* Add error loging messages.
* Fix VecLoadSize and TranspseC for mem pipeline.
* Update unit-tests & disable mem pipeline.
* Clang format
* Update include/ck_tile/core/tensor/tile_window.hpp
Co-authored-by: jakpiase <jakub.piasecki@amd.com>
* Fix compilation and reviewers comments.
* Refactor unit-test. Fallback to non-universal gemm.
Need to use GemmPipelineAGmemBGmemCRegV1 for now,
since GemmKernel is now supporting also non-K major vector reads.
---------
Co-authored-by: jakpiase <jakub.piasecki@amd.com>
* not using structures under ck_tile/ops for ck_tile/host
* update as constexpr function
* Rename fn
* Update other examples.
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* add interwave scheduler for gemm mem pipeline
* Fix merge artifacts.
* Refactor unit tests.
* Switch to interwave scheduler for mem example
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* Block universal gemm.
* Universal block gemm with interwave scheduler - draft.
* Refactoring
* Move a/b_warp_tiles into BlockGemmImpl
* set BlockGemmImpl as a class member
* Change tile size for more suitable to memory bound cases.
* Introduce kKPerThread to WarpGemm
* Add documentation comment.
* Fix Interwave scheduler block gemm.
* Add compute/memory friendly tile configuration.
* Clean
* New tile configurations in gemm mem example.
* Add more static checks and fix loop order in block gemm.
* Add more static checks and use warp gemm mfma dispatcher.
* Add default scheduler block gemm.
* Remove logging in example.
* CK-Tile GEMM with memory bound pipeline.
* Memory bound gemm pipeline.
* Fix not closed namespace.
* Block gemm mem pipeline draft.
* Do not use ck_tile:: within ck_tile namespace.
* Refactoring & Move Layout info to pipeline problem.
* Get hot loop and TailNum information before lunching kernel.
* Fixes in pipeline.
* Add comment to load_tile_raw and change variable naming style.
* Few small changes & formatting.
* Do not use macro.
* Add gtests.
* Use AccDataType for Output of MFMA instruction.
* Formatting.
* Refactor gemm examples.
* Switch over to current block gemm.
* Use currently available pipeline policy.
* Refactoring and review comment.s
* Fixes after merge.
* Add missing include.
* Add load tile overload which accepts output tensor as parameter.
* This give 8% perf boost at the cost of using more registers.
* Rename example.
* Small changes.
* Fix compilation err and lower K.
* Support different layouts for A/B
* Fix vector size for different layouts.
* Rename Alignment into VectorSize
* Unblock tests.