* Avoid passing indices (std::vector) by value to host tensor's operator()
Each access requires 2 allocations and copies of the vector.
* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification
* Compute ds_hp_host_ref in parallel
This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.
* Do not use ForEach for simple copy and conversion of large tensors
These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.
* add transpose load; no real logic
* fix some compile errors
* fix some issues
* update transpose load logic
* add some fixes
* fix a distribution issue
* update some codes
* add some fix
* can pass; but no logic
* transpose load enable
* update tile transpose
* miss output tile distribution mapping
* hack for transpose 16x16
* update output tensor distribution
* delete unused variables
* fix transpose related codes
* update transpose load example
* exchange the iteration order
* fix 16x16 related dimension transpose
* fix a transpose index issue
* fix a transpose index issue
* fix clang format check
* update load tile transpose related codes
* fix compile errors and pass 16x16 tests
* fix a typo
* update logic
* check other data types
* add transpose load api
* update transpose load api
* fix clang format check
* change file name
* refactor codes
* update code name
* delete some unused codes
* delete the unused oob flag for transpose load
* update tensor view api for transpose load
* update for testing
* fix a typo error
* move transpose ops to example directory
* update transpose api
* update include file
* fix for pr review
* fix compile errors
* add transpose load; no real logic
* fix some compile errors
* fix some issues
* update transpose load logic
* add some fixes
* fix a distribution issue
* update some codes
* add some fix
* can pass; but no logic
* transpose load enable
* update tile transpose
* miss output tile distribution mapping
* hack for transpose 16x16
* update output tensor distribution
* delete unused variables
* fix transpose related codes
* update transpose load example
* exchange the iteration order
* fix 16x16 related dimension transpose
* fix a transpose index issue
* fix a transpose index issue
* fix clang format check
* update load tile transpose related codes
* fix compile errors and pass 16x16 tests
* fix a typo
* update logic
* check other data types
* add transpose load api
* update transpose load api
* fix clang format check
* change file name
* refactor codes
* update code name
* delete some unused codes
* delete the unused oob flag for transpose load
* update tensor view api for transpose load
* update for testing
* fix a typo error
* move transpose ops to example directory
* update transpose api
* update include file
* fix for pr review
* fix compile errors
* change directory name
* delete the duplicated directory
* update cmakelists file
* delete the unused codes
* update function names
* update transpose policy
* update code after remod.py
* update codes
* add some comment
* Polish the instr infrastructure
* build up the fixed instr
* redesign the transpose api, currently it has numerical error
* add the bf16 transpose
* fix some issues
* add some comments
* update document
* Finished the refactor of API and pass through the verification
* fix the merging issue
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* [ck-tile] fix default epilogue in gemm universal
* argument validation needs vector size D
* operator() needs to specify dram windows
* copy/paste from cshuffle epilogue
* clang-format
* mark unused argument
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* [CK_TILE] Support multi-config in tile_example_gemm_universal
Add GemmConfig in run_gemm_example to support multiple tile config.
- It is useful when use you need compare gemm perf with different tile/pipeline config
- we also can use it simplify the code for wmma support in the furture.
* [CK_TILE] Support multi-config in tile_example_gemm_universal
Address review comments
* rebase code and fix clang format.
* fix clang format
* support pipeline v5.
* fix merge conflict
* address review comment
* add missing file
* address review comment v2
* fix build error
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* Add support for specifying valid flag when fetching elements for tile_scatter_gather
Add constexpr for operator[] of TrueGenerator
* Use different path when valid is enabled
* Multiple d, initial commit
* Check Ds Layout
* Readme and clang format
* Update branch & conflicts
* Multiple D - fix clang-formatter
* Rename elemetwise_op
* Fix CI
* Code review part1
* Remove printf
* Remove unnecessary comment
* Add new tests with Col layout
* Review part 2
* Added support for Multiple D GEMM
* Update comment
* Remove maybe_unused
* Clang-format
* Review part 3
* Add comment to function
* Add comment to function: another
* Take number of params for a refrence function
* Remove additional d param for 0 tensor
* Change name of function
* Fix CI fails
* SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround
* SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo
---------
Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com>
* Fix fmha fwd precision issue on MI3XX series
For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.
For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.
* Add option to switch rtn/rtz for fmha fwd
* Add TailHandler for V3, V4 and Mem pipelines
* Adapt examples and tests to use TailHandler
* move tail-handling logic to pipeline in persistent grouped gemm
* Fix Mem pipeline dispatching, add CompV4 dispatching
* Use a macro for handling the many tails of Mem pipeline
* Fix formatting again
* Use const-ref RunFunction, remove unnecessary try_run
* window_origin variable now in base class
* abstracted more functions
* consolidated tile_window_static_distribution and tile_window_static_lengths
* clang format
* skeleton code for tile_window and tile_window_linear consolidation
* more abstraction
* moved variables from child to parent
* clang format
* removed comments
* removed debug code
* removed debug code
* abstracting traits WIP
* consolidated traits
* removed comments and clang formatted