* experiment with config file
* experiment with version.h config
* add more info to version.h
* minor updates
* minor updates
* fix case where DTYPE is not used
* large amount of files but minor changes
* remove white space
* minor changes to add more MACROs
* fix cmakedefine01
* fix issue with CK internal conflict
* fix define and define value
* fix clang-format
* fix formatting issue
* experiment with cmake
* clang format v12 to be consistent with miopen
* avoid clang-format for config file
* reopen masking att instance due to CI is upgraded
* re-enable instances previously failed on 9110
* enable ksize-kpadding pair validity test
* add non-masked attention+permute test; expose masking boolean to attention kernel handles
* disable bench
* fix test
* move files
* bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
* format
* amend rename
* disable bench in test
* add mask/no-mask test for non-permute attention kernels
* disable broken kernel instance
* example working
add non-permuted problem statement
evaluating whether overhead comes from permutation or the extra kernel arg
* interface for bias addition without implementing it
* test and profiler running
* tidy
* mask type determined by enum class
* unify example code
* move masking specialization to its own header
* align formats
* extract helper functions
* experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
* add tensor specialization to template args
since tensor spec packed shows perf parity when permutation isn't needed
remove redundant template args
comment on 'packed' tensor specialization
* grouped attention with input/output permute example
* format
* clean up
* refactor acc0 tile visitor
Co-authored-by: shaojiewang <wsjmessi@163.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* attention host validation
* add blockwsie softmax v1
* iteratively update softmax+gemm
* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
* add init method for easier debugging
* do away with manual thread cluster calculation
* generalize blockwise softmax interface
* row-wise softmax sum & max
* format
* rename to DeviceBatchedGemmSoftmaxGemm
* add gemm_softmax_gemm instances and tests
* comment
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Turning compare warnings on
* Cleaning part I
* Cleaning part II
* Explicit static_cast to ck::type_convert
* Resolving large tensor size issue.
* format
* revert change to tensor descriptor; promote lementSpaceSize to 64bit
* use integer value for GEMM test
* Review remarks
* Review remarks + issues with (un)signed arithmetic
* Format fix
* Format
* Clang-format.
* fix 2gb limit issue
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>