* reopen masking att instance due to CI is upgraded
* re-enable instances previously failed on 9110
* enable ksize-kpadding pair validity test
* add non-masked attention+permute test; expose masking boolean to attention kernel handles
* disable bench
* fix test
* move files
* bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
* format
* amend rename
* disable bench in test
* add mask/no-mask test for non-permute attention kernels
* disable broken kernel instance
* example working
add non-permuted problem statement
evaluating whether overhead comes from permutation or the extra kernel arg
* interface for bias addition without implementing it
* test and profiler running
* tidy
* mask type determined by enum class
* unify example code
* move masking specialization to its own header
* align formats
* extract helper functions
* experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
* add tensor specialization to template args
since tensor spec packed shows perf parity when permutation isn't needed
remove redundant template args
comment on 'packed' tensor specialization
* grouped attention with input/output permute example
* format
* clean up
* refactor acc0 tile visitor
Co-authored-by: shaojiewang <wsjmessi@163.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* add space_filling_curve
* cleanup and move space_filling_curve into test
* WIP: start refactoring threadwise_transfer_v1r3
* threadwise_copy works but needs further refactoring
* add some comments
* add SpaceFillingCurve::GetIndices()
* minor changes
* removed GetIndices; refactored GetDstCoordinateResetStep
* add DynamicBuffer::Transfer, but Add is not tested
* rebased agaist develop
* threadwise_copy_v6r1/v6r2/v6r3 using space-filling curve start to work
* minor changes
* refactored threadcopy v3r1, v2; removed old implementations
* clang-format
* cleanup
* fix a typo in v6r3
* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* add space_filling_curve
* cleanup and move space_filling_curve into test
* add functions for backward and forward step; hard coded results in unit test
* minor changes