Detailed comments explain:
- sequence_find_value: Constexpr loop with O(1) template depth vs O(N) recursive
- find_in_tuple_of_sequences: Pack expansion instead of nested static_for loops
- Why constexpr search reduces template instantiations dramatically
- When to apply constexpr search patterns for compile-time operations
- Implementation details for each optimization approach
This documentation helps maintainers understand the compile-time search optimization
strategy without relying on specific benchmark numbers that may vary by use case.
The InitializeElementSize function used generate_tuple with a lambda to
compute visible dimension lengths. Each TensorDescriptor type created
a unique lambda type, causing 78 instantiations (385ms).
Replace with direct pack expansion using helper functions, eliminating
the lambda instantiation overhead entirely.
Results on example_grouped_conv_fwd_xdl_fp16:
- generate_tuple lambdas: 178 -> 100 (44% reduction)
- Template instantiation time: 19.5s -> 19.0s
The GetTransformAndItsUpperDimension function used nested static_for
loops with lambdas to search for a hidden dimension in UpperDimensionIdss.
This caused 918 applier::operator() instantiations (81% of all applier
instantiations).
Replace with find_in_tuple_of_sequences helper that uses constexpr
array lookup and if-constexpr recursion, eliminating the lambda
instantiation overhead.
Results on example_grouped_conv_fwd_xdl_fp16:
- applier instantiations: 1132 -> 127 (89% reduction)
- TensorDescriptor instantiations: 2503 -> 664 (73% reduction)
- Template instantiation time: 23.4s -> 19.4s (17% reduction)
* experiment with config file
* experiment with version.h config
* add more info to version.h
* minor updates
* minor updates
* fix case where DTYPE is not used
* large amount of files but minor changes
* remove white space
* minor changes to add more MACROs
* fix cmakedefine01
* fix issue with CK internal conflict
* fix define and define value
* fix clang-format
* fix formatting issue
* experiment with cmake
* clang format v12 to be consistent with miopen
* avoid clang-format for config file
* reopen masking att instance due to CI is upgraded
* re-enable instances previously failed on 9110
* enable ksize-kpadding pair validity test
* add non-masked attention+permute test; expose masking boolean to attention kernel handles
* disable bench
* fix test
* move files
* bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
* format
* amend rename
* disable bench in test
* add mask/no-mask test for non-permute attention kernels
* disable broken kernel instance
* example working
add non-permuted problem statement
evaluating whether overhead comes from permutation or the extra kernel arg
* interface for bias addition without implementing it
* test and profiler running
* tidy
* mask type determined by enum class
* unify example code
* move masking specialization to its own header
* align formats
* extract helper functions
* experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
* add tensor specialization to template args
since tensor spec packed shows perf parity when permutation isn't needed
remove redundant template args
comment on 'packed' tensor specialization
* grouped attention with input/output permute example
* format
* clean up
* refactor acc0 tile visitor
Co-authored-by: shaojiewang <wsjmessi@163.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* attention host validation
* add blockwsie softmax v1
* iteratively update softmax+gemm
* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
* add init method for easier debugging
* do away with manual thread cluster calculation
* generalize blockwise softmax interface
* row-wise softmax sum & max
* format
* rename to DeviceBatchedGemmSoftmaxGemm
* add gemm_softmax_gemm instances and tests
* comment
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Turning compare warnings on
* Cleaning part I
* Cleaning part II
* Explicit static_cast to ck::type_convert
* Resolving large tensor size issue.
* format
* revert change to tensor descriptor; promote lementSpaceSize to 64bit
* use integer value for GEMM test
* Review remarks
* Review remarks + issues with (un)signed arithmetic
* Format fix
* Format
* Clang-format.
* fix 2gb limit issue
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>