* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* Add help for example
* Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings
* Add stride-aware reference for batched contraction with independent D tensor layouts
* Add -num_d argument for runtime D tensor count selection in batched contraction
* Add stride vector arguments in example code for testing non-contiguous batched contraction inputs
* Add descriptor-based architecture for batched contraction multi-dimensional stride support
* Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0
* Add complete multi-dimensional stride support via descriptors
* Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm
* Clean up batched contraction: remove old UniversalGemmKernel path
* Clean up batched contraction: remove legacy paths and finalize docs
* Optimize batched contraction example: pass dimension sizes not vectors
* correct the reference calculation, unsigned int to int
* Fix batched_contraction C++17 build errors for gfx90a CI
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch
* fix cpp17 compile error in the ck-tile examples
---------
Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
* Initial commit. create batched_contraction_kernel file
* initial problem definition
* implement initial example to launch kernel
* add universal gemm to contraction. initial phase
* complete implementation for special case all Dims are 1 and no Ds
* clean code
* initial changes to support multi dimensional G
* more progress in implementing multiple G
* tmp commit
* manage dynamic NumDimG in kernel
* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit
* implement the example for general Multi dimension G M N K and test different reference calculation algorithms
* 2 functions for reference using multi dimensional and flat indexing
* clean the code for muti dimentional G, M, N, K contraction and add some logs
* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E
* some cleaning on kernel
* clean the code for calculating the offsets from flatten batch number
* Start adding MultiD support to kernel and example
* more changes to manage multi D in kernel and example
* manage passing multi d to kernel and testing.
* complete multi D support in kernel. modify example code to support it
* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning
* Minor fix
* Generalize example code for variable NumD tensors and apply cleanup based on review feedback
* Refactored code and addressed review feedback
* refactoring, cleaning, add documents, in kernel side and example codes
* Optimize batch offset calculation in kernel
* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>