* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm
* Update rmsnorm host reference
* Update tree reduction of rmsnorm for reference host
* Fix cross warp for m > 1 cases
* Add RMSNorm model selectable option for host reference
* Fix save_unquant cases
* Update reference rmsnorm forward function to use enum for model sensitivity
* Update reference rmsnorm calculation for model sensitivity
* Fix m warp for layernorm
* Adjust parameter of reference for twoPass
* Fix clang format
* Run clang-format-overwrite.sh to fix formating issue
* fix clang format
---------
Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake:
* Enable BUILD_GMOCK by default (was previously disabled)
* Patch gtest-src/googlemock/CMakeLists.txt for broken include path.
* Add configuration to gmock if the target is used.
No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code.
* Enable the adapted LDS B layout for Row-Major
* fix formatting
* Implement specialized col-major A LDS block descriptor
* Fix formatting
* Use VecLoadSize for AK1/BK1
* Fix some thread access pattern values
* Use GetVectorSizeA for A
* Fix formatting
* Add extra condition to avoid division by zero
* disable layout for wave32
* remove extra else
* fix formatting
* Fix formatting
* Rename one remaining TileDistributionEncodingPattern2D
* Use integer ceil division
* revert remod.py changes
* also revert utility.hpp
* use getA/BTileAccessPattern everywhere
* use integer_divide_ceil for AK0 too
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* Initial commit. create batched_contraction_kernel file
* initial problem definition
* implement initial example to launch kernel
* add universal gemm to contraction. initial phase
* complete implementation for special case all Dims are 1 and no Ds
* clean code
* initial changes to support multi dimensional G
* more progress in implementing multiple G
* tmp commit
* manage dynamic NumDimG in kernel
* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit
* implement the example for general Multi dimension G M N K and test different reference calculation algorithms
* 2 functions for reference using multi dimensional and flat indexing
* clean the code for muti dimentional G, M, N, K contraction and add some logs
* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E
* some cleaning on kernel
* clean the code for calculating the offsets from flatten batch number
* Start adding MultiD support to kernel and example
* more changes to manage multi D in kernel and example
* manage passing multi d to kernel and testing.
* complete multi D support in kernel. modify example code to support it
* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning
* Minor fix
* Generalize example code for variable NumD tensors and apply cleanup based on review feedback
* Refactored code and addressed review feedback
* refactoring, cleaning, add documents, in kernel side and example codes
* Optimize batch offset calculation in kernel
* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>