* Fix instances dtype check
* Fix source dtypes seletor for examples and tests
* Sync with new cmakefile changes
* Remove not needed ifdefs
* Remove not needed ifdefs
* reinterpret_cast to const char* in dumpBufferToFile to be compatible with both const and non-const input pointers
* Add seed input to GeneratorTensor_4 for normal_distribution generator
* Add GetTypeString() for DeviceElementwiseImpl
* Add HIP_CHECK_ERROR macro
* save mean and inverse std in normalization
* Save mean and inverse std in splitK
* Vector save mean and inv std
* Modify instance for save mean and std
* simplify the layernorm example
* Save mean and std in groupnorm example
* Save mean and inv std in ckProfiler and test
* Remove compute data type from base class
* Save mean and inv std in client example
* Add changelog
* clang format
* Fix compile error
* Refine naming
* Avoid error in bf16
* revert changelog
* add a hipTensor test to CI
* use jenkins git plugin
* change hipTensor folder location in CI
* change the git method for hipTensor
* run tests usign ctest
* check the hipTensor contents
* only build hipTensor on MI100/200
* pull hipTensor as zip archive
* fix jenkins syntax
* add path to the CK installation
* combine build commands into one shell
* change jenkins syntax for CK installer path
* try different syntax
* allow unzip overwrite
* fix jenkins file syntax
* remove any old versions of hipTensor before building
* add option to select hipTensor branch for testing
* Introduce LocalBlockToCTileMap.
* Change the signature of CalculateBottomIndex() function which now does
not accept any argument. The B2C map which is already passed as an
argument to the kernel Run function is calculating block's local id
already outside at kernel entry point __global__ function.
The LocalB2C map stores as members local block ID.
* Use LocalBlockToCTile map in device ops.
* First draft of tile loop work distribution.
* Fix typo.
* Simplify kernel arguments.
Calculate descriptors & B2C maps on the device.
* Use looping kernel.
* Fix B2C constructor.
* Fix Navi21 errors.
* Calculate tile start/end in device kernel.
* Change Run API to accept user provided workspace buffer.
* Add new line at EOF.
* Move Gemm KernelArguments to device op interface.
* Remove unused code.
* Update API.
* Launch grid size which is min of occupancy vs tile count
* Get back to use constant memory for gemm descriptors.
* Remove unused code.
* Add default virtual method implementation.
* Update comments to conform with doxygen style.
* Fix doc style and unused parameters.
* Add thread cluster lengths to kernel name.
* Remove old splitk impl and replace it with tile looping one.
* Modify instances.
* set KPerBlock to 64
* maximize wherever possible vector load size.
* Fix instances cluster lengths.
* Change comment style.
* Use 128b store where possible in instances.
* Update test cases, since KPerBlock has doubled.
* Update output stream operator for Sequence.
* Add pipeline version to GroupedGEMM device op type string.
* Fix pipeline version type logging.
* Fix input tensors type after merge.
* Fix compiler error.
* Fix output stream operator for Pipeline version.
* Store using 128b.
* Set of instances with kpb 32/64
* Limit number of instances
* Remove commented out instances.
* Fix function name.
* Limit the number of instances.
Add pipline version to the regular instances
* Change thr cluster layout for reading B tensor.
* disabled failed instances
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
* workaround nan problem by changing output to fp16
* enable f8/bf8 gemm tests on MI200
* workaround f16 to f8 conversion
---------
Co-authored-by: Jing Zhang <jizha@amd.com>
* Extract common functionality to separate files
* Reference contraction: Remove incorrect consts from type_converts
* Reference contraction: Add missing type_convert for dst value
* Reference contraction: Fix incorrect order of B matrix dimensions
* Add support for mixed precision in contraction scale and bilinear
* Move using statements from instances to a common file
* Move using statements from examples to a common file
* Fix the order of B matrix dimensions across examples and profiler
* Fix the computation of error threshold
* Make ComputeDataType an optional argument
* Include possible DataType -> ComputeDataType casting error in the threshold
* Remove commented code
* Added error check after kernel launch (#919)
Co-authored-by: Xiaodong Wang <xdwang@meta.com>
Co-authored-by: Xiaodong Wang <xw285@cornell.edu>
* remove M=0 test cases for test_gemm_splitk
---------
Co-authored-by: Xiaodong Wang <xdwang@meta.com>
Co-authored-by: Xiaodong Wang <xw285@cornell.edu>
* Add column to image kernel
* Minor fixes for dtypes and client examples
* Disable tests for disabled dtypes
* Disable add instances functions for disabled data types
* Minor stylistic fixes
* Revert "Disable add instances functions for disabled data types"
This reverts commit 728b869563.
* Instances reduction
* Add comments in device_column_to_image_impl
* Update changelog and Copyrights
* Improve changelog
* split the types in gemm_bilinear instances, add condition to cmake policy
* fix syntax
* split the data types in batchnorm examples
* fix the batchnorm_bwd test
* fix types in the batchnorm_bwd test