* save mean and inverse std in normalization
* Save mean and inverse std in splitK
* Vector save mean and inv std
* Modify instance for save mean and std
* simplify the layernorm example
* Save mean and std in groupnorm example
* Save mean and inv std in ckProfiler and test
* Remove compute data type from base class
* Save mean and inv std in client example
* Add changelog
* clang format
* Fix compile error
* Refine naming
* Avoid error in bf16
* revert changelog
* Introduce LocalBlockToCTileMap.
* Change the signature of CalculateBottomIndex() function which now does
not accept any argument. The B2C map which is already passed as an
argument to the kernel Run function is calculating block's local id
already outside at kernel entry point __global__ function.
The LocalB2C map stores as members local block ID.
* Use LocalBlockToCTile map in device ops.
* First draft of tile loop work distribution.
* Fix typo.
* Simplify kernel arguments.
Calculate descriptors & B2C maps on the device.
* Use looping kernel.
* Fix B2C constructor.
* Fix Navi21 errors.
* Calculate tile start/end in device kernel.
* Change Run API to accept user provided workspace buffer.
* Add new line at EOF.
* Move Gemm KernelArguments to device op interface.
* Remove unused code.
* Update API.
* Launch grid size which is min of occupancy vs tile count
* Get back to use constant memory for gemm descriptors.
* Remove unused code.
* Add default virtual method implementation.
* Update comments to conform with doxygen style.
* Fix doc style and unused parameters.
* Add thread cluster lengths to kernel name.
* Remove old splitk impl and replace it with tile looping one.
* Modify instances.
* set KPerBlock to 64
* maximize wherever possible vector load size.
* Fix instances cluster lengths.
* Change comment style.
* Use 128b store where possible in instances.
* Update test cases, since KPerBlock has doubled.
* Update output stream operator for Sequence.
* Add pipeline version to GroupedGEMM device op type string.
* Fix pipeline version type logging.
* Fix input tensors type after merge.
* Fix compiler error.
* Fix output stream operator for Pipeline version.
* Store using 128b.
* Set of instances with kpb 32/64
* Limit number of instances
* Remove commented out instances.
* Fix function name.
* Limit the number of instances.
Add pipline version to the regular instances
* Change thr cluster layout for reading B tensor.
* disabled failed instances
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
* workaround nan problem by changing output to fp16
* enable f8/bf8 gemm tests on MI200
* workaround f16 to f8 conversion
---------
Co-authored-by: Jing Zhang <jizha@amd.com>
* Extract common functionality to separate files
* Reference contraction: Remove incorrect consts from type_converts
* Reference contraction: Add missing type_convert for dst value
* Reference contraction: Fix incorrect order of B matrix dimensions
* Add support for mixed precision in contraction scale and bilinear
* Move using statements from instances to a common file
* Move using statements from examples to a common file
* Fix the order of B matrix dimensions across examples and profiler
* Fix the computation of error threshold
* Make ComputeDataType an optional argument
* Include possible DataType -> ComputeDataType casting error in the threshold
* Remove commented code
* Add column to image kernel
* Minor fixes for dtypes and client examples
* Disable tests for disabled dtypes
* Disable add instances functions for disabled data types
* Minor stylistic fixes
* Revert "Disable add instances functions for disabled data types"
This reverts commit 728b869563.
* Instances reduction
* Add comments in device_column_to_image_impl
* Update changelog and Copyrights
* Improve changelog
* Redesign the DPP8 GEMM kernel to use warp-wise component
* Review: Improve error messages
* Review: Remove unnecessary empty lines
* Review: Fix M, N per thread names
* Review: Rename mfma_input_type to dpp_input_type
* Review: Fix tensor adaptor; remove unnecessary element
* Review: Remove calls to dpp_gemm's MakeCDescriptor
* Review: Add blockwise doc, change function names to include dimension names
* Review: Remove duplicated code; Move Block2CtileMap alias to the top of the file
* Review: Add __restrict__ keywords
* Review: Use MatrixPadder for padding A, B, C matrices
* Review: Remove hardcoded datatypes
* Review: Change names from FloatX to XDataType
* Review: Introduce AK0 and BK0 instead of a single K0
* Review: Remove construction of dpp_datatypes object
* Review: Rename DppInstrRunner to DppLanegroupGemm
* Add image to column kernel
* Add instances, tests, profiler, example
* Add client example
* Several fixes of image to column
* Fix variable name in device_image_to_column_impl
* Several fixes of image to column profiler
* Fix num_btype calculation
* Make new mesaurements for correct bytes calculation
* Add maxpool instances
* Rename index pool to max pool.
* Add maxpool bwd bf16 instances
* Add avg pool bwd instances
* Rename avgpool and maxpool to avg_pool3d and max_pool
* Add bf16 pool fwd instances
* Add max pool bwd to ckProfiler
* Add avg pool3d bwd to ckProfiler
* Add avg pool bwd test
* Fix bug of reference pool fwd (dilation)
* Fix bug of max pool bwd (dilation and initZero)
* Support bf16 compute data type
* Force compute type be f32. Because atomicAdd only support f32
* Add max pool bwd test
* Rename folder
* Rename pool
* Add max pool bwd client example
* Add avg pool bwd client example
* Add missing workspace
* clang format
* Rename macro
* remove useless header
* remove useless layout
* experiment with config file
* experiment with version.h config
* add more info to version.h
* minor updates
* minor updates
* fix case where DTYPE is not used
* large amount of files but minor changes
* remove white space
* minor changes to add more MACROs
* fix cmakedefine01
* fix issue with CK internal conflict
* fix define and define value
* fix clang-format
* fix formatting issue
* experiment with cmake
* clang format v12 to be consistent with miopen
* avoid clang-format for config file