* first change bias load
* add bias dim and scalervector parameter
* make CDE0BlockTransferSrcVectorDim not work
* changse toinstance
* add limit for CDE0BlockTransferSrcScalarPerVector
* allow building CK for specific data types
* add CI build and test stage on Naiv3x without some int8 instances
* add missing gemm fp16 instances
* add the changes to the missed cmake file
* add empty lines at end of source files
* Do not build quantization client example on navi3 in CI
* disable batched_gemm_multi_d_int8 instances with DTYPES
* disable device_conv2d_bwd_data_instance with DTYPES
* fix ckprofiler for conv_bwd_data for int8
* properly isolate the conv_bwd_data int8 instances
* remove empty line
* Move source file into sub-directories
* Add missing include directive
* Split DeviceGemmXdl<> fp16 instances
* Fix format
* Remove unnecessary CMakeLists.txt
* Add macros to toggle new features
* Remove debug message
* Turn off GEMM v2 pipeline optimization by default
* Fix format
* Extract duplicated string as list
* Enlarge indent in CMakeLists.txt
* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]
* Add CountDataType as template parameter in blockwise_welford
* Add utility/get_shift.hpp
* Add BatchNorm multiblock single-kernel implementation
* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a
* Renaming in device_batchnorm_forward_impl.hpp
* Tiny fix in the batchnorm_fwd profiler
* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"
This reverts commit d16d00919c.
* Use the old two-kernel batchnorm multiblock method for gfx1030
* Use the old two-kernel batchnorm multiblock method for gfx908
* use the single-kernel batchnorm multiblock method only for gfx90a
* Remove get_wave_id() from utility/get_id.hpp since it is not used
* Set true for testing running mean/variance and saving mean/invvariance in the examples
* Fix to copy-right words
* Remove un-needed including in utility/get_id.hpp
* Add comments to workgroup_synchronization.hpp
* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp
* Renaming in the kernels
* Remove un-used kernel file
* Add basic fp8 definitions and prn-generator
* Format
* Add fp8<->fp32 type_convert
* Format
* Split type_convert and cast_to/from_f8
* Format
* Minor fix
* Minor fix
* Move fp8 utils to a separate header
* Add elementwise ops
* Add fp8_convert_sr
* Format
* Add element op
* Eliminate magic numbers
* Split f8_convert_sr in host and device
* Format
* Add some constexpr
* Add a datatype test
* Format
* Another format
* Add fp8<->fp16 tests
* Update type_converts
* Format
* Add fp16 casting functions
* Format
* Use seed as a runtime arg
* Use element location for PRNG
* Format
* Add fp8<->fp16 to PassThrough element op
* Clean up
* Merge host and device implementations
* Add comments on rounding modes
* Remove leftover code
* Put type_converts into a separate header
* Put random number gen to a separate header
* Rearrange f8_utils' namespaces
* Refactor type_convert.hpp
* Move f8_t definition
* Add maxpool f32 kernel and example
* Revise copyright
* Add device pool bwd device op
* Support f16 and bf16
* Add compute datatype for reference code.
Prevent error in bf16
* Fix type error
* Remove layout
* Fix bf16 error
* Add f16 and bf16 example
* Add more operations
* Implement IsSupportedArgument
* Add changelog
* Add comment
* Add comment
* Remove useless header
* Move initialize of workspace to the run
* Move set din zero to the device operator
* Save din_length_raw
* Remove useless header
* Calculate gridsize according to the number of CU
* Calculate gridSize according to the number of CU.
Remove useless header
* Add put example
* Remove useless header
* Fix CI fail
* Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting
* Move the generic kernel instance to be the first of the instance list for elementwise op of normalization
* Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax
* Add testing of GetGenericInstance() in client_example of Softmax
* Revert "Add testing of GetGenericInstance() in client_example of Softmax"
This reverts commit f629cd9a93.
* Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"
This reverts commit a9f0d000eb.
* Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm
* Move generic kernel instance to separate tuple for elementwise op of normalization
* Remove un-used files for softmax instance
* Store generic kernel instance to separate tuple for softmax
* Add IsSupported checking for generic instance to client example of softmax
* Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization
* clang-format fix
* Remove int8 from softmax instances
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* enable gfx941/942 targets
* fix clang format
* fix the cmake logic for multiple targets
* fix cmake syntax for looping over targets
* add gfx941/942 support for gemm_xdl instances
* Add generic instance gemm_add_add_fastgelu
* Add a client example for generic gemm_add_add_fastgelu
* Update CMakeLists
* Format
* Format
* Add generic instance gemm_add_fastgelu
* Format
* Add a gemm_add_fastgelu client example
* Format
* Add generic instance gemm_fastgelu
* Format
* Fix argument order
* Add gemm_fastgelu client example
* Add exceptions if argument is not supported
* Remove M/N/KPad local variables
* Use M/N/KPad to name padded lengths
* Replace duplicated local variable by parameters
* Rename variables M/N/KRaw to M/N/K
* Move AK0/BK0 compute logic into GridwiseGemm
* Use macro to shorten code
* Move CalculateGridSize() logic into GridwiseGemm
* Add comment to credit the implementation source
* Reuse the existing implementation
* Remove no-longer used data members
* Remove elementwise-op objects from interfaces
* Reserve kernel arg as whole object in interfaces
* Remove redundant data member
* Make 3rd type parameter optional
* Remove unnesscary type parameters
* Remove no-longer used descriptor-creation methods
* Move kernel arg type definition into GridwiseGemm
* Add macro to switch between code sections
* Move argument field computing logic into device op side
* Make utility method 'static'
* Declare special methods
* Unify MakeArgument() usage
* Adapt the new GridwiseGemm interface
* Push-down class 'GridwiseGemm::Argument' fields
* Remove no-longer used methods
* Add unused parameters
* Force copying parameters in 'Embed' ctor
* Remove no-longer used descriptors
* Fallback change on BaseArgument
* Remove macro 'INTEGER_DIVIDE_CEIL'
* Make variable naming more consistent
* Make sure methods are only invoked on right place
* Remove tailing underscore in public attribute name
* Remove necessary methods
* Hide computing logic of derived attributes
* Make new 'Embed' ctor only available for device code
* Make sure 'Embed' type args are not references
* Move check for karg.K into CheckValidity()
* Remove more integer division logic form device code
* Undo changes on Embed
* Separate 'Problem' concept out from 'Argument'
* Add overloaded version of __builtin_amdgcn_readfirstlane()
* Remove 'static' specifiers
* Remove more 'static' specifier
* Replace unsigne char by std::byte
* Add 'const' specifier to never changing variable
* Add 'inline' specifier to funcion definition
* Share same name for kernel interfaces
* Fix wrong boundar calculation logic
* Leave the third template arg for compatibility
* Remove unnecessary parameters
* Fix wrong error message (for type name)
* Create descriptor on device side
* Fix wrong debug message
* Remove no-longer used data members
* Rename type trait
* Remove std:: qualifier from standard types
* Replace 'size_t' by 'unsigned'
* Use type alias to hint usage
* Replace static_for<> by ordinary 'for' loop
* Reject unsupported argument
* Rename readfirstlane() to amd_wave_read_first_lane()
* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
* Update function calls
* Reorder statements
* Re-format files
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add overloaded version of __builtin_amdgcn_readfirstlane()
* Remove 'static' specifiers
* Remove more 'static' specifier
* Replace unsigne char by std::byte
* Add 'const' specifier to never changing variable
* Add 'inline' specifier to funcion definition
* Fix wrong boundar calculation logic
* Rename type trait
* Remove std:: qualifier from standard types
* Replace 'size_t' by 'unsigned'
* Use type alias to hint usage
* Replace static_for<> by ordinary 'for' loop
* Rename readfirstlane() to amd_wave_read_first_lane()
* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
* Reorder statements