* refactor cmake files for the tests
* refactor cmake files for examples
* fix cmake for gemm example
* fix the cmake file for all examples
* add splitting by data types in gemm_splitk instance header
* rename test to reflect only dl instances are used
* clean up CI workspace, update cmake for instances
* change the jenkinsfile syntax
* build all instances except DL on gfx11
* move workspace cleanup after stages
* clean up workspace after every stage
* isolate data types in grouped_conv_fwd header
* isolate dl instances for grouped_conv2d_fwd
* fix syntax
* fix cmake and batchnorm instances
* fix typo
* fix reduction instances
* fix grouped_conv headers
* fix syntax
* replace parsing logic for instances, replace bfp16 with bf16
* fix the client examples build
* clean up DTYPES from instances cmake files
* update the parsing logic in cmake files
* make an exception for reduction kernels
* update few remaining cmake files to handle DTYPES
* fix syntax
* fix cmake conflicts
* replace f8 with fp8 test name
* resolve conflicts for dpp instances
* Redesign the DPP8 GEMM kernel to use warp-wise component
* Review: Improve error messages
* Review: Remove unnecessary empty lines
* Review: Fix M, N per thread names
* Review: Rename mfma_input_type to dpp_input_type
* Review: Fix tensor adaptor; remove unnecessary element
* Review: Remove calls to dpp_gemm's MakeCDescriptor
* Review: Add blockwise doc, change function names to include dimension names
* Review: Remove duplicated code; Move Block2CtileMap alias to the top of the file
* Review: Add __restrict__ keywords
* Review: Use MatrixPadder for padding A, B, C matrices
* Review: Remove hardcoded datatypes
* Review: Change names from FloatX to XDataType
* Review: Introduce AK0 and BK0 instead of a single K0
* Review: Remove construction of dpp_datatypes object
* Review: Rename DppInstrRunner to DppLanegroupGemm
* Add image to column kernel
* Add instances, tests, profiler, example
* Add client example
* Several fixes of image to column
* Fix variable name in device_image_to_column_impl
* Several fixes of image to column profiler
* Fix num_btype calculation
* Make new mesaurements for correct bytes calculation
* Add maxpool instances
* Rename index pool to max pool.
* Add maxpool bwd bf16 instances
* Add avg pool bwd instances
* Rename avgpool and maxpool to avg_pool3d and max_pool
* Add bf16 pool fwd instances
* Add max pool bwd to ckProfiler
* Add avg pool3d bwd to ckProfiler
* Add avg pool bwd test
* Fix bug of reference pool fwd (dilation)
* Fix bug of max pool bwd (dilation and initZero)
* Support bf16 compute data type
* Force compute type be f32. Because atomicAdd only support f32
* Add max pool bwd test
* Rename folder
* Rename pool
* Add max pool bwd client example
* Add avg pool bwd client example
* Add missing workspace
* clang format
* Rename macro
* remove useless header
* remove useless layout
* experiment with config file
* experiment with version.h config
* add more info to version.h
* minor updates
* minor updates
* fix case where DTYPE is not used
* large amount of files but minor changes
* remove white space
* minor changes to add more MACROs
* fix cmakedefine01
* fix issue with CK internal conflict
* fix define and define value
* fix clang-format
* fix formatting issue
* experiment with cmake
* clang format v12 to be consistent with miopen
* avoid clang-format for config file
* Do not hardcode stride
* devicePool2DFwd Inherit devicePool3DFwd
* Move instance declaration out of common
* Add dilation
* use the pool3d rank, because pool2d inherit pooo3d
* calculate Do Ho Wo for the dilation
* Fix header name
* Modify ckProfiler
* Remove pool2d instance
* Remove pool2d in profiler
* Remove pool2d and add dilation
* In to client example, this commit revise following:
1. Add dilation.
2. Use pool3d to implement pool2d
* Refine naming and IsSupportedArgument()
* Add dilation to maxpool bwd example
* clang format
* 1. Remove useless header
2. Fix copyright
3. Refine naming
* Add layout parameter to pool fwd
* clang format
* Fix merge error
* Fix compile error
* Remove layout parameter in derived class
* Refine changlog
* Fix compile error
* Fix compiler error
* Add layout to external api and profiler
* Add avgpool bwd reference code
* Refine naming
* Fix invalid in_element op in ref_conv
* Add example (only reference now)
* Add the full example of avgpool bwd
* Fix copyright
* Imitate MakeDescriptor from transform_conv_bwd_data_to_gemm_v1.hpp
* rename channel to c from k
* Arrange the code
* Imitate the argument from conv bwd
* Implement invoker
* Fix order of parameter in example
* Refactor reference code for different dimension
* Support different stride
* Check if argument is valid
* Fix kernel parameter for NDHWC, fastest dimension C is not reduced
* Add more data type in example
* Fix bug in example
* calculate Do Ho Wo according to the dilation
* Remove useless header
* Add comment in reference code
* Add layout parameter
* Remove layout in derived class
* Refine reference comment
* Enable grouped conv with small K or C
* Add missing instances
* Refactor grouped conv fwd instances
* Fix fp16 instances since it supports src_per_vec %2 = 0
* Add generic instances
* properly split conv_nd_bwd_data instances
* split conv2d_fwd instance data types
* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm
* split the tests by data types where possible
* filter examples by DTYPES
* split few remaining examples by DTYPES
* filter most instances by DTYPES
* add new lines at end of headers, fix grouped_gemm profiler
* fix syntax
* split the ckprofiler instances by DTYPES
* split the conv2d and quantization DL and XDL instances
* fix the splitting of conv2d DL instances
* split softmax and pool_fwd tests for fp16 and fp32 types
* fix syntax
* fix the dl_int8 quantization instances isolation
* initial stream-k implementation with example
* fix unexpected change in err
* improve a little bit performance by reorganize pipeline.
* improve perf a little bit by swizzle block idx
* add profiler
* update example
* fix spelling
* shrink karg for streamk
* support dynamic buffer using memory coherence glc_slc bit from template
* control memory coherence while construct dynamic buffer
* update reduction for streamk(not ready yet)
* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
* fix build issue
* fix several bug
* now result is correct, everything works (but has scratch)
* remove scratch by manually reset coordinate
* update device code
* fix a bug in final reduce
* fix something in example
* update async memset
* fix enum as camel case
* modify coherence enum name
* clean code and use atomic streamk by default
* remove unused var
* throw exception if have empty pointer
* fix format
* fix CI warning
* fix type in init
* modify CI error
* filter out on gfx10+
* restore changed example code
---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* allow building CK for specific data types
* add CI build and test stage on Naiv3x without some int8 instances
* add missing gemm fp16 instances
* add the changes to the missed cmake file
* add empty lines at end of source files
* Do not build quantization client example on navi3 in CI
* disable batched_gemm_multi_d_int8 instances with DTYPES
* disable device_conv2d_bwd_data_instance with DTYPES
* fix ckprofiler for conv_bwd_data for int8
* properly isolate the conv_bwd_data int8 instances
* remove empty line
* Add basic fp8 definitions and prn-generator
* Format
* Add fp8<->fp32 type_convert
* Format
* Split type_convert and cast_to/from_f8
* Format
* Minor fix
* Minor fix
* Move fp8 utils to a separate header
* Add elementwise ops
* Add fp8_convert_sr
* Format
* Add element op
* Eliminate magic numbers
* Split f8_convert_sr in host and device
* Format
* Add some constexpr
* Add a datatype test
* Format
* Another format
* Add fp8<->fp16 tests
* Update type_converts
* Format
* Add fp16 casting functions
* Format
* Use seed as a runtime arg
* Use element location for PRNG
* Format
* Add fp8<->fp16 to PassThrough element op
* Clean up
* Merge host and device implementations
* Add comments on rounding modes
* Remove leftover code
* Put type_converts into a separate header
* Put random number gen to a separate header
* Rearrange f8_utils' namespaces
* Refactor type_convert.hpp
* Move f8_t definition
* Add maxpool f32 kernel and example
* Revise copyright
* Add device pool bwd device op
* Support f16 and bf16
* Add compute datatype for reference code.
Prevent error in bf16
* Fix type error
* Remove layout
* Fix bf16 error
* Add f16 and bf16 example
* Add more operations
* Implement IsSupportedArgument
* Add changelog
* Add comment
* Add comment
* Remove useless header
* Move initialize of workspace to the run
* Move set din zero to the device operator
* Save din_length_raw
* Remove useless header
* Calculate gridsize according to the number of CU
* Calculate gridSize according to the number of CU.
Remove useless header
* Add put example
* Remove useless header
* Fix CI fail
* Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting
* Move the generic kernel instance to be the first of the instance list for elementwise op of normalization
* Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax
* Add testing of GetGenericInstance() in client_example of Softmax
* Revert "Add testing of GetGenericInstance() in client_example of Softmax"
This reverts commit f629cd9a93.
* Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"
This reverts commit a9f0d000eb.
* Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm
* Move generic kernel instance to separate tuple for elementwise op of normalization
* Remove un-used files for softmax instance
* Store generic kernel instance to separate tuple for softmax
* Add IsSupported checking for generic instance to client example of softmax
* Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization
* clang-format fix
* Remove int8 from softmax instances
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Expand the base class of pool2d, prepare to share base class with pool3d
* Add pool3d device op
* Add pool3d f16 example
* Refactor the base class. implement generic pooling in the future
* clang format
* get original index in max pooling
* Add outputindex to base class
* Fix dimension
* Add pooling instance
* Use indexType instead
* Remove useless header
* Extract IndexDataType to template
* Extract pooling reference code
* clang format
* clang format
* Fix typo
* Add tensor stride
* Add missing header
* Add index stride and output stride
* Refine naming
* Add type to base class
* Rename file
* Use proper size
* Fix typo
* Refine naming
* Modify the argument into vector.
* Add max pool profiler
* Refine naming
* Support f32 pool
* Fix typo
* Add avg pool2d fwd in profiler
* clang format
* Rename AccDatatype to ComputeDatatype
* Fix init
* test pool
* Extract variable
* Add client example
* Check the pooling dim
* clang format
* Connect argv and arg_parser
* Add found check
* Remove useless header
* Refine naming
* Adjust the order of device_pool_fwd
* Add contraction profiler and tests
* Build and style fixes
* Allow to use any elementwise operator for ref_contraction
* Introduce profile_contraction_scale and profile_contraction_bilinear
* Make ref_contraction generic and extend interface tests
* Stylistic minor fixes
* Extend test_contraction_interface
* Add TypeConvert class and start refactoring
* Refactor TypeConvert as a struct
* Get back to template functions type_convert
* Add a type_convert_bf16_rtn, set rtz as default
* Clean up
* Add UnaryConvertPrecision struct for high-precision workloads
* Format
* Update type_convert to UnaryConvert on threadwise level
* Update UnaryConvertPrecision
* Format
* Fix chmod
* Add a flag to pick converion method
* Format
* Remove the added flag
* Merge elementwise op with type conversion
* Move type_convert to elemwise op, update the op
* Update type_convert_precision -> bf16_convert_rtn
* Clean up
* Update comments
* Update the CK_WORKAROUND_DENORM_FIX flag handling
* Update the unneeded op to work but warn user
* Remove the message
* Use a PassThrough instead of ConvertBF16RTN to calcaulate reference
* Format
* Add missing include
* [What] Remove pure conv int8 instance
[Why] We will never use pure int8 conv in AI, use int8 quantization instead
* Change layout
* Share the kernel parameter
* Support more type of NHWGC for group conv
* Revise client example of conv 2d, use NHWGC layout
* Add instance to cmake
* Revise layout of group conv quantization instance
* Revise layout of external api of group conv quantization
* Revise layout of group conv quantization client example
* Fix clang format
* Add comment to describe meaning of each parameter
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp