* Add generic instance gemm_add_add_fastgelu
* Add a client example for generic gemm_add_add_fastgelu
* Update CMakeLists
* Format
* Format
* Add generic instance gemm_add_fastgelu
* Format
* Add a gemm_add_fastgelu client example
* Format
* Add generic instance gemm_fastgelu
* Format
* Fix argument order
* Add gemm_fastgelu client example
* Add exceptions if argument is not supported
* Remove M/N/KPad local variables
* Use M/N/KPad to name padded lengths
* Replace duplicated local variable by parameters
* Rename variables M/N/KRaw to M/N/K
* Move AK0/BK0 compute logic into GridwiseGemm
* Use macro to shorten code
* Move CalculateGridSize() logic into GridwiseGemm
* Add comment to credit the implementation source
* Reuse the existing implementation
* Remove no-longer used data members
* Remove elementwise-op objects from interfaces
* Reserve kernel arg as whole object in interfaces
* Remove redundant data member
* Make 3rd type parameter optional
* Remove unnesscary type parameters
* Remove no-longer used descriptor-creation methods
* Move kernel arg type definition into GridwiseGemm
* Add macro to switch between code sections
* Move argument field computing logic into device op side
* Make utility method 'static'
* Declare special methods
* Unify MakeArgument() usage
* Adapt the new GridwiseGemm interface
* Push-down class 'GridwiseGemm::Argument' fields
* Remove no-longer used methods
* Add unused parameters
* Force copying parameters in 'Embed' ctor
* Remove no-longer used descriptors
* Fallback change on BaseArgument
* Remove macro 'INTEGER_DIVIDE_CEIL'
* Make variable naming more consistent
* Make sure methods are only invoked on right place
* Remove tailing underscore in public attribute name
* Remove necessary methods
* Hide computing logic of derived attributes
* Make new 'Embed' ctor only available for device code
* Make sure 'Embed' type args are not references
* Move check for karg.K into CheckValidity()
* Remove more integer division logic form device code
* Undo changes on Embed
* Separate 'Problem' concept out from 'Argument'
* Add overloaded version of __builtin_amdgcn_readfirstlane()
* Remove 'static' specifiers
* Remove more 'static' specifier
* Replace unsigne char by std::byte
* Add 'const' specifier to never changing variable
* Add 'inline' specifier to funcion definition
* Share same name for kernel interfaces
* Fix wrong boundar calculation logic
* Leave the third template arg for compatibility
* Remove unnecessary parameters
* Fix wrong error message (for type name)
* Create descriptor on device side
* Fix wrong debug message
* Remove no-longer used data members
* Rename type trait
* Remove std:: qualifier from standard types
* Replace 'size_t' by 'unsigned'
* Use type alias to hint usage
* Replace static_for<> by ordinary 'for' loop
* Reject unsupported argument
* Rename readfirstlane() to amd_wave_read_first_lane()
* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
* Update function calls
* Reorder statements
* Re-format files
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add overloaded version of __builtin_amdgcn_readfirstlane()
* Remove 'static' specifiers
* Remove more 'static' specifier
* Replace unsigne char by std::byte
* Add 'const' specifier to never changing variable
* Add 'inline' specifier to funcion definition
* Fix wrong boundar calculation logic
* Rename type trait
* Remove std:: qualifier from standard types
* Replace 'size_t' by 'unsigned'
* Use type alias to hint usage
* Replace static_for<> by ordinary 'for' loop
* Rename readfirstlane() to amd_wave_read_first_lane()
* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
* Reorder statements
* Remove M/N/KPad local variables
* Use M/N/KPad to name padded lengths
* Replace duplicated local variable by parameters
* Rename variables M/N/KRaw to M/N/K
* Move AK0/BK0 compute logic into GridwiseGemm
* Use macro to shorten code
* Move CalculateGridSize() logic into GridwiseGemm
* Add comment to credit the implementation source
* Reuse the existing implementation
* Remove no-longer used data members
* Remove elementwise-op objects from interfaces
* Reserve kernel arg as whole object in interfaces
* Remove redundant data member
* Make 3rd type parameter optional
* Remove unnesscary type parameters
* Remove no-longer used descriptor-creation methods
* Move kernel arg type definition into GridwiseGemm
* Add macro to switch between code sections
* Move argument field computing logic into device op side
* Make utility method 'static'
* Declare special methods
* Unify MakeArgument() usage
* Adapt the new GridwiseGemm interface
* Push-down class 'GridwiseGemm::Argument' fields
* Remove no-longer used methods
* Add unused parameters
* Force copying parameters in 'Embed' ctor
* Remove no-longer used descriptors
* Fallback change on BaseArgument
* Remove macro 'INTEGER_DIVIDE_CEIL'
* Make variable naming more consistent
* Make sure methods are only invoked on right place
* Remove tailing underscore in public attribute name
* Remove necessary methods
* Hide computing logic of derived attributes
* Make new 'Embed' ctor only available for device code
* Make sure 'Embed' type args are not references
* Move check for karg.K into CheckValidity()
* Remove more integer division logic form device code
* Undo changes on Embed
* Separate 'Problem' concept out from 'Argument'
* Share same name for kernel interfaces
* Reject unsupported argument
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add license header.
* Reduce number of logged output. Add constant initialization.
* Add functional tests for grouped_gemm with different kbatch value.
* Add debug log informations + remove unused code.
* Don't pass kbatch to CalculateKPadded.
* Turn on logging in grouped gemm and gemm splitk profiler
* Debug: limit number of test cases to run;
* Log more information and initialize with constant value.
* Turn on DEBUG_LOG
* Add more debug log informations.
* Limit the number of instances to compile.
* Use GridwiseGemmPipeline
* Use KBatch to calculate K0
* Multiple DebugLog messages.
* Unit tests for multiple KBatch values.
* Refactoring
* Disable logging
* extract out of if statement KBatch update.
* Uncomment instances.
* Disable DebugLog.
* Use Kbatch when calculate KPadded.
* Fix CGridDesc padding.
* Use available helper functions.
* Uncomment code commented for debuggin.
* Remove unnecessary debug log messages.
* Uncomment previously commented code for debug purposes.
* Add KBatch info to profiler output summary log.
* Add gtests for gemm splitk using ckProfiler API.
* Add more test-cases for different data layout.
* Add more test cases for gemm splitk
* Remove old test.
* Unit tests for MKNK ggemm interface.
* Fix and add more unit-tests.
* Constepxr everything!
* Increase error threshold for fp16 and splitk.
Since we're using fp16 atomic add for splitk there's a
known precision loss.
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Expand the base class of pool2d, prepare to share base class with pool3d
* Add pool3d device op
* Add pool3d f16 example
* Refactor the base class. implement generic pooling in the future
* clang format
* get original index in max pooling
* Add outputindex to base class
* Fix dimension
* Add pooling instance
* Use indexType instead
* Remove useless header
* Extract IndexDataType to template
* Extract pooling reference code
* clang format
* clang format
* Fix typo
* Add tensor stride
* Add missing header
* Add index stride and output stride
* Refine naming
* Add type to base class
* Rename file
* Use proper size
* Fix typo
* Refine naming
* Modify the argument into vector.
* Add max pool profiler
* Refine naming
* Support f32 pool
* Fix typo
* Add avg pool2d fwd in profiler
* clang format
* Rename AccDatatype to ComputeDatatype
* Fix init
* test pool
* Extract variable
* Add client example
* Check the pooling dim
* clang format
* Connect argv and arg_parser
* Add found check
* Remove useless header
* Refine naming
* Adjust the order of device_pool_fwd
* enable dl kernels on navi3
* do not build xdl tests and examples on Navi
* run tests before building everything on jenkins
* disable gemm_bilinear on gfx1030
* add gpu targets to installer on Navi
* put tests in the same order as before
* reduce the number of navi targets in CI
* build CI installed for gfx940 as well
* only build for MI300 during QA runs
* update documentation dependencies
add version number to docs
rename doc config directories
enable more doc formats on rtd
add license section in docs
* Add contraction profiler and tests
* Build and style fixes
* Allow to use any elementwise operator for ref_contraction
* Introduce profile_contraction_scale and profile_contraction_bilinear
* Make ref_contraction generic and extend interface tests
* Stylistic minor fixes
* Extend test_contraction_interface
* Add TypeConvert class and start refactoring
* Refactor TypeConvert as a struct
* Get back to template functions type_convert
* Add a type_convert_bf16_rtn, set rtz as default
* Clean up
* Add UnaryConvertPrecision struct for high-precision workloads
* Format
* Update type_convert to UnaryConvert on threadwise level
* Update UnaryConvertPrecision
* Format
* Fix chmod
* Add a flag to pick converion method
* Format
* Remove the added flag
* Merge elementwise op with type conversion
* Move type_convert to elemwise op, update the op
* Update type_convert_precision -> bf16_convert_rtn
* Clean up
* Update comments
* Update the CK_WORKAROUND_DENORM_FIX flag handling
* Update the unneeded op to work but warn user
* Remove the message
* Use a PassThrough instead of ConvertBF16RTN to calcaulate reference
* Format
* Add missing include
* replace amd_buffer_atomic_add with hip_atomic_add
* fix grouped_gemm_splitk kernels on mi300
* fix syntax
* revert experimental atomic_add changes
* fix the group of kernels from ticket 723 on MI300
---------
Co-authored-by: Jing Zhang <jizhan@amd.com>
incomplete fix from https://github.com/ROCmSoftwarePlatform/composable_kernel/pull/670
So it does not only happen in gtest but also in CK code:
We need to fix them as a quality improvement, but for now suppressing this warning in immediate releases:
http://compiler-ci.amd.com/blue/rest/organizations/jenkins/pipelines/compiler-psdb-amd-stg-open/runs/2540/nodes/282/steps/3202/log/?start=0
e.g.
```
[2023-04-26T17:26:31.524Z] /jenkins/workspace/compiler-psdb-amd-stg-open/Libs/MIOpen/deps_hip/cget/build/tmp-a3db5da587a64213bde99fb856db1b43/composable_kernel-0f98035df1cc5ba3e90ab03187e672b426a25b00/include/ck/utility/generic_memory_space_atomic.hpp:52:19: error: unsafe pointer arithmetic [-Werror,-Wunsafe-buffer-usage]
[2023-04-26T17:26:31.524Z] atomicAdd(c_style_pointer_cast<float*>(p_dst) + 1, vx.template AsType<float>()[I1]);
[2023-04-26T17:26:31.524Z] ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
```
[2023-04-26T17:26:31.523Z] /jenkins/workspace/compiler-psdb-amd-stg-open/Libs/MIOpen/deps_hip/cget/build/tmp-a3db5da587a64213bde99fb856db1b43/composable_kernel-0f98035df1cc5ba3e90ab03187e672b426a25b00/include/ck/utility/amd_inline_asm.hpp:62:20: error: 'p_a_half2' is an unsafe pointer used for buffer access [-Werror,-Wunsafe-buffer-usage]
[2023-04-26T17:26:31.523Z] const half2_t* p_a_half2 = c_style_pointer_cast<const half2_t*>(&a);
[2023-04-26T17:26:31.523Z] ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
* [What] Remove pure conv int8 instance
[Why] We will never use pure int8 conv in AI, use int8 quantization instead
* Change layout
* Share the kernel parameter
* Support more type of NHWGC for group conv
* Revise client example of conv 2d, use NHWGC layout
* Add instance to cmake
* Revise layout of group conv quantization instance
* Revise layout of external api of group conv quantization
* Revise layout of group conv quantization client example
* Fix clang format
* Add comment to describe meaning of each parameter
* simplify karg in device/grid split-k op
* fix mk_kn_mn instances
* add more instances
* use name from tensor layout
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
* enable use of rocm5.5 release candidate 4
* upgrade to ROCM5.5 RC5
* try fix the PUB_KEY error, remove the cmake-data package
* upgrade to latest cmake version
* use private dockerhub repo for rocm5.5 rc5
* add missing bracket
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp