* Add int4 example for convnd_fwd_bias_relu_add
* Fix AddReluAdd for building without int4 support
* Update CMakeLists.txt
* Format
* Convert int4 tensors for int8 kernel
* Fix device memory allocation
* Format
* Format
* Add GEMM examples for int4
Currently the source files are just copied from int8 examples
* Re-use pre-defined alias in int4 exmples
* Distinguish user-side type from kernel-side type
* Add int4_t support for check_err()
* Allow conversion between Tensor<> specializations
* Re-format source files
* Use different type for host tensors
* Re-use CopyAsType<>() to implement copy ctor
* Re-use element-wise operation type alias
* Fix typo in alias names
* Complete the int4 examples
* Add constraint to Tensor<> templated methods
* Add type traits 'is_signed_integral<>'
* Add type constraints for integer version check_err<>()
* Allow comparing different-sized integral types in check_err()
* Check converted Tensor<int4_t> with golden Tensor<int8_t>
* Remove constraint of Tensor<>::CopyAsType()
* Avoid compilation error while disabling ck::int4_t support
* Remove debug messages
* Add #error directive to prevent compile sources with wrong setting
* Simplify tensor usages in examples
* Add constraint to check_err() input reference type
* Align design with other PR
* Use ""_uz to simplify example code
* Avoid too much generalizing check_err()
* Re-format GEMM instance template arguments
* Extract int4 example common codes
* Sort include directives
* Move #include directives into new header
* Move common codes together
* Re-format template argument in example code
* Reuse same implementation code for most of GEMM examples
* Re-format common.hpp
* Unify structured comment in examples
* Use reinterpret_cast<>() for cross-type pointer conversion
* Revert "Add type traits 'is_signed_integral<>'"
This reverts commit f2c148efae.
* Allow unsigned integer arguments for check_err()
* Fix compilation error in check_err()
* Remove unnecessary copy ctor for Tensor<>
* Mark Tensor<> special member functions as 'default'
* Use more strict condition to add code in examples
* Fix wrong program return value of GEMM examples
* Handle the case while user specify all the strides
* Fix never-ran examples
* Exit successfully if GEMM instance does not support given problem
* Add missing 'else' keyword
* Re-format CMakeLists.txt
* Add wrapper function to hide value conversion while copying memory
* Add new DeviceMem API to copy memory
* Use new DeviceMem API to implement examples
* Revert "Add new DeviceMem API to copy memory"
This reverts commit 3f190b0779.
* Add conversion ctor for Tensor<>
* Write Tensor<> conversion logics explicitly in example code
* Convert Tensor<> values after transfer data to host
* comment on specialization for TensorSpecialization::Packed
* gemm_softmax_gemm with output permutation
* scaling
* refactor MatrixPadder; rename to GemmPadder
* remove old sanity check
* restore original gemm_softmax_gemm
* revise comment in gemm_softmax_gemm example
* use GetElementSpaceSize()
* remove extra header
* typo
* remove archaic DeviceOpPtr
* add examples into grouped/batched_gemm
* adding splitK examples
* fixed splitK
* add bfp16 int8 example into splitK
* formatting
* use static_cast
* added common for batched_gemm
* add commons for examples of splitK/batched/grouped_gemm
* return true
* adjust splitK check tol
* update example
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
* Add custom target to bundle examples together
* Add int4 example conditionally (just copy from int8 example)
* Extract common code into common.hpp
* Move ref gemm type alias into data-type-specific sources
* Add #error directive to prevent compile with wrong setting
* Let AddAddFastGelu support int4 parameter type
* Let check_err() support int4 parameter type
* Add wrapper function to hide value conversion while copying memory
* Finish int4 example for GEMM + AddAddFastGelu
* Add new DeviceMem API to copy memory
* Use new DeviceMem API to implement examples
* Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'
* Revert "Add new DeviceMem API to copy memory"
This reverts commit e26e7af71e.
* Add conversion ctor for Tensor<>
* Add 'const' specifier to Tensor<>::CopyAsType()
* Convert Tensor<> values before/after transfer between host & device
* GemmPadder and GemmGemmPadder
* proper padding using GemmGemmPadder
* test gemm_gemm padding
* properly check size K in IsSupportedArgument()
* properly check size requirement given SrcScalarPerVector in IsSupportedArgument()
* comment
* format
* Introduce int4 data type.
* Add unit-tests for int4
* Compile int4 UT only when int4 enabled.
* clang-format
Co-authored-by: Adam Osewski <aosewski@amd.com>
* Implement multiple-reduction in one kernel (kernels, device ops, examples)
* Add generic elementwise kernel and device interface
* Add generator for normal-distributed data initialization
* Add host refer implementation of batchnorm-forward and batchnorm-infer
* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels
* Remove un-needed including in batchnorm example
* Renaming generic_elementwise to elementiwise in kernel and device classes/functions
* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise
* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise
* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise
* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp
* Removing and renaming files
* Update to synchronize gemm_layernorm client example to the generic element-wise device op API
* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming
* Merge two static member functions in device_elementwise.hpp
* Remove unary_elementwise_1d kernel and device
* Change all device operations to use add_instance_library to avoid duplicated cmake configuration.
* update DeviceMem
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Add threadwise and blockwise welford
* Rename gridwise op, prepare to add welford version
* implement welford and integrate welford into layernorm
* Take care of tail loop
* Fix buf when ThreadSliceK > 1
* Fix bug of merging of two empty set
* Rename clip to clamp
* 1. Fix type of count
2. Remove useless static_assert
* Do not inherit Reduction::Argument
* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
* Add y stride
* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm
* Move literal ""_uz & ""_zu into namespace 'literals'
* Move namespace 'literals' as 'ck::literals'
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* add gemm_gemm instances and tests
* avoid LDS data hazard
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Update the reduce_blockwise example to support user specified data type and input+reducing dimensions
* Add examples for using reduce_multiblock_atomic_add
* Add more running examples to the default command-line
* Remove un-necessary header including
* Update to the example README.md
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* attention host validation
* add blockwsie softmax v1
* iteratively update softmax+gemm
* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
* add init method for easier debugging
* do away with manual thread cluster calculation
* generalize blockwise softmax interface
* row-wise softmax sum & max
* format
* rename to DeviceBatchedGemmSoftmaxGemm
* add gemm_softmax_gemm instances and tests
* comment
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* [LWPCK-359] Initial commit
* Working version for fp16, add results to readme
* Update according to PR #341
* Update results in readme
* Add fp32 example
* Add bf16 example
* Update fp16 and fp32 examples
* Add int8 example
* Add separate lengths and strides tensors for D tensors
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
* build docker in separate stage
* build docker with only one prefix
* add parallel statement
* add docker repo url
* fix the name of perf_conv_bwd_data log file
* Add always_false<> util to delay symbol resolution
* Use always_false<> to prevent trying instantiate unwanted method
* Add new specializations of AddAddFastGelu::operator() method
* Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32
* Use floating point literal to simplify code
* Remove unnecessary capture in lambda expressions
* Extract fast GeLU calculation as standalone method
* Mark methods as 'constexpr'
* Add constraint for HostTensorDescriptor templated ctors
* Simplify HostTensorDescriptor ctor calls
* Add C++23 std::size_t literal suffix
* Use _uz suffix to shorten example code
* Remove unnecessary conversion to std::array<>
* Re-order include directives
* Remove C-style casting by literal suffix
* Remove unnecessary statements in main()
* Remove unused type parameter of always_false<>
* Remove unused include directive
* Exit main() by returning meaningful value
* Use 'if constexpr' to switch example flow
* Use std::is_same_v<> to shorten example code
* Add 'inline' specifier to literal functions
* Unify output methods in example
* Move common codes into .inc file
* Add type check in type_convert<>()
* Add type_convert<float>() before computation
* Merge AddAddFastGelu method specializations
* Remove always_false<>
* Add constraint to AddAddFastGelu::operator() parameter types
* allow selecting compiler version
* fix typo
* add Wno-deprecated flag for google tests
* change git repo, fix qa log files names
* change the git clone syntax
* use Omkar's git credentials
* try to use jenkins as git user
* try using illsilin username for gerrit repo with ssh key
* try new gerrit authorization
* change ssh key syntax
* try another way of passing ssh key to docker
* add mount ssh in dockerfile
* create .ssh folder
* move ssh-keyscan to later
* get rid of npm call
* build first docker image on master
* check the contents of the .ssh folder
* try replacing omkars creds with gerrit creds
* use open repo, clean up changes
* get rid of ssh default argument
* Add int8 specialization for elementwise Add and Subtract.
* CGEMM examples bf16, fp32, int8
* Add convert reference output to CDataType.
* Skip BF16 data type during testing.
* Lower K value to get rid of accumulation error.
* Fix merge artifact.
* Fix changed function name: GetElementSpaceSize()
* Fix merge artifact.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* turn on full qa only on gfx90a, use int initialization
* change script syntax
* update script parsing clinfo, throw exception if 0 devices
* fix syntax
* try using toBoolean for the QA conditions
* run regular CI on MI100 only, use MI200 only for daily QA
* evaluate when conditions before agent
* launch QA on develop branch and update profile_reduce script
* update test script
* update script
* remove false dependency from dockerfile
* try removing rbuild completely
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
* add verify flag and update scripts
* replace old check_error function with the new check_err
* fix syntax
* remove blank spaces
* remove empty line
* add check_err for tensors
* fix syntax
* replace tensors with vectors in check_err calls
* fix syntax
* remove blank spaces
* fix syntax
* add new line at end of file
* disable conv2d_bwd_weight test, add gpu check
* set check_gpu using export
* check GPU using runShell
* add definition of runShell
* fix script syntax
* reduce the number of threads, add full qa option
* run processing scripts in bash
* fix the branch and host names in performance scripts, add chronos
* replace parameterizedCron with cron
* archive the perf log files
* try to fix git call
* pass branch and host names as arguments into scripts
* fix script arguments
* fix script arguments
* process results on master
* fix pipeline
* add definition of gpu_arch
* run processing scripts in docker
* fix the brackets
* add agent master for the processing stage
* get rid of show_node_info call on master
* try using mici label instead of master, disable MI100 tests for now
* fix syntax
* simplify container for results processing
* remove node(master) from the process_results stage
* put all stages in original order
* change the agent label from master to mici for gfx908
* Implement layernorm kernel and deviceOp
* verify gpu kernel with host code
* 1. Separate gamma aand beta from affine
2. Check if argument is valid
* clean
* Sync the naming
* Support sweep once mode if we can put k dimension data inside one block
* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.
* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.
* Use 1d descriptor for gamma and beta
* Add accElementwiseOp
* Extract layernorm host code
* Support different YVectorDim in GridwiseLayernorm
* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
* Gamma and beta can share the VGPR.
* Add test for fp32 and fp16
* Fix bug of concurrency and add test case which may fail orignally
* Propagate NaN for layernorm
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* adding scripts for full perf test suite
* uncomment the sql queries
* fix typo and chmod a+x for scripts
* dos2unix for all new scripts
* disable verification in full performance test
* fix reduction scripts, add gfrouped_gemm hotfix
* fix the grouped_gemm hotfix and only run reduction for fp16
* change compiler flag syntax
* fix syntax
* add predefinition of dockerArgs
* avoid redefinitions of dockerArgs
* add blank space at the end of dockerArgs
* try to build with release compiler
* adding spaces inside if condition
* limit the number of threads for building 9110 compiler
* change the way HIP_CLANG_PATH is set
* remove the export command
* change the conditional ENV syntax
* set HIP_CLANG_PATH at docker run time
* update scripts for full qa
* enable the sql write query
* fix typo
* remove a comment from a script
* format
* improving pipeline
* fix typo
* format
* adding thread group
* adding thread group
* adding thread group
* adding gemm pipeline
* tweak
* refactor
* refactor
* add missing type convert
* refactor
* refactor
* refactor
* clean
* fix build
* refactor
* format
* clean up
* use remove_cvref_t
* clean
* use pipeline_v2 for gemm kernel
* Remove inconsistent indent
* Fix compilation errors due to incomplete merge process
* Add missing include directives
* Fix compilation errors in currently unused files
* Add license in newly added files
* Re-format touched files by clang-format-10
* Fix wrong template argument count of DeviceGemm<>
* Use language construct to choose between types
* Use language construct to choose GEMM example instance
* Fix compilation error due to interface change
* Re-use type alias to avoid duplication
* Unify type alias usage in source file
* Only use v2 pipeline in one gridwise GEMM type
* Remove no-longer used include directives
* Add static_assert() to check pipeline type requirements
* Revert "Add static_assert() to check pipeline type requirements"
This reverts commit f0985f0a13.
* clean
* clean
* clean
* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>