* Add int8 specialization for elementwise Add and Subtract.
* CGEMM examples bf16, fp32, int8
* Add convert reference output to CDataType.
* Skip BF16 data type during testing.
* Lower K value to get rid of accumulation error.
* Fix merge artifact.
* Fix changed function name: GetElementSpaceSize()
* Fix merge artifact.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* turn on full qa only on gfx90a, use int initialization
* change script syntax
* update script parsing clinfo, throw exception if 0 devices
* fix syntax
* try using toBoolean for the QA conditions
* run regular CI on MI100 only, use MI200 only for daily QA
* evaluate when conditions before agent
* launch QA on develop branch and update profile_reduce script
* update test script
* update script
* remove false dependency from dockerfile
* try removing rbuild completely
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
* add verify flag and update scripts
* replace old check_error function with the new check_err
* fix syntax
* remove blank spaces
* remove empty line
* add check_err for tensors
* fix syntax
* replace tensors with vectors in check_err calls
* fix syntax
* remove blank spaces
* fix syntax
* add new line at end of file
* disable conv2d_bwd_weight test, add gpu check
* set check_gpu using export
* check GPU using runShell
* add definition of runShell
* fix script syntax
* reduce the number of threads, add full qa option
* run processing scripts in bash
* fix the branch and host names in performance scripts, add chronos
* replace parameterizedCron with cron
* archive the perf log files
* try to fix git call
* pass branch and host names as arguments into scripts
* fix script arguments
* fix script arguments
* process results on master
* fix pipeline
* add definition of gpu_arch
* run processing scripts in docker
* fix the brackets
* add agent master for the processing stage
* get rid of show_node_info call on master
* try using mici label instead of master, disable MI100 tests for now
* fix syntax
* simplify container for results processing
* remove node(master) from the process_results stage
* put all stages in original order
* change the agent label from master to mici for gfx908
* Implement layernorm kernel and deviceOp
* verify gpu kernel with host code
* 1. Separate gamma aand beta from affine
2. Check if argument is valid
* clean
* Sync the naming
* Support sweep once mode if we can put k dimension data inside one block
* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.
* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.
* Use 1d descriptor for gamma and beta
* Add accElementwiseOp
* Extract layernorm host code
* Support different YVectorDim in GridwiseLayernorm
* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
* Gamma and beta can share the VGPR.
* Add test for fp32 and fp16
* Fix bug of concurrency and add test case which may fail orignally
* Propagate NaN for layernorm
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* adding scripts for full perf test suite
* uncomment the sql queries
* fix typo and chmod a+x for scripts
* dos2unix for all new scripts
* disable verification in full performance test
* fix reduction scripts, add gfrouped_gemm hotfix
* fix the grouped_gemm hotfix and only run reduction for fp16
* change compiler flag syntax
* fix syntax
* add predefinition of dockerArgs
* avoid redefinitions of dockerArgs
* add blank space at the end of dockerArgs
* try to build with release compiler
* adding spaces inside if condition
* limit the number of threads for building 9110 compiler
* change the way HIP_CLANG_PATH is set
* remove the export command
* change the conditional ENV syntax
* set HIP_CLANG_PATH at docker run time
* update scripts for full qa
* enable the sql write query
* fix typo
* remove a comment from a script
* format
* improving pipeline
* fix typo
* format
* adding thread group
* adding thread group
* adding thread group
* adding gemm pipeline
* tweak
* refactor
* refactor
* add missing type convert
* refactor
* refactor
* refactor
* clean
* fix build
* refactor
* format
* clean up
* use remove_cvref_t
* clean
* use pipeline_v2 for gemm kernel
* Remove inconsistent indent
* Fix compilation errors due to incomplete merge process
* Add missing include directives
* Fix compilation errors in currently unused files
* Add license in newly added files
* Re-format touched files by clang-format-10
* Fix wrong template argument count of DeviceGemm<>
* Use language construct to choose between types
* Use language construct to choose GEMM example instance
* Fix compilation error due to interface change
* Re-use type alias to avoid duplication
* Unify type alias usage in source file
* Only use v2 pipeline in one gridwise GEMM type
* Remove no-longer used include directives
* Add static_assert() to check pipeline type requirements
* Revert "Add static_assert() to check pipeline type requirements"
This reverts commit f0985f0a13.
* clean
* clean
* clean
* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>
* dump lds content in appropriate precision type
* add squared add reduction op; allows sq sum
* initial stub from regular gemm impl
* layernorm example code & host verification
* initial layernorm implementation
* tidy up
* make C0 precision type consistent with C
* clang-tidy and additional comments
* tighten up example code
* account for extra flops/bytes from normalization
* clang-format
* c0 bias/beta/gamma now have its own precision type
* AccElemOp for gemm outputs prior to feeding to layernorm
* update workgroup mapping
* rename kernel template param to reflect its dual use
* use LDS mem pool for reduction workspace
* change cshuffle precision type to f16; clean up
* clang-format
* correct naming
* explicit cast
* fully implemented gemm + bias + activation + add + norm
* activation in correct order
* reflect reduction API's recent change
* amend
* clean up; add comment
* keep up with recent changes in reduction API
* format
* resolve merge conflicts
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* use 'sweep once' softmax kernel where applicable
* threadwise copy's dst buffer can specify invalid element value
* add int8 in/out float compute softmax support
give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error
* format
* softmax inherits DeviceNormalization
* softmax profiler stub
* tighten up reference softmax interface
* example prints tensor dimension
* add fp32 to softmax profiler
* rename header
* hook with ckProfiler
* format
* resolve merge conflict
* resolve merge conflicts
* update normalization profiler help string
* resolve conflict
* typo
* remove residual
* softmax profiler: address feedback
* test for mixed precision input/output
* fully qualify ck::math::isnan
* add comment for device normalization interface
* revise wording
* constness for alpha/beta scaler pointer
* Extract base class for elementwise
* Refactor interface of DeviceGemmReduce. Do not use tuple in interface
* [What] Rename d into reduce in gemm + reduction related code
[Why] Prepare to add d term for add
* Unify base class of gemm + reduce and gemm + bias + add + reduce
* 1. Rename gemm_bias_add_reduce for external api
2. Refine cmake
* Add normalize device operation
* [What] Reorder the argument
[Why] Because d0 is also the input of c.
* Add type string
* Add example of gemm_bias_add_layernorm via external api
* Refactor example code
* clang-format
* Fix compile error
* clang-format
* Add external api for gemm_add_add_layernorm and normalize
* Add client example
* clang-format
* Switch to standard ROCm packaging
* Revert .gitignore changes
* install new rocm-cmake version
* update readme
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* UniforFill with integer values.
* Log tested instance type string.
* Add UT for all convolution specializations.
* debugging conv
* Fix dangling reference bug.
* Small refinements.
* Fix call to error checking function.
* Small refinements to tests.
* Configure error tolerance
* Change problem size.
* Remove OddC case from types that do not support it.
* Add helper traits for AccumulatorDataType.
* Print first 5 errs in check_err for integral types.
* Rename FillUniform to FillUniformDistribution
* Refactor
* Do not use typed tests.
* Instead use plain fixture class with templatized member functions.
* Initialize tensors with integer values.
* Refine test instances.
* Properly set accumulator data type.
* Add another "big" instance.
* Refactor convolution tests.
* Revert "debugging conv"
This reverts commit b109516455.
* Add pragma once + format + small refinement.
* Fix some unwanted changes.
* Clang-format
* Fix profile_convnd to use renamed tensor initializer.
* Add instances for ConvFWDND kernel case 2D
* Helpers to get ConvNDFwd 2D instances.
* Refactoring.
* Remove "small block" instance as it was generating compiler errors.
* Remove default template parameters values.
* Refine and fix test.
* Fix problem with default template parameter types.
* Adjust error thresholds for floating point values test.
* Use integer values initialization for instances test.
* Add tests for ConvNDFwd 2D case.
* Remove AccumulatorDataType type trait.
* Update unit-tests.
* Remove operator<< overload.
* Unlock conv1d/3d nd fwd instances.
* Enable skipping calculating reference using flag.
* Fix number of channels for first ResNet50 layer.
* Clang-format.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* initial stub for standalone softmax
* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
* host softmax validates
* compiles; to implement beta scaling
* use NaN trick to efficiently ignore OOB values during sum of exponentials
* freeload device_reduce's utility functions
* clean up interface
* adding prior value (beta scaling)
* remove restriction related to perf considerations
* apply clang-format
* clean; disable diagnostics
* resolve conflicts
* add exp wrapper
* honor HostTensorDesc interface; allow implicit cast from different vector<T> type
* test softmax for fp16/fp32
* update readme
* amend commit NaN trick
* remove redundant param added during development
* format
* replace ScalarDataType with AccDataType
* separate out test programs by precision type
* move softmax sample code to its own folder
* format
* keep up with recent changes in reduction API
* remove extra header
* use pre-built docker instead of building a new one
* try docker.image.pull
* change syntax in docker.image()
* add 30 min timeout
* increase timeout to 3 hours
* move performance tests to first stage for testing
* set image variable to the new container name
* update image name
* check available images
* check available images in both places
* try different image name
* use image ID to refer to image
* run performance on gfx90a
* fix the gpu_arch labeling, add parameter
* move env vars out of stages
* add stand-alone performance script, MI200 tests, CU numbers
* dos2unix for run_perf_tests.sh
* try the new git credentials
* use env var for git credentials
* don't look up /sys/module/amdgpu/version
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces
* Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers
* Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers
* Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation
* Use struct-scope operator template instantiation for binary and unary element-wise operations
* Change a few more elementwise operations to use template for operator()
* Tiny correction in Normalize operator
* Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons
* Correction in some examples with regard to using ReduceAccDataType
* Use static_assert for UnaryDivide
* Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly
* Tiny fix with regard to SetWorkSpacePointer()
* Copy "gemm reduce" to "gemm bias add reduce"
* Implement gemm bias add reduction
* Fix compiler error due to merge from develop
* Add tensor operation for gemm + bias + add + reduce
* Add gemm_bais_add_reduce to ckProfiler
* Add c1 functor
* Refine type
* Use reduceAccDataType instead of explicitly float
* Change to use check_err()
* Do relu in float32 instead of bhalf_t. Because bhalf_t is unsigned
* Refactor relu. using type_trait instead of overloading
* Rename DxsReduceAccElementwiseOperation to DxsReduceAccElementwiseOperation
* Fix denominator
* Refine nameing
* Fix denominator in host
* Remove useless include header
* Use AccDataType
* Fix static_cast order
* Refine type
* [What] Remove tuple type in the base class
[Why] External api depend on base class. if base class has relationship with type, we will need many class for different type
* add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight
* add bwd weight for bf16: init
* remove redundant compute
* use datatype and split k to check whether a workspace is used
* remove unused computation for work space size
* add some code for bfp16
* add device/grid unary op
* add unary type convert to bwd-weight example
* support bf16 splitk kernel for convnd bwd weight
* 1. remove comments. 2. add checkvalidity. 3. add gridsize computation
* add workspace size check
* fix format
* change function name
* use pre-built docker instead of building a new one
* try docker.image.pull
* change syntax in docker.image()
* add 30 min timeout
* increase timeout to 3 hours
* move performance tests to first stage for testing
* set image variable to the new container name
* update image name
* check available images
* check available images in both places
* try different image name
* use image ID to refer to image
* run performance on gfx90a
* fix the gpu_arch labeling, add parameter
* move env vars out of stages
* add stand-alone performance script, MI200 tests, CU numbers
* dos2unix for run_perf_tests.sh
* try the new git credentials
* use env var for git credentials
* use pre-built docker instead of building a new one
* try docker.image.pull
* change syntax in docker.image()
* add 30 min timeout
* increase timeout to 3 hours
* move performance tests to first stage for testing
* set image variable to the new container name
* update image name
* check available images
* check available images in both places
* try different image name
* use image ID to refer to image
* run performance on gfx90a
* fix the gpu_arch labeling, add parameter
* move env vars out of stages
* add stand-alone performance script, MI200 tests, CU numbers
* add resnet50 test to performance tests
* add blanks before gpu_arch in log files
* add resnet50 test with N=4 and process its results
* add ROCM and HIP versions to test tables
* uncomment the sql queries
* fix script syntax in jenkinsfile
* Use the unified naming for math functions on host and HIP kernel
* Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming
* Renaming GetReductionZeroVal() to GetIdentityValue()
* Tiny renaming in profile_reduce_impl.hpp
* More renaming in profile_reduce_impl.hpp
* Replace zeroVal by identiyVal
* Remove ck_ prefix in the naming of ck::math provided functions