* Add conv2d requant example
* Fix bash error
* Rename example
* 1. Rename gemm quantization
2. shares the requantization lambda function with conv
* Refine declare type
* Add conv bias relu quantization exmaple
* clang format
* Fix compile error due to merge develop
* Fix CI error
* Extract quantization post operation into another file
* Support quantization for non piecewise linear function
* Add instance for conv quantization
* Add convolution quantization factory
* Add convolution quantization client example
* Add more instances with different template parameters
* clang format
* Sync the naming with the develop
* build CK only once, use deb package in all subsequent stages
* update jenkins file
* change prefix for build_CK stage
* update writing deb metadata to control file
* update ubuntu source for docker, script syntax for deb package metadata
* try different way to create deb metadata
* clean up DEBIAN before creating one
* fix the CI folder names, fix splitK qa
* use correct docker in all stages, separate tests for splitK verification and performance
* clean old comments, change dir before packaging
* use different package syntax
* change packaging syntax
* package with cmake
* remove unnecessary build prefix
* get rid of unnecessary paths
* change paths during unpacking
* change script syntax while unpacking
* get rid of unneccesary steps
* get rid of comments in the scripts
* use double quotes for scripts
* add ccache during build, try dpkg -x
* pull and install each package separately
* use full package names
* try to use stashing for packages
* change stash/unstash syntax
* move unstash out of shell, run tests on any gpu node
* unpack each package separately
* try re-using existing workspace
* merge the build and test stages, only stash ckProfiler
* merge the build and test stages, only stash zipped ckProfiler
* fix syntax
* add GPU check before build and test, rename docker to usual name
* upgrade the OS and ROCM versions in CK docker
* add cxx flags to link code with rocm5.2 and ck-9110 compiler
* rename the docker image
* run ONNX gemms using init=1
* add processing for the onng_gemm and splitK_gemm
* add profile_onnx_gemm.sh
* add stderr to logfiles, add splitK and onnx gemm parsing
* enable splitK gemm wresults posting to db
* allow selecting compiler version
* fix typo
* add Wno-deprecated flag for google tests
* change git repo, fix qa log files names
* change the git clone syntax
* use Omkar's git credentials
* try to use jenkins as git user
* try using illsilin username for gerrit repo with ssh key
* try new gerrit authorization
* change ssh key syntax
* try another way of passing ssh key to docker
* add mount ssh in dockerfile
* create .ssh folder
* move ssh-keyscan to later
* get rid of npm call
* build first docker image on master
* check the contents of the .ssh folder
* try replacing omkars creds with gerrit creds
* use open repo, clean up changes
* get rid of ssh default argument
* turn on full qa only on gfx90a, use int initialization
* change script syntax
* update script parsing clinfo, throw exception if 0 devices
* fix syntax
* try using toBoolean for the QA conditions
* run regular CI on MI100 only, use MI200 only for daily QA
* evaluate when conditions before agent
* launch QA on develop branch and update profile_reduce script
* update test script
* update script
* remove false dependency from dockerfile
* try removing rbuild completely
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
* add verify flag and update scripts
* replace old check_error function with the new check_err
* fix syntax
* remove blank spaces
* remove empty line
* add check_err for tensors
* fix syntax
* replace tensors with vectors in check_err calls
* fix syntax
* remove blank spaces
* fix syntax
* add new line at end of file
* disable conv2d_bwd_weight test, add gpu check
* set check_gpu using export
* check GPU using runShell
* add definition of runShell
* fix script syntax
* reduce the number of threads, add full qa option
* run processing scripts in bash
* fix the branch and host names in performance scripts, add chronos
* replace parameterizedCron with cron
* archive the perf log files
* try to fix git call
* pass branch and host names as arguments into scripts
* fix script arguments
* fix script arguments
* process results on master
* fix pipeline
* add definition of gpu_arch
* run processing scripts in docker
* fix the brackets
* add agent master for the processing stage
* get rid of show_node_info call on master
* try using mici label instead of master, disable MI100 tests for now
* fix syntax
* simplify container for results processing
* remove node(master) from the process_results stage
* put all stages in original order
* change the agent label from master to mici for gfx908
* adding scripts for full perf test suite
* uncomment the sql queries
* fix typo and chmod a+x for scripts
* dos2unix for all new scripts
* disable verification in full performance test
* fix reduction scripts, add gfrouped_gemm hotfix
* fix the grouped_gemm hotfix and only run reduction for fp16
* change compiler flag syntax
* fix syntax
* add predefinition of dockerArgs
* avoid redefinitions of dockerArgs
* add blank space at the end of dockerArgs
* try to build with release compiler
* adding spaces inside if condition
* limit the number of threads for building 9110 compiler
* change the way HIP_CLANG_PATH is set
* remove the export command
* change the conditional ENV syntax
* set HIP_CLANG_PATH at docker run time
* update scripts for full qa
* enable the sql write query
* fix typo
* remove a comment from a script
* UniforFill with integer values.
* Log tested instance type string.
* Add UT for all convolution specializations.
* debugging conv
* Fix dangling reference bug.
* Small refinements.
* Fix call to error checking function.
* Small refinements to tests.
* Configure error tolerance
* Change problem size.
* Remove OddC case from types that do not support it.
* Add helper traits for AccumulatorDataType.
* Print first 5 errs in check_err for integral types.
* Rename FillUniform to FillUniformDistribution
* Refactor
* Do not use typed tests.
* Instead use plain fixture class with templatized member functions.
* Initialize tensors with integer values.
* Refine test instances.
* Properly set accumulator data type.
* Add another "big" instance.
* Refactor convolution tests.
* Revert "debugging conv"
This reverts commit b109516455.
* Add pragma once + format + small refinement.
* Fix some unwanted changes.
* Clang-format
* Fix profile_convnd to use renamed tensor initializer.
* Add instances for ConvFWDND kernel case 2D
* Helpers to get ConvNDFwd 2D instances.
* Refactoring.
* Remove "small block" instance as it was generating compiler errors.
* Remove default template parameters values.
* Refine and fix test.
* Fix problem with default template parameter types.
* Adjust error thresholds for floating point values test.
* Use integer values initialization for instances test.
* Add tests for ConvNDFwd 2D case.
* Remove AccumulatorDataType type trait.
* Update unit-tests.
* Remove operator<< overload.
* Unlock conv1d/3d nd fwd instances.
* Enable skipping calculating reference using flag.
* Fix number of channels for first ResNet50 layer.
* Clang-format.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* use pre-built docker instead of building a new one
* try docker.image.pull
* change syntax in docker.image()
* add 30 min timeout
* increase timeout to 3 hours
* move performance tests to first stage for testing
* set image variable to the new container name
* update image name
* check available images
* check available images in both places
* try different image name
* use image ID to refer to image
* run performance on gfx90a
* fix the gpu_arch labeling, add parameter
* move env vars out of stages
* add stand-alone performance script, MI200 tests, CU numbers
* dos2unix for run_perf_tests.sh
* try the new git credentials
* use env var for git credentials
* use pre-built docker instead of building a new one
* try docker.image.pull
* change syntax in docker.image()
* add 30 min timeout
* increase timeout to 3 hours
* move performance tests to first stage for testing
* set image variable to the new container name
* update image name
* check available images
* check available images in both places
* try different image name
* use image ID to refer to image
* run performance on gfx90a
* fix the gpu_arch labeling, add parameter
* move env vars out of stages
* add stand-alone performance script, MI200 tests, CU numbers
* add resnet50 test to performance tests
* add blanks before gpu_arch in log files
* add resnet50 test with N=4 and process its results
* add ROCM and HIP versions to test tables
* uncomment the sql queries
* fix script syntax in jenkinsfile
* Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type
* Update to host layer and host reduction
* Merge and remove reduction kernels
* Merge and remove reduction device interfaces and update pooling device interface
* Merge and remove useless reduction device instances
* Update to reduction profiler and reduction ctests
* Update to reduction and pooling examples and add one reduction example
* Change to reduction examples to let them testable by ctest
* Add explicit pass checking for reduction and pooling examples
* Explicit assignment of tensor shapes in example reduce_blockwise_two_call
* Use atomic_add to repace atomicAdd and add atomic_add for double type
* Add reduce ctest support for double data type
* Replace to_int_vector() by using c++ std::vector::assign()
* Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise
* Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock
* Add GetAtomicOperationZeroValue() support for AtomicMax
* Tiny change to reduce example README.md
* Fix some tiny issues due to branch merging
* Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t
* Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64
* Renaming
* Clean the header includings in device_reduce instances header files
* modify ckProfiler_gemm output
* fix syntax
* change ckProfiler output and return 0
* fix syntax
* output datatype
* fix syntax
* output datatype in another way
* fix syntax
* fix syntax
* test return values of ckProfiler
* add layout info and tests, make sure ckprofiler returns 0
* fix syntax
* change layout output
* fix syntax
* fix syntax again
* update script to process perf results
* rearrange jenkins stages
* fix typo
* add python packages to Docker file
* adding setuptools-rust package
* modify parsing for new test parameters
* test db credentials on jenkins
* fix syntax
* update python script to handle incomplete lines
* ungrade python to 3.8 and write the gemm_params table
* add sqlalchemy package to docker
* move perf data processing to master node
* move the master node inside a steps region
* add new stage for result processing
* move results processing to separate stage
* reduce number of tests to speedup debugging
* pass config to processPerfResults stage
* run script on master in a docker container
* replace show_node_info
* try loading docker on master node again
* use ansible node instead of master
* get rid of pymysql package
* try ssh connection using paramiko
* put back pymysql
* put the perf data processing back on the gpu node
* put back artifact definition
* archive the perf_log before parsing
* clean up jenkinsfile, fix parsing
* fix typo
* enable all perf tests
* put all stages in original order, finalize script
* fix gpu_arch version
* update parsing script
* remove obsolete file causing merge conflict
* add performance test to jenkins pipeline
* fix typo
* fix the syntax in conv_fwd_util.cpp
* fix the error message syntax spacing
* fix the error message syntax spacing again
* run profile_gemm and archive results
* fix typo
* try to figure out the paths
* try to figure out the paths one more time
* skip the copying step
* build ckProfiler release only once
* change directory using dir
* fix dir syntax
* change the gemm parameters
* do not pipe script output to file
* try running ckProfiler directly
* fix typo
* use set +e
* run profile_gemm.sh || true
* run multiple gemms and parse results
* fix typo in jenkinsfile
* fix syntax
* add new gemm sizes, update scripts
* put all jenkins steps in original order
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
* compile ck for all targets
* update the target criteria
* change the target condition
* fixed some typos
* fixed missed file
* revert changes in README
* revert device_conv3d_fwd_xdl_...
* update device_conv3d_fwd_xdl_...
* update device_batched_gemm_reduce...
* test the unused arguments fix
* test the warning suppression
* try suppress warnings in device_batched_gemm_reduce_xdl...
* fix the last warnings
* replace UNUSED with std::ignore
* fix a typo
* replaced std::ignore with ignore
* add igonre header to common_header
* refactor atomicAdd
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Convolution ND
* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances
* Move convnd f32 instance file to comply with repo structure.
* Conv 1D tensor layouts.
* Formatting and use ReferenceConv
* Reference ConvFwd supporting 1D and 2D convolution.
* Debug printing TensorLayout name.
* Conv fwd 1D instance f32
* Refactor conv ND example.
Needed to support various conv dimensio.
Needed to support various conv dimensions
* Rename conv nd example director to prevent conflicts.
* Refactor some common utility to single file.
Plus some tests.
* Refactor GetHostTensorDescriptor + UT.
* Add 1D test case.
* Test reference convolution 1d/2d
* Remove some leftovers.
* Fix convolution example error for 1D
* Refactor test check errors utility function.
* Test Conv2D Fwd XDL
* More UT for 1D case.
* Parameterize input & weight initializers.
* Rename example to prevent conflicts.
* Split convnd instance into separate files for 1d/2d
* Address review comments.
* Fix data type for flops/gbytes calculations.
* Assign example number 11.
* 3D cases for convolution utility functions.
* 3D reference convolution.
* Add support for 3D convolution.
* Check for inputs bigger than 2GB.
* Formatting
* Support for bf16/f16/f32/i8 - conv instances + UT.
* Use check_err from test_util.hpp.
* Split convnd test into separate files for each dim.
* Fix data generation and use proper instances.
* Formatting
* Skip tensor initialization if not necessary.
* Fix CMakefiles.
* Remove redundant conv2d_fwd test.
* Lower problem size for conv3D UT.
* 3D case for convnd example.
* Remove leftovers after merge.
* Add Conv Specialization string to GetTypeString
* Skip instance causing numerical errors.
* Small fixes.
* Remove redundant includes.
* Fix namespace name error.
* Script for automatic testing and logging convolution fwd UTs
* Comment out numactl cmd.
* Refine weights initalization and relax rtol for fp16
* Fix weights initialization for int8.
* Add type_convert when store output in ref conv 1D.
* Get back old conv2d_fwd_xdl operation.
* Silence conv debug print.
* format
* clean
* clean
* Fix merge.
* Fix namespace for check_err
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
* Rename the folder name for the pool2d and reduce examples
* Update to reduction test scripts
* Add Readme for pool2d_fwd and reduce_blockwise examples
* Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)
* Tiny fix in reduce profiler and tiny update in reduce testing scripts
* Tiny fix in testing script profile_reduce_no_index.sh
* Tiny fix in testing script profile_reduce_no_index.sh
* Add support for bfp16 reduction (using bhalf_t = ushort)
* Tiny fix in amd_buffer_addressing.hpp
* Tiny change in script/profile_reduce_with_index.sh
* Use AccDataType for Beta value and use element_wise::PassThrough
* Use type_convert for type converting in host layer reduction
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming all NumReduceDims to NumReduceDim
* Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2
* Update to testing scripts to add bf16 support
* added more static_assert
* Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp
* Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations
* minor change
* Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass
* Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp
* Tiny fix in script/profile_reduce_no_index.sh
* Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims
* Generic renaming in host reduction and DeviceReduce layer
* Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances
* Use multi-thread and simplification for host Reduction implementation
* Add ctest for reduction
* Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/
* Update to the reduce CTest executables to enable default testing behavior when no command argument
* Renaming
Co-authored-by: Jianfeng yan <jfyan008@gmail.com>
* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction
* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter
* Rename the folder name for the pool2d and reduce examples
* Update to reduction test scripts
* Add Readme for pool2d_fwd and reduce_blockwise examples
* Tiny fix in reduce profiler and tiny update in reduce testing scripts
* Tiny fix in testing script profile_reduce_no_index.sh
* Tiny change in script/profile_reduce_with_index.sh
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming and refining in Reduction profiler/device layer/examples
* Renaming all NumReduceDims to NumReduceDim
* Initial adding of generic reduction
* Initial adding of generic reduction ...
* Updates to make compiling done
* clang-format all files
* clang-format some files again
* Renaming in profiler/include/profile_reduce.hpp
* Updates and make BlockWise cases passed
* Updates and make ThreadWise and MultiBlockTwoCall cases passed
* Remove the support for MUL and NORM1 reduceOp from the profiler and the device instances
* Change to replace the dim0_max_vector_size/dim1_max_vector_size template argument in the device reduce classes
* format
* adding pooling
* added max and average pooling
* comment out cout and kernel timing
* Tiny simplification in profiler/reduce_profiler.cpp
* Add example for reduce_blockwise
* Tiny updates
* Change to pass the ElementWiseOp from device layer to kernel
* Fix the vectorDim and vectorSize in Device layer
* Enable vector load on both dim0 and dim1 for Threadwise method
* Tiny updates
* Change to let the user to pass the preUnaryOp and posUnaryOp
* Make pooling example work
* split device_reduce_instance into two libraries
* Tiny update
* Replace nanPropaOpt enum by boolean propagate_nan
* Simplification in DeviceReduce layer codes
* update build
* Change to clarify the difference between ck::half_t and half_float::half
* Renaming in all the reduction codes
* Add VectorSize as template parameter for device layer
* Add BetaIsZero as kernel template and as AccDataType for alpha
* print
* Small updates for pooling
* Updates for host_generic_reduction for reference
* Update to make AVG pooling pass
* Update to make MAX pooling with indices output pass
* fix
* add OutDst vector store to threadwise reduction and pooling
* tweak
* turn off check_indices that caused build issue
* refactor pooling
* clean up
* turn off check_indices for building issue for php-compiler
* add more tile size for odd C
* tweak conv for odd C
* update script
* clean up elementwise op
* add hack in reduction_operator.hpp to avoid compile error. To fix it, need to use element_wise_op in reduction op
* Add OutVectorSize as device and kernel tunable, also update to Elementwise Operations
* Move reduce operator mapping to host layer file reduction_operator_mapping.hpp from reduction_operator.hpp
* Change to the unary operators
* Move the definitions of unary operations to element_wise_operation.hpp
* re-org files
* Refine in device interfaces and multiblock kernels
* Split the reduction configurations into instances for specific methods
* Update in getTypeString() of device pool2d
* Renaming in host and kernel
* Tiny update in profiler/src/profiler.cpp
* Uncomment in device_operation/CMakeLists.txt to enable the building of all operations
* Make check_indices a templated function to remove some linking issue
* Renaming in the profiler reduce module
* Add support for double Reduction (but disable MultiblockAtomicAdd for double)
* Tiny correction of literal string
* Rename DevicePoolFwd to DevicePool2dFwd
* Split device_reduce_instance_xxx.cpp files according to the data types to speed up compiling
* Add comments for lists of configurations, lists of instances and references of add_reduce_instances_xxx
* Remove un-used header file gridwise_generic_reduction_wrapper_common.hpp
* Renaming and refining in the Reduction codes
* Tiny change in the unary operators
* Renaming symbols and files
* Renaming symbols in the kernels
* Move kernel kernel_set_buffer_value to separate file
* Add IndexDataType template parameter for kernels and use int32_t as index data type in device layer
* Tiny update in the kernels
* Remove definition of sqrtf()/isnan()/abs() for half_t due to some ADL issue
* Simplify a helper function in device layer
* Tiny adjustment in testing data initialization
* Renaming in kernel/device/host
* Add two testing scripts for reduction
* Refine the Unary operators in element_wise_operation.hpp
* Update in the reduce profiler module
* Update to the reduction testing scripts
* reduce compile parallelism
* change CI docker to rocm5.0
* remove unused variables
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* tweak conv for odd C
* update script
* clean up elementwise op
* fix build
* clean up
* added example for gemm+bias+relu+add
* added example for gemm+bias+relu
* add profiler for gemm_s_shuffle; re-org files
* add profiler
* fix build
* clean up
* clean up
* clean up
* fix build
* [What]
1. Add DeviceGemmXdl_C_Shuffle
2. Revise example of gemm_xdl
[Why] Prepare to add shuffle version of D = alpha * (A * B) + beta * C
[How] Imitate DeviceGemmXdl and device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp
* Add online-compiling facility
* Synchronize from fwd-v4r5 and implement host interfaces to call conv-fwd v4r4/v4r5 using on-line compiling method
* Tiny adjustment to time reporting
* Use object assignment to replace explicit bytes copying in the first kernel of v4r4/v4r5
* Use single thread to assign descriptor object to device memory
* Adjust to the workload assignment of the two kernels of v4r4 (experimental)
* Revert "Adjust to the workload assignment of the two kernels of v4r4 (experimental)"
This reverts commit eb38461456bb0c82b6c0d32cdd616e181907e20c.
* Update to make constexpr for generating descriptor types in kernel 2 of dynamic conv-fwd v4r4
* Update to dynamic conv-fwd v4r4 online-compiling
* Update to dynamic conv-fwd v4r5 online-compiling (result not accurate)
* Tiny update to driver/CMakeLists.txt
* clang-format
* Tiny comments change
* Add env OLC_DUMP_SAVE_TMP_DIR to support saving of temperary dir
* Fwd v4r5 olc perf (#39)
* added hip-clang flags that fix perf issue of online compilation
* fix bug for olc fwd-v4r5-nchw
* Move constexpr and type reference statements out of the function body in conv-fwd v4r4/v4r5 kernel wrapper
* Remove printing in hip_build_utils.cpp
* Update to root CMakeLists.txt
* Revert "Move constexpr and type reference statements out of the function body in conv-fwd v4r4/v4r5 kernel wrapper"
This reverts commit 3d2c5d8ecdd8298b72d127110500ed5b38d9835c.
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@dc-smc-18.amd.com>
* Use DynamicBuffer to hold raw pointer (to global and LDS memory)
* add workaround for compiler issue (inefficient ISA) of ds_write for int8x4, int8x8, int8x16
* Replace most raw index calculation to coordinate transformation
* Overhaul blockwise and threadwise GEMM
* Overhaul driver for gridwies GEMM kernel
Co-authored-by: Jing Zhang <jizhan@amd.com>