* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm
* added bf16 tests for batched_gemm_softmax_gemm_permute
* changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
* changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
* aligned annotations
* modified CMakeLists for examples
* add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl
* use macro to control the instances
* added macro control into instances
* clang-format some files
* changed error tolerance for bf16
* changed index for 10_elementwise_normalization
* fixed xdlops code bug in amd_xdlops.hpp
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
We can use this template to eliminate duplicated iterator computing
logics. By providing return type to ck::accumulate_n(), we can avoid
type conversion operations.
* Rangify check_err()
By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.
* Re-format example code
* Rangify STL algorithms
This commit adapts rangified std::copy(), std::fill() & std::transform()
* Rangify check_err()
By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.
* Allow constructing Tensor<> like a HostTensorDescriptor
* Simplify Tensor<> object construction logics
* Remove more unnecessary 'HostTensorDescriptor' objects
* Re-format example code
* Re-write more HostTensorDescriptor ctor call
* add client example for elementwise_normalization
* clang format elementwise_layernorm2d.cpp
* changed some naming to make it more understandable
* changed naming of input into ab_input
* fixed bug for threadwise_x_store
* add elementwise operation to reference
* Rename example folder for GroupedConvFwdMultipleD
* Unify example codes
* Change target names
* Add fp16 example for multiple d instance
* Re-format common.hpp
* Add interface 'DeviceGroupedConvFwd'
* Use simpler interface
* Move common conv params out
* Rename conv fwd client example folder
* Add missing include directive
* Update grouped conv instance implementations
* Simplify ckProfiler (grouped conv forward)
* Use GroupedConvFwd to implement client example
* Use greater groupe count in example
* Add custom target to group examples
* Add extra tag param to instance factory function
* Use tag to differentiate factory functions
* Add missing tag argument for factory function
* Remove inheritance relationship
* Remove no-longer used include directive
* Add license in front of file
* Remove redundant CMake setting
* Extract common code from files
* Rename folder 'convnd' to 'conv'
* Use std::array<> to accept compile-time kwnown # of arguments
* Fix compilation error of tuning parameter
* In example, use same setting as unit-test
* Remove no-longer used include directive
* Add interface for grouped conv bwd weight
* Add group support for conv bwd weight
* Add grouped conv bwd weight example
* Use group parameter in example
* Rename example folder
* Remove non-grouped version example source files
* Rename device op template
* Add group support to convolution backward weight
* Remove debug messages
* Use smaller group size in example
* Use named variable as loop terminate condition
* Prettify example output message
* Enlarge used grid size
* Allow real grid size exceeds expected grid size
* Rename interface file
* Add client example for grouped conv2d bwd weight
* Fix wrong include directive
* Rename client example folder
* add fused addition lyernorm
* add fused addition lyernorm
* changed CMakelist
* removed annotates
* modified descriptor of C
* fixed bug in gridwise add layernorm
* format the files
* modified name from add&layernorm into elementwise&layernorm
* created fused elementwise layernorm branch
* change input into tuple type
* add sweep once to reduce load & read of C from global memory
* modified Argument api
* modified way to malloc c in global memory
* changed gamma and beta to m_k_desc
* fixed bug when sweep once and move CDataType when define device level struct
* add src dim for gamma and beta
* implement optimization for coalesced
* delete a annotation line
* fixed some bug to meet the requirements of ck
* add bandwidth computing in example, and fixed the time unit
* move device_elementwise_layernorm_impl.hpp into device/impl
* fixed bug in device_elementwise_layernorm_impl.hpp
* changed name from layernorm into normalization
* clang-format the changed files
* changed the names
* moved immidiate results into lds, it become faster in non-sweeponce cases
* changed naming of C into X to make the defination more clear
* changed naming in example
* add tests for elementwise normalization
* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
* move test_elementwise_layernorm_fp16 into new folder
* move elementwise_normalization_instances into a new folder
* add more tests in test_elementwise_layernorm_fp16.cpp
* added some corner cases in test
* fixed method to compute lds size for matrix X
* changed name of 44_elementwise_normalization into 45_elementwise_normalization
* modified some comments
* modified some other confused comments
* reduce redundant tests in test_elementwise_layernorm_fp16.cpp
* Sync the naming
* Sync the test of layernorm with groupnorm
* Sync the naming
* Minor change for comment and log
* [What] Add saveMean and SaveInvVariance in the interface.
[Why] These can optimize the backward
* Improve example reusability
* Remove no-longer used file
* Rename folder of grouped_conv_bwd_data example
* Add normal grouped conv bwd example
* Add interface 'DeviceGroupedConvBwdData'
* Prettify comment of device op type arguments
* Add grouped conv2d/conv3d backward data fp16 instances
* Fix wrong template argument
* Add grouped_conv2d_bwd_data client example
* Use simpler expression to calculate memory size
* Fix formating
* Remove grouped_conv3d_bw_data instances
Underlying device operator is not ready to handle 3D input
* Remove no-longer necessary include directive
* Add missing include directive
* Use more realistic conv param in example
* Add gridwise gemm pipeline v1/v2 selector
* Pipeline selector working, test-wise add pipeline options to one instance
* Add gemm instances
* Add debug info to DeviceGemmXdl
* Add debug info to DeviceGemmXdl_CShuffle
* Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu
* Minor fix
* Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm
* set up inter-wave configuration
* use defualt loop scheduling for supported gemm ops
for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI
* Add enum PipelineVersion
* Update instances
* Format
* Fix the merge conflict
* Add flags to disable added instances
* Test disable flag check
* Disable flag check
* Enable the instances
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
* Add reduction across all dims cases.
* host softmax: handle all reduce
* Test cases when reduced dim is not innermost axis.
* Fix syntax.
* Test non innermost dim for fp32 and int8
* Group test suites wrt NumReduceDim.
* Additionally test failing cases.
* Throw error when Rank or NumReduceDims doesn't match arguments.
* Check reducedDims has correct values
* Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
Instead implement own. (in fact just get rid of one check to enable
reduction across inner dimensions).
* Reorganize unit tests to better cover use scenarios.
* Test input validation
* Test reduction of inner dimensions with custom op instances.
* Refactor fp32 and int8 unit tests.
* Fix FP32 instance template parameters.
* Add more instances.
* Instances with InSrcVectorDim=0.
* Do not initialize and copy data when arg not supported.
* ckProfiler Softmax use instance factory.
* Refactor device softmax IsSupported.
* Additionally add non-polymorphic api functions
* Split softmax instances into multiple files.
* Fix profiler.
* Reorganize tests to reuse profiler and cover edge cases.
* Clang-format
* I8 Softmax instances along with UT.
* Reuse type alias definitions from instance factory header.
* Clean included headers
* Fix variable names.
* Add missing checks in Argument constructor.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
* Add conv2d requant example
* Fix bash error
* Rename example
* 1. Rename gemm quantization
2. shares the requantization lambda function with conv
* Refine declare type
* Add conv bias relu quantization exmaple
* clang format
* Fix compile error due to merge develop
* Fix CI error
* Extract quantization post operation into another file
* Support quantization for non piecewise linear function
* Add instance for conv quantization
* Add convolution quantization factory
* Add convolution quantization client example
* Add more instances with different template parameters
* clang format
* Sync the naming with the develop
* add device of dl
* fix k1 of GridwiseGemmDl_km_kn_mn_v1r3
* init version for dl conv
* add example(init)
* result right
* disable elementwise operation
* check parameters
* add fp32,int8 example and change check code
* change deive file and class name
* add check vector access of C
* add instance
* add to ckProfiler
* add Filter1x1Pad0 instances
* fix ignore error
* fix for CI
Co-authored-by: letaoqin <letaoqin@amd.com>
* Update to the batchnorm-forward API and base class
* Fix leeked header including in gridwise_set_buffer_value.hpp
* Add kernels and device file for batchnorm-forward welford supporting both blockwise and multi-block reduction
* Update to the batchnorm-forward example to use the new batchnorm-forward device interface
* Change the batchnorm-forward reference to use sequential welford method
* Change to assign the workspace into four buffers in the host layer
* Use GetReduceCountPerThread functor to replace the initial count for Blockwise and Multiblock welford
* Tiny correction and remove un-used file under example/34_batchnorm
* Renaming in the kernel arguments
* Explicitly use ck::math::sqrt in batchnorm-forward kernels
* Add some comments to some kernels
* Tiny fix
* Generalize the data types in reference_batchnorm_forward_nhwc_c
* Use ck::ignore to mark un-used parameters
* Move GetReduceCountPerThread functor codes from kernel to device
* Remove some un-used codes in device_batchnorm_forward_impl.hpp
* Tiny fix in batchnorm_forward example
* Move GetReduceCountPerThread() to welford_helper.hpp
* Use seperate data type for Scale and Bias
* Renaming in device Op
* Tiny fix in forward example
* Updata to batchnorm-infer (type spliting, renaming)
* Add time and bandwidth measurement to the batchnorm-forward example
* Add support of elementwise operation for batchnorm forward output
* Reduce object copying by passing object as reference type
* Tiny change for performance
* Updates for performance again
* Some Renamings
* Add GetActualVariance template parameter for ThreadwiseWelfordMerge
* Tiny update in reference batchnorm forward nhwc/c
* Move batchnorm multiblock kernel files to grid/batchnorm_multiblock sub-directory
* Fuse mean and bias in the normalization calculation
Co-authored-by: root <root@dc-smc-18.amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
* reduce the number of default targets
* re-write the setting of target flags
* move all options to one place
* add new custom target instances for installing CK
* reopen masking att instance due to CI is upgraded
* re-enable instances previously failed on 9110
* enable ksize-kpadding pair validity test
* add non-masked attention+permute test; expose masking boolean to attention kernel handles
* disable bench
* fix test
* move files
* bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
* format
* amend rename
* disable bench in test
* add mask/no-mask test for non-permute attention kernels
* disable broken kernel instance
* example working
add non-permuted problem statement
evaluating whether overhead comes from permutation or the extra kernel arg
* interface for bias addition without implementing it
* test and profiler running
* tidy
* mask type determined by enum class
* unify example code
* move masking specialization to its own header
* align formats
* extract helper functions
* experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
* add tensor specialization to template args
since tensor spec packed shows perf parity when permutation isn't needed
remove redundant template args
comment on 'packed' tensor specialization
* grouped attention with input/output permute example
* format
* clean up
* refactor acc0 tile visitor
* fused attention client example
* format
Co-authored-by: shaojiewang <wsjmessi@163.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* reopen masking att instance due to CI is upgraded
* re-enable instances previously failed on 9110
* enable ksize-kpadding pair validity test
* add non-masked attention+permute test; expose masking boolean to attention kernel handles
* disable bench
* fix test
* move files
* bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute
* format
* amend rename
* disable bench in test
* add mask/no-mask test for non-permute attention kernels
* disable broken kernel instance
* example working
add non-permuted problem statement
evaluating whether overhead comes from permutation or the extra kernel arg
* interface for bias addition without implementing it
* test and profiler running
* tidy
* mask type determined by enum class
* unify example code
* move masking specialization to its own header
* align formats
* extract helper functions
* experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute
* add tensor specialization to template args
since tensor spec packed shows perf parity when permutation isn't needed
remove redundant template args
comment on 'packed' tensor specialization
* grouped attention with input/output permute example
* format
* clean up
* refactor acc0 tile visitor
Co-authored-by: shaojiewang <wsjmessi@163.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Fix for lwpck-425, update BlockTransferSrcVectorDim
* Revert "Fix for lwpck-425, update BlockTransferSrcVectorDim"
This reverts commit fd24e280e2.
* Add Batched Gemm int8 test, expect it to fail
* Format
* Re-add the fix
* prototype
4 layouts
fix default stride
all problem sizes
tidy
move file
update build script
restore old file
fix build
* refactor standalone test to use gemm test harness
* simplify gemm test
* update build script
* remove redundant
* early return when cmd arg doesn't match
* tidy
* report failure when result not validated
* tidy
* Apply suggestions from code review
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances
* Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces
* Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock
* Split the cpp and hpp files for reduction instances to enable more parallel compiling
* Remove the using of macros for declaring reduction instances and instance references
* Update to add_device_reduce_instance_xxxx templated functions
* Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates
* Change return format
* add fused addition lyernorm
* add fused addition lyernorm
* changed CMakelist
* removed annotates
* modified descriptor of C
* fixed bug in gridwise add layernorm
* format the files
* modified name from add&layernorm into elementwise&layernorm
* created fused elementwise layernorm branch
* change input into tuple type
* add sweep once to reduce load & read of C from global memory
* modified Argument api
* modified way to malloc c in global memory
* changed gamma and beta to m_k_desc
* fixed bug when sweep once and move CDataType when define device level struct
* add src dim for gamma and beta
* implement optimization for coalesced
* delete a annotation line
* fixed some bug to meet the requirements of ck
* add bandwidth computing in example, and fixed the time unit
* move device_elementwise_layernorm_impl.hpp into device/impl
* fixed bug in device_elementwise_layernorm_impl.hpp
* changed name from layernorm into normalization
* clang-format the changed files
* changed the names
* moved immidiate results into lds, it become faster in non-sweeponce cases
* changed naming of C into X to make the defination more clear
* changed naming in example
* add tests for elementwise normalization
* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
* move test_elementwise_layernorm_fp16 into new folder
* move elementwise_normalization_instances into a new folder
* add more tests in test_elementwise_layernorm_fp16.cpp
* added some corner cases in test
* fixed method to compute lds size for matrix X
* changed name of 44_elementwise_normalization into 45_elementwise_normalization
* modified some comments
* modified some other confused comments
* reduce redundant tests in test_elementwise_layernorm_fp16.cpp
* Move kernel implementation files under impl directory.
* Update examples paths.
* Update device kernel impl include paths.
* Update tensor operation instances include paths.
* Update profiler and tests include paths.
* Clang-format
* Update include paths for batched gemm reduce
* Refactor UnitTest ConvNDBwdWeight.
* Refactor fwd and bwd data convND UT.
* Fix used test macro.
* Fix include path.
* Fix include paths.
* Fix include paths in profiler and tests.
* Fix include paths.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* start split k
* add base device class
* add example after merge develop
* add gridwise gemm
* add b matrix split k
* split=1
* change name for kb
* not bias result right
* bias only add once
* fix register spill
* regular code
* add fp32 example
* fix for 64bit index
* fix CheckValidity of gridwise
* use another instance to check the efficiency
* optimize group layer norm
* 1. coalesce load/store data for gridwise layer norm welford. 2. move a sqrt and divison into a outer static loop
* add more instances to layernorm
* add 2 more test cases
* remove ignore in generating tuple of vector
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* enable ccache and decouple it from MIOpen ccache use
* fix the ccache check script
* use another method to get server name
* fix syntax
* add quotes around the server name variable
* use check_host as function
* change syntax
* fix syntax
* test if server name is parsed correctly
* try different syntax
* check the env var value
* test new check node function
* add ROCMVERSION parameter and fix script syntax
* fix script syntax
* add missing instances of rocm version
* install ccache in the docker image
* do not check GPU in clang format stage, clean up old code
* update defaults and clean up