* add instance for gemm bias softmax gemm
* add client example
* change CGridDesc_G_M_N to CGridDesc_G_M_O
* add gridwise
* change c grid name
* device add d0s data
* fix 08 client_example
* add example 47_fused_attention
* example output correct
* add d0 to example
* add d0 element op
* rechange instance code
* change Acc0ElementwiseOperation to C0DEElementwiseOperation
* change example name
* update instance for cdeelementwiseop
* add bhalf_t ScaleAdd
* add test
* not surport geem1 bias
* remove some ignore
* fix test bug
* test the QA cron parameter for compiler commit
* create separate dockers for latest and fixed amd-stg-open compiler versions
* change groovy syntax
* apply cron timers back to develop branch
* File renaming and class renaming for device element-wise operation
* Add batchnorm-infer instances, external API and client example
* Add batchnorm-infer profiler module and gtests
* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp
* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer
* Rename class and file due to conflict from device_elementwise_2d.hpp
* Fix namespace in batcnnorm_infer_nhwc client example
* Use double as alpha/beta values type in reduce device op api
* Use double as alpha/beta values type in softmax device op api
* Use double as alpha/beta values type in multiple-reduce device op api
* Use double as epsilon value type in normalization/elementwise-normalization device op api
* add multi embeddings support
* fix format
* optimize sqrt
* add reduce operation
* change to elementwise op
* fix name
* rename
* run ci cd
* format example
* format code
* format code
* add example
* fix example
* add instance for gemm permute
* add to client example
* change configs
* change instance file name
* formate
* change client example file name and remove example
* Change to the DeviceReduce base class template to include all problem description information
* Add external api for reduction
* Add client example to test the reduction external api
* Spelling correction
* Re-implement the host_reduction to follow the DeviceReduce base API format
* Change the reduce profiler to call the external API for collecting device instances
* Rename reduce client example directory from 08_reduce to 12_reduce
* Remove (void) before the functional call
* Tiny update in reduce client example
* Tiny update in profile_reduce_impl.hpp
* Rename the reduce client example directory
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Add device op of gemm layernorm
* [What] Rename F to H
[Why] F and G prepare for welford tensor
* Add gridwise gemm + welford
* Extract template parameter
* Rename kernel. Prepare to add second half kernel
* Extract var
* Add second kernel for gemm+layernorm
* Move to the gemm_layernorm folder
* Rename F and G to mean and var
* Do not use snakeCurved, it makes determination of padding for welford difficult
* Rewrite the device interface and rename some var
* Add welford count
* Update interface
* Sync code, prepare to test on MI200
* Clean the code
* Implement layernorm
* Add comment to mension hipFree
* Wrtie out the e for debug.
This could be remove and use h for instead
* 1. Allocate mean, var and count into by SetWorkSpacePointer.
2. Add GetWorkSpaceSize to calculate the space size
* Add gemm layernorm host code
* use reference layernorm
* Fix bug of blockwise welford for first kernel
* Fix bug of mean var padding for layernorm
* Use sgpr for shuffleM_index
* padding for GemmMeanVarCountGridDescriptor_M_NBlock
* Add layout parameter
* Check argument for gemm
* calculate max count for tail block
* Share E and H memory in device op
* Hard code the vector dim
* Refine the MakeDescriptor
* 1. Remove E parameter, because E is inside of device op
2. Check vector size
* [What] Rename MakeMeanVarDescriptor_M_N
[Why] Prepare to add count version of make descriptor
* Use 1D global memory for count
* Prevent redundant IO
* Update parameter
* Add pipeline v1/v2 selector
* Rename the example name
* Add base class for gemm layernorm
* Refine naming to distinguish naive and welford
* Add comment to explan in detail
* We don't need to pad in N dimension in gemm for mean/var/count. Set NPerTile 1
* Rewrite the 2st kernel, use multiple block along N dimension in layernorm kernel
* Share the vector size
* Refine var name
* [What] Force LayernormThreadSliceSize_N = vector size.
[Why] Memory coalesce
* Add comment
* Extract divisor out of the loop in reference layernorm
* Pad different size for E and H in layernorm kernel according to different block tile
* Refine naming
* Refine naming
* Prevent implicit cast
* [What] use ck::math::sqrt instead of __builtin_amdgcn_sqrtf
[Why] __builtin_amdgcn_sqrtf is only support float, double will cause casting
* Cast only constant
* Change of post shuffle thread descriptor
* Add EMeanVarDataType parameter.
* Merge the mean and var threadwise copy
* Add missing index
* Fix Typo
* Sync the variable with previous if
* 1. Declare e inside the host_gemm_layernorm()
2. Prevent implicit cast in reference code
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* tempsave
* tempsave
* tempsave
* runtime bug, cannot find symbol
* workaround for incorrect HIP warpSize return value
* debugging
* tempsave
* Correctness OK, waiting for optimization
* Tidy up + format
* temp save
* temp save, reproduce the v_bfi_b32 issue
* add inline asm for wmmaop test
* tidy up
* clean some debug purpose code
* discard some codes
* clang format
* clang format
* compiler issue fixed + increase tile size
* add DEBUG_LOG macro to enable/disable debug output
* fix syntax
* fix syntax again
* fix syntax one more time
* remove balnk spaces
* use ifdefs
* add the Print argument
* move the definition of DEBUG_LOG to ck.hpp
* add the missign argument to Print()
* start add example
* add multiple d fp16 example
* device transfer elementwiseop to gridwise
* gridwise add multiple d
* change example for multiple d
* fix spill registers
* fix for passthrough element op
* fix int8 overflow
* change example file name
* add instance for dl multiple d
* example add DsDataType
* remove grouped_convolution_forward_dl.hpp
* add head file(was deleted before)
* fix not support device issue
* format
* remove passthrough check
Co-authored-by: letaoqin <letaoqin@amd.com>
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* Remote int4 related
* delete deprecated test
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Re-structure ckProfiler source files
* Rename profiler.cpp to main.cpp
* Modularize ckProfiler operations
* Add description for profiler operations
* Use longer name to avoid name collision
* Use macro to delay expansion
* Use std::move() to avoid object copying
* Prohibit users from calling dtor
* Use macro to eliminate redundant code
* Make friend function hidden
* Add missing include directive <iostream>
* Fix wrong include directives
* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* Refine the device batchnorm-backward base API templates and data type assignments
* Remove duplicated kernel file
* Add batchnorm backward instances and external API
* Add batchnorm-backward profiler and tests
* Add client example which uses batchnorm backward external API
* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory
* Loose the threshold for batchnorm-backward check_err()
* Implemented batchnorm-backward Blockwise and Multiblock kernels
* Add batchnorm-backward device op
* Add batchnorm-backward host-reference op
* Add batchnorm-backward example
* Parameters renaming in batchnorm backward kernels and device op
* Change in the example to loose the threshold for ScaleDiff checking
* Add comments to explain the implementation of batchnorm-backward
* Parameters renaming again in batchnorm backward kernels
* Improve the expression calculation for performance
* Add batchnorm backward to README
* Add comments to explain inv-variance in batchnorm forward and backward
* Renaming the batchnorm forward training and inferring examples
* Add/update the comments for batchnorm-backward kernels
* Renaming again
* Add block_sync_lds between two consecutive blockwise reductions
* Move common expression 1/N out of the static_for loops
* Add dy_elementwise_op
* Renaming in backward example again
* Add checking for reduceDims in reference_batchnorm_backward
* Update to comments and codes format
* Rename in the comments
* Remove common expression out of the loop in reference_batchnorm_backward_nhwc_c
* Add block_sync_lds() between blockwise reduction again
* Fix comments again
* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
* Update to device_batchnorm_forward base class to include all template parameters for problem description
* Add batchnorm forward instances and external api
* Add batchnorm forward profiler module which uses the external api
* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]
* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward
* Improvement to the batchnorm infer base API
* Add batchnorm forward client example which shows using the batchnorm forward external API
* Add test for batchnorm forward
* Tuning the batchnorm profiler initialized values and error threshold
* Add support for bhalf_t in instances/external api/tests
* Add support for int8_t in instances/external api/tests
* Add support for double in instances/external api/tests
* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances
* Checking before running best instance in batchnorm_fwd_nhwc client example
* Add checking for YElementwiseOp in batchnorm_forward external API
* Add more types in batchnorm forward profiler
* Add more test lengths
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
* FastGelu support for more data types.
* AddFastGelu & FastGelu instances.
* Client example.
* clang-format
* Remove unused stride variable.
* Add new line at EOF.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm
* added bf16 tests for batched_gemm_softmax_gemm_permute
* changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
* changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
* aligned annotations
* modified CMakeLists for examples
* add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl
* use macro to control the instances
* added macro control into instances
* clang-format some files
* changed error tolerance for bf16
* changed index for 10_elementwise_normalization
* fixed xdlops code bug in amd_xdlops.hpp
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
We can use this template to eliminate duplicated iterator computing
logics. By providing return type to ck::accumulate_n(), we can avoid
type conversion operations.
* Rangify check_err()
By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.
* Re-format example code