* Add image to column kernel
* Add instances, tests, profiler, example
* Add client example
* Several fixes of image to column
* Fix variable name in device_image_to_column_impl
* Several fixes of image to column profiler
* Fix num_btype calculation
* Make new mesaurements for correct bytes calculation
* Add maxpool instances
* Rename index pool to max pool.
* Add maxpool bwd bf16 instances
* Add avg pool bwd instances
* Rename avgpool and maxpool to avg_pool3d and max_pool
* Add bf16 pool fwd instances
* Add max pool bwd to ckProfiler
* Add avg pool3d bwd to ckProfiler
* Add avg pool bwd test
* Fix bug of reference pool fwd (dilation)
* Fix bug of max pool bwd (dilation and initZero)
* Support bf16 compute data type
* Force compute type be f32. Because atomicAdd only support f32
* Add max pool bwd test
* Rename folder
* Rename pool
* Add max pool bwd client example
* Add avg pool bwd client example
* Add missing workspace
* clang format
* Rename macro
* remove useless header
* remove useless layout
* Do not hardcode stride
* devicePool2DFwd Inherit devicePool3DFwd
* Move instance declaration out of common
* Add dilation
* use the pool3d rank, because pool2d inherit pooo3d
* calculate Do Ho Wo for the dilation
* Fix header name
* Modify ckProfiler
* Remove pool2d instance
* Remove pool2d in profiler
* Remove pool2d and add dilation
* In to client example, this commit revise following:
1. Add dilation.
2. Use pool3d to implement pool2d
* Refine naming and IsSupportedArgument()
* Add dilation to maxpool bwd example
* clang format
* 1. Remove useless header
2. Fix copyright
3. Refine naming
* Add layout parameter to pool fwd
* clang format
* Fix merge error
* Fix compile error
* Remove layout parameter in derived class
* Refine changlog
* Fix compile error
* Fix compiler error
* Add layout to external api and profiler
* Add wei_strides to grouped conv3d wei to keep consistency
* Fix strides in client examples
* Unify backward weight api with forward
* Fix for example
* Fixes for examples
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* initial stream-k implementation with example
* fix unexpected change in err
* improve a little bit performance by reorganize pipeline.
* improve perf a little bit by swizzle block idx
* add profiler
* update example
* fix spelling
* shrink karg for streamk
* support dynamic buffer using memory coherence glc_slc bit from template
* control memory coherence while construct dynamic buffer
* update reduction for streamk(not ready yet)
* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
* fix build issue
* fix several bug
* now result is correct, everything works (but has scratch)
* remove scratch by manually reset coordinate
* update device code
* fix a bug in final reduce
* fix something in example
* update async memset
* fix enum as camel case
* modify coherence enum name
* clean code and use atomic streamk by default
* remove unused var
* throw exception if have empty pointer
* fix format
* fix CI warning
* fix type in init
* modify CI error
* filter out on gfx10+
* restore changed example code
---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting
* Move the generic kernel instance to be the first of the instance list for elementwise op of normalization
* Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax
* Add testing of GetGenericInstance() in client_example of Softmax
* Revert "Add testing of GetGenericInstance() in client_example of Softmax"
This reverts commit f629cd9a93.
* Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"
This reverts commit a9f0d000eb.
* Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm
* Move generic kernel instance to separate tuple for elementwise op of normalization
* Remove un-used files for softmax instance
* Store generic kernel instance to separate tuple for softmax
* Add IsSupported checking for generic instance to client example of softmax
* Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization
* clang-format fix
* Remove int8 from softmax instances
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add license header.
* Reduce number of logged output. Add constant initialization.
* Add functional tests for grouped_gemm with different kbatch value.
* Add debug log informations + remove unused code.
* Don't pass kbatch to CalculateKPadded.
* Turn on logging in grouped gemm and gemm splitk profiler
* Debug: limit number of test cases to run;
* Log more information and initialize with constant value.
* Turn on DEBUG_LOG
* Add more debug log informations.
* Limit the number of instances to compile.
* Use GridwiseGemmPipeline
* Use KBatch to calculate K0
* Multiple DebugLog messages.
* Unit tests for multiple KBatch values.
* Refactoring
* Disable logging
* extract out of if statement KBatch update.
* Uncomment instances.
* Disable DebugLog.
* Use Kbatch when calculate KPadded.
* Fix CGridDesc padding.
* Use available helper functions.
* Uncomment code commented for debuggin.
* Remove unnecessary debug log messages.
* Uncomment previously commented code for debug purposes.
* Add KBatch info to profiler output summary log.
* Add gtests for gemm splitk using ckProfiler API.
* Add more test-cases for different data layout.
* Add more test cases for gemm splitk
* Remove old test.
* Unit tests for MKNK ggemm interface.
* Fix and add more unit-tests.
* Constepxr everything!
* Increase error threshold for fp16 and splitk.
Since we're using fp16 atomic add for splitk there's a
known precision loss.
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Expand the base class of pool2d, prepare to share base class with pool3d
* Add pool3d device op
* Add pool3d f16 example
* Refactor the base class. implement generic pooling in the future
* clang format
* get original index in max pooling
* Add outputindex to base class
* Fix dimension
* Add pooling instance
* Use indexType instead
* Remove useless header
* Extract IndexDataType to template
* Extract pooling reference code
* clang format
* clang format
* Fix typo
* Add tensor stride
* Add missing header
* Add index stride and output stride
* Refine naming
* Add type to base class
* Rename file
* Use proper size
* Fix typo
* Refine naming
* Modify the argument into vector.
* Add max pool profiler
* Refine naming
* Support f32 pool
* Fix typo
* Add avg pool2d fwd in profiler
* clang format
* Rename AccDatatype to ComputeDatatype
* Fix init
* test pool
* Extract variable
* Add client example
* Check the pooling dim
* clang format
* Connect argv and arg_parser
* Add found check
* Remove useless header
* Refine naming
* Adjust the order of device_pool_fwd
* Add contraction profiler and tests
* Build and style fixes
* Allow to use any elementwise operator for ref_contraction
* Introduce profile_contraction_scale and profile_contraction_bilinear
* Make ref_contraction generic and extend interface tests
* Stylistic minor fixes
* Extend test_contraction_interface
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* Grouped gemm + Gelu instances.
* Device Instance Factory for GroupedGemm+Gelu
* Client example
* Rangify fill helper functions.
* Fix name clash.
* Profiler for grouped_gemm+gelu
* No need to use full namespace name.
* Add check for MRaw divisible by vector load.
* Ugly fix for big errors.
* Add grouped_gemm+gelu to profiler CMakelists.
* Store in argument additional info.
* Information about Mraw, Nraw, Kraw values.
* Use FastGelu instead of Gelu.
* Change client ex to use FastGelu
* Remove relaxed error precision.
* Remove duplicate output elementwise-op
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Sync the order of type string with template parameter
* Add more instances
* Check the vector size and remove redundant var
* Extract var to static, prepare to separate sweep once kernel
* Separate sweeponce flow and optimize the flow
* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
* Remove useless code
* Update naive variance kernel
* Refine string
* Fix typo
* Support naive variance for device_normalization
* Check the blocksize
* Share the VGPR of x and y
* Share the VGPR of gamma and beta
* Add more instances
* Support fp16 sqrt for experiment
* Add CHANGELOG
* Fix typo
* clang-format
* Add gemm + layernorm instance
* Add ckProfiler
* Add test
* Add client example
* Detect if user forger to set the workrspace
* Use literal in the example
* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
* check gemm vaildity in IsSupportedArgument
* Add more testcases
* Merge duplicated folder in client example
* Print more infomation
* Use better kernel parameter for MS problem size
* clang format
* Add constexpr for if condition and remove redundant include
* Remove cstdlib and add constexpr
* add instance for gemm bias softmax gemm
* add client example
* change CGridDesc_G_M_N to CGridDesc_G_M_O
* add gridwise
* change c grid name
* device add d0s data
* fix 08 client_example
* add example 47_fused_attention
* example output correct
* add d0 to example
* add d0 element op
* rechange instance code
* change Acc0ElementwiseOperation to C0DEElementwiseOperation
* change example name
* update instance for cdeelementwiseop
* add bhalf_t ScaleAdd
* add test
* not surport geem1 bias
* remove some ignore
* fix test bug
* File renaming and class renaming for device element-wise operation
* Add batchnorm-infer instances, external API and client example
* Add batchnorm-infer profiler module and gtests
* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp
* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer
* Rename class and file due to conflict from device_elementwise_2d.hpp
* Fix namespace in batcnnorm_infer_nhwc client example
* Use double as alpha/beta values type in reduce device op api
* Use double as alpha/beta values type in softmax device op api
* Use double as alpha/beta values type in multiple-reduce device op api
* Use double as epsilon value type in normalization/elementwise-normalization device op api
* Change to the DeviceReduce base class template to include all problem description information
* Add external api for reduction
* Add client example to test the reduction external api
* Spelling correction
* Re-implement the host_reduction to follow the DeviceReduce base API format
* Change the reduce profiler to call the external API for collecting device instances
* Rename reduce client example directory from 08_reduce to 12_reduce
* Remove (void) before the functional call
* Tiny update in reduce client example
* Tiny update in profile_reduce_impl.hpp
* Rename the reduce client example directory
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* start add example
* add multiple d fp16 example
* device transfer elementwiseop to gridwise
* gridwise add multiple d
* change example for multiple d
* fix spill registers
* fix for passthrough element op
* fix int8 overflow
* change example file name
* add instance for dl multiple d
* example add DsDataType
* remove grouped_convolution_forward_dl.hpp
* add head file(was deleted before)
* fix not support device issue
* format
* remove passthrough check
Co-authored-by: letaoqin <letaoqin@amd.com>
* Re-structure ckProfiler source files
* Rename profiler.cpp to main.cpp
* Modularize ckProfiler operations
* Add description for profiler operations
* Use longer name to avoid name collision
* Use macro to delay expansion
* Use std::move() to avoid object copying
* Prohibit users from calling dtor
* Use macro to eliminate redundant code
* Make friend function hidden
* Add missing include directive <iostream>
* Fix wrong include directives
* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* Refine the device batchnorm-backward base API templates and data type assignments
* Remove duplicated kernel file
* Add batchnorm backward instances and external API
* Add batchnorm-backward profiler and tests
* Add client example which uses batchnorm backward external API
* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory
* Loose the threshold for batchnorm-backward check_err()
* Update to device_batchnorm_forward base class to include all template parameters for problem description
* Add batchnorm forward instances and external api
* Add batchnorm forward profiler module which uses the external api
* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]
* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward
* Improvement to the batchnorm infer base API
* Add batchnorm forward client example which shows using the batchnorm forward external API
* Add test for batchnorm forward
* Tuning the batchnorm profiler initialized values and error threshold
* Add support for bhalf_t in instances/external api/tests
* Add support for int8_t in instances/external api/tests
* Add support for double in instances/external api/tests
* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances
* Checking before running best instance in batchnorm_fwd_nhwc client example
* Add checking for YElementwiseOp in batchnorm_forward external API
* Add more types in batchnorm forward profiler
* Add more test lengths
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm
* added bf16 tests for batched_gemm_softmax_gemm_permute
* changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
* changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
* aligned annotations
* modified CMakeLists for examples
* add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl
* use macro to control the instances
* added macro control into instances
* clang-format some files
* changed error tolerance for bf16
* changed index for 10_elementwise_normalization
* fixed xdlops code bug in amd_xdlops.hpp
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Rangify STL algorithms
This commit adapts rangified std::copy(), std::fill() & std::transform()
* Rangify check_err()
By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.
* Allow constructing Tensor<> like a HostTensorDescriptor
* Simplify Tensor<> object construction logics
* Remove more unnecessary 'HostTensorDescriptor' objects
* Re-format example code
* Re-write more HostTensorDescriptor ctor call
* Rename example folder for GroupedConvFwdMultipleD
* Unify example codes
* Change target names
* Add fp16 example for multiple d instance
* Re-format common.hpp
* Add interface 'DeviceGroupedConvFwd'
* Use simpler interface
* Move common conv params out
* Rename conv fwd client example folder
* Add missing include directive
* Update grouped conv instance implementations
* Simplify ckProfiler (grouped conv forward)
* Use GroupedConvFwd to implement client example
* Use greater groupe count in example
* Add custom target to group examples
* Add extra tag param to instance factory function
* Use tag to differentiate factory functions
* Add missing tag argument for factory function
* Remove inheritance relationship
* Remove no-longer used include directive
* Add license in front of file
* Remove redundant CMake setting
* Extract common code from files
* Rename folder 'convnd' to 'conv'
* Use std::array<> to accept compile-time kwnown # of arguments
* Fix compilation error of tuning parameter
* In example, use same setting as unit-test
* Remove no-longer used include directive
* Add interface for grouped conv bwd weight
* Add group support for conv bwd weight
* Add grouped conv bwd weight example
* Use group parameter in example
* Rename example folder
* Remove non-grouped version example source files
* Rename device op template
* Add group support to convolution backward weight
* Remove debug messages
* Use smaller group size in example
* Use named variable as loop terminate condition
* Prettify example output message
* Enlarge used grid size
* Allow real grid size exceeds expected grid size
* Rename interface file
* Add client example for grouped conv2d bwd weight
* Fix wrong include directive
* Rename client example folder
* add fused addition lyernorm
* add fused addition lyernorm
* changed CMakelist
* removed annotates
* modified descriptor of C
* fixed bug in gridwise add layernorm
* format the files
* modified name from add&layernorm into elementwise&layernorm
* created fused elementwise layernorm branch
* change input into tuple type
* add sweep once to reduce load & read of C from global memory
* modified Argument api
* modified way to malloc c in global memory
* changed gamma and beta to m_k_desc
* fixed bug when sweep once and move CDataType when define device level struct
* add src dim for gamma and beta
* implement optimization for coalesced
* delete a annotation line
* fixed some bug to meet the requirements of ck
* add bandwidth computing in example, and fixed the time unit
* move device_elementwise_layernorm_impl.hpp into device/impl
* fixed bug in device_elementwise_layernorm_impl.hpp
* changed name from layernorm into normalization
* clang-format the changed files
* changed the names
* moved immidiate results into lds, it become faster in non-sweeponce cases
* changed naming of C into X to make the defination more clear
* changed naming in example
* add tests for elementwise normalization
* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization
* move test_elementwise_layernorm_fp16 into new folder
* move elementwise_normalization_instances into a new folder
* add more tests in test_elementwise_layernorm_fp16.cpp
* added some corner cases in test
* fixed method to compute lds size for matrix X
* changed name of 44_elementwise_normalization into 45_elementwise_normalization
* modified some comments
* modified some other confused comments
* reduce redundant tests in test_elementwise_layernorm_fp16.cpp
* Sync the naming
* Sync the test of layernorm with groupnorm
* Sync the naming
* Minor change for comment and log
* [What] Add saveMean and SaveInvVariance in the interface.
[Why] These can optimize the backward
* Add reduction across all dims cases.
* host softmax: handle all reduce
* Test cases when reduced dim is not innermost axis.
* Fix syntax.
* Test non innermost dim for fp32 and int8
* Group test suites wrt NumReduceDim.
* Additionally test failing cases.
* Throw error when Rank or NumReduceDims doesn't match arguments.
* Check reducedDims has correct values
* Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
Instead implement own. (in fact just get rid of one check to enable
reduction across inner dimensions).
* Reorganize unit tests to better cover use scenarios.
* Test input validation
* Test reduction of inner dimensions with custom op instances.
* Refactor fp32 and int8 unit tests.
* Fix FP32 instance template parameters.
* Add more instances.
* Instances with InSrcVectorDim=0.
* Do not initialize and copy data when arg not supported.
* ckProfiler Softmax use instance factory.
* Refactor device softmax IsSupported.
* Additionally add non-polymorphic api functions
* Split softmax instances into multiple files.
* Fix profiler.
* Reorganize tests to reuse profiler and cover edge cases.
* Clang-format
* I8 Softmax instances along with UT.
* Reuse type alias definitions from instance factory header.
* Clean included headers
* Fix variable names.
* Add missing checks in Argument constructor.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
* add device of dl
* fix k1 of GridwiseGemmDl_km_kn_mn_v1r3
* init version for dl conv
* add example(init)
* result right
* disable elementwise operation
* check parameters
* add fp32,int8 example and change check code
* change deive file and class name
* add check vector access of C
* add instance
* add to ckProfiler
* add Filter1x1Pad0 instances
* fix ignore error
* fix for CI
Co-authored-by: letaoqin <letaoqin@amd.com>