* Add maxpool instances
* Rename index pool to max pool.
* Add maxpool bwd bf16 instances
* Add avg pool bwd instances
* Rename avgpool and maxpool to avg_pool3d and max_pool
* Add bf16 pool fwd instances
* Add max pool bwd to ckProfiler
* Add avg pool3d bwd to ckProfiler
* Add avg pool bwd test
* Fix bug of reference pool fwd (dilation)
* Fix bug of max pool bwd (dilation and initZero)
* Support bf16 compute data type
* Force compute type be f32. Because atomicAdd only support f32
* Add max pool bwd test
* Rename folder
* Rename pool
* Add max pool bwd client example
* Add avg pool bwd client example
* Add missing workspace
* clang format
* Rename macro
* remove useless header
* remove useless layout
* experiment with config file
* experiment with version.h config
* add more info to version.h
* minor updates
* minor updates
* fix case where DTYPE is not used
* large amount of files but minor changes
* remove white space
* minor changes to add more MACROs
* fix cmakedefine01
* fix issue with CK internal conflict
* fix define and define value
* fix clang-format
* fix formatting issue
* experiment with cmake
* clang format v12 to be consistent with miopen
* avoid clang-format for config file
* Do not hardcode stride
* devicePool2DFwd Inherit devicePool3DFwd
* Move instance declaration out of common
* Add dilation
* use the pool3d rank, because pool2d inherit pooo3d
* calculate Do Ho Wo for the dilation
* Fix header name
* Modify ckProfiler
* Remove pool2d instance
* Remove pool2d in profiler
* Remove pool2d and add dilation
* In to client example, this commit revise following:
1. Add dilation.
2. Use pool3d to implement pool2d
* Refine naming and IsSupportedArgument()
* Add dilation to maxpool bwd example
* clang format
* 1. Remove useless header
2. Fix copyright
3. Refine naming
* Add layout parameter to pool fwd
* clang format
* Fix merge error
* Fix compile error
* Remove layout parameter in derived class
* Refine changlog
* Fix compile error
* Fix compiler error
* Add layout to external api and profiler
* Add avgpool bwd reference code
* Refine naming
* Fix invalid in_element op in ref_conv
* Add example (only reference now)
* Add the full example of avgpool bwd
* Fix copyright
* Imitate MakeDescriptor from transform_conv_bwd_data_to_gemm_v1.hpp
* rename channel to c from k
* Arrange the code
* Imitate the argument from conv bwd
* Implement invoker
* Fix order of parameter in example
* Refactor reference code for different dimension
* Support different stride
* Check if argument is valid
* Fix kernel parameter for NDHWC, fastest dimension C is not reduced
* Add more data type in example
* Fix bug in example
* calculate Do Ho Wo according to the dilation
* Remove useless header
* Add comment in reference code
* Add layout parameter
* Remove layout in derived class
* Refine reference comment
* Enable grouped conv with small K or C
* Add missing instances
* Refactor grouped conv fwd instances
* Fix fp16 instances since it supports src_per_vec %2 = 0
* Add generic instances
* properly split conv_nd_bwd_data instances
* split conv2d_fwd instance data types
* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm
* split the tests by data types where possible
* filter examples by DTYPES
* split few remaining examples by DTYPES
* filter most instances by DTYPES
* add new lines at end of headers, fix grouped_gemm profiler
* fix syntax
* split the ckprofiler instances by DTYPES
* split the conv2d and quantization DL and XDL instances
* fix the splitting of conv2d DL instances
* split softmax and pool_fwd tests for fp16 and fp32 types
* fix syntax
* fix the dl_int8 quantization instances isolation
* Add wei_strides to grouped conv3d wei to keep consistency
* Fix strides in client examples
* Unify backward weight api with forward
* Fix for example
* Fixes for examples
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add s_nops after v_dot to avoid hazard
* Fix builtin for inner_produxt fp16
* Skip inline version to builtin
* Add comments regarding isa
* Fix comment regarding s_nop
* initial stream-k implementation with example
* fix unexpected change in err
* improve a little bit performance by reorganize pipeline.
* improve perf a little bit by swizzle block idx
* add profiler
* update example
* fix spelling
* shrink karg for streamk
* support dynamic buffer using memory coherence glc_slc bit from template
* control memory coherence while construct dynamic buffer
* update reduction for streamk(not ready yet)
* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
* fix build issue
* fix several bug
* now result is correct, everything works (but has scratch)
* remove scratch by manually reset coordinate
* update device code
* fix a bug in final reduce
* fix something in example
* update async memset
* fix enum as camel case
* modify coherence enum name
* clean code and use atomic streamk by default
* remove unused var
* throw exception if have empty pointer
* fix format
* fix CI warning
* fix type in init
* modify CI error
* filter out on gfx10+
* restore changed example code
---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* first change bias load
* add bias dim and scalervector parameter
* make CDE0BlockTransferSrcVectorDim not work
* changse toinstance
* add limit for CDE0BlockTransferSrcScalarPerVector
* allow building CK for specific data types
* add CI build and test stage on Naiv3x without some int8 instances
* add missing gemm fp16 instances
* add the changes to the missed cmake file
* add empty lines at end of source files
* Do not build quantization client example on navi3 in CI
* disable batched_gemm_multi_d_int8 instances with DTYPES
* disable device_conv2d_bwd_data_instance with DTYPES
* fix ckprofiler for conv_bwd_data for int8
* properly isolate the conv_bwd_data int8 instances
* remove empty line