* [What] Add 2d version of bias, prepare to implement alpha / beta scaling
* Add alpha / beta functor
* Refine parameter of example
* [What] Use real type instead of template
[Why] Prevent implicit cast
* Rename parameter for general operator
* Remove redundant comment
* Fix compile error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
- device_gemm_xdl_c_shuffle function signature matches split-k
- retire host_driver since it is no longer maintained
- linter error (unused variable)
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* tweak conv for odd C
* update script
* clean up elementwise op
* fix build
* clean up
* added example for gemm+bias+relu+add
* added example for gemm+bias+relu
* add profiler for gemm_s_shuffle; re-org files
* add profiler
* fix build
* clean up
* clean up
* clean up
* fix build
* Do not hardcode the function parameter, use template instead.
* [What] Remove AThreadTransferSrcResetCoordinateAfterRun and BThreadTransferSrcResetCoordinateAfterRun in host API
[Why] "C_Shuffle" version is supposed to be similar to the vanilla one
* Fix typo
Let DeviceGemmXdl_C_Shuffle use kernel_gemm_xdlops_v3r1
* [What]
1. Add DeviceGemmXdl_C_Shuffle
2. Revise example of gemm_xdl
[Why] Prepare to add shuffle version of D = alpha * (A * B) + beta * C
[How] Imitate DeviceGemmXdl and device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp
* fix relu
* clean up
* clean up
* adding 1x1 conv
* adding 1x1 conv
* added 1x1 conv
* refactor
* refactor
* refactor
* added profiler for conv+bias+relu+add
* clean up
* adding conv+bias+relu
* adding conv+bias+relu
* added conv+bias+relu
* Update README.md
* update cpu verification
* adding c shuffle
* update static_tensor for dealing with invalid element
* adding c shuffle
* debugging
* fix bug
* convert to fp16 before shuffle
* shuffle more than one M/NRepeat
* clean up
* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1
* clean up
* remove coordinate step hack from all gridwise gemm xdl
* clean up coordinate step hack
* clean up coordinate step hack
* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst
* adding output shuffle in conv+bias+relu+add
* update
* added conv+bias+relu+add with c shuffle
* added conv+bias+relu+add with c shuffle
* fix forward_sweep bugs in threadwise copy
* clean up
* refactor
* clean up
* clean up
* added conv_c_shuffle+bias_relu
* clean up
* added conv+bias+relu+atomic_add
* clean up
* clean up
* clean up
* clean up
* clean up
* clean up
* misc fixes; add 1x1 specialization
* clean up
* delete unused device op
* clean up
* add support for odd C value