* use cast_pointer_to_generic_address_space() in v6r1 kernel wrapper, DynamcBuffer and buffer_load take customized invalid-element-value, add buffer_load/store for fp64
* use remove_cvref_t
[ROCm/composable_kernel commit: 10bb811060]
* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
* make inner product compatible on gfx900
* Update src/include/miopen/solver/ck_utility_common.hpp
* compiler parameter use stream
* use int instead of index_t in kernel wrapper
* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
* Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities)
* Some updates to dynamic composable kernel facility for the need of dynamic generic reduction
* Update to generic reduction C++ host interface layer to support dynamic generic reduction
* Update to remove tidy complaints in host interface layer
* Change the unary operator form from void op(T &x) to T op(T x)
* Update to pass single workspace pointer for all kernels (fix for OpenCL backend)
* Use cppcheck-suppress to prevent some strange warnings
* Re-use operator [] and () for DynamicBuffer and update to depending codes
* Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers
* [performance] Remove un-needed local buffer initialization
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>
[ROCm/composable_kernel commit: 9e80cdceb7]
* add f32/i32 atomicAdd support into dynamicBuffer, and enable it in v1r3
* fixed
* fixed
* update comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: a7a758d8ce]
* add threadwise copy the copy a tensor in one copy, added kpack to DL GEMM
* add kpack into fwd v4r5 nchw fp32
[ROCm/composable_kernel commit: b8b2d0a6d1]
* Add online-compiling facility
* Synchronize from fwd-v4r5 and implement host interfaces to call conv-fwd v4r4/v4r5 using on-line compiling method
* Tiny adjustment to time reporting
* Use object assignment to replace explicit bytes copying in the first kernel of v4r4/v4r5
* Use single thread to assign descriptor object to device memory
* Adjust to the workload assignment of the two kernels of v4r4 (experimental)
* Revert "Adjust to the workload assignment of the two kernels of v4r4 (experimental)"
This reverts commit eb38461456bb0c82b6c0d32cdd616e181907e20c.
* Update to make constexpr for generating descriptor types in kernel 2 of dynamic conv-fwd v4r4
* Update to dynamic conv-fwd v4r4 online-compiling
* Update to dynamic conv-fwd v4r5 online-compiling (result not accurate)
* Tiny update to driver/CMakeLists.txt
* clang-format
* Tiny comments change
* Add env OLC_DUMP_SAVE_TMP_DIR to support saving of temperary dir
* Fwd v4r5 olc perf (#39)
* added hip-clang flags that fix perf issue of online compilation
* fix bug for olc fwd-v4r5-nchw
* Move constexpr and type reference statements out of the function body in conv-fwd v4r4/v4r5 kernel wrapper
* Remove printing in hip_build_utils.cpp
* Update to root CMakeLists.txt
* Revert "Move constexpr and type reference statements out of the function body in conv-fwd v4r4/v4r5 kernel wrapper"
This reverts commit 3d2c5d8ecdd8298b72d127110500ed5b38d9835c.
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@dc-smc-18.amd.com>
[ROCm/composable_kernel commit: 1685048a67]
* Use DynamicBuffer to hold raw pointer (to global and LDS memory)
* add workaround for compiler issue (inefficient ISA) of ds_write for int8x4, int8x8, int8x16
[ROCm/composable_kernel commit: 78b987fbd6]
* Replace most raw index calculation to coordinate transformation
* Overhaul blockwise and threadwise GEMM
* Overhaul driver for gridwies GEMM kernel
Co-authored-by: Jing Zhang <jizhan@amd.com>
[ROCm/composable_kernel commit: 01055d95d9]
* initial implementation for magic number division and DynamicMerge_v2_magic_division that uses it
* turn off DynamicMerge_v2_magic_division that use magic number division by default
[ROCm/composable_kernel commit: 3bf52e60c5]
* use address_space(4) in kernel signature to fix performance issue when passing tensor descriptor from host to kernel by (void) pointers
* remove passing by pointer* option (only use pass by value or void*)
[ROCm/composable_kernel commit: d2217f3040]