mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 02:02:46 +00:00
f3d8052ad219afef6b48172dfa654027de27215f
* Squashed 'src/composable_kernel/' content from commit a4b211238
git-subtree-dir: src/composable_kernel
git-subtree-split: a4b21123849265d90a6b8fa86905a9a8ab253787
* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
* Squashed 'src/composable_kernel/' changes from a4b211238..5805b5dc4
5805b5dc4 Update develop (#5) (#6)
ede23b251 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
8b079b5c6 refactor
c3d788bfa refactor
fcf913481 rename
git-subtree-dir: src/composable_kernel
git-subtree-split: 5805b5dc442dd8d71295954c4a755a6ef30593bb
* fix
* refactor
* remove online compilation from CK
* refactor
* fix
* add ctest
* tidy
* add tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* add c-style pointer cast
* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
* fix clang warning suppression
* tidy
* suppress cppcheck
* fix enum issue
* revert chagnes to hip build
* fix kernel filename
* update CK build script
* rename
* rename
* make innner product compatiable on gfx900
* Update src/include/miopen/solver/ck_utility_common.hpp
Co-authored-by: JD <Jehandad.Khan@amd.com>
* compiler parameter use stream
* use int instead of index_t in kernel wrapper
* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
* refactor
* refactor
* change cmakelist
* change ck common utility
* fix
* Squashed 'src/composable_kernel/' changes from 5805b5dc4..dd3d4444e
dd3d4444e Merge pull request #16 from ROCmSoftwarePlatform/develop
cb6b2dc63 Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
d9b2fcab4 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
57b74196a refactor
431c47bea refactor
9a0d05870 DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
bc4146402 use int instead of index_t in kernel wrapper
87a2fc094 compiler parameter use stream
24743c85e make innner product compatiable on gfx900
7ad33d8e1 rename
5a3bace8d rename
12405c12a update CK build script
3c2effd43 fix kernel filename
12ff8d1ca fix enum issue
f0f97fd79 tidy
26f311aa9 fix clang warning suppression
c4f47ed09 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
35fd7bf79 add c-style pointer cast
9c31642f0 tidy
1a2efac60 tidy
ddd3b4e94 tidy
7daa0cfbf tidy
cab6e58d3 tidy
d9a8aebd8 tidy
533e356ce tidy
42639836b tidy
efe2836a2 add tidy
d53d7c666 fix
cf4ea1145 remove online compilation from CK
e63b17bdf refactor
5a2e56f78 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3
702078bfd Merge pull request #7 from ROCmSoftwarePlatform/master
9ce85357a Update develop (#5)
10a172710 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
git-subtree-dir: src/composable_kernel
git-subtree-split: dd3d4444e9b9ed07a54f82d91d969770aa8d5074
* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel
* Fix with regard to implementing GetZeroVal() in both kernel and host
* Avoid convert to compType from dstDataType before writting the output value
* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator
* Add CONSTANT decorator for descriptor read buffer
* Use get_thread_local_1d_id() for thread local Id
* Rename GetZeroVal() to GetReductionZeroVal() in the kernels
* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp
* Occasional tiny simplification and update in the kernel files
* Update in src/reducetensor.cpp for consistent IDs passing to the kernel
* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers
* Update to remove OpenCL tidy checking failures
* Small updates in src/reducetensor.cpp
* Update for better readability
* Remove unused codes and not-needed template parameters in the kernel wrappers
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>
[ROCm/composable_kernel commit: dfb80c4e39]
How to build and run
Docker
docker run \
-it \
--rm \
--privileged \
--group-add sudo \
-w /root/workspace \
-v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace \
rocm/tensorflow:rocm4.2-tf2.4-dev \
/bin/bash
Install Boost for online compilation
https://www.boost.org/doc/libs/1_66_0/more/getting_started/unix-variants.html#easy-build-and-install
Build
Add path of Boost
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
mkdir build && cd build
cmake cmd. Need to Specify target ID, example below is gfx908
cmake \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_CXX_FLAGS="-DCK_AMD_GPU_GFX908 -O3 --amdgpu-target=gfx908 -mllvm --amdgpu-spill-vgpr-to-agpr=0 -gline-tables-only -save-temps=$PWD" \
-D HIP_ONLINE_COMPILER_FLAGS="-DCK_AMD_GPU_GFX908" \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-D CMAKE_PREFIX_PATH=/opt/rocm \
-D CMAKE_VERBOSE_MAKEFILE:BOOL=ON \
..
Build drivers:
conv_fwd_driver_offline is (offline compilation) driver for forward convolution,
conv_bwd_driver_offline is (offline compilation) driver for backward-data convolution
conv_fwd_driver_online is (online compilation) driver for forward convolution
make -j conv_fwd_driver_offline
make -j conv_bwd_driver_offline
make -j conv_fwd_driver_online
Run
- layout: 0 = NCHW; 1 = NHWC
- algo: algorithm
- verify: 0 = no verification; 1 = do verification
- init: 0 ~ 5. initialization method
- log: 0 = no log; 1 = do log
- repeat: number of time kernel being launched
######################################################## layout algo verify init log repeat N__ K___ C___ Y X Hi_ Wi__ Strides Dilations LeftPads RightPads
./host/driver_offline/conv_fwd_driver_offline 0 4 0 0 0 1 128 256 192 3 3 71 71 2 2 1 1 1 1 1 1
./host/driver_offline/conv_fwd_driver_offline 0 4 0 0 0 1 256 1024 256 3 3 14 14 1 1 1 1 1 1 1 1
./host/driver_offline/conv_fwd_driver_offline 1 5 0 0 0 1 128 256 192 3 3 71 71 2 2 1 1 1 1 1 1
./host/driver_offline/conv_fwd_driver_offline 1 5 0 0 0 1 256 1024 256 3 3 14 14 1 1 1 1 1 1 1 1
./host/driver_offline/conv_bwd_driver_offline 1 5 0 0 0 1 256 256 1024 3 3 14 14 1 1 1 1 1 1 1 1
Result
Forward convoltuion, FP16, NCHW
./host/driver_offline/conv_fwd_driver_offline 0 4 0 0 0 1 128 256 192 3 3 71 71 2 2 1 1 1 1 1 1
layout: 0
in: dim 4, lengths {128, 192, 71, 71}, strides {967872, 5041, 71, 1}
wei: dim 4, lengths {256, 192, 3, 3}, strides {1728, 9, 3, 1}
out: dim 4, lengths {128, 256, 36, 36}, strides {331776, 1296, 36, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {2, 2, }
ConvDilations size 2, {1, 1, }
device_convolution_forward_implicit_gemm_v4r4r2_xdlops_nchw_kcyx_nkhw
a_k0_m_k1_grid_desc{216, 256, 8}
b_k0_n_k1_grid_desc{216, 165888, 8}
c_m_n_grid_desc{ 256, 165888}
launch_and_time_kernel: grid_dim {1296, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 1.4155 ms, 103.686 TFlop/s
Forward convoltuion, FP16, NCHW
./host/driver_offline/conv_fwd_driver_offline 0 4 0 0 0 1 256 1024 256 3 3 14 14 1 1 1 1 1 1 1 1
layout: 0
in: dim 4, lengths {256, 256, 14, 14}, strides {50176, 196, 14, 1}
wei: dim 4, lengths {1024, 256, 3, 3}, strides {2304, 9, 3, 1}
out: dim 4, lengths {256, 1024, 14, 14}, strides {200704, 196, 14, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_convolution_forward_implicit_gemm_v4r4r2_xdlops_nchw_kcyx_nkhw
a_k0_m_k1_grid_desc{288, 1024, 8}
b_k0_n_k1_grid_desc{288, 50176, 8}
c_m_n_grid_desc{ 1024, 50176}
launch_and_time_kernel: grid_dim {1568, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 2.21357 ms, 106.959 TFlop/s
Forward convolution, FP16, NHWC
./host/driver_offline/conv_fwd_driver_offline 1 5 0 0 0 1 128 256 192 3 3 71 71 2 2 1 1 1 1 1 1
layout: 1
in: dim 4, lengths {128, 71, 71, 192}, strides {967872, 13632, 192, 1}
wei: dim 4, lengths {256, 3, 3, 192}, strides {1728, 576, 192, 1}
out: dim 4, lengths {128, 36, 36, 256}, strides {331776, 9216, 256, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {2, 2, }
ConvDilations size 2, {1, 1, }
device_convolution_forward_implicit_gemm_v4r4r4_xdlops_nhwc_kyxc_nhwk
a_k0_m_k1_grid_desc{216, 165888, 8}
b_k0_n_k1_grid_desc{216, 256, 8}
c_m_n_grid_desc{ 165888, 256}
launch_and_time_kernel: grid_dim {1296, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 1.12014 ms, 131.025 TFlop/s
Forward convolution, FP16, NHWC
./host/driver_offline/conv_fwd_driver_offline 1 5 0 0 0 1 256 1024 256 3 3 14 14 1 1 1 1 1 1 1 1
layout: 1
in: dim 4, lengths {256, 14, 14, 256}, strides {50176, 3584, 256, 1}
wei: dim 4, lengths {1024, 3, 3, 256}, strides {2304, 768, 256, 1}
out: dim 4, lengths {256, 14, 14, 1024}, strides {200704, 14336, 1024, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_convolution_forward_implicit_gemm_v4r4r4_xdlops_nhwc_kyxc_nhwk
a_k0_m_k1_grid_desc{288, 50176, 8}
b_k0_n_k1_grid_desc{288, 1024, 8}
c_m_n_grid_desc{ 50176, 1024}
launch_and_time_kernel: grid_dim {1568, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 1.86877 ms, 126.693 TFlop/s
Backward data convolution, FP16, NHWC
./host/driver_offline/conv_bwd_driver_offline 1 1 0 3 0 1 256 256 1024 3 3 14 14 1 1 1 1 1 1 1 1
layout: 1
in: dim 4, lengths {256, 14, 14, 1024}, strides {200704, 14336, 1024, 1}
wei: dim 4, lengths {256, 3, 3, 1024}, strides {9216, 3072, 1024, 1}
out: dim 4, lengths {256, 14, 14, 256}, strides {50176, 3584, 256, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_convolution_backward_data_implicit_gemm_v4r1r2_xdlops_nhwc_kyxc_nhwk
a_k0_m_k1_grid_desc{288, 50176, 8}
b_k0_n_k1_grid_desc{288, 1024, 8}
c_m_n_grid_desc{ 50176, 1024}
launch_and_time_kernel: grid_dim {1568, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 2.22461 ms, 106.428 TFlop/s
Languages
C++
93.1%
Python
4.5%
CMake
1.5%
Shell
0.5%
Pawn
0.2%