mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 02:02:46 +00:00

Go to file

Qianfeng f3d8052ad2 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156 )

* Squashed 'src/composable_kernel/' content from commit a4b211238

git-subtree-dir: src/composable_kernel
git-subtree-split: a4b21123849265d90a6b8fa86905a9a8ab253787

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from a4b211238..5805b5dc4

5805b5dc4 Update develop (#5) (#6)
ede23b251 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
8b079b5c6 refactor
c3d788bfa refactor
fcf913481 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5805b5dc442dd8d71295954c4a755a6ef30593bb

* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* tidy

* add tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix

* Squashed 'src/composable_kernel/' changes from 5805b5dc4..dd3d4444e

dd3d4444e Merge pull request #16 from ROCmSoftwarePlatform/develop
cb6b2dc63 Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
d9b2fcab4 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
57b74196a refactor
431c47bea refactor
9a0d05870 DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
bc4146402 use int instead of index_t in kernel wrapper
87a2fc094 compiler parameter use stream
24743c85e make innner product compatiable on gfx900
7ad33d8e1 rename
5a3bace8d rename
12405c12a update CK build script
3c2effd43 fix kernel filename
12ff8d1ca fix enum issue
f0f97fd79 tidy
26f311aa9 fix clang warning suppression
c4f47ed09 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
35fd7bf79 add c-style pointer cast
9c31642f0 tidy
1a2efac60 tidy
ddd3b4e94 tidy
7daa0cfbf tidy
cab6e58d3 tidy
d9a8aebd8 tidy
533e356ce tidy
42639836b tidy
efe2836a2 add tidy
d53d7c666 fix
cf4ea1145 remove online compilation from CK
e63b17bdf refactor
5a2e56f78 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3
702078bfd Merge pull request #7 from ROCmSoftwarePlatform/master
9ce85357a Update develop (#5)
10a172710 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

git-subtree-dir: src/composable_kernel
git-subtree-split: dd3d4444e9b9ed07a54f82d91d969770aa8d5074

* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update in src/reducetensor.cpp for consistent IDs passing to the kernel

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Small updates in src/reducetensor.cpp

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>

[ROCm/composable_kernel commit: dfb80c4e39]

2021-09-29 08:12:11 -07:00

cmake

Composable kernel init integration v3 (#1097 )

2021-08-19 10:55:03 -05:00

composable_kernel

[Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156 )

2021-09-29 08:12:11 -07:00

external/rocm/include

Composable kernel init integration v3 (#1097 )

2021-08-19 10:55:03 -05:00

host

Composable kernel init integration v3 (#1097 )

2021-08-19 10:55:03 -05:00

script

Composable kernel init integration v3 (#1097 )

2021-08-19 10:55:03 -05:00

.clang-format

Composable kernel init integration v3 (#1097 )

2021-08-19 10:55:03 -05:00

.clang-tidy

Composable kernel init integration v3 (#1097 )

2021-08-19 10:55:03 -05:00

CMakeLists.txt

Composable kernel init integration v3 (#1097 )

2021-08-19 10:55:03 -05:00

README.md

Composable kernel init integration v3 (#1097 )

2021-08-19 10:55:03 -05:00

README.md

How to build and run

Docker

docker run                                                                   \
-it                                                                          \
--rm                                                                         \
--privileged                                                                 \
--group-add sudo                                                             \
-w /root/workspace                                                           \
-v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace                                \
rocm/tensorflow:rocm4.2-tf2.4-dev                                            \
/bin/bash

Install Boost for online compilation

https://www.boost.org/doc/libs/1_66_0/more/getting_started/unix-variants.html#easy-build-and-install

Build

Add path of Boost

 export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

mkdir build && cd build

cmake cmd. Need to Specify target ID, example below is gfx908

cmake                                                                                                                              \
-D CMAKE_BUILD_TYPE=Release                                                                                                                    \
-D CMAKE_CXX_FLAGS="-DCK_AMD_GPU_GFX908 -O3 --amdgpu-target=gfx908 -mllvm --amdgpu-spill-vgpr-to-agpr=0 -gline-tables-only -save-temps=$PWD"   \
-D HIP_ONLINE_COMPILER_FLAGS="-DCK_AMD_GPU_GFX908"                                                                                             \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                                                                                                      \
-D CMAKE_PREFIX_PATH=/opt/rocm                                                                                                                 \
-D CMAKE_VERBOSE_MAKEFILE:BOOL=ON                                                                                                              \
..

Build drivers:
conv_fwd_driver_offline is (offline compilation) driver for forward convolution,
conv_bwd_driver_offline is (offline compilation) driver for backward-data convolution
conv_fwd_driver_online is (online compilation) driver for forward convolution

 make -j conv_fwd_driver_offline
 make -j conv_bwd_driver_offline
 make -j conv_fwd_driver_online

Run

layout: 0 = NCHW; 1 = NHWC
algo: algorithm
verify: 0 = no verification; 1 = do verification
init: 0 ~ 5. initialization method
log: 0 = no log; 1 = do log
repeat: number of time kernel being launched

######################################################## layout  algo  verify  init  log  repeat  N__ K___ C___ Y X Hi_ Wi__ Strides Dilations LeftPads RightPads
 ./host/driver_offline/conv_fwd_driver_offline                0     4       0     0    0       1  128  256  192 3 3  71   71     2 2       1 1      1 1       1 1
 ./host/driver_offline/conv_fwd_driver_offline                0     4       0     0    0       1  256 1024  256 3 3  14   14     1 1       1 1      1 1       1 1
 ./host/driver_offline/conv_fwd_driver_offline                1     5       0     0    0       1  128  256  192 3 3  71   71     2 2       1 1      1 1       1 1
 ./host/driver_offline/conv_fwd_driver_offline                1     5       0     0    0       1  256 1024  256 3 3  14   14     1 1       1 1      1 1       1 1
 ./host/driver_offline/conv_bwd_driver_offline                1     5       0     0    0       1  256  256 1024 3 3  14   14     1 1       1 1      1 1       1 1

Result

Forward convoltuion, FP16, NCHW

./host/driver_offline/conv_fwd_driver_offline                0     4       0     0    0       1  128  256  192 3 3  71   71     2 2       1 1      1 1       1 1

layout: 0
in: dim 4, lengths {128, 192, 71, 71}, strides {967872, 5041, 71, 1}
wei: dim 4, lengths {256, 192, 3, 3}, strides {1728, 9, 3, 1}
out: dim 4, lengths {128, 256, 36, 36}, strides {331776, 1296, 36, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {2, 2, }
ConvDilations size 2, {1, 1, }
device_convolution_forward_implicit_gemm_v4r4r2_xdlops_nchw_kcyx_nkhw
a_k0_m_k1_grid_desc{216, 256, 8}
b_k0_n_k1_grid_desc{216, 165888, 8}
c_m_n_grid_desc{ 256, 165888}
launch_and_time_kernel: grid_dim {1296, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 1.4155 ms, 103.686 TFlop/s

Forward convoltuion, FP16, NCHW

 ./host/driver_offline/conv_fwd_driver_offline                0     4       0     0    0       1  256 1024  256 3 3  14   14     1 1       1 1      1 1       1 1
 
 layout: 0
in: dim 4, lengths {256, 256, 14, 14}, strides {50176, 196, 14, 1}
wei: dim 4, lengths {1024, 256, 3, 3}, strides {2304, 9, 3, 1}
out: dim 4, lengths {256, 1024, 14, 14}, strides {200704, 196, 14, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_convolution_forward_implicit_gemm_v4r4r2_xdlops_nchw_kcyx_nkhw
a_k0_m_k1_grid_desc{288, 1024, 8}
b_k0_n_k1_grid_desc{288, 50176, 8}
c_m_n_grid_desc{ 1024, 50176}
launch_and_time_kernel: grid_dim {1568, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 2.21357 ms, 106.959 TFlop/s

Forward convolution, FP16, NHWC

./host/driver_offline/conv_fwd_driver_offline                1     5       0     0    0       1  128  256  192 3 3  71   71     2 2       1 1      1 1       1 1

layout: 1
in: dim 4, lengths {128, 71, 71, 192}, strides {967872, 13632, 192, 1}
wei: dim 4, lengths {256, 3, 3, 192}, strides {1728, 576, 192, 1}
out: dim 4, lengths {128, 36, 36, 256}, strides {331776, 9216, 256, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {2, 2, }
ConvDilations size 2, {1, 1, }
device_convolution_forward_implicit_gemm_v4r4r4_xdlops_nhwc_kyxc_nhwk
a_k0_m_k1_grid_desc{216, 165888, 8}
b_k0_n_k1_grid_desc{216, 256, 8}
c_m_n_grid_desc{ 165888, 256}
launch_and_time_kernel: grid_dim {1296, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 1.12014 ms, 131.025 TFlop/s

Forward convolution, FP16, NHWC

./host/driver_offline/conv_fwd_driver_offline                1     5       0     0    0       1  256 1024  256 3 3  14   14     1 1       1 1      1 1       1 1

layout: 1
in: dim 4, lengths {256, 14, 14, 256}, strides {50176, 3584, 256, 1}
wei: dim 4, lengths {1024, 3, 3, 256}, strides {2304, 768, 256, 1}
out: dim 4, lengths {256, 14, 14, 1024}, strides {200704, 14336, 1024, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_convolution_forward_implicit_gemm_v4r4r4_xdlops_nhwc_kyxc_nhwk
a_k0_m_k1_grid_desc{288, 50176, 8}
b_k0_n_k1_grid_desc{288, 1024, 8}
c_m_n_grid_desc{ 50176, 1024}
launch_and_time_kernel: grid_dim {1568, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 1.86877 ms, 126.693 TFlop/s

Backward data convolution, FP16, NHWC

./host/driver_offline/conv_bwd_driver_offline       1     1       0     3    0       1  256  256 1024 3 3  14   14     1 1       1 1      1 1       1 1

layout: 1
in: dim 4, lengths {256, 14, 14, 1024}, strides {200704, 14336, 1024, 1}
wei: dim 4, lengths {256, 3, 3, 1024}, strides {9216, 3072, 1024, 1}
out: dim 4, lengths {256, 14, 14, 256}, strides {50176, 3584, 256, 1}
InLeftPads size 2, {1, 1, }
InRightPads size 2, {1, 1, }
ConvStrides size 2, {1, 1, }
ConvDilations size 2, {1, 1, }
device_convolution_backward_data_implicit_gemm_v4r1r2_xdlops_nhwc_kyxc_nhwk
a_k0_m_k1_grid_desc{288, 50176, 8}
b_k0_n_k1_grid_desc{288, 1024, 8}
c_m_n_grid_desc{ 50176, 1024}
launch_and_time_kernel: grid_dim {1568, 1, 1}, block_dim {256, 1, 1}
Warm up
Start running 1 times...
Average time : 2.22461 ms, 106.428 TFlop/s

Languages

C++ 93.1%

Python 4.5%

CMake 1.5%

Shell 0.5%

Pawn 0.2%