Commit Graph

473 Commits

Author SHA1 Message Date
Jing Zhang
ea6fa92eea updated bfloat16_to_float
[ROCm/composable_kernel commit: 89e1ebd4d5]
2021-11-16 18:01:25 +00:00
zjing14
456f5306df Add bfp16/int8 support into XDL GEMM operator (#50)
* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* add parameters

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* tweak gemm

* adjust blockwise_gemm_xdlops

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* add bf16_1k

* clang-format

* clean

* add bfp16 for gfx908

* add verification

* clean up

* clean code

* restore bfl16

* clean

* add bfp16 support into gemm_driver

* apply new generator to other drivers

* add int8 support

* cleanb

* clean

* clean

* clean

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@hayabusa6111.amd.com>

[ROCm/composable_kernel commit: 3737bb039a]
2021-11-15 10:24:39 -06:00
Chao Liu
8791d26e52 FP16 data in-register transpose (#41)
* start fixing 16bit data packing

* adding StaticTensor

* adding StaticTensor

* adding StaticTensor

* add missing constexpr

* adding static tensor

* adding static tensor

* adding transpose

* add inline asm for transpose 2x2 of half_t

* add general transpose_vectors(), but have unnecessary register initialization using v_mov

* fix unnecessary register initialization in transpose_vector by using more pass-by-reference

* add hardcoded logic for NHWC wrw

* improve asm for v_pack

* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor

* tweak

* reorganize file

[ROCm/composable_kernel commit: b491ebf384]
2021-11-15 10:05:58 -06:00
Chao Liu
2f5ccb68f5 ckProfiler and device-level XDL GEMM operator (#48)
* add DeviceGemmXdl

* update script

* fix naming issue

* fix comment

* output HostTensorDescriptor

* rename

* padded GEMM for fwd v4r4r4 nhwc

* refactor

* refactor

* refactor

* adding ckProfiler

* adding ckProfiler

* refactor

* fix tuning parameter bug

* add more gemm instances

* add more fp16 GEMM instances

* fix profiler driver

* fix bug in tuning parameter

* add fp32 gemm instances

* small fix

* refactor

* rename

* refactor gemm profiler; adding DeviceConv and conv profiler

* refactor

* fix

* add conv profiler

* refactor

* adding more GEMM and Conv instance

* Create README.md

Add build instruction for ckProfiler

* Create README.md

Add Readme for gemm_xdl example

* Update README.md

Remove build instruction from top most folder

* Update README.md

* clean up

[ROCm/composable_kernel commit: e823d518cb]
2021-11-14 11:28:32 -06:00
ltqin
1f6ca26819 [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44)
* change method computering kpad

* remove unusing variable: batchlen

* change KPerBlock to K0PerBlock

* fix bug for k0 == k0perblock

* fix bug for get k0 index

* use math::integer_divide_ceil

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 6014185ac6]
2021-10-27 09:39:18 -05:00
Chao Liu
d3dc0bcbe3 Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all
update ck from miopen ck_upstream

[ROCm/composable_kernel commit: 3e9113707f]
2021-10-27 09:07:38 -05:00
ltqin
1ae6a49f43 Merge branch 'develop' into miopen_downstream_all
[ROCm/composable_kernel commit: 211dae8229]
2021-10-27 13:34:19 +08:00
Jun Liu
0c17233608 [Composable Kernel] update develop branch code to ck_upstream
Merge pull request #1236 from ROCmSoftwarePlatform/develop

[ROCm/composable_kernel commit: 5890e30076]
2021-10-25 19:49:17 -07:00
Chao Liu
3fcbcb776c fix bug in gridwise gemm xdlops v2r3 (#45)
[ROCm/composable_kernel commit: d5297abae9]
2021-10-21 16:42:24 -05:00
Chao Liu
02fc6ba269 bug fix (#39)
[ROCm/composable_kernel commit: c3018794b4]
2021-10-19 18:43:10 -05:00
ltqin
0d74bff825 add nchw atomic , nhwc and nhwc atomic method for backward weight (#30)
* add add new algorithm from v4r4r2

* program once issue

* add split k functiion

* redefine code

* add a matrix unmerge

* add b matrix unmerge k0

* trans a and b to gridegemm

* nhwc init

* no hacks and vector load

* add hacks

* modify some parameter

* fix tuning prometer for fp32

* fix tuning prometer for fp16

* start change gridwise k split

* init ok

* revome a b matrix k0mk1 desc in grid

* carewrite lculate gridsize

* add kbatch to CalculateBottomIndex

* remove some unused funtion

* add clear data function before call kernel

* out hacks

* in hacks

* rename device convolution file and function name

* modify kBatch value

* fix some tuning code

* start from v4r4 nhwc

* nhwc atomic is able to run

* just for fp32

* enable nchw atomic

* tweak

* tweak

* re-arrange gridwise gemm hot loop for wrw

* add wrw v4r5

* v4r4r5 fp16

* v4r4r4 fp16

* v4r4r2 fp16

* V4R4R4XDLNHWC fp16

* V4R4R2XDLATOMICNCHW fp16

* adjust for fp16

* input gridsize

* change kbatch to gridsize

* testing wrw

* clean up

* k_batch to gridsize

* fix bug

* wrw v4r4r4 kbatch change to gride size

* wrw v4r4r2 kbatch change to gride size

* after merge , change gridwise gemm v2r4

* change MakeCBlockClusterAdaptor

* other method use new gridwise gemm

* clean up

* chapad method nge to make_right_pad_transform

* kbatch out from transform function

* clean up and fix bug

* fix bug

* using function type reduce template parameters

* using auto replace define fuction type

* clean up

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

[ROCm/composable_kernel commit: fd49ff8080]
2021-10-19 18:42:34 -05:00
Qianfeng
afe31f1e41 [MIOpen Downstream] Fix Reduction Kernel (#34)
* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: b2dc55f82c]
2021-10-06 14:43:17 -05:00
Chao Liu
720cf3d6b2 Tweak GEMM kernel (#38)
* add parameters

* tweak gemm

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* clean up

[ROCm/composable_kernel commit: b3e8d57d51]
2021-10-06 11:12:36 -05:00
zjing14
8159394bfa Add VectorType support into StaticBuffer (#27)
* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* adjust blockwise_gemm_xdlops

* reorder ops in GEMM hot loop

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 846f462bd4]
2021-10-06 10:13:52 -05:00
Qianfeng
d1c185cde7 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156)
* Squashed 'src/composable_kernel/' content from commit aa8c98119

git-subtree-dir: src/composable_kernel
git-subtree-split: aa8c981198

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from aa8c98119..1d8dbe3c5

1d8dbe3c5 Update develop (#5) (#6)
8ce0728ae Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
f017e3448 refactor
9eb35eec8 refactor
041c48a06 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 1d8dbe3c57

* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* tidy

* add tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix

* Squashed 'src/composable_kernel/' changes from 1d8dbe3c5..887df7b12

887df7b12 Merge pull request #16 from ROCmSoftwarePlatform/develop
7e6b9fb7a Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
833701f40 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
e25c4c2f1 refactor
27048b771 refactor
65e834905 DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
b3759bf6a use int instead of index_t in kernel wrapper
04ed8ddf4 compiler parameter use stream
9f40048d1 make innner product compatiable on gfx900
f7df8c7ee rename
1e312fef1 rename
c9869a5ac update CK build script
c825eb6b1 fix kernel filename
594b1cf91 fix enum issue
286475c6b tidy
a7c943aba fix clang warning suppression
d49e0ddcb vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
314b9d78e add c-style pointer cast
d4b35bd09 tidy
cb2edf210 tidy
4771cfa34 tidy
eb7f9f35b tidy
b14b5d337 tidy
9c589af82 tidy
e8def0e77 tidy
9e2c3c776 tidy
51ab4abaf add tidy
cba13cb6b fix
5ed1b840a remove online compilation from CK
5856acc10 refactor
7221bedc9 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3
0bb6c85c2 Merge pull request #7 from ROCmSoftwarePlatform/master
a0b9a203a Update develop (#5)
898807d60 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

git-subtree-dir: src/composable_kernel
git-subtree-split: 887df7b129

* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update in src/reducetensor.cpp for consistent IDs passing to the kernel

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Small updates in src/reducetensor.cpp

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>

[ROCm/composable_kernel commit: dfb80c4e39]
2021-09-29 08:12:11 -07:00
Jun Liu
47bf3c6a3d Merge pull request #1165 from ROCmSoftwarePlatform/develop
Merge develop into CK_upstream (Please don't squash when merging)

[ROCm/composable_kernel commit: 8557901d02]
2021-09-21 15:52:12 -07:00
Chao Liu
c420cdb1a4 Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr
[MIOpen Downstream] Dynamic Reduction PR

[ROCm/composable_kernel commit: f305bebdc3]
2021-09-21 11:59:23 -05:00
Chao Liu
2a43644437 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr
[ROCm/composable_kernel commit: b725e3fc84]
2021-09-21 11:55:26 -05:00
Chao Liu
78178aecec :Merge remote-tracking branch 'origin/develop' into CK_upstream
[ROCm/composable_kernel commit: df0d68106e]
2021-09-20 20:44:01 -05:00
Chao Liu
f1d7806427 Add a version of Merge transform that use integerdivision and mod (#25)
* add Merg_v3_division_mod

* refactor

[ROCm/composable_kernel commit: f3acd2510b]
2021-09-05 12:57:57 -05:00
Chao Liu
079adb1e7d GEMM driver and kernel (#29)
* add gemm driver

* tweak

* add gemm kernel: mk_kn_mn and km_kn_mn

* tweak

* add GEMM km_nk_mn

* fix comment

[ROCm/composable_kernel commit: 19613902b5]
2021-09-05 12:41:28 -05:00
ltqin
2f4f6427f5 Backward weight v4r4r2 with xdlops (#18)
* start

* modify transformat

* modify device convolutiion

* modify host

* added host conv bwd and wrw

* remove bwd, seperate wrw

* clean

* hacall k to zero

* out log

* fixed

* fixed

* change to (out in wei)

* input hack

* hack to out

* format

* fix by comments

* change wei hacks(wei transform has not merge)

* fix program once issue

* fix review comment

* fix vector load issue

* tweak

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 627d8ef35a]
2021-08-30 22:49:17 -05:00
Chao Liu
a44dd0d851 Misc fixes (#24)
* use cast_pointer_to_generic_address_space() in v6r1 kernel wrapper, DynamcBuffer and buffer_load take customized invalid-element-value, add buffer_load/store for fp64

* use remove_cvref_t

[ROCm/composable_kernel commit: 10bb811060]
2021-08-26 20:05:19 -05:00
Qianfeng
b315c39b11 [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108)
* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* make inner product compatible on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities)

* Some updates to dynamic composable kernel facility for the need of dynamic generic reduction

* Update to generic reduction C++ host interface layer to support dynamic generic reduction

* Update to remove tidy complaints in host interface layer

* Change the unary operator form from void op(T &x) to T op(T x)

* Update to pass single workspace pointer for all kernels (fix for OpenCL backend)

* Use cppcheck-suppress to prevent some strange warnings

* Re-use operator [] and () for DynamicBuffer and update to depending codes

* Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers

* [performance] Remove un-needed local buffer initialization

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>

[ROCm/composable_kernel commit: 9e80cdceb7]
2021-08-26 18:04:55 -07:00
zjing14
20e76381da GlobalAtomicAdd for fp32/int32 (#23)
* add f32/i32 atomicAdd support into dynamicBuffer, and enable it in v1r3

* fixed

* fixed

* update comment

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: a7a758d8ce]
2021-08-25 10:55:55 -05:00
zjing14
14e4d7ba7d Xdlops refactor fix (#22)
* added constexpr ahead of adptor; clean unused driver; rename M/NPerWave to M/NPerXDL

* fixed bwd

* fixed comment

[ROCm/composable_kernel commit: 9d3f634a3c]
2021-08-23 11:22:10 -05:00
Chao Liu
eda5a8852c magic division use __umulhi() (#19)
[ROCm/composable_kernel commit: c6f26bb480]
2021-08-23 10:40:27 -05:00
Chao Liu
ee428d2d6f Composable kernel init integration v3 (#1097)
* Squashed 'src/composable_kernel/' content from commit f6edda611

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda6119

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda611..5781adf5c

5781adf5c Update develop (#5) (#6)
97e6d514f Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e5 refactor
49c33aaea refactor
54b3e73d1 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5cf

* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix

Co-authored-by: JD <Jehandad.Khan@amd.com>

[ROCm/composable_kernel commit: 6fe3627a9e]
2021-08-19 10:55:03 -05:00
zjing14
6dde0995c5 refactor dynamic xdlops iGemm (#13)
* xdlops refactor

* fixed commnt

* clean xdlops_gemm

* add make c into xldops-gemm

* change mfma_info

* refactor xdlops, hide c desc

* clean

* clean

* clean

* apply hacks changes to v4r4r4_nhwc

* rename hacks and use single stage adapter

* enable fp16 mfma

[ROCm/composable_kernel commit: a2ad6d3531]
2021-08-19 09:54:10 -05:00
zjing14
f8e4daa52c Added host_conv_wrw for verification (#15)
* added host conv wrw

[ROCm/composable_kernel commit: ba6f79a75e]
2021-08-19 01:00:41 -05:00
Chao Liu
887df7b129 Merge pull request #16 from ROCmSoftwarePlatform/develop
Merge develop into master

[ROCm/composable_kernel commit: 31b403526e]
2021-08-18 11:22:34 -05:00
Chao Liu
7e6b9fb7a6 Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
MIOpen Downstream: Initial integration 2nd PR

[ROCm/composable_kernel commit: b62bf8c3f8]
2021-08-16 16:39:40 -05:00
Chao Liu
833701f408 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
[ROCm/composable_kernel commit: ccc4a1d365]
2021-08-16 16:28:53 -05:00
Chao Liu
e25c4c2f15 refactor
[ROCm/composable_kernel commit: 67ad47e7c1]
2021-08-16 21:01:33 +00:00
Chao Liu
27048b7714 refactor
[ROCm/composable_kernel commit: 16effa767c]
2021-08-16 20:36:47 +00:00
Chao Liu
65e8349051 DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
[ROCm/composable_kernel commit: a91b68dfcb]
2021-08-13 23:40:19 +00:00
Chao Liu
b3759bf6a2 use int instead of index_t in kernel wrapper
[ROCm/composable_kernel commit: 2cbabbba54]
2021-08-13 20:55:39 +00:00
Chao Liu
04ed8ddf43 compiler parameter use stream
[ROCm/composable_kernel commit: 0834bc7635]
2021-08-13 01:05:14 +00:00
Chao Liu
9f40048d1c make innner product compatiable on gfx900
[ROCm/composable_kernel commit: f2ac7832c6]
2021-08-11 09:42:53 -05:00
Chao Liu
f7df8c7eea rename
[ROCm/composable_kernel commit: 4e57b30a6a]
2021-08-11 00:08:42 +00:00
Chao Liu
1e312fef12 rename
[ROCm/composable_kernel commit: c03045ce2d]
2021-08-10 23:45:36 +00:00
Chao Liu
c9869a5ace update CK build script
[ROCm/composable_kernel commit: b2589957f3]
2021-08-10 22:19:13 +00:00
Chao Liu
c825eb6b1c fix kernel filename
[ROCm/composable_kernel commit: 2c48039d0e]
2021-08-10 22:15:23 +00:00
Chao Liu
594b1cf918 fix enum issue
[ROCm/composable_kernel commit: d626dccc95]
2021-08-10 20:55:13 +00:00
Chao Liu
286475c6b6 tidy
[ROCm/composable_kernel commit: 643ebd4f3e]
2021-08-10 07:07:11 +00:00
Chao Liu
a7c943abab fix clang warning suppression
[ROCm/composable_kernel commit: ddd49ec9e7]
2021-08-10 06:20:24 +00:00
Chao Liu
d49e0ddcb2 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
[ROCm/composable_kernel commit: 4f566c6221]
2021-08-10 05:55:20 +00:00
Chao Liu
314b9d78ef add c-style pointer cast
[ROCm/composable_kernel commit: 172036d728]
2021-08-10 00:01:52 -05:00
Chao Liu
d4b35bd09f tidy
[ROCm/composable_kernel commit: 76f3131939]
2021-08-09 18:49:59 -05:00
Chao Liu
cb2edf2100 tidy
[ROCm/composable_kernel commit: d18428901e]
2021-08-09 18:20:02 -05:00