Chao Liu
a4f24233e5
manually apply bug fix changes in pr #63 ( #64 )
...
* Bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2()
* Bug in ThreadwiseTensorSliceTransfer_v1r3 logic for calculating "forward_sweep"
2021-12-12 18:05:51 -06:00
Chao Liu
fd3d907a80
fix ReLU formula ( #61 )
...
* fix relu
* clean up
* clean up
2021-12-04 16:05:29 -06:00
Chao Liu
41cdd3801a
GEMM/Conv+BiasAdd+ReLU+Add ( #55 )
...
* gemm+activation
* move C pointwise operation into threadwise copy
* add pointwise operation to A/B matrix
* update ckProfiler
* adding bias add
* adding bias add
* adding bias add
* added bias add; worked around compiler issues
* clean up
* clean up
* Update README.md
* Update README.md
* Update README.md
* clean up
* add conv_xdl example
* adding conv_xdl_bias_relu_add example
* add conv+bias+relu+add, but has register spill issue
* tweak
* tweak
* refactor
* Update README.md
update readme for example/2_gemm_xdl_bias_relu_add
* clean up
* Update README.md
update readme for example/3_conv_xdl
* Update README.md
2021-12-02 20:07:37 -06:00
Jing Zhang
d7a0a3f94c
renaming/comments
2021-12-02 23:37:57 +00:00
Jing Zhang
2cbb897638
add static_buffer_v2 zero out
2021-12-02 05:54:19 +00:00
Jing Zhang
d798c9b8c6
fixed c_buffer alloc
2021-12-02 05:03:03 +00:00
Chao Liu
4041850f11
fix layout naming convention ( #56 )
2021-11-30 09:10:55 -06:00
Chao Liu
237d4ca03f
added test for magic number division ( #58 )
2021-11-30 09:09:28 -06:00
zjing14
567f5e9c5f
add args for packed gemm ( #54 )
2021-11-24 12:33:55 -06:00
Chao Liu
64350affc5
Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars ( #53 )
...
* reworking vector_type
* use __builtin_memcpy for bit_cast and vector access of scalar pointer
* clean up
2021-11-18 09:11:15 -06:00
zjing14
970fa3e92e
v5r1 fusion kernels for inference ( #49 )
...
* init
* refactor for 1x1
* rename e0_e1
* add e1 with bugs
* debug
* fixed
* fixed e1
* add timer
* imprve threadwise gemm with dot2
* add e2
* tuning
* seperate c2
* add nhwc
* restore nchwc
* clean
* opt
* fixed; tuning
* add BGlobalMoveSliceWindowStepHacks{}
* tuning
* repeat running
* adjust
* merge v5r1 nchwc
* add adaptors
* split k0 k1 in c_thread_grid
* split h and w
* remove v5r1 nhwc
* clean for pr
* remove host_conv_add
* clean code
* clean
* add dynamic support
* static mode
* test static
* add conv+add fusion
* fixed validation
* naming fix
* use activ_enum
* make static
* refactor conv_add for InMem::add
* add bias
* add conv_out
* add configurable makeddesc
* add maxpool fusion
* add maxpool host for validation
* enable static desc
* conv-only use v5r1_add
* test
* test
* for binary dumps
* fixed incorrect results due to typo
* clean
* debugging maxpool
* workaround with offset trick
* clean code
* modularize ops of fusion
* add gridwise_gemm_v3
* create seperate fusion fun
* enable dynamic mode of conv and conv+resize_add
* add dynamic mode of maxpool
* add pass by point
* add activ_type as arguments
* merge develop
* clean
* reset config to old default
Co-authored-by: Chao Liu <chao.liu2@amd.com >
2021-11-18 08:34:07 -06:00
zjing14
a651ea4f7a
Fixed bfp16 host_conv_fwd ( #52 )
...
* fixed bfloat16 issues
* refactor type_convert
* fixed host_convolution_forward for ushort
Co-authored-by: Chao Liu <chao.liu2@amd.com >
2021-11-18 08:10:56 -06:00
zjing14
0a66c54e95
fixed multiple definition issue of bfp16/fp32 conversion function when building ckProfiler ( #51 )
...
* fixed bfloat16 issues
* refactor type_convert
Co-authored-by: Chao Liu <chao.liu2@amd.com >
2021-11-16 15:44:17 -06:00
Jing Zhang
89e1ebd4d5
updated bfloat16_to_float
2021-11-16 18:01:25 +00:00
zjing14
3737bb039a
Add bfp16/int8 support into XDL GEMM operator ( #50 )
...
* init StaticBufferV2
* clean
* adopt old output stage for staticBufferV2
* clean
* remove hack
* clean
* clean
* add parameters
* clean code
* move c_buffer alloc into blockwise gemm
* add adaptors for m/n_thread_data_on_grid
* tweak gemm
* adjust blockwise_gemm_xdlops
* tweak
* update conv
* update script
* adding bwd 1x1
* update script
* adding 1x1 bwd
* debugging bwd 1x1 failure
* update script
* update script
* test
* test v100
* add bf16_1k
* clang-format
* clean
* add bfp16 for gfx908
* add verification
* clean up
* clean code
* restore bfl16
* clean
* add bfp16 support into gemm_driver
* apply new generator to other drivers
* add int8 support
* cleanb
* clean
* clean
* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com >
Co-authored-by: Chao Liu <lc.roy86@gmail.com >
Co-authored-by: root <root@hayabusa6111.amd.com >
2021-11-15 10:24:39 -06:00
Chao Liu
b491ebf384
FP16 data in-register transpose ( #41 )
...
* start fixing 16bit data packing
* adding StaticTensor
* adding StaticTensor
* adding StaticTensor
* add missing constexpr
* adding static tensor
* adding static tensor
* adding transpose
* add inline asm for transpose 2x2 of half_t
* add general transpose_vectors(), but have unnecessary register initialization using v_mov
* fix unnecessary register initialization in transpose_vector by using more pass-by-reference
* add hardcoded logic for NHWC wrw
* improve asm for v_pack
* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor
* tweak
* reorganize file
2021-11-15 10:05:58 -06:00
Chao Liu
e823d518cb
ckProfiler and device-level XDL GEMM operator ( #48 )
...
* add DeviceGemmXdl
* update script
* fix naming issue
* fix comment
* output HostTensorDescriptor
* rename
* padded GEMM for fwd v4r4r4 nhwc
* refactor
* refactor
* refactor
* adding ckProfiler
* adding ckProfiler
* refactor
* fix tuning parameter bug
* add more gemm instances
* add more fp16 GEMM instances
* fix profiler driver
* fix bug in tuning parameter
* add fp32 gemm instances
* small fix
* refactor
* rename
* refactor gemm profiler; adding DeviceConv and conv profiler
* refactor
* fix
* add conv profiler
* refactor
* adding more GEMM and Conv instance
* Create README.md
Add build instruction for ckProfiler
* Create README.md
Add Readme for gemm_xdl example
* Update README.md
Remove build instruction from top most folder
* Update README.md
* clean up
2021-11-14 11:28:32 -06:00
ltqin
6014185ac6
[Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue ( #44 )
...
* change method computering kpad
* remove unusing variable: batchlen
* change KPerBlock to K0PerBlock
* fix bug for k0 == k0perblock
* fix bug for get k0 index
* use math::integer_divide_ceil
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Chao Liu <chao.liu2@amd.com >
2021-10-27 09:39:18 -05:00
Chao Liu
3e9113707f
Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all
...
update ck from miopen ck_upstream
2021-10-27 09:07:38 -05:00
ltqin
211dae8229
Merge branch 'develop' into miopen_downstream_all
2021-10-27 13:34:19 +08:00
Jun Liu
5890e30076
[Composable Kernel] update develop branch code to ck_upstream
...
Merge pull request #1236 from ROCmSoftwarePlatform/develop
2021-10-25 19:49:17 -07:00
Chao Liu
d5297abae9
fix bug in gridwise gemm xdlops v2r3 ( #45 )
2021-10-21 16:42:24 -05:00
Chao Liu
c3018794b4
bug fix ( #39 )
2021-10-19 18:43:10 -05:00
ltqin
fd49ff8080
add nchw atomic , nhwc and nhwc atomic method for backward weight ( #30 )
...
* add add new algorithm from v4r4r2
* program once issue
* add split k functiion
* redefine code
* add a matrix unmerge
* add b matrix unmerge k0
* trans a and b to gridegemm
* nhwc init
* no hacks and vector load
* add hacks
* modify some parameter
* fix tuning prometer for fp32
* fix tuning prometer for fp16
* start change gridwise k split
* init ok
* revome a b matrix k0mk1 desc in grid
* carewrite lculate gridsize
* add kbatch to CalculateBottomIndex
* remove some unused funtion
* add clear data function before call kernel
* out hacks
* in hacks
* rename device convolution file and function name
* modify kBatch value
* fix some tuning code
* start from v4r4 nhwc
* nhwc atomic is able to run
* just for fp32
* enable nchw atomic
* tweak
* tweak
* re-arrange gridwise gemm hot loop for wrw
* add wrw v4r5
* v4r4r5 fp16
* v4r4r4 fp16
* v4r4r2 fp16
* V4R4R4XDLNHWC fp16
* V4R4R2XDLATOMICNCHW fp16
* adjust for fp16
* input gridsize
* change kbatch to gridsize
* testing wrw
* clean up
* k_batch to gridsize
* fix bug
* wrw v4r4r4 kbatch change to gride size
* wrw v4r4r2 kbatch change to gride size
* after merge , change gridwise gemm v2r4
* change MakeCBlockClusterAdaptor
* other method use new gridwise gemm
* clean up
* chapad method nge to make_right_pad_transform
* kbatch out from transform function
* clean up and fix bug
* fix bug
* using function type reduce template parameters
* using auto replace define fuction type
* clean up
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Chao Liu <chao.liu2@amd.com >
Co-authored-by: Jing Zhang <jizhan@amd.com >
2021-10-19 18:42:34 -05:00
Qianfeng
b2dc55f82c
[MIOpen Downstream] Fix Reduction Kernel ( #34 )
...
* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel
* Fix with regard to implementing GetZeroVal() in both kernel and host
* Avoid convert to compType from dstDataType before writting the output value
* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator
* Add CONSTANT decorator for descriptor read buffer
* Use get_thread_local_1d_id() for thread local Id
* Rename GetZeroVal() to GetReductionZeroVal() in the kernels
* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp
* Occasional tiny simplification and update in the kernel files
* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers
* Update to remove OpenCL tidy checking failures
* Update for better readability
* Remove unused codes and not-needed template parameters in the kernel wrappers
Co-authored-by: Chao Liu <chao.liu2@amd.com >
2021-10-06 14:43:17 -05:00
Chao Liu
b3e8d57d51
Tweak GEMM kernel ( #38 )
...
* add parameters
* tweak gemm
* tweak
* update conv
* update script
* adding bwd 1x1
* update script
* adding 1x1 bwd
* debugging bwd 1x1 failure
* update script
* update script
* test
* test v100
* clean up
2021-10-06 11:12:36 -05:00
zjing14
846f462bd4
Add VectorType support into StaticBuffer ( #27 )
...
* init StaticBufferV2
* clean
* adopt old output stage for staticBufferV2
* clean
* remove hack
* clean
* clean
* clean code
* move c_buffer alloc into blockwise gemm
* add adaptors for m/n_thread_data_on_grid
* adjust blockwise_gemm_xdlops
* reorder ops in GEMM hot loop
Co-authored-by: Chao Liu <chao.liu2@amd.com >
2021-10-06 10:13:52 -05:00
Qianfeng
dfb80c4e39
[Enhancements] Several bugfixes and refactoring of dynamic generic reduction ( #1156 )
...
* Squashed 'src/composable_kernel/' content from commit f6edda611
git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda6119
* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
* Squashed 'src/composable_kernel/' changes from f6edda611..5781adf5c
5781adf5c Update develop (#5 ) (#6 )
97e6d514f Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e5 refactor
49c33aaea refactor
54b3e73d1 rename
git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5cf
* fix
* refactor
* remove online compilation from CK
* refactor
* fix
* add ctest
* tidy
* add tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* tidy
* add c-style pointer cast
* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
* fix clang warning suppression
* tidy
* suppress cppcheck
* fix enum issue
* revert chagnes to hip build
* fix kernel filename
* update CK build script
* rename
* rename
* make innner product compatiable on gfx900
* Update src/include/miopen/solver/ck_utility_common.hpp
Co-authored-by: JD <Jehandad.Khan@amd.com >
* compiler parameter use stream
* use int instead of index_t in kernel wrapper
* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
* refactor
* refactor
* change cmakelist
* change ck common utility
* fix
* Squashed 'src/composable_kernel/' changes from 5781adf5c..31b403526
31b403526 Merge pull request #16 from ROCmSoftwarePlatform/develop
b62bf8c3f Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
ccc4a1d36 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
67ad47e7c refactor
16effa767 refactor
a91b68dfc DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
2cbabbba5 use int instead of index_t in kernel wrapper
0834bc763 compiler parameter use stream
f2ac7832c make innner product compatiable on gfx900
4e57b30a6 rename
c03045ce2 rename
b2589957f update CK build script
2c48039d0 fix kernel filename
d626dccc9 fix enum issue
643ebd4f3 tidy
ddd49ec9e fix clang warning suppression
4f566c622 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
172036d72 add c-style pointer cast
76f313193 tidy
d18428901 tidy
f885c131d tidy
80120f0a0 tidy
c3efeb5e2 tidy
56fc0842b tidy
54fba515b tidy
e62bae7a4 tidy
24c872894 add tidy
61487e0a0 fix
ae98b52ad remove online compilation from CK
cb9542131 refactor
73ca97015 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3
3b8664611 Merge pull request #7 from ROCmSoftwarePlatform/master
d09ea4f4e Update develop (#5 )
3d32ae940 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
git-subtree-dir: src/composable_kernel
git-subtree-split: 31b403526e
* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel
* Fix with regard to implementing GetZeroVal() in both kernel and host
* Avoid convert to compType from dstDataType before writting the output value
* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator
* Add CONSTANT decorator for descriptor read buffer
* Use get_thread_local_1d_id() for thread local Id
* Rename GetZeroVal() to GetReductionZeroVal() in the kernels
* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp
* Occasional tiny simplification and update in the kernel files
* Update in src/reducetensor.cpp for consistent IDs passing to the kernel
* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers
* Update to remove OpenCL tidy checking failures
* Small updates in src/reducetensor.cpp
* Update for better readability
* Remove unused codes and not-needed template parameters in the kernel wrappers
Co-authored-by: Chao Liu <chao.liu2@amd.com >
Co-authored-by: JD <Jehandad.Khan@amd.com >
2021-09-29 08:12:11 -07:00
Jun Liu
8557901d02
Merge pull request #1165 from ROCmSoftwarePlatform/develop
...
Merge develop into CK_upstream (Please don't squash when merging)
2021-09-21 15:52:12 -07:00
Chao Liu
f305bebdc3
Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr
...
[MIOpen Downstream] Dynamic Reduction PR
2021-09-21 11:59:23 -05:00
Chao Liu
b725e3fc84
Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr
2021-09-21 11:55:26 -05:00
Chao Liu
df0d68106e
:Merge remote-tracking branch 'origin/develop' into CK_upstream
2021-09-20 20:44:01 -05:00
Chao Liu
f3acd2510b
Add a version of Merge transform that use integerdivision and mod ( #25 )
...
* add Merg_v3_division_mod
* refactor
2021-09-05 12:57:57 -05:00
Chao Liu
19613902b5
GEMM driver and kernel ( #29 )
...
* add gemm driver
* tweak
* add gemm kernel: mk_kn_mn and km_kn_mn
* tweak
* add GEMM km_nk_mn
* fix comment
2021-09-05 12:41:28 -05:00
ltqin
627d8ef35a
Backward weight v4r4r2 with xdlops ( #18 )
...
* start
* modify transformat
* modify device convolutiion
* modify host
* added host conv bwd and wrw
* remove bwd, seperate wrw
* clean
* hacall k to zero
* out log
* fixed
* fixed
* change to (out in wei)
* input hack
* hack to out
* format
* fix by comments
* change wei hacks(wei transform has not merge)
* fix program once issue
* fix review comment
* fix vector load issue
* tweak
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Jing Zhang <jizhan@amd.com >
Co-authored-by: Chao Liu <chao.liu2@amd.com >
2021-08-30 22:49:17 -05:00
Chao Liu
10bb811060
Misc fixes ( #24 )
...
* use cast_pointer_to_generic_address_space() in v6r1 kernel wrapper, DynamcBuffer and buffer_load take customized invalid-element-value, add buffer_load/store for fp64
* use remove_cvref_t
2021-08-26 20:05:19 -05:00
Qianfeng
9e80cdceb7
[SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction ( #1108 )
...
* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
* make inner product compatible on gfx900
* Update src/include/miopen/solver/ck_utility_common.hpp
* compiler parameter use stream
* use int instead of index_t in kernel wrapper
* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
* Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities)
* Some updates to dynamic composable kernel facility for the need of dynamic generic reduction
* Update to generic reduction C++ host interface layer to support dynamic generic reduction
* Update to remove tidy complaints in host interface layer
* Change the unary operator form from void op(T &x) to T op(T x)
* Update to pass single workspace pointer for all kernels (fix for OpenCL backend)
* Use cppcheck-suppress to prevent some strange warnings
* Re-use operator [] and () for DynamicBuffer and update to depending codes
* Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers
* [performance] Remove un-needed local buffer initialization
Co-authored-by: Chao Liu <chao.liu2@amd.com >
Co-authored-by: JD <Jehandad.Khan@amd.com >
2021-08-26 18:04:55 -07:00
zjing14
a7a758d8ce
GlobalAtomicAdd for fp32/int32 ( #23 )
...
* add f32/i32 atomicAdd support into dynamicBuffer, and enable it in v1r3
* fixed
* fixed
* update comment
Co-authored-by: Chao Liu <chao.liu2@amd.com >
2021-08-25 10:55:55 -05:00
zjing14
9d3f634a3c
Xdlops refactor fix ( #22 )
...
* added constexpr ahead of adptor; clean unused driver; rename M/NPerWave to M/NPerXDL
* fixed bwd
* fixed comment
2021-08-23 11:22:10 -05:00
Chao Liu
c6f26bb480
magic division use __umulhi() ( #19 )
2021-08-23 10:40:27 -05:00
Chao Liu
6fe3627a9e
Composable kernel init integration v3 ( #1097 )
...
* Squashed 'src/composable_kernel/' content from commit f6edda611
git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda6119
* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files
* Squashed 'src/composable_kernel/' changes from f6edda611..5781adf5c
5781adf5c Update develop (#5 ) (#6 )
97e6d514f Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e5 refactor
49c33aaea refactor
54b3e73d1 rename
git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5cf
* fix
* refactor
* remove online compilation from CK
* refactor
* fix
* add ctest
* add c-style pointer cast
* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
* fix clang warning suppression
* tidy
* suppress cppcheck
* fix enum issue
* revert chagnes to hip build
* fix kernel filename
* update CK build script
* rename
* rename
* make innner product compatiable on gfx900
* Update src/include/miopen/solver/ck_utility_common.hpp
Co-authored-by: JD <Jehandad.Khan@amd.com >
* compiler parameter use stream
* use int instead of index_t in kernel wrapper
* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
* refactor
* refactor
* change cmakelist
* change ck common utility
* fix
Co-authored-by: JD <Jehandad.Khan@amd.com >
2021-08-19 10:55:03 -05:00
zjing14
a2ad6d3531
refactor dynamic xdlops iGemm ( #13 )
...
* xdlops refactor
* fixed commnt
* clean xdlops_gemm
* add make c into xldops-gemm
* change mfma_info
* refactor xdlops, hide c desc
* clean
* clean
* clean
* apply hacks changes to v4r4r4_nhwc
* rename hacks and use single stage adapter
* enable fp16 mfma
2021-08-19 09:54:10 -05:00
zjing14
ba6f79a75e
Added host_conv_wrw for verification ( #15 )
...
* added host conv wrw
2021-08-19 01:00:41 -05:00
Chao Liu
31b403526e
Merge pull request #16 from ROCmSoftwarePlatform/develop
...
Merge develop into master
2021-08-18 11:22:34 -05:00
Chao Liu
b62bf8c3f8
Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
...
MIOpen Downstream: Initial integration 2nd PR
2021-08-16 16:39:40 -05:00
Chao Liu
ccc4a1d365
Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
2021-08-16 16:28:53 -05:00
Chao Liu
67ad47e7c1
refactor
2021-08-16 21:01:33 +00:00
Chao Liu
16effa767c
refactor
2021-08-16 20:36:47 +00:00
Chao Liu
a91b68dfcb
DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
2021-08-13 23:40:19 +00:00
Chao Liu
2cbabbba54
use int instead of index_t in kernel wrapper
2021-08-13 20:55:39 +00:00