Commit Graph

181 Commits

Author SHA1 Message Date
Jing Zhang
89e1ebd4d5 updated bfloat16_to_float 2021-11-16 18:01:25 +00:00
zjing14
3737bb039a Add bfp16/int8 support into XDL GEMM operator (#50)
* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* add parameters

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* tweak gemm

* adjust blockwise_gemm_xdlops

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* add bf16_1k

* clang-format

* clean

* add bfp16 for gfx908

* add verification

* clean up

* clean code

* restore bfl16

* clean

* add bfp16 support into gemm_driver

* apply new generator to other drivers

* add int8 support

* cleanb

* clean

* clean

* clean

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@hayabusa6111.amd.com>
2021-11-15 10:24:39 -06:00
Chao Liu
b491ebf384 FP16 data in-register transpose (#41)
* start fixing 16bit data packing

* adding StaticTensor

* adding StaticTensor

* adding StaticTensor

* add missing constexpr

* adding static tensor

* adding static tensor

* adding transpose

* add inline asm for transpose 2x2 of half_t

* add general transpose_vectors(), but have unnecessary register initialization using v_mov

* fix unnecessary register initialization in transpose_vector by using more pass-by-reference

* add hardcoded logic for NHWC wrw

* improve asm for v_pack

* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor

* tweak

* reorganize file
2021-11-15 10:05:58 -06:00
Chao Liu
e823d518cb ckProfiler and device-level XDL GEMM operator (#48)
* add DeviceGemmXdl

* update script

* fix naming issue

* fix comment

* output HostTensorDescriptor

* rename

* padded GEMM for fwd v4r4r4 nhwc

* refactor

* refactor

* refactor

* adding ckProfiler

* adding ckProfiler

* refactor

* fix tuning parameter bug

* add more gemm instances

* add more fp16 GEMM instances

* fix profiler driver

* fix bug in tuning parameter

* add fp32 gemm instances

* small fix

* refactor

* rename

* refactor gemm profiler; adding DeviceConv and conv profiler

* refactor

* fix

* add conv profiler

* refactor

* adding more GEMM and Conv instance

* Create README.md

Add build instruction for ckProfiler

* Create README.md

Add Readme for gemm_xdl example

* Update README.md

Remove build instruction from top most folder

* Update README.md

* clean up
2021-11-14 11:28:32 -06:00
ltqin
6014185ac6 [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44)
* change method computering kpad

* remove unusing variable: batchlen

* change KPerBlock to K0PerBlock

* fix bug for k0 == k0perblock

* fix bug for get k0 index

* use math::integer_divide_ceil

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2021-10-27 09:39:18 -05:00
Chao Liu
d5297abae9 fix bug in gridwise gemm xdlops v2r3 (#45) 2021-10-21 16:42:24 -05:00
Chao Liu
c3018794b4 bug fix (#39) 2021-10-19 18:43:10 -05:00
ltqin
fd49ff8080 add nchw atomic , nhwc and nhwc atomic method for backward weight (#30)
* add add new algorithm from v4r4r2

* program once issue

* add split k functiion

* redefine code

* add a matrix unmerge

* add b matrix unmerge k0

* trans a and b to gridegemm

* nhwc init

* no hacks and vector load

* add hacks

* modify some parameter

* fix tuning prometer for fp32

* fix tuning prometer for fp16

* start change gridwise k split

* init ok

* revome a b matrix k0mk1 desc in grid

* carewrite lculate gridsize

* add kbatch to CalculateBottomIndex

* remove some unused funtion

* add clear data function before call kernel

* out hacks

* in hacks

* rename device convolution file and function name

* modify kBatch value

* fix some tuning code

* start from v4r4 nhwc

* nhwc atomic is able to run

* just for fp32

* enable nchw atomic

* tweak

* tweak

* re-arrange gridwise gemm hot loop for wrw

* add wrw v4r5

* v4r4r5 fp16

* v4r4r4 fp16

* v4r4r2 fp16

* V4R4R4XDLNHWC fp16

* V4R4R2XDLATOMICNCHW fp16

* adjust for fp16

* input gridsize

* change kbatch to gridsize

* testing wrw

* clean up

* k_batch to gridsize

* fix bug

* wrw v4r4r4 kbatch change to gride size

* wrw v4r4r2 kbatch change to gride size

* after merge , change gridwise gemm v2r4

* change MakeCBlockClusterAdaptor

* other method use new gridwise gemm

* clean up

* chapad method nge to make_right_pad_transform

* kbatch out from transform function

* clean up and fix bug

* fix bug

* using function type reduce template parameters

* using auto replace define fuction type

* clean up

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
2021-10-19 18:42:34 -05:00
Qianfeng
b2dc55f82c [MIOpen Downstream] Fix Reduction Kernel (#34)
* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2021-10-06 14:43:17 -05:00
Chao Liu
b3e8d57d51 Tweak GEMM kernel (#38)
* add parameters

* tweak gemm

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* clean up
2021-10-06 11:12:36 -05:00
zjing14
846f462bd4 Add VectorType support into StaticBuffer (#27)
* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* adjust blockwise_gemm_xdlops

* reorder ops in GEMM hot loop

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2021-10-06 10:13:52 -05:00
Chao Liu
b725e3fc84 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr 2021-09-21 11:55:26 -05:00
Chao Liu
f3acd2510b Add a version of Merge transform that use integerdivision and mod (#25)
* add Merg_v3_division_mod

* refactor
2021-09-05 12:57:57 -05:00
ltqin
627d8ef35a Backward weight v4r4r2 with xdlops (#18)
* start

* modify transformat

* modify device convolutiion

* modify host

* added host conv bwd and wrw

* remove bwd, seperate wrw

* clean

* hacall k to zero

* out log

* fixed

* fixed

* change to (out in wei)

* input hack

* hack to out

* format

* fix by comments

* change wei hacks(wei transform has not merge)

* fix program once issue

* fix review comment

* fix vector load issue

* tweak

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2021-08-30 22:49:17 -05:00
Chao Liu
10bb811060 Misc fixes (#24)
* use cast_pointer_to_generic_address_space() in v6r1 kernel wrapper, DynamcBuffer and buffer_load take customized invalid-element-value, add buffer_load/store for fp64

* use remove_cvref_t
2021-08-26 20:05:19 -05:00
Qianfeng
9e80cdceb7 [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108)
* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* make inner product compatible on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities)

* Some updates to dynamic composable kernel facility for the need of dynamic generic reduction

* Update to generic reduction C++ host interface layer to support dynamic generic reduction

* Update to remove tidy complaints in host interface layer

* Change the unary operator form from void op(T &x) to T op(T x)

* Update to pass single workspace pointer for all kernels (fix for OpenCL backend)

* Use cppcheck-suppress to prevent some strange warnings

* Re-use operator [] and () for DynamicBuffer and update to depending codes

* Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers

* [performance] Remove un-needed local buffer initialization

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>
2021-08-26 18:04:55 -07:00
zjing14
a7a758d8ce GlobalAtomicAdd for fp32/int32 (#23)
* add f32/i32 atomicAdd support into dynamicBuffer, and enable it in v1r3

* fixed

* fixed

* update comment

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2021-08-25 10:55:55 -05:00
zjing14
9d3f634a3c Xdlops refactor fix (#22)
* added constexpr ahead of adptor; clean unused driver; rename M/NPerWave to M/NPerXDL

* fixed bwd

* fixed comment
2021-08-23 11:22:10 -05:00
Chao Liu
c6f26bb480 magic division use __umulhi() (#19) 2021-08-23 10:40:27 -05:00
Chao Liu
6fe3627a9e Composable kernel init integration v3 (#1097)
* Squashed 'src/composable_kernel/' content from commit f6edda611

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda6119

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda611..5781adf5c

5781adf5c Update develop (#5) (#6)
97e6d514f Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e5 refactor
49c33aaea refactor
54b3e73d1 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5cf

* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix

Co-authored-by: JD <Jehandad.Khan@amd.com>
2021-08-19 10:55:03 -05:00
zjing14
a2ad6d3531 refactor dynamic xdlops iGemm (#13)
* xdlops refactor

* fixed commnt

* clean xdlops_gemm

* add make c into xldops-gemm

* change mfma_info

* refactor xdlops, hide c desc

* clean

* clean

* clean

* apply hacks changes to v4r4r4_nhwc

* rename hacks and use single stage adapter

* enable fp16 mfma
2021-08-19 09:54:10 -05:00
Chao Liu
67ad47e7c1 refactor 2021-08-16 21:01:33 +00:00
Chao Liu
16effa767c refactor 2021-08-16 20:36:47 +00:00
Chao Liu
a91b68dfcb DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element 2021-08-13 23:40:19 +00:00
Chao Liu
2cbabbba54 use int instead of index_t in kernel wrapper 2021-08-13 20:55:39 +00:00
Chao Liu
f2ac7832c6 make innner product compatiable on gfx900 2021-08-11 09:42:53 -05:00
Chao Liu
4e57b30a6a rename 2021-08-11 00:08:42 +00:00
Chao Liu
c03045ce2d rename 2021-08-10 23:45:36 +00:00
Chao Liu
d626dccc95 fix enum issue 2021-08-10 20:55:13 +00:00
Chao Liu
ddd49ec9e7 fix clang warning suppression 2021-08-10 06:20:24 +00:00
Chao Liu
4f566c6221 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast 2021-08-10 05:55:20 +00:00
Chao Liu
172036d728 add c-style pointer cast 2021-08-10 00:01:52 -05:00
Chao Liu
76f3131939 tidy 2021-08-09 18:49:59 -05:00
Chao Liu
d18428901e tidy 2021-08-09 18:20:02 -05:00
Chao Liu
f885c131d8 tidy 2021-08-09 22:13:47 +00:00
Chao Liu
80120f0a0c tidy 2021-08-09 21:10:09 +00:00
Chao Liu
56fc0842b3 tidy 2021-08-09 19:27:49 +00:00
Chao Liu
54fba515b3 tidy 2021-08-09 17:33:32 +00:00
Chao Liu
24c8728942 add tidy 2021-08-08 17:41:54 +00:00
Chao Liu
82fae390fb update to clang-format-10 2021-07-30 16:37:00 -05:00
Chao Liu
f63a23acb1 [MIOpen Downstream] Initial MIOpen integration (#52)
* update online kernel wrapper bundle all descriptors in a tuple

* change __CONSTANT__ to CONSTANT

* rename

* adding tuning

* added IsValidCompileParameter

* reorginze

* adding tunable for fp16 and int8

* fix kernel compile warning and bug fixes

* suppress warning about cast CONSTANT (address space 4) pointer

* fix building issue
2021-07-27 00:02:27 -05:00
Chao Liu
1264925422 reorganize files to prepare for MIOpen integration (#51)
* change olc cmake

* adding online compile to fwd-v4r5r2

* update scripts

* remane fwd-v4r5r2 to fwd-v6r1

* clean up
2021-07-18 00:43:05 -05:00
zjing14
fbdf4332c7 Add xdlops v4r4r4 into online compilation (#48)
* init for v4r4 xdlops olc

* refactor wrap

* init impl of v4r4 nchw xdlops olc

* tuning

* test perf

* fixed v4r4 nhwc

* tuned v4r4 nhwc

* use gridwise_gemm_xdlops_v2r3

* swap a/b

* add pointer support into offline v2r3

* debugging v4r4r4 transform for olc

* change timer of olc

* refactor v4r4 xdlops nchw olc

* remove transform fun in v4r4 xdlops nhwc olc

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2021-07-16 23:27:08 -05:00
Chao Liu
58a8057011 default iterator hack for blockwise copy (#47) 2021-07-16 08:57:15 -05:00
Chao Liu
1c1b56fe61 fix bug: config for ThreadwiseDynamicTensorSliceTransfer_v2 (#46) 2021-07-09 10:13:20 -05:00
Chao Liu
2f82cfb190 Update default launch bounds (#43)
* update default launch bounds
2021-07-08 11:26:57 -05:00
Chao Liu
81c942cd7e Deprecate static kernel (#42)
* deprecate static kernels
2021-07-08 10:40:00 -05:00
Chao Liu
b8b2d0a6d1 DL GEMM fp32/fp16/int8 (#41)
* add threadwise copy the copy a tensor in one copy, added kpack to DL GEMM

* add kpack into fwd v4r5 nchw fp32
2021-07-04 22:50:29 -05:00
Chao Liu
11ec07e9d1 fix complain about divide by zero (#40) 2021-07-01 16:50:57 -05:00
zjing14
3835318cc3 xdlops_v4r4_fwd fp32/fp16 (#34)
* create files for xdlops

* working on blockwise_gemm_xdlops

* add KReduction

* add m/n repeats

* add 2x2 pipeline

* added 128x128 wavegemm

* use StaticBuffer of vector_type

* break vector type to blk_size

* add kpack into xldops_gemm and blockwise_gemm

* abroadcast only

* add fp32 mfma instructions

* adding fp16 mfma

* pack half4_t

* rename kperwave to kpack

* add 32x32x8fp16

* add fp16 mfma

* clean code

* clean code

* V4r4 xdlops kpack (#35)

* add kpack with incorrect results

* bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2

* add 1x1 kernel

* add gridwise_gemm_v2 - single_buffer

* enabled dwordx4 for fp16

Co-authored-by: Chao Liu <chao.liu2@amd.com>

* refactor fwd-v4r4-xdlops

* add v4r4-nhwc-xdlop

* improve some perf of nhwc and nchw by tuning parameters, and change scheuduling in gridwise-gemm loop

* tweak scheduling in gridwise gemm

* add v4r3 with a single output copy

* init commit: output with slice win

* adding sliceWin

* add multiple repeats pattern

* starting adding bwd-v4r1-xdlops

* use tuple as SrcBuffer

* adding bwd-data v4r1 nhwc xdlops

* fix bug in make_dynamic_naive_tensor_descriptor_aligned_v2()

* fix bug in host bwd-data conv

* initial implementation of bwd-data v4r1 nhwc xdlops

* add launch bound flags

* enable launch bound

* add m/nrepeat=4

* tweak bwd-data v4r1 nhwc xdlops

* added bwd-data v4r1 nhwc xlops with output A and weight B

* add fwd-v4r4 nhwc xdlops, A input, B weight, C output

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2021-07-01 14:33:00 -05:00