Commit Graph

493 Commits

Author SHA1 Message Date
rocking5566
34ecda63ec Gemm alpha beta profiler (fp32 & fp16) (#91)
* [What] Refactor verification of gemm alpha_beta, move to reference operation
[Why] Sync with other verification

* Profile mk_nk for gemm bias 2d

* Support bias 2d with mn * kn in profiler

* Support bias 2d with km*kn and km*nk in profiler

* Support fp32 bias 2d in profiler

* format

* format

Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 19c5d6e651]
2022-02-21 11:35:21 -06:00
JD
ba2852c7a3 Initial Setup for CI (#86)
* add docker file and make default target buildable

* add Jenkinsfile

* remove empty env block

* fix package stage

* remove render group from docker run

* clean up Jenkins file

* add cppcheck as dev dependency

* update cmake file

* Add profiler build stage

* add hip_version config file for reduction operator

* correct jenkins var name

* Build release instead of debug

* clean up

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 2778e99758]
2022-02-18 21:44:11 -06:00
ltqin
32c128bcc5 NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73)
* add fwd bf16 conv

* change tunning parametor

* add int8 for conv fwd

* remove comments

* change tunning parametor for int8

* change init int8 example

* add test for conv2d fwd

* change device operation file pos because merge develop

* fwd int8 use reference

* test_conv_fwd use reference

* add braket for if statement

* rename fwd example name

* remove StaticBufferOfVectorTypeV2

* tweak example

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 880fbee957]
2022-02-11 20:06:40 -06:00
zjing14
a76cc6deed Add small tile size for fp16/fp32 and NN layout (#80)
* add DeviceGemmSplitKXdl

* add file device_gemm_splitk_xdl.hpp

* set c matrix zero

* using atomic

* add all tuning parameter to f32 mkkn

* grid size change to 720

* add tunning parameter for NT

* add tunning parameter for TN

* add tunning parameter for TT

* add m=96tunning parameter

* add lost config

* debug

* fix sweep

* add failed tuning params

* fixed sweep logic

* clean

* add padding to M/N for irr tile size

* clean code

* add element wise operation

* fixed MPerBlock=96

* remove marco for slpitk swtich

* add test

* add new line at the end of device_gemm_xdl_instance.hpp

* remove step hack

* seperate split-k instance files

* add tunning parameters

* change disired grid size to parameters

* remove slice length

* add desiredgridsize parameter to ckProfiler

* add losting file device_gemm_xdl_splitk_instance.hpp

* change desired gride size to kbatch

* format

* format

* clean up

* add selection of device_instances

* clean code

* clean code

* add small tile size in fp16 nn

* test for rocm 4.5

* merge develop

* clean

* clean

* clean

* remove no-use code

* add padding switch to device_gemm_xdl

* add padding switch for ksplit fp32

* clean

* clean

* add files

* rename

* Update profiler.cpp

* format

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 20a672d0b8]
2022-02-11 15:49:06 -06:00
zjing14
e57c9a886f Batched GEMM for fp16 (#79)
* prepare host for batched_gemm

* init commit of batched kernels

* fixed

* refine transform with freeze

* m/n padding

* fixed a bug; clean

* add small tiles

* clean

* clean code

* clean code

* add nt, tn, tt layout

* add missing file

* use StaticBufferTupleOfVector instead

* add reference_batched_gemm

* fixed a macro

[ROCm/composable_kernel commit: b53e9d08ed]
2022-02-11 09:36:52 -06:00
rocking5566
6822598f8a Support alpha beta scaling for GEMM (#78)
* [What] Add 2d version of bias, prepare to implement alpha / beta scaling

* Add alpha / beta functor

* Refine parameter of example

* [What] Use real type instead of template
[Why] Prevent implicit cast

* Rename parameter for general operator

* Remove redundant comment

* Fix compile error

Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 6f928a0876]
2022-02-11 00:48:41 -06:00
Anthony Chang
75ababb759 fix build breaks (#81)
- device_gemm_xdl_c_shuffle function signature matches split-k
- retire host_driver since it is no longer maintained
- linter error (unused variable)

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 904cbe2a8f]
2022-02-10 23:52:19 -06:00
Chao Liu
8efcb80fa5 GEMM+Bias+ReLU+Add (#76)
* tweak conv for odd C

* update script

* clean up elementwise op

* fix build

* clean up

* added example for gemm+bias+relu+add

* added example for gemm+bias+relu

* add profiler for gemm_s_shuffle; re-org files

* add profiler

* fix build

* clean up

* clean up

* clean up

* fix build

[ROCm/composable_kernel commit: 823657ed12]
2022-02-06 22:32:47 -06:00
ltqin
8890cc207d References for conv2d fwd bias relu and add (#75)
* add reference

* clean up

* add reference for conv

* rename

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 690c75a7eb]
2022-02-03 22:29:58 -06:00
zjing14
7a2e49df19 Replace llvm Intrinsics with clang buildins (#65)
* test mfma builtins

* add fp16 buildins

* add int8 buildins

* add bfl16 buildins

* simplify host conv forward

* clean

* clean

[ROCm/composable_kernel commit: 6d92959ad3]
2022-02-02 23:13:09 -06:00
ltqin
25d05d36c4 add split-k GEMM (#59)
* add DeviceGemmSplitKXdl

* add file device_gemm_splitk_xdl.hpp

* set c matrix zero

* using atomic

* add all tuning parameter to f32 mkkn

* grid size change to 720

* add tunning parameter for NT

* add tunning parameter for TN

* add tunning parameter for TT

* add m=96tunning parameter

* add lost config

* add element wise operation

* fixed MPerBlock=96

* remove marco for slpitk swtich

* add test

* add new line at the end of device_gemm_xdl_instance.hpp

* remove step hack

* seperate split-k instance files

* add tunning parameters

* change disired grid size to parameters

* remove slice length

* add desiredgridsize parameter to ckProfiler

* add losting file device_gemm_xdl_splitk_instance.hpp

* change desired gride size to kbatch

* format

* format

* clean up

* add selection of device_instances

* clean code

* fix build issue

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

[ROCm/composable_kernel commit: 4be7f0198e]
2022-02-02 22:47:27 -06:00
rocking5566
e56dfde6dc Do not hardcode the function parameter, use template instead. (#72)
* Do not hardcode the function parameter, use template instead.

* [What] Remove AThreadTransferSrcResetCoordinateAfterRun and BThreadTransferSrcResetCoordinateAfterRun in host API
[Why] "C_Shuffle" version is supposed to be similar to the vanilla one

* Fix typo
Let DeviceGemmXdl_C_Shuffle use kernel_gemm_xdlops_v3r1

[ROCm/composable_kernel commit: ca47a6cfe2]
2022-01-24 22:44:13 -06:00
rocking5566
50367b2e80 Add gemm_shuffle host api (#71)
* [What]
1. Add DeviceGemmXdl_C_Shuffle
2. Revise example of gemm_xdl
[Why] Prepare to add shuffle version of D = alpha * (A * B) + beta * C
[How] Imitate DeviceGemmXdl and device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp

[ROCm/composable_kernel commit: 4d40b1974e]
2022-01-21 00:31:17 -06:00
Chao Liu
b622a9b463 Fix building issue for examples (#66)
* fix build issue

[ROCm/composable_kernel commit: 6260ced2f3]
2022-01-17 23:49:04 -06:00
Chao Liu
d6a0d8efcd Fusion Conv+Bias+ReLU(+Add) (#62)
* fix relu

* clean up

* clean up

* adding 1x1 conv

* adding 1x1 conv

* added 1x1 conv

* refactor

* refactor

* refactor

* added profiler for conv+bias+relu+add

* clean up

* adding conv+bias+relu

* adding conv+bias+relu

* added conv+bias+relu

* Update README.md

* update cpu verification

* adding c shuffle

* update static_tensor for dealing with invalid element

* adding c shuffle

* debugging

* fix bug

* convert to fp16 before shuffle

* shuffle more than one M/NRepeat

* clean up

* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1

* clean up

* remove coordinate step hack from all gridwise gemm xdl

* clean up coordinate step hack

* clean up coordinate step hack

* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst

* adding output shuffle in conv+bias+relu+add

* update

* added conv+bias+relu+add with c shuffle

* added conv+bias+relu+add with c shuffle

* fix forward_sweep bugs in threadwise copy

* clean up

* refactor

* clean up

* clean up

* added conv_c_shuffle+bias_relu

* clean up

* added conv+bias+relu+atomic_add

* clean up

* clean up

* clean up

* clean up

* clean up

* clean up

* misc fixes; add 1x1 specialization

* clean up

* delete unused device op

* clean up

* add support for odd C value

[ROCm/composable_kernel commit: acbd7bd7c5]
2021-12-26 07:43:42 -07:00
Chao Liu
861202da79 manually apply bug fix changes in pr #63 (#64)
* Bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2()
* Bug in ThreadwiseTensorSliceTransfer_v1r3 logic for calculating "forward_sweep"

[ROCm/composable_kernel commit: a4f24233e5]
2021-12-12 18:05:51 -06:00
Chao Liu
f2d7b4bafb fix ReLU formula (#61)
* fix relu

* clean up

* clean up

[ROCm/composable_kernel commit: fd3d907a80]
2021-12-04 16:05:29 -06:00
Chao Liu
9c85245412 GEMM/Conv+BiasAdd+ReLU+Add (#55)
* gemm+activation

* move C pointwise operation into threadwise copy

* add pointwise operation to A/B matrix

* update ckProfiler

* adding bias add

* adding bias add

* adding bias add

* added bias add; worked around compiler issues

* clean up

* clean up

* Update README.md

* Update README.md

* Update README.md

* clean up

* add conv_xdl example

* adding conv_xdl_bias_relu_add example

* add conv+bias+relu+add, but has register spill issue

* tweak

* tweak

* refactor

* Update README.md

update readme for example/2_gemm_xdl_bias_relu_add

* clean up

* Update README.md

update readme for example/3_conv_xdl

* Update README.md

[ROCm/composable_kernel commit: 41cdd3801a]
2021-12-02 20:07:37 -06:00
Jing Zhang
0572b85298 renaming/comments
[ROCm/composable_kernel commit: d7a0a3f94c]
2021-12-02 23:37:57 +00:00
Jing Zhang
d7a0702062 add static_buffer_v2 zero out
[ROCm/composable_kernel commit: 2cbb897638]
2021-12-02 05:54:19 +00:00
Jing Zhang
50e0ec6fde fixed c_buffer alloc
[ROCm/composable_kernel commit: d798c9b8c6]
2021-12-02 05:03:03 +00:00
Chao Liu
727098dfa0 fix layout naming convention (#56)
[ROCm/composable_kernel commit: 4041850f11]
2021-11-30 09:10:55 -06:00
Chao Liu
7ead49ca42 added test for magic number division (#58)
[ROCm/composable_kernel commit: 237d4ca03f]
2021-11-30 09:09:28 -06:00
zjing14
35a48e1bc6 add args for packed gemm (#54)
[ROCm/composable_kernel commit: 567f5e9c5f]
2021-11-24 12:33:55 -06:00
Chao Liu
bf1768aea4 Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53)
* reworking vector_type

* use __builtin_memcpy for bit_cast and vector access of scalar pointer

* clean up

[ROCm/composable_kernel commit: 64350affc5]
2021-11-18 09:11:15 -06:00
zjing14
75f9af0fc5 v5r1 fusion kernels for inference (#49)
* init

* refactor for 1x1

* rename e0_e1

* add e1 with bugs

* debug

* fixed

* fixed e1

* add timer

* imprve threadwise gemm with dot2

* add e2

* tuning

* seperate c2

* add nhwc

* restore nchwc

* clean

* opt

* fixed; tuning

* add BGlobalMoveSliceWindowStepHacks{}

* tuning

* repeat running

* adjust

* merge v5r1 nchwc

* add adaptors

* split k0 k1 in c_thread_grid

* split h and w

* remove v5r1 nhwc

* clean for pr

* remove host_conv_add

* clean code

* clean

* add dynamic support

* static mode

* test static

* add conv+add fusion

* fixed validation

* naming fix

* use activ_enum

* make static

* refactor conv_add for InMem::add

* add bias

* add conv_out

* add configurable makeddesc

* add maxpool fusion

* add maxpool host for validation

* enable static desc

* conv-only use v5r1_add

* test

* test

* for binary dumps

* fixed incorrect results due to typo

* clean

* debugging maxpool

* workaround with offset trick

* clean code

* modularize ops of fusion

* add gridwise_gemm_v3

* create seperate fusion fun

* enable dynamic mode of conv and conv+resize_add

* add dynamic mode of maxpool

* add pass by point

* add activ_type as arguments

* merge develop

* clean

* reset config to old default

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 970fa3e92e]
2021-11-18 08:34:07 -06:00
zjing14
87a5e0056f Fixed bfp16 host_conv_fwd (#52)
* fixed bfloat16 issues

* refactor type_convert

* fixed host_convolution_forward for ushort

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: a651ea4f7a]
2021-11-18 08:10:56 -06:00
zjing14
43b1d325d4 fixed multiple definition issue of bfp16/fp32 conversion function when building ckProfiler (#51)
* fixed bfloat16 issues

* refactor type_convert

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 0a66c54e95]
2021-11-16 15:44:17 -06:00
Jing Zhang
a3e1551535 updated bfloat16_to_float
[ROCm/composable_kernel commit: 89e1ebd4d5]
2021-11-16 18:01:25 +00:00
zjing14
c05b73844b Add bfp16/int8 support into XDL GEMM operator (#50)
* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* add parameters

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* tweak gemm

* adjust blockwise_gemm_xdlops

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* add bf16_1k

* clang-format

* clean

* add bfp16 for gfx908

* add verification

* clean up

* clean code

* restore bfl16

* clean

* add bfp16 support into gemm_driver

* apply new generator to other drivers

* add int8 support

* cleanb

* clean

* clean

* clean

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: root <root@hayabusa6111.amd.com>

[ROCm/composable_kernel commit: 3737bb039a]
2021-11-15 10:24:39 -06:00
Chao Liu
b827099a27 FP16 data in-register transpose (#41)
* start fixing 16bit data packing

* adding StaticTensor

* adding StaticTensor

* adding StaticTensor

* add missing constexpr

* adding static tensor

* adding static tensor

* adding transpose

* add inline asm for transpose 2x2 of half_t

* add general transpose_vectors(), but have unnecessary register initialization using v_mov

* fix unnecessary register initialization in transpose_vector by using more pass-by-reference

* add hardcoded logic for NHWC wrw

* improve asm for v_pack

* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor

* tweak

* reorganize file

[ROCm/composable_kernel commit: b491ebf384]
2021-11-15 10:05:58 -06:00
Chao Liu
b9f9ed96ac ckProfiler and device-level XDL GEMM operator (#48)
* add DeviceGemmXdl

* update script

* fix naming issue

* fix comment

* output HostTensorDescriptor

* rename

* padded GEMM for fwd v4r4r4 nhwc

* refactor

* refactor

* refactor

* adding ckProfiler

* adding ckProfiler

* refactor

* fix tuning parameter bug

* add more gemm instances

* add more fp16 GEMM instances

* fix profiler driver

* fix bug in tuning parameter

* add fp32 gemm instances

* small fix

* refactor

* rename

* refactor gemm profiler; adding DeviceConv and conv profiler

* refactor

* fix

* add conv profiler

* refactor

* adding more GEMM and Conv instance

* Create README.md

Add build instruction for ckProfiler

* Create README.md

Add Readme for gemm_xdl example

* Update README.md

Remove build instruction from top most folder

* Update README.md

* clean up

[ROCm/composable_kernel commit: e823d518cb]
2021-11-14 11:28:32 -06:00
ltqin
ed91fc0f4c [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44)
* change method computering kpad

* remove unusing variable: batchlen

* change KPerBlock to K0PerBlock

* fix bug for k0 == k0perblock

* fix bug for get k0 index

* use math::integer_divide_ceil

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 6014185ac6]
2021-10-27 09:39:18 -05:00
ltqin
a2ccbe5550 Merge branch 'develop' into miopen_downstream_all
[ROCm/composable_kernel commit: 211dae8229]
2021-10-27 13:34:19 +08:00
Jun Liu
3fc733cd35 [Composable Kernel] update develop branch code to ck_upstream
Merge pull request #1236 from ROCmSoftwarePlatform/develop

[ROCm/composable_kernel commit: 5890e30076]
2021-10-25 19:49:17 -07:00
Chao Liu
58864d1082 fix bug in gridwise gemm xdlops v2r3 (#45)
[ROCm/composable_kernel commit: d5297abae9]
2021-10-21 16:42:24 -05:00
Chao Liu
aa41cfbb26 bug fix (#39)
[ROCm/composable_kernel commit: c3018794b4]
2021-10-19 18:43:10 -05:00
ltqin
3341ddbdf5 add nchw atomic , nhwc and nhwc atomic method for backward weight (#30)
* add add new algorithm from v4r4r2

* program once issue

* add split k functiion

* redefine code

* add a matrix unmerge

* add b matrix unmerge k0

* trans a and b to gridegemm

* nhwc init

* no hacks and vector load

* add hacks

* modify some parameter

* fix tuning prometer for fp32

* fix tuning prometer for fp16

* start change gridwise k split

* init ok

* revome a b matrix k0mk1 desc in grid

* carewrite lculate gridsize

* add kbatch to CalculateBottomIndex

* remove some unused funtion

* add clear data function before call kernel

* out hacks

* in hacks

* rename device convolution file and function name

* modify kBatch value

* fix some tuning code

* start from v4r4 nhwc

* nhwc atomic is able to run

* just for fp32

* enable nchw atomic

* tweak

* tweak

* re-arrange gridwise gemm hot loop for wrw

* add wrw v4r5

* v4r4r5 fp16

* v4r4r4 fp16

* v4r4r2 fp16

* V4R4R4XDLNHWC fp16

* V4R4R2XDLATOMICNCHW fp16

* adjust for fp16

* input gridsize

* change kbatch to gridsize

* testing wrw

* clean up

* k_batch to gridsize

* fix bug

* wrw v4r4r4 kbatch change to gride size

* wrw v4r4r2 kbatch change to gride size

* after merge , change gridwise gemm v2r4

* change MakeCBlockClusterAdaptor

* other method use new gridwise gemm

* clean up

* chapad method nge to make_right_pad_transform

* kbatch out from transform function

* clean up and fix bug

* fix bug

* using function type reduce template parameters

* using auto replace define fuction type

* clean up

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

[ROCm/composable_kernel commit: fd49ff8080]
2021-10-19 18:42:34 -05:00
Qianfeng
abda4c0cab [MIOpen Downstream] Fix Reduction Kernel (#34)
* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: b2dc55f82c]
2021-10-06 14:43:17 -05:00
Chao Liu
0e75841071 Tweak GEMM kernel (#38)
* add parameters

* tweak gemm

* tweak

* update conv

* update script

* adding bwd 1x1

* update script

* adding 1x1 bwd

* debugging bwd 1x1 failure

* update script

* update script

* test

* test v100

* clean up

[ROCm/composable_kernel commit: b3e8d57d51]
2021-10-06 11:12:36 -05:00
zjing14
ad110e92ba Add VectorType support into StaticBuffer (#27)
* init StaticBufferV2

* clean

* adopt old output stage for staticBufferV2

* clean

* remove hack

* clean

* clean

* clean code

* move c_buffer alloc into blockwise gemm

* add adaptors for m/n_thread_data_on_grid

* adjust blockwise_gemm_xdlops

* reorder ops in GEMM hot loop

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 846f462bd4]
2021-10-06 10:13:52 -05:00
Qianfeng
f3d8052ad2 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156)
* Squashed 'src/composable_kernel/' content from commit a4b211238

git-subtree-dir: src/composable_kernel
git-subtree-split: a4b21123849265d90a6b8fa86905a9a8ab253787

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from a4b211238..5805b5dc4

5805b5dc4 Update develop (#5) (#6)
ede23b251 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
8b079b5c6 refactor
c3d788bfa refactor
fcf913481 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5805b5dc442dd8d71295954c4a755a6ef30593bb

* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* tidy

* add tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix

* Squashed 'src/composable_kernel/' changes from 5805b5dc4..dd3d4444e

dd3d4444e Merge pull request #16 from ROCmSoftwarePlatform/develop
cb6b2dc63 Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
d9b2fcab4 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
57b74196a refactor
431c47bea refactor
9a0d05870 DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
bc4146402 use int instead of index_t in kernel wrapper
87a2fc094 compiler parameter use stream
24743c85e make innner product compatiable on gfx900
7ad33d8e1 rename
5a3bace8d rename
12405c12a update CK build script
3c2effd43 fix kernel filename
12ff8d1ca fix enum issue
f0f97fd79 tidy
26f311aa9 fix clang warning suppression
c4f47ed09 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
35fd7bf79 add c-style pointer cast
9c31642f0 tidy
1a2efac60 tidy
ddd3b4e94 tidy
7daa0cfbf tidy
cab6e58d3 tidy
d9a8aebd8 tidy
533e356ce tidy
42639836b tidy
efe2836a2 add tidy
d53d7c666 fix
cf4ea1145 remove online compilation from CK
e63b17bdf refactor
5a2e56f78 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3
702078bfd Merge pull request #7 from ROCmSoftwarePlatform/master
9ce85357a Update develop (#5)
10a172710 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

git-subtree-dir: src/composable_kernel
git-subtree-split: dd3d4444e9b9ed07a54f82d91d969770aa8d5074

* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update in src/reducetensor.cpp for consistent IDs passing to the kernel

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Small updates in src/reducetensor.cpp

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>

[ROCm/composable_kernel commit: dfb80c4e39]
2021-09-29 08:12:11 -07:00
Jun Liu
bd1af9250d Merge pull request #1165 from ROCmSoftwarePlatform/develop
Merge develop into CK_upstream (Please don't squash when merging)

[ROCm/composable_kernel commit: 8557901d02]
2021-09-21 15:52:12 -07:00
Chao Liu
c732ff2164 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr
[ROCm/composable_kernel commit: b725e3fc84]
2021-09-21 11:55:26 -05:00
Chao Liu
b047701d37 :Merge remote-tracking branch 'origin/develop' into CK_upstream
[ROCm/composable_kernel commit: df0d68106e]
2021-09-20 20:44:01 -05:00
Chao Liu
5141373604 Add a version of Merge transform that use integerdivision and mod (#25)
* add Merg_v3_division_mod

* refactor

[ROCm/composable_kernel commit: f3acd2510b]
2021-09-05 12:57:57 -05:00
Chao Liu
115f77e17a GEMM driver and kernel (#29)
* add gemm driver

* tweak

* add gemm kernel: mk_kn_mn and km_kn_mn

* tweak

* add GEMM km_nk_mn

* fix comment

[ROCm/composable_kernel commit: 19613902b5]
2021-09-05 12:41:28 -05:00
ltqin
79b671c5dd Backward weight v4r4r2 with xdlops (#18)
* start

* modify transformat

* modify device convolutiion

* modify host

* added host conv bwd and wrw

* remove bwd, seperate wrw

* clean

* hacall k to zero

* out log

* fixed

* fixed

* change to (out in wei)

* input hack

* hack to out

* format

* fix by comments

* change wei hacks(wei transform has not merge)

* fix program once issue

* fix review comment

* fix vector load issue

* tweak

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 627d8ef35a]
2021-08-30 22:49:17 -05:00
Chao Liu
0dc65eae46 Misc fixes (#24)
* use cast_pointer_to_generic_address_space() in v6r1 kernel wrapper, DynamcBuffer and buffer_load take customized invalid-element-value, add buffer_load/store for fp64

* use remove_cvref_t

[ROCm/composable_kernel commit: 10bb811060]
2021-08-26 20:05:19 -05:00
Qianfeng
b315c39b11 [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108)
* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* make inner product compatible on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities)

* Some updates to dynamic composable kernel facility for the need of dynamic generic reduction

* Update to generic reduction C++ host interface layer to support dynamic generic reduction

* Update to remove tidy complaints in host interface layer

* Change the unary operator form from void op(T &x) to T op(T x)

* Update to pass single workspace pointer for all kernels (fix for OpenCL backend)

* Use cppcheck-suppress to prevent some strange warnings

* Re-use operator [] and () for DynamicBuffer and update to depending codes

* Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers

* [performance] Remove un-needed local buffer initialization

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>

[ROCm/composable_kernel commit: 9e80cdceb7]
2021-08-26 18:04:55 -07:00