Chao Liu
04d90a65c5
refactor
...
[ROCm/composable_kernel commit: 16effa767c ]
2021-08-16 20:36:47 +00:00
Chao Liu
e02d6a0f21
[MIOpen Downstream] Initial MIOpen integration ( #52 )
...
* update online kernel wrapper bundle all descriptors in a tuple
* change __CONSTANT__ to CONSTANT
* rename
* adding tuning
* added IsValidCompileParameter
* reorginze
* adding tunable for fp16 and int8
* fix kernel compile warning and bug fixes
* suppress warning about cast CONSTANT (address space 4) pointer
* fix building issue
[ROCm/composable_kernel commit: f63a23acb1 ]
2021-07-27 00:02:27 -05:00
zjing14
67dcc552b6
xdlops_v4r4_fwd fp32/fp16 ( #34 )
...
* create files for xdlops
* working on blockwise_gemm_xdlops
* add KReduction
* add m/n repeats
* add 2x2 pipeline
* added 128x128 wavegemm
* use StaticBuffer of vector_type
* break vector type to blk_size
* add kpack into xldops_gemm and blockwise_gemm
* abroadcast only
* add fp32 mfma instructions
* adding fp16 mfma
* pack half4_t
* rename kperwave to kpack
* add 32x32x8fp16
* add fp16 mfma
* clean code
* clean code
* V4r4 xdlops kpack (#35 )
* add kpack with incorrect results
* bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2
* add 1x1 kernel
* add gridwise_gemm_v2 - single_buffer
* enabled dwordx4 for fp16
Co-authored-by: Chao Liu <chao.liu2@amd.com >
* refactor fwd-v4r4-xdlops
* add v4r4-nhwc-xdlop
* improve some perf of nhwc and nchw by tuning parameters, and change scheuduling in gridwise-gemm loop
* tweak scheduling in gridwise gemm
* add v4r3 with a single output copy
* init commit: output with slice win
* adding sliceWin
* add multiple repeats pattern
* starting adding bwd-v4r1-xdlops
* use tuple as SrcBuffer
* adding bwd-data v4r1 nhwc xdlops
* fix bug in make_dynamic_naive_tensor_descriptor_aligned_v2()
* fix bug in host bwd-data conv
* initial implementation of bwd-data v4r1 nhwc xdlops
* add launch bound flags
* enable launch bound
* add m/nrepeat=4
* tweak bwd-data v4r1 nhwc xdlops
* added bwd-data v4r1 nhwc xlops with output A and weight B
* add fwd-v4r4 nhwc xdlops, A input, B weight, C output
Co-authored-by: Chao Liu <chao.liu2@amd.com >
[ROCm/composable_kernel commit: 3835318cc3 ]
2021-07-01 14:33:00 -05:00
Chao Liu
f4acec502e
Restructure gridwise and blockwise GEMM, add tensor contraction and FWD-v4r5 ( #36 )
...
* experimenting magic number division
* overhauling fwd-v4r4 to clearly reflect transformation graph
* added fwd-v4r5
* bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2
* bug fix and added sanity-check in transform_dynamic_tensor_descriptor
* added conv_driver_v2
[ROCm/composable_kernel commit: 30072aec37 ]
2021-06-09 23:53:08 -05:00
Chao Liu
ce4662dbda
No raw index calculation ( #31 )
...
* Replace most raw index calculation to coordinate transformation
* Overhaul blockwise and threadwise GEMM
* Overhaul driver for gridwies GEMM kernel
Co-authored-by: Jing Zhang <jizhan@amd.com >
[ROCm/composable_kernel commit: 01055d95d9 ]
2021-05-11 00:09:25 -05:00
Chao Liu
b4dbf677ce
Dynamic tensor descriptor ( #24 )
...
* support dynamic tensor descriptor
* use buffer load OOB feature for padding case
* add navi support
* add int8x4 inference kernel
Co-authored-by: Chao Liu <chao@ixt-rack-81.local.lan >
Co-authored-by: Jing Zhang <jizhan@amd.com >
[ROCm/composable_kernel commit: fcbb978828 ]
2021-03-25 13:51:11 -05:00
Chao Liu
0eb214d1cd
Code clean up ( #20 )
...
* tuning para,
* testing on v100
* add fp16
* remove deprecated tensor descriptor
* sync with miopen
* update build script
Co-authored-by: Jing Zhang <jizhan@amd.com >
[ROCm/composable_kernel commit: 5c7cec1115 ]
2020-06-23 20:31:27 -05:00
Chao Liu
ef393a2bb2
MIopen integration ( #13 )
...
* update for miopen integration: cosmetic refactor
[ROCm/composable_kernel commit: 1a66e35b6f ]
2020-02-17 09:53:20 -06:00
Chao Liu
81e3c745dc
Update for recent MIOpen integration ( #11 )
...
* update for MIOpen integration
[ROCm/composable_kernel commit: 3406a1148a ]
2020-01-27 15:29:33 -06:00
Chao Liu
8b51bc4b1d
Added bwd data v3r1 v4r1, tweaking v1 ( #10 )
...
* Added bwd data v3r1: breaking down compute into a series of load balanced GEMM, and launch in a single kernel
* Added bwd data v4r1: like v3r1, but launch GEMMs in multiple kernels
* Tweaked v1r1 and v1r2 (atomic) on AMD GPU
[ROCm/composable_kernel commit: c5da0377fb ]
2020-01-20 10:20:03 -06:00
Chao Liu
3799741fee
backward data ( #7 )
...
* enabled atomic add in tensor copy
* added gridwise GEMM
* added backward data conv using GEMM + atomic
* added backward data conv using GEMM, no atomic
[ROCm/composable_kernel commit: 8f5f64960e ]
2019-12-03 01:16:12 -06:00
Chao Liu
17564ecfec
adding merge transform
...
[ROCm/composable_kernel commit: ca42e9101d ]
2019-09-10 01:53:49 -05:00
Chao Liu
399be319a2
more utility code
...
[ROCm/composable_kernel commit: 7a7fe16086 ]
2019-09-09 00:29:33 -05:00
Chao Liu
6166bf61c2
adding dimension tranformation
...
[ROCm/composable_kernel commit: 0c05f4279f ]
2019-09-05 00:19:06 -05:00
Chao Liu
e181a33df3
refactor
...
[ROCm/composable_kernel commit: 37b82b7e54 ]
2019-06-19 22:26:45 -05:00
Chao Liu
84ce32552d
fixed amd build
...
[ROCm/composable_kernel commit: 1f2cfcebb3 ]
2019-06-19 18:51:19 -05:00
Chao Liu
611944080a
refactor
...
[ROCm/composable_kernel commit: 21f7e9f103 ]
2019-06-19 17:43:56 -05:00