ltqin
998217be22
add split-k GEMM ( #59 )
...
* add DeviceGemmSplitKXdl
* add file device_gemm_splitk_xdl.hpp
* set c matrix zero
* using atomic
* add all tuning parameter to f32 mkkn
* grid size change to 720
* add tunning parameter for NT
* add tunning parameter for TN
* add tunning parameter for TT
* add m=96tunning parameter
* add lost config
* add element wise operation
* fixed MPerBlock=96
* remove marco for slpitk swtich
* add test
* add new line at the end of device_gemm_xdl_instance.hpp
* remove step hack
* seperate split-k instance files
* add tunning parameters
* change disired grid size to parameters
* remove slice length
* add desiredgridsize parameter to ckProfiler
* add losting file device_gemm_xdl_splitk_instance.hpp
* change desired gride size to kbatch
* format
* format
* clean up
* add selection of device_instances
* clean code
* fix build issue
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Chao Liu <chao.liu2@amd.com >
Co-authored-by: Jing Zhang <jizhan@amd.com >
[ROCm/composable_kernel commit: 4be7f0198e ]
2022-02-02 22:47:27 -06:00
Chao Liu
886680ae94
Fusion Conv+Bias+ReLU(+Add) ( #62 )
...
* fix relu
* clean up
* clean up
* adding 1x1 conv
* adding 1x1 conv
* added 1x1 conv
* refactor
* refactor
* refactor
* added profiler for conv+bias+relu+add
* clean up
* adding conv+bias+relu
* adding conv+bias+relu
* added conv+bias+relu
* Update README.md
* update cpu verification
* adding c shuffle
* update static_tensor for dealing with invalid element
* adding c shuffle
* debugging
* fix bug
* convert to fp16 before shuffle
* shuffle more than one M/NRepeat
* clean up
* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1
* clean up
* remove coordinate step hack from all gridwise gemm xdl
* clean up coordinate step hack
* clean up coordinate step hack
* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst
* adding output shuffle in conv+bias+relu+add
* update
* added conv+bias+relu+add with c shuffle
* added conv+bias+relu+add with c shuffle
* fix forward_sweep bugs in threadwise copy
* clean up
* refactor
* clean up
* clean up
* added conv_c_shuffle+bias_relu
* clean up
* added conv+bias+relu+atomic_add
* clean up
* clean up
* clean up
* clean up
* clean up
* clean up
* misc fixes; add 1x1 specialization
* clean up
* delete unused device op
* clean up
* add support for odd C value
[ROCm/composable_kernel commit: acbd7bd7c5 ]
2021-12-26 07:43:42 -07:00
Chao Liu
4a141b5a04
GEMM/Conv+BiasAdd+ReLU+Add ( #55 )
...
* gemm+activation
* move C pointwise operation into threadwise copy
* add pointwise operation to A/B matrix
* update ckProfiler
* adding bias add
* adding bias add
* adding bias add
* added bias add; worked around compiler issues
* clean up
* clean up
* Update README.md
* Update README.md
* Update README.md
* clean up
* add conv_xdl example
* adding conv_xdl_bias_relu_add example
* add conv+bias+relu+add, but has register spill issue
* tweak
* tweak
* refactor
* Update README.md
update readme for example/2_gemm_xdl_bias_relu_add
* clean up
* Update README.md
update readme for example/3_conv_xdl
* Update README.md
[ROCm/composable_kernel commit: 41cdd3801a ]
2021-12-02 20:07:37 -06:00
Chao Liu
46ec6b3ef0
fix layout naming convention ( #56 )
...
[ROCm/composable_kernel commit: 4041850f11 ]
2021-11-30 09:10:55 -06:00
zjing14
220bc28498
add args for packed gemm ( #54 )
...
[ROCm/composable_kernel commit: 567f5e9c5f ]
2021-11-24 12:33:55 -06:00
zjing14
1e7102575b
fixed multiple definition issue of bfp16/fp32 conversion function when building ckProfiler ( #51 )
...
* fixed bfloat16 issues
* refactor type_convert
Co-authored-by: Chao Liu <chao.liu2@amd.com >
[ROCm/composable_kernel commit: 0a66c54e95 ]
2021-11-16 15:44:17 -06:00
Chao Liu
8791d26e52
FP16 data in-register transpose ( #41 )
...
* start fixing 16bit data packing
* adding StaticTensor
* adding StaticTensor
* adding StaticTensor
* add missing constexpr
* adding static tensor
* adding static tensor
* adding transpose
* add inline asm for transpose 2x2 of half_t
* add general transpose_vectors(), but have unnecessary register initialization using v_mov
* fix unnecessary register initialization in transpose_vector by using more pass-by-reference
* add hardcoded logic for NHWC wrw
* improve asm for v_pack
* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor
* tweak
* reorganize file
[ROCm/composable_kernel commit: b491ebf384 ]
2021-11-15 10:05:58 -06:00
Chao Liu
2f5ccb68f5
ckProfiler and device-level XDL GEMM operator ( #48 )
...
* add DeviceGemmXdl
* update script
* fix naming issue
* fix comment
* output HostTensorDescriptor
* rename
* padded GEMM for fwd v4r4r4 nhwc
* refactor
* refactor
* refactor
* adding ckProfiler
* adding ckProfiler
* refactor
* fix tuning parameter bug
* add more gemm instances
* add more fp16 GEMM instances
* fix profiler driver
* fix bug in tuning parameter
* add fp32 gemm instances
* small fix
* refactor
* rename
* refactor gemm profiler; adding DeviceConv and conv profiler
* refactor
* fix
* add conv profiler
* refactor
* adding more GEMM and Conv instance
* Create README.md
Add build instruction for ckProfiler
* Create README.md
Add Readme for gemm_xdl example
* Update README.md
Remove build instruction from top most folder
* Update README.md
* clean up
[ROCm/composable_kernel commit: e823d518cb ]
2021-11-14 11:28:32 -06:00