* Add int8 of mk_nk_mn to the ckProfiler
* Add example of int8 gemm
* Fix typo, use ushort instead of half_t for bfloat16
* replace ushortXXX_t to bhalfXXX_t
* rename ushort to bhalf_t
* Add bf16 example
* Add bf16 gemm to ckProfiler
* Fix alignment
* Fix typo
* Add unit test for gemm_xdl int8
* Add gemm_xdl fp32 unit test
* Add gemm_xdl bf16 unit test
* fix build
* fix build issue due to merge conflict
* Fix build
* Fix build error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* add docker file and make default target buildable
* add Jenkinsfile
* remove empty env block
* fix package stage
* remove render group from docker run
* clean up Jenkins file
* add cppcheck as dev dependency
* update cmake file
* Add profiler build stage
* add hip_version config file for reduction operator
* correct jenkins var name
* Build release instead of debug
* Update test CMakeLists.txt
reorg test dir
add test stage
* reduce compile threads to prevent compiler crash
* add optional debug stage, update second test
* remove old test target
* fix tests to return proper results and self review
* Fix package name and make test run without args
* change Dockerfile to ues rocm4.3.1
* remove parallelism from build
* Lower paralellism
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* add gitignore
* host tensor: allow generating sequentially increasing value in a given dimension
* gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor
- remove dangling header include
- modify example gemm_xdl accordingly
- infer KPack value from M/NPerXdl
- device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
(API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)
* add LDS data dump utility
* profiler: reflect API change for distinct K0/K1 for A/B matrices
* profiler: add conflict-free LDS write FP16 kernel instances
* fix accidental perf regression
* address feedback; cosmetic changes
* clang-format for new files
* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* conv3d compiles but has memory error
* conv3d works
* fix performance issue by using __builtin_amdgc_readfirstlane
* change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*
* clang-format
* remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d
* format
* remove useless marc
* add comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>
- device_gemm_xdl_c_shuffle function signature matches split-k
- retire host_driver since it is no longer maintained
- linter error (unused variable)
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* tweak conv for odd C
* update script
* clean up elementwise op
* fix build
* clean up
* added example for gemm+bias+relu+add
* added example for gemm+bias+relu
* add profiler for gemm_s_shuffle; re-org files
* add profiler
* fix build
* clean up
* clean up
* clean up
* fix build
* fix relu
* clean up
* clean up
* adding 1x1 conv
* adding 1x1 conv
* added 1x1 conv
* refactor
* refactor
* refactor
* added profiler for conv+bias+relu+add
* clean up
* adding conv+bias+relu
* adding conv+bias+relu
* added conv+bias+relu
* Update README.md
* update cpu verification
* adding c shuffle
* update static_tensor for dealing with invalid element
* adding c shuffle
* debugging
* fix bug
* convert to fp16 before shuffle
* shuffle more than one M/NRepeat
* clean up
* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1
* clean up
* remove coordinate step hack from all gridwise gemm xdl
* clean up coordinate step hack
* clean up coordinate step hack
* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst
* adding output shuffle in conv+bias+relu+add
* update
* added conv+bias+relu+add with c shuffle
* added conv+bias+relu+add with c shuffle
* fix forward_sweep bugs in threadwise copy
* clean up
* refactor
* clean up
* clean up
* added conv_c_shuffle+bias_relu
* clean up
* added conv+bias+relu+atomic_add
* clean up
* clean up
* clean up
* clean up
* clean up
* clean up
* misc fixes; add 1x1 specialization
* clean up
* delete unused device op
* clean up
* add support for odd C value
* add add new algorithm from v4r4r2
* program once issue
* add split k functiion
* redefine code
* add a matrix unmerge
* add b matrix unmerge k0
* trans a and b to gridegemm
* nhwc init
* no hacks and vector load
* add hacks
* modify some parameter
* fix tuning prometer for fp32
* fix tuning prometer for fp16
* start change gridwise k split
* init ok
* revome a b matrix k0mk1 desc in grid
* carewrite lculate gridsize
* add kbatch to CalculateBottomIndex
* remove some unused funtion
* add clear data function before call kernel
* out hacks
* in hacks
* rename device convolution file and function name
* modify kBatch value
* fix some tuning code
* start from v4r4 nhwc
* nhwc atomic is able to run
* just for fp32
* enable nchw atomic
* tweak
* tweak
* re-arrange gridwise gemm hot loop for wrw
* add wrw v4r5
* v4r4r5 fp16
* v4r4r4 fp16
* v4r4r2 fp16
* V4R4R4XDLNHWC fp16
* V4R4R2XDLATOMICNCHW fp16
* adjust for fp16
* input gridsize
* change kbatch to gridsize
* testing wrw
* clean up
* k_batch to gridsize
* fix bug
* wrw v4r4r4 kbatch change to gride size
* wrw v4r4r2 kbatch change to gride size
* after merge , change gridwise gemm v2r4
* change MakeCBlockClusterAdaptor
* other method use new gridwise gemm
* clean up
* chapad method nge to make_right_pad_transform
* kbatch out from transform function
* clean up and fix bug
* fix bug
* using function type reduce template parameters
* using auto replace define fuction type
* clean up
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>