Commit Graph

35 Commits

Author SHA1 Message Date
Shaojie WANG
ebab84b6f9 MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425)
* add lower triangle bmm

* init code for tile skipping

* functionality right with lower triangle mask

* add decoder lower triangular mask calculation

* use 7*13 group

* fix n2 compute error

* attention with lower triangle mask with tile skipping

* add template to distinguish masking kernel

* rename template and remove default template value

* remove lower triangle gemm reference struct

* add some comments on example

* add 10 instance for masking bmm + scale + softmax + bmm + permute kernels

* add test

* add test file

* add gtest for bmm masking scale softmax bmm permute

* clang-format

* fix compile error

* check lef bottom corner for tile skipping

* fix error: check left bottom corner for tile skipping

* add k padding

* add test and instance for MNK padding

* passing a mask struct

* fix instances

* delete used comments

* format

Co-authored-by: danyao12 <yaodan@dc-smc-13.amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-09-20 12:43:53 -05:00
Adam Osewski
e00149ac67 int4 data type (#364)
* Introduce int4 data type.

* Add unit-tests for int4

* Compile int4 UT only when int4 enabled.

* clang-format

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-08-18 14:53:47 -05:00
Anthony Chang
c20a75b07d Fused GEMM+GEMM (#351)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* add gemm_gemm instances and tests

* avoid LDS data hazard

* fix build

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 09:18:58 -05:00
Anthony Chang
cac014f173 Fused attention (#345)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* attention host validation

* add blockwsie softmax v1

* iteratively update softmax+gemm

* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum

* add init method for easier debugging

* do away with manual thread cluster calculation

* generalize blockwise softmax interface

* row-wise softmax sum & max

* format

* rename to DeviceBatchedGemmSoftmaxGemm

* add gemm_softmax_gemm instances and tests

* comment

Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 00:16:14 -05:00
Chao Liu
500fa99512 Clean up conv example, Instances, profiler and test (#324)
* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test
2022-07-29 18:19:25 -05:00
rocking5566
7f21662089 Standalone layernorm (#315)
* Implement layernorm kernel and deviceOp

* verify gpu kernel with host code

* 1. Separate gamma aand beta from affine
2. Check if argument is valid

* clean

* Sync the naming

* Support sweep once mode if we can put k dimension data inside one block

* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.

* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.

* Use 1d descriptor for gamma and beta

* Add accElementwiseOp

* Extract layernorm host code

* Support different YVectorDim in GridwiseLayernorm

* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp

* Gamma and beta can share the VGPR.

* Add test for fp32 and fp16

* Fix bug of concurrency and add test case which may fail orignally

* Propagate NaN for layernorm

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-07-13 11:16:14 -05:00
Shaojie WANG
763ca61581 add conv1d/3d bwd weight instances (#318)
* add conv1d/3d bwd weight instances

* add profiler code
2022-07-08 15:42:20 -05:00
Liam Wrubleski
b653c5eb2e Switch to standard ROCm packaging (#301)
* Switch to standard ROCm packaging

* Revert .gitignore changes

* install new rocm-cmake version

* update readme

Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-06-25 09:35:16 -05:00
Chao Liu
d1db6a0c3e Absolute include path (#281)
* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add client app example

* update readme

* delete obselete files

* remove old client app

* delete old file

* cleaning

* clean

* remove half

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path for all examples

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* revert client app example

* clean build

* fix build

* temporary disable client test on Jenkins

* clean

* clean

* clean
2022-06-24 20:51:04 -05:00
Anthony Chang
15c89e81f0 Standalone softmax kernel (#284)
* initial stub for standalone softmax

* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m

* host softmax validates

* compiles; to implement beta scaling

* use NaN trick to efficiently ignore OOB values during sum of exponentials

* freeload device_reduce's utility functions

* clean up interface

* adding prior value (beta scaling)

* remove restriction related to perf considerations

* apply clang-format

* clean; disable diagnostics

* resolve conflicts

* add exp wrapper

* honor HostTensorDesc interface; allow implicit cast from different vector<T> type

* test softmax for fp16/fp32

* update readme

* amend commit NaN trick

* remove redundant param added during development

* format

* replace ScalarDataType with AccDataType

* separate out test programs by precision type

* move softmax sample code to its own folder

* format

* keep up with recent changes in reduction API

* remove extra header
2022-06-21 14:59:19 -05:00
Jianfeng Yan
40b59a63cc Navi21 gemm (#197)
* start adding navi21 GEMM

* navi_gemm_km_kn_mn_fp32 compiles and passes one test.

* rename variables and functions in gridwise_gemm_dlops_v1r3

* add other 3 layouts; format instance

* adding more tuning parameters

add tuning parameters for other 3 layouts

* add gemm_dlops_f16

* tmp

* add dependence of DeviceGemm::IsSupportedArg() on arch

* minor changes

* minor changes

* minor changes

* minor changes

* minor changes

* minor changes

* minor changes

* push gemm_dlops into profiler

* minor changes

* if using xdl or dlops is moved into profiler_gemm_impl

* minor changes

* minor changes

* remove is_xdl from profile_gemm_impl

* make IsSupportedArg dependent on arch for other device_gemm

* minor changes

* minor changes

* fix a bug in f_generate_tensor_value

* add 64x64x64 for gemm_dlops_int8

* add 64x64x64 for gemm_dlops_int8

* comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1

* fix

* start fixing tuning parameters

* monir

* minor changes

* minor changes

* minor changes

* fixing

* adding example

* adding example

* adding example

* add gemm fp32 example

* clean up

* use 128x128x16 as MNK tile in navi21 gemm example

* bug fix

* fix test

* use new block c tile

* clean

* fix build

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>
2022-05-24 12:19:27 -05:00
Anthony Chang
a054f7d604 Refactor block to C tile map (#235)
* refactor block-to-ctile-map

* gridwise gemm block2ctile generic validity check

* format

* amend split-k gemm block2ctile map refactor

* add test

* format

* amend

* revert to calculating batch index in kernel instead of passing as block_id_z

* move file

* add valid ctile index check to gridwise v2r4
2022-05-20 12:40:51 -05:00
Anthony Chang
9f71ff48e2 Validate examples in CI (#233)
* validate examples in ctest runs

* format

* fix usage of check_err

* amend

* add example codes to custom target 'check'

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-05-13 16:54:44 -05:00
JD
cec69bc3bc Add host API (#220)
* Add host API

* manually rebase on develop

* clean

* manually rebase on develop

* exclude tests from all target

* address review comments

* update client app name

* fix missing lib name

* clang-format update

* refactor

* refactor

* refactor

* refactor

* refactor

* fix test issue

* refactor

* refactor

* refactor

* upate cmake and readme

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-05-12 09:21:01 -05:00
ltqin
0f912e205e enable convnd bwd data test (#234) 2022-05-12 09:18:59 -05:00
Adam Osewski
712e464c4e Post PR183 review fixes. (#224)
* Suppress additional warnings for googltest.

* Rename file conv_fwd_util to conv_util.

* Update includes and ConvParams member access.

* Formatting.

* Change conv_fwd_util target to conv_util

* Fix compiler errors.

* Fix leftovers.

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-05-10 15:41:29 -05:00
Adam Osewski
8eca05a633 Introduce GoogleTest framework. (#204)
* Use googletest for tests. Add conv2d_fwd UT.

* Add conv1D/3D to gtest UT.

* Fix: not duplicate test with CTest.

* Convert more tests to googltests.

* Fix: GIT_SHALLOW is not allowed for git commit hash.

* Clang-format

* use integer value for GEMM test

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
2022-04-30 08:50:16 -05:00
Adam Osewski
abf4bdb9a9 Common forward convolution utility refactor. (#141)
* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.

* 3D cases for convolution utility functions.

* 3D reference convolution.

* Add support for 3D convolution.

* Check for inputs bigger than  2GB.

* Formatting

* Support for bf16/f16/f32/i8 - conv instances + UT.

* Use check_err from test_util.hpp.

* Split convnd test into separate files for each dim.

* Fix data generation and use proper instances.

* Formatting

* Skip tensor initialization if not necessary.

* Fix CMakefiles.

* Remove redundant conv2d_fwd test.

* Lower problem size for conv3D UT.

* 3D case for convnd example.

* Remove leftovers after merge.

* Add Conv Specialization string to GetTypeString

* Skip instance causing numerical errors.

* Small fixes.

* Remove redundant includes.

* Fix namespace name error.

* Script for automatic testing and logging convolution fwd UTs

* Comment out numactl cmd.

* Refine weights initalization and relax rtol for fp16

* Move test_util.hpp to check_err.hpp

* Refine weights initalization and relax rtol for fp16

* Refactor common part of test conv utils.

* Move utility function to single common place.

* Add additional common functions to utility.

* Refactor convnd_fwd_xdl examples.

* Remove redundant files.
* Unify structure.

* Add constructor to ConvParams.

* And add input parameters validation.

* Modify conv examples to use single utility file.

* Remove check_error from host_tensor.hpp

* Get rid of check_indices function.

* Remove bf16_to_f32 function overload for scalars.

* Fix namespace.

* Add half_float::half for check_err.

* Fix conv params size in UT.

* Fix weights initialization for int8.

* Fix weights initialization for int8.

* Add type_convert when store output in ref conv 1D.

* Get back old conv2d_fwd_xdl operation.

* Silence conv debug print.

* format

* clean

* clean

* Fix merge.

* Fix namespace for check_err

* Formatting.

* Fix merge artifacts.

* Remove deleted header.

* Fix some includes and use ck::utils::check_err.

* Remove unused check_indices restored by previous merge.

* Fix namespaces after merge.

* Fix compilation error.

* Small fixes.

* Use common functions.
* Fix filename
* Fix namespaces.

* Fix merge artifact - retrieve removed by accident fun.

* Fix ConvForwardSpecialization.

* Adhere to coding style rules.

* Fix merge artifacts.

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-04-05 15:16:59 -05:00
ltqin
781cacd2e6 NHWC Conv2d Bwd weight fp16 ckprofiler and test (#166)
* change backward weight name

* start add bwd weight lib and profiler

* change tuning paramter

* change output info

* add bwd weight test

* change test info

* using conv_util

* change wgt to weight

* add }

* add fp32
2022-04-04 20:32:00 -05:00
Jianfeng Yan
34c661e71c Batched gemm and reduction (#156)
* adding batched_gemm_and_reduction

* batched_gemm_reduce works with bactch_count=1

* fix a bug in grid_size; batched_gemm_reduce works for batch_count > 1

* adding profiler for batched_gemm_fp16

* fixed a bug in declaration of d1 and d0; both example and profiler work

* clang-format

* cleanup

* batched_gemm_reduce: add test

* minor change

* fixed some typo in function names
2022-03-30 11:21:18 -05:00
ltqin
0536f2b312 Unified implementation of 1d/2d/3d conv bwd-data. fp32/fp16/bfp16/int8 (#134)
* start convnd bwd data

* add 3d laoyout name

* add conv1d reference

* add con3d reference

* finished example client code

* conv1d kernel finished

* fix input error

* add conv3d

* add 3d layout in conv_utils.hpp

* fix sepecial check

* addconvnd lib

* add test for bwd data

* finished test

* add check slice length

* convnd bwd data start

* profiler can be compiled

* fix some bug

* set input to zero

* modify readme for example

* fix test_convnd_bwd_data bug

* test_convnd_bwd_data parameter desc

* workaround for 1d

* workaroud for 2d

* change init value

* workaround for 3d int8

* fix init value bug

* remove workaround

* fix acc data type

* add int32

* change select function to template

* tilda to tilde

* remove int32 instance

* fix commit for device hpp

* fix comments for profiler

* using profile imp to test

* add pass verification

* fix conv2d reference

* fix conflict

* remove double batched_gemm

* fix exampel conv2d data and test convnd

* format

* change conv2d_bwd_data return value

* remove repeat = 1

* remove conv bwd data

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-03-29 10:52:25 -05:00
Chao Liu
f95267f166 Gemm+Reduce Fusion (#128)
* add gridwise gemm v4r1

* rename

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* use sfc in shuffling

* remove hardcode

* remove hardcode

* refactor

* fix build

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* adding gemm+reduce

* format

* clean

* adding gemm+reduce

* adding profiler for gemm+reduce

* adding gemm+reduce profiler

* fix build

* clean up

* gemm+reduce

* fix build

* update DeviceGemm_Xdl_CShuffle; update enum to enum class

* clean up

* add test for gemm+reduce

* clean up

* refactor

* fix build

* fix build
2022-03-23 22:18:42 -05:00
Adam Osewski
f91579aab6 Unified conv3D API + support for all data types. (#133)
* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.

* 3D cases for convolution utility functions.

* 3D reference convolution.

* Add support for 3D convolution.

* Check for inputs bigger than  2GB.

* Formatting

* Support for bf16/f16/f32/i8 - conv instances + UT.

* Use check_err from test_util.hpp.

* Split convnd test into separate files for each dim.

* Fix data generation and use proper instances.

* Formatting

* Skip tensor initialization if not necessary.

* Fix CMakefiles.

* Remove redundant conv2d_fwd test.

* Lower problem size for conv3D UT.

* 3D case for convnd example.

* Remove leftovers after merge.

* Add Conv Specialization string to GetTypeString

* Skip instance causing numerical errors.

* Small fixes.

* Remove redundant includes.

* Fix namespace name error.

* Script for automatic testing and logging convolution fwd UTs

* Comment out numactl cmd.

* Refine weights initalization and relax rtol for fp16

* Fix weights initialization for int8.

* Add type_convert when store output in ref conv 1D.

* Get back old conv2d_fwd_xdl operation.

* Silence conv debug print.

* format

* clean

* clean

* Fix merge.

* Fix namespace for check_err

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-03-23 10:23:13 -05:00
zjing14
716f1c7fb1 Grouped GEMM for fp16 (#126)
* init of grouped_gemm

* 2 gemm test

* perf test

* clean

* wrap desc into a struct

* test cast static_arr to pointer

* add ptr to GemmDesc

* add grouped gemm profiler

* fixed mem issue with unique_ptr

* clean

* clean

* finished ckprofiler

* Update README.md

* readme

* fixed readme

* add example

* improve code

* fixed comments: reserve, seperate ptr and gemm_shapes

* merge group and non-group

* fixed comments: replace push_back with emplace_back to avoid copy constructor

* fixed comments: unified blk2ctile; add test

* ci fix

* fixed ci

* fixed ci

* fixed ci
2022-03-22 18:18:18 -05:00
Qianfeng
9a8ee8a39a Reduction for int8 and bfloat16 (#125)
* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction

* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter

* Rename the folder name for the pool2d and reduce examples

* Update to reduction test scripts

* Add Readme for pool2d_fwd and reduce_blockwise examples

* Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)

* Tiny fix in reduce profiler and tiny update in reduce testing scripts

* Tiny fix in testing script profile_reduce_no_index.sh

* Tiny fix in testing script profile_reduce_no_index.sh

* Add support for bfp16 reduction (using bhalf_t = ushort)

* Tiny fix in amd_buffer_addressing.hpp

* Tiny change in script/profile_reduce_with_index.sh

* Use AccDataType for Beta value and use element_wise::PassThrough

* Use type_convert for type converting in host layer reduction

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming all NumReduceDims to NumReduceDim

* Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2

* Update to testing scripts to add bf16 support

* added more static_assert

* Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp

* Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations

* minor change

* Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass

* Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp

* Tiny fix in script/profile_reduce_no_index.sh

* Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims

* Generic renaming in host reduction and DeviceReduce layer

* Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances

* Use multi-thread and simplification for host Reduction implementation

* Add ctest for reduction

* Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/

* Update to the reduce CTest executables to enable default testing behavior when no command argument

* Renaming

Co-authored-by: Jianfeng yan <jfyan008@gmail.com>
2022-03-22 14:35:14 -05:00
Jianfeng Yan
cb87b049de refactored deviceBatchedGemm; removed GridwiseBatchedGemm; added fp32 and int8 to profiler (#120)
changed long_index_t to index_t when computing memory offset

uncomment other ops in profiler

added test for batched_gemm
2022-03-21 16:45:14 -05:00
Chao Liu
5d37d7bff4 Reorganize files, Part 1 (#119)
* delete obselete files

* move files

* build

* update cmake

* update cmake

* fix build

* reorg examples

* update cmake for example and test
2022-03-08 21:46:36 -06:00
Chao Liu
5b178874a1 Fix Tests build (#109)
* fix tests

* remove useless file

* fix test build

* reduce parallelism when compiling

* fix test
2022-03-05 00:44:11 -06:00
rocking5566
7e9a9d32c7 [Bf16 & int8] [example & ckprofiler] (#100)
* Add int8 of mk_nk_mn to the ckProfiler

* Add example of int8 gemm

* Fix typo, use ushort instead of half_t for bfloat16

* replace ushortXXX_t to bhalfXXX_t

* rename ushort to bhalf_t

* Add bf16 example

* Add bf16 gemm to ckProfiler

* Fix alignment

* Fix typo

* Add unit test for gemm_xdl int8

* Add gemm_xdl fp32 unit test

* Add gemm_xdl bf16 unit test

* fix build

* fix build issue due to merge conflict

* Fix build

* Fix build error

Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-03-04 15:56:44 -06:00
JD
992f71e371 Update test CMakeLists to add new tests automatically and add Jenkins stage for tests (#88)
* add docker file and make default target buildable

* add Jenkinsfile

* remove empty env block

* fix package stage

* remove render group from docker run

* clean up Jenkins file

* add cppcheck as dev dependency

* update cmake file

* Add profiler build stage

* add hip_version config file for reduction operator

* correct jenkins var name

* Build release instead of debug

* Update test CMakeLists.txt
reorg test dir
add test stage

* reduce compile threads to prevent compiler crash

* add optional debug stage, update second test

* remove old test target

* fix tests to return proper results and self review

* Fix package name and make test run without args

* change Dockerfile to ues rocm4.3.1

* remove parallelism from build

* Lower paralellism

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-03-03 16:59:42 -06:00
Jianfeng Yan
bdedf64b98 Space filling curve (#96)
* add space_filling_curve

* cleanup and move space_filling_curve into test

* add functions for backward and forward step; hard coded results in unit test

* minor changes
2022-02-24 20:11:36 -06:00
Adam Osewski
756a761727 Unify Convolution FWD XDL 1D/2D implementation. (#93)
* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-02-23 10:44:20 -06:00
ltqin
880fbee957 NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73)
* add fwd bf16 conv

* change tunning parametor

* add int8 for conv fwd

* remove comments

* change tunning parametor for int8

* change init int8 example

* add test for conv2d fwd

* change device operation file pos because merge develop

* fwd int8 use reference

* test_conv_fwd use reference

* add braket for if statement

* rename fwd example name

* remove StaticBufferOfVectorTypeV2

* tweak example

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-02-11 20:06:40 -06:00
ltqin
4be7f0198e add split-k GEMM (#59)
* add DeviceGemmSplitKXdl

* add file device_gemm_splitk_xdl.hpp

* set c matrix zero

* using atomic

* add all tuning parameter to f32 mkkn

* grid size change to 720

* add tunning parameter for NT

* add tunning parameter for TN

* add tunning parameter for TT

* add m=96tunning parameter

* add lost config

* add element wise operation

* fixed MPerBlock=96

* remove marco for slpitk swtich

* add test

* add new line at the end of device_gemm_xdl_instance.hpp

* remove step hack

* seperate split-k instance files

* add tunning parameters

* change disired grid size to parameters

* remove slice length

* add desiredgridsize parameter to ckProfiler

* add losting file device_gemm_xdl_splitk_instance.hpp

* change desired gride size to kbatch

* format

* format

* clean up

* add selection of device_instances

* clean code

* fix build issue

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
2022-02-02 22:47:27 -06:00
Chao Liu
237d4ca03f added test for magic number division (#58) 2021-11-30 09:09:28 -06:00