Commit Graph

702 Commits

Author SHA1 Message Date
Po-Yen, Chen
f7288bc2b1 Reuse same implementation code for most of GEMM examples 2022-08-19 14:47:09 -04:00
Po-Yen, Chen
ed51c0638b Re-format template argument in example code 2022-08-19 14:31:46 -04:00
Po-Yen, Chen
5931c7ebe6 Move common codes together 2022-08-19 13:49:22 -04:00
Po-Yen, Chen
68a57e71e6 Move #include directives into new header 2022-08-19 13:24:00 -04:00
Po-Yen, Chen
42d75f356c Sort include directives 2022-08-19 12:59:46 -04:00
Po-Yen, Chen
dd5b139401 Extract int4 example common codes 2022-08-19 12:57:36 -04:00
Po-Yen, Chen
3e2f37a148 Re-format GEMM instance template arguments 2022-08-19 12:02:57 -04:00
Po-Yen, Chen
c1fbabea04 Avoid too much generalizing check_err() 2022-08-19 11:59:21 -04:00
Po-Yen, Chen
4d4a659cd6 Use ""_uz to simplify example code 2022-08-19 11:54:51 -04:00
Po-Yen, Chen
3e2371c554 Align design with other PR 2022-08-19 11:44:08 -04:00
Po-Yen, Chen
503f07c1e0 Add constraint to check_err() input reference type 2022-08-19 11:34:19 -04:00
Po-Yen, Chen
2fb766e852 Simplify tensor usages in examples 2022-08-19 11:33:25 -04:00
Po-Yen, Chen
0d5025befe Add #error directive to prevent compile sources with wrong setting 2022-08-19 10:51:30 -04:00
Po-Yen, Chen
625f95ade4 Remove debug messages 2022-08-19 10:05:44 -04:00
Po-Yen, Chen
84843aa36f Avoid compilation error while disabling ck::int4_t support 2022-08-19 09:54:03 -04:00
Po-Yen, Chen
51d0c6794c Remove constraint of Tensor<>::CopyAsType() 2022-08-19 05:31:04 -04:00
Po-Yen, Chen
c34f8411c4 Check converted Tensor<int4_t> with golden Tensor<int8_t> 2022-08-19 04:40:13 -04:00
Po-Yen, Chen
a83c006098 Allow comparing different-sized integral types in check_err() 2022-08-19 04:39:20 -04:00
Po-Yen, Chen
726c115393 Add type constraints for integer version check_err<>() 2022-08-19 03:48:20 -04:00
Po-Yen, Chen
f2c148efae Add type traits 'is_signed_integral<>' 2022-08-19 03:47:22 -04:00
Po-Yen, Chen
463d15f9b5 Add constraint to Tensor<> templated methods 2022-08-19 03:27:41 -04:00
Po-Yen, Chen
f3f61f836b Complete the int4 examples 2022-08-19 02:19:50 -04:00
Po-Yen, Chen
2dc3357a20 Fix typo in alias names 2022-08-19 01:41:20 -04:00
Po-Yen, Chen
79480f0aee Re-use element-wise operation type alias 2022-08-19 01:39:46 -04:00
Po-Yen, Chen
dd849a8736 Re-use CopyAsType<>() to implement copy ctor 2022-08-19 01:02:36 -04:00
Po-Yen, Chen
e03cece9c4 Use different type for host tensors 2022-08-19 00:32:57 -04:00
Po-Yen, Chen
89a827cab9 Re-format source files 2022-08-19 00:32:24 -04:00
Po-Yen, Chen
cbbe2485b2 Allow conversion between Tensor<> specializations 2022-08-19 00:30:53 -04:00
Po-Yen, Chen
30ed3e218c Add int4_t support for check_err() 2022-08-19 00:30:28 -04:00
Po-Yen, Chen
194faf7837 Distinguish user-side type from kernel-side type 2022-08-18 23:43:19 -04:00
Po-Yen, Chen
70c87970ec Re-use pre-defined alias in int4 exmples 2022-08-18 23:29:38 -04:00
Po-Yen, Chen
4b153bd974 Add GEMM examples for int4
Currently the source files are just copied from int8 examples
2022-08-18 23:03:36 -04:00
Illia Silin
9efd033bee restart the stages on MI200 in case of failures (#366)
* restart the stages on MI200

* fix the docker image storage issue
2022-08-18 14:54:47 -05:00
Adam Osewski
e00149ac67 int4 data type (#364)
* Introduce int4 data type.

* Add unit-tests for int4

* Compile int4 UT only when int4 enabled.

* clang-format

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-08-18 14:53:47 -05:00
Chao Liu
bac7df8faf use scale (#363) 2022-08-17 10:38:00 -05:00
Anthony Chang
c961ce9226 Hotfix LDS data hazard in fused attention (#360)
* avoid LDS data hazard in gemm_softmax_gemm pipeline

* trivial refactors

* comments

* shrink blockwise gemm v2 thread buffer size

* reclaim A block lds space when during 2nd gemm

* amend

* amend
2022-08-15 12:04:20 -05:00
Qianfeng
53ea4713af Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320)
* Implement multiple-reduction in one kernel (kernels, device ops, examples)

* Add generic elementwise kernel and device interface

* Add generator for normal-distributed data initialization

* Add host refer implementation of batchnorm-forward and batchnorm-infer

* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels

* Remove un-needed including in batchnorm example

* Renaming generic_elementwise to elementiwise in kernel and device classes/functions

* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise

* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise

* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise

* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp

* Removing and renaming files

* Update to synchronize gemm_layernorm client example to the generic element-wise device op API

* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming

* Merge two static member functions in device_elementwise.hpp

* Remove unary_elementwise_1d kernel and device
2022-08-15 10:11:02 -05:00
Chao Liu
5ee304595c fix build issue (#357)
* fix build

* excludeexample_gemm_max_xdl_fp16 from testing due to random failure on gfx908
2022-08-13 15:58:31 -05:00
cloudhan
fb1cbf025b Change all device operations to use add_instance_library (#338)
* Change all device operations to use add_instance_library to avoid duplicated cmake configuration.

* update DeviceMem

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 12:17:58 -05:00
rocking5566
0bd6b842b9 Layernorm welford (#346)
* Add threadwise and blockwise welford

* Rename gridwise op, prepare to add welford version

* implement welford and integrate welford into layernorm

* Take care of tail loop

* Fix buf when ThreadSliceK > 1

* Fix bug of merging of two empty set

* Rename clip to clamp

* 1. Fix type of count
2. Remove useless static_assert

* Do not inherit Reduction::Argument

* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)

* Add y stride

* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm

* Move literal ""_uz & ""_zu into namespace 'literals'

* Move namespace 'literals' as 'ck::literals'

Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 09:43:18 -05:00
Anthony Chang
c20a75b07d Fused GEMM+GEMM (#351)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* add gemm_gemm instances and tests

* avoid LDS data hazard

* fix build

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 09:18:58 -05:00
ltqin
10b3278b05 Skip lds of b matrix (#326)
* start

* read for gridwise gemm

* add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1

* add thread  copy desc and register buffer

* add K0PerBlock dim

* add read global data

* finish gridwise gemm

* finish blockwise gemm

* add print data

* add smallest config

* add compare code for gridwis gemm

* fix NXdlPerWave

* fix k0perthread and gridewis gemm main loop

* remove b matrix lds alloc

* fix name

* add test code

* create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter

* add double register

* modify b_thread_desc_

* add float

* fp16 tag

* add tail for pipeline

* finish main loop

* optimize main loop

* start clear gridwise gemm

* clear code

* clear redundant code

* change file name

* change file name

* fix bug after merge develop

* fix input parameters

* using MultiK0 control b load data loop

* fix some config

* 4 buffer

* fix bug

* one can use

* change read order

* change buffer array to tuple

* change to 8 buffer

* interleave buffer load

* change to 16

* read 8 buffer

* add data buffer to template

* fix after merge develop(head file)

* format

* change to 4 buffer

* remove unnecessary lambda fun
2022-08-13 01:35:49 -05:00
Qianfeng
14932e8de3 Add examples for reduction fp16/fp32/bp16/int8/fp64 for 3d/4d/5d (#342)
* Update the reduce_blockwise example to support user specified data type and input+reducing dimensions

* Add examples for using reduce_multiblock_atomic_add

* Add more running examples to the default command-line

* Remove un-necessary header including

* Update to the example README.md
2022-08-13 01:10:01 -05:00
rocking5566
6c3c06bf1f Gemm multiple d multiple r (#335)
* Imitate XXX_gemm_multiple_d, add XXX_gemm_multiple_d_multiple_r for gemm + reduction

* Implement run of kernel

* Add example

* Fix parameter of typo

* Rewrite the reduceMax example

* Rewrite the reduceMean + reduceMeanSquare example

* Refine naming

* Refine folder name

* refine naming

* Rewrite the gemm + bias + relu + add + layernorm example

* Rewrite the gemm + layernorm example

* clang-format

* Fix bug if sync lds

* Fix compile error
2022-08-13 01:07:12 -05:00
Anthony Chang
cac014f173 Fused attention (#345)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* attention host validation

* add blockwsie softmax v1

* iteratively update softmax+gemm

* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum

* add init method for easier debugging

* do away with manual thread cluster calculation

* generalize blockwise softmax interface

* row-wise softmax sum & max

* format

* rename to DeviceBatchedGemmSoftmaxGemm

* add gemm_softmax_gemm instances and tests

* comment

Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 00:16:14 -05:00
Po Yen Chen
a670a5a092 Move literal ""_uz & ""_zu into namespace 'ck::literals' (#354)
* Move literal ""_uz & ""_zu into namespace 'literals'

* Move namespace 'literals' as 'ck::literals'
2022-08-12 17:48:35 -05:00
Rostyslav Geyyer
0c6ef7c14e Add example of conv_fwd_bias_relu_add for int4, int8, bfp16, fp16, and fp32 (#343)
* [LWPCK-359] Initial commit

* Working version for fp16, add results to readme

* Update according to PR #341

* Update results in readme

* Add fp32 example

* Add bf16 example

* Update fp16 and fp32 examples

* Add int8 example

* Add separate lengths and strides tensors for D tensors

Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
2022-08-12 15:30:27 -05:00
zjing14
35e49f2de6 add g; fixed strides (#355) 2022-08-12 15:22:39 -05:00
Illia Silin
de60d290b6 Build docker only once in CI, fix conv_bwd logfile names. (#353)
* build docker in separate stage

* build docker with only one prefix

* add parallel statement

* add docker repo url

* fix the name of perf_conv_bwd_data log file
2022-08-12 12:30:37 -05:00
Po Yen Chen
68b61504a3 Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340)
* Add always_false<> util to delay symbol resolution

* Use always_false<> to prevent trying instantiate unwanted method

* Add new specializations of AddAddFastGelu::operator() method

* Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32

* Use floating point literal to simplify code

* Remove unnecessary capture in lambda expressions

* Extract fast GeLU calculation as standalone method

* Mark methods as 'constexpr'

* Add constraint for HostTensorDescriptor templated ctors

* Simplify HostTensorDescriptor ctor calls

* Add C++23 std::size_t literal suffix

* Use _uz suffix to shorten example code

* Remove unnecessary conversion to std::array<>

* Re-order include directives

* Remove C-style casting by literal suffix

* Remove unnecessary statements in main()

* Remove unused type parameter of always_false<>

* Remove unused include directive

* Exit main() by returning meaningful value

* Use 'if constexpr' to switch example flow

* Use std::is_same_v<> to shorten example code

* Add 'inline' specifier to literal functions

* Unify output methods in example

* Move common codes into .inc file

* Add type check in type_convert<>()

* Add type_convert<float>() before computation

* Merge AddAddFastGelu method specializations

* Remove always_false<>

* Add constraint to AddAddFastGelu::operator() parameter types
2022-08-11 17:31:28 -05:00