Commit Graph

95 Commits

Author SHA1 Message Date
Rostyslav Geyyer
c7a4d36147 Add padding device_gemm_xdl instances (#529)
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-12-07 17:46:03 -06:00
ltqin
23ecf0fa9e Add multiple d gridwise gemm on Navi21 for ResNet50 (#517)
* start add example

* add multiple d fp16 example

* device transfer elementwiseop to gridwise

* gridwise add multiple d

* change example for multiple d

* fix spill registers

* fix for passthrough element op

* fix int8 overflow

* change example file name

* add instance for dl multiple d

* example add DsDataType

* remove grouped_convolution_forward_dl.hpp

* add head file(was deleted before)

* fix not support device issue

* format

* remove passthrough check

Co-authored-by: letaoqin <letaoqin@amd.com>
2022-12-02 11:42:31 -06:00
rocking5566
ad541ad6b9 gemm, conv perchannel quantization (#503)
* Use gemm_multiple_D instead

* Add gemm bias relu quantization example

* Add pure gemm quantization example

* Add quantization of perchannel conv + bias + relu example

* Refine the code

* Rename multiplier to requant_scale

* Rename the folder

* Remove redundant comment

* Rename the file. Prepare to add perchannel

* Add conv perchannel instance

* Move to quantization folder

* Add conv perchannel client example

* Apply Rangify constructor of HostTensorDescriptor & Tensor<>

* Fix merge error
2022-11-30 14:13:04 -06:00
Qianfeng
63af525c06 BatchNorm backward instance/external API/profiler/tests (#519)
* Refine the device batchnorm-backward base API templates and data type assignments

* Remove duplicated kernel file

* Add batchnorm backward instances and external API

* Add batchnorm-backward profiler and tests

* Add client example which uses batchnorm backward external API

* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory

* Loose the threshold for batchnorm-backward check_err()
2022-11-30 13:32:20 -06:00
Qianfeng
5bf0475afd Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test (#516) 2022-11-28 14:33:00 -06:00
Qianfeng
4e6a5575be BatchNorm forward instance/external api/profiler/tests/client example (#511)
* Update to device_batchnorm_forward base class to include all template parameters for problem description

* Add batchnorm forward instances and external api

* Add batchnorm forward profiler module which uses the external api

* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]

* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward

* Improvement to the batchnorm infer base API

* Add batchnorm forward client example which shows using the batchnorm forward external API

* Add test for batchnorm forward

* Tuning the batchnorm profiler initialized values and error threshold

* Add support for bhalf_t in instances/external api/tests

* Add support for int8_t in instances/external api/tests

* Add support for double in instances/external api/tests

* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances

* Checking before running best instance in batchnorm_fwd_nhwc client example

* Add checking for YElementwiseOp in batchnorm_forward external API

* Add more types in batchnorm forward profiler

* Add more test lengths

Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
2022-11-24 18:02:27 -06:00
Adam Osewski
43a889b72e Client examples AddFastGelu and FastGelu + instances. (#509)
* FastGelu support for more data types.

* AddFastGelu & FastGelu instances.

* Client example.

* clang-format

* Remove unused stride variable.

* Add new line at EOF.

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-11-19 22:08:26 -06:00
guangzlu
4c4c7328a6 Add BF16 tests for batched_gemm_softmax_gemm_permute (#504)
* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm

* added bf16 tests for batched_gemm_softmax_gemm_permute

* changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* aligned annotations

* modified CMakeLists for examples

* add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl

* use macro to control the instances

* added macro control into instances

* clang-format some files

* changed error tolerance for bf16

* changed index for 10_elementwise_normalization

* fixed xdlops code bug in amd_xdlops.hpp

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2022-11-15 16:30:23 -06:00
ltqin
db0eb1ea9c Add Conv Backward Data on Navi21 for ResNet50 (#499)
* start add example

* add device dl

* change launch kernel

* change init data method

* change example config

* add config valid check

* add instance for dl bwd

* add instance to ckProfiler

* reserver to profiler and cmakelist

* add instance to ckProfiler2

* change instance f32 config

* fix example return value

Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2022-11-15 16:22:20 -06:00
Po Yen Chen
730204eed0 Introduce ck::accumulate_n() (#439)
We can use this template to eliminate duplicated iterator computing
logics. By providing return type to ck::accumulate_n(), we can avoid
type conversion operations.
2022-11-14 19:53:39 -06:00
Po Yen Chen
38470e0497 Add client example of grouped conv2d backward weight (data type: fp16) (#498)
* Remove redundant CMake setting

* Extract common code from files

* Rename folder 'convnd' to 'conv'

* Use std::array<> to accept compile-time kwnown # of arguments

* Fix compilation error of tuning parameter

* In example, use same setting as unit-test

* Remove no-longer used include directive

* Add interface for grouped conv bwd weight

* Add group support for conv bwd weight

* Add grouped conv bwd weight example

* Use group parameter in example

* Rename example folder

* Remove non-grouped version example source files

* Rename device op template

* Add group support to convolution backward weight

* Remove debug messages

* Use smaller group size in example

* Use named variable as loop terminate condition

* Prettify example output message

* Enlarge used grid size

* Allow real grid size exceeds expected grid size

* Rename interface file

* Add client example for grouped conv2d bwd weight

* Fix wrong include directive

* Rename client example folder
2022-11-09 18:50:03 -06:00
Po Yen Chen
67423a2275 Remove interface 'DeviceGroupedConvBwdData' (#500)
* Remove interface 'DeviceGroupedConvBwdData'

* Remove no-longer needed include directive

* Rename client example folder
2022-11-09 18:32:17 -06:00
guangzlu
8a4253baaf Fused elementwise normalization (#492)
* add fused addition lyernorm

* add fused addition lyernorm

* changed CMakelist

* removed annotates

* modified descriptor of C

* fixed bug in gridwise add layernorm

* format the files

* modified name from add&layernorm into elementwise&layernorm

* created fused elementwise layernorm branch

* change input into tuple type

* add sweep once to reduce load & read of C from global memory

* modified Argument api

* modified way to malloc c in global memory

* changed gamma and beta to m_k_desc

* fixed bug when sweep once and move CDataType when define device level struct

* add src dim for gamma and beta

* implement optimization for coalesced

* delete a annotation line

* fixed some bug to meet the requirements of ck

* add bandwidth computing in example, and fixed the time unit

* move device_elementwise_layernorm_impl.hpp into device/impl

* fixed bug in device_elementwise_layernorm_impl.hpp

* changed name from layernorm into normalization

* clang-format the changed files

* changed the names

* moved immidiate results into lds, it become faster in non-sweeponce cases

* changed naming of C into X to make the defination more clear

* changed naming in example

* add tests for elementwise normalization

* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization

* move test_elementwise_layernorm_fp16 into new folder

* move elementwise_normalization_instances into a new folder

* add more tests in test_elementwise_layernorm_fp16.cpp

* added some corner cases in test

* fixed method to compute lds size for matrix X

* changed name of 44_elementwise_normalization into 45_elementwise_normalization

* modified some comments

* modified some other confused comments

* reduce redundant tests in test_elementwise_layernorm_fp16.cpp
2022-11-03 12:01:58 -06:00
Anthony Chang
451f1e3d65 remove atten kernel workarounds as we move over to rocm 5.3 (#496) 2022-11-02 16:56:07 -06:00
Po Yen Chen
9e57a290af Add client example of grouped conv2d backward data (data type: fp16) (#481)
* Improve example reusability

* Remove no-longer used file

* Rename folder of grouped_conv_bwd_data example

* Add normal grouped conv bwd example

* Add interface 'DeviceGroupedConvBwdData'

* Prettify comment of device op type arguments

* Add grouped conv2d/conv3d backward data fp16 instances

* Fix wrong template argument

* Add grouped_conv2d_bwd_data client example

* Use simpler expression to calculate memory size

* Fix formating

* Remove grouped_conv3d_bw_data instances

Underlying device operator is not ready to handle 3D input

* Remove no-longer necessary include directive

* Add missing include directive

* Use more realistic conv param in example
2022-11-02 16:54:41 -06:00
Rostyslav Geyyer
1a0b0e7bec Add pipeline v1/v2 selector, add more instances (#381)
* Add gridwise gemm pipeline v1/v2 selector

* Pipeline selector working, test-wise add pipeline options to one instance

* Add gemm instances

* Add debug info to DeviceGemmXdl

* Add debug info to DeviceGemmXdl_CShuffle

* Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu

* Minor fix

* Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm

* set up inter-wave configuration

* use defualt loop scheduling for supported gemm ops

for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI

* Add enum PipelineVersion

* Update instances

* Format

* Fix the merge conflict

* Add flags to disable added instances

* Test disable flag check

* Disable flag check

* Enable the instances

Co-authored-by: Anthony Chang <ac.chang@outlook.com>
2022-11-02 16:50:48 -06:00
Adam Osewski
6d8614ee50 Softmax unit-test reduction across all and non innermost dims cases. (#406)
* Add reduction across all dims cases.

* host softmax: handle all reduce

* Test cases when reduced dim is not innermost axis.

* Fix syntax.

* Test non innermost dim for fp32 and int8

* Group test suites wrt NumReduceDim.

* Additionally test failing cases.

* Throw error when Rank or NumReduceDims doesn't match arguments.

* Check reducedDims has correct values

* Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
Instead implement own. (in fact just get rid of one check to enable
reduction across inner dimensions).

* Reorganize unit tests to better cover use scenarios.

* Test input validation
* Test reduction of inner dimensions with custom op instances.

* Refactor fp32 and int8 unit tests.

* Fix FP32 instance template parameters.

* Add more instances.

* Instances with InSrcVectorDim=0.

* Do not initialize and copy data when arg not supported.

* ckProfiler Softmax use instance factory.

* Refactor device softmax IsSupported.

* Additionally add non-polymorphic api functions

* Split softmax instances into multiple files.

* Fix profiler.

* Reorganize tests to reuse profiler and cover edge cases.

* Clang-format

* I8 Softmax instances along with UT.

* Reuse type alias definitions from instance factory header.

* Clean included headers

* Fix variable names.

* Add missing checks in Argument constructor.

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
2022-11-02 16:46:08 -06:00
rocking5566
226bc02b73 Conv perlayer int8 quantization (#471)
* Add conv2d requant example

* Fix bash error

* Rename example

* 1. Rename gemm quantization
2. shares the requantization lambda function with conv

* Refine declare type

* Add conv bias relu quantization exmaple

* clang format

* Fix compile error due to merge develop

* Fix CI error

* Extract quantization post operation into another file

* Support quantization for non piecewise linear function

* Add instance for conv quantization

* Add convolution quantization factory

* Add convolution quantization client example

* Add more instances with different template parameters

* clang format

* Sync the naming with the develop
2022-11-02 13:56:26 -06:00
ltqin
8ee36118be Add Conv Forward on Navi21 for ResNet50 (#490)
* add device of dl

* fix k1 of GridwiseGemmDl_km_kn_mn_v1r3

* init version for dl conv

* add example(init)

* result right

* disable elementwise operation

* check parameters

* add fp32,int8 example and change check code

* change deive file and class name

* add check vector access of C

* add instance

* add to ckProfiler

* add Filter1x1Pad0 instances

* fix ignore error

* fix for CI

Co-authored-by: letaoqin <letaoqin@amd.com>
2022-10-31 10:24:25 -06:00
Anthony Chang
d8f5f717f8 fix missing -fPIC flag for conv3d_fwd instance lib (#473) 2022-10-27 15:32:46 -06:00
Anthony Chang
de37550f72 Input/output permutation for fused attention (#460)
* reopen masking att instance due to CI is upgraded

* re-enable instances previously failed on 9110

* enable ksize-kpadding pair validity test

* add non-masked attention+permute test; expose masking boolean to attention kernel handles

* disable bench

* fix test

* move files

* bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute

* format

* amend rename

* disable bench in test

* add mask/no-mask test for non-permute attention kernels

* disable broken kernel instance

* example working

add non-permuted problem statement

evaluating whether overhead comes from permutation or the extra kernel arg

* interface for bias addition without implementing it

* test and profiler running

* tidy

* mask type determined by enum class

* unify example code

* move masking specialization to its own header

* align formats

* extract helper functions

* experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute

* add tensor specialization to template args

since tensor spec packed shows perf parity when permutation isn't needed

remove redundant template args

comment on 'packed' tensor specialization

* grouped attention with input/output permute example

* format

* clean up

* refactor acc0 tile visitor

Co-authored-by: shaojiewang <wsjmessi@163.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-10-27 14:58:20 -06:00
Rostyslav Geyyer
cd51732690 Fix Batched Gemm op for int8 data (#482)
* Fix for lwpck-425, update BlockTransferSrcVectorDim

* Revert "Fix for lwpck-425, update BlockTransferSrcVectorDim"

This reverts commit fd24e280e2.

* Add Batched Gemm int8 test, expect it to fail

* Format

* Re-add the fix
2022-10-27 12:34:04 -06:00
Qianfeng
dda3a0a10b Update to the Reduction API and instances (#476)
* Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances

* Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces

* Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock

* Split the cpp and hpp files for reduction instances to enable more parallel compiling

* Remove the using of macros for declaring reduction instances and instance references

* Update to add_device_reduce_instance_xxxx templated functions

* Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates

* Change return format
2022-10-25 09:39:11 -06:00
guangzlu
6ea9257e9d Revert "Fused elementwise layernorm (#468)" (#491)
This reverts commit efbcc6eddc.
2022-10-25 18:37:12 +08:00
guangzlu
efbcc6eddc Fused elementwise layernorm (#468)
* add fused addition lyernorm

* add fused addition lyernorm

* changed CMakelist

* removed annotates

* modified descriptor of C

* fixed bug in gridwise add layernorm

* format the files

* modified name from add&layernorm into elementwise&layernorm

* created fused elementwise layernorm branch

* change input into tuple type

* add sweep once to reduce load & read of C from global memory

* modified Argument api

* modified way to malloc c in global memory

* changed gamma and beta to m_k_desc

* fixed bug when sweep once and move CDataType when define device level struct

* add src dim for gamma and beta

* implement optimization for coalesced

* delete a annotation line

* fixed some bug to meet the requirements of ck

* add bandwidth computing in example, and fixed the time unit

* move device_elementwise_layernorm_impl.hpp into device/impl

* fixed bug in device_elementwise_layernorm_impl.hpp

* changed name from layernorm into normalization

* clang-format the changed files

* changed the names

* moved immidiate results into lds, it become faster in non-sweeponce cases

* changed naming of C into X to make the defination more clear

* changed naming in example

* add tests for elementwise normalization

* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization

* move test_elementwise_layernorm_fp16 into new folder

* move elementwise_normalization_instances into a new folder

* add more tests in test_elementwise_layernorm_fp16.cpp

* added some corner cases in test

* fixed method to compute lds size for matrix X

* changed name of 44_elementwise_normalization into 45_elementwise_normalization

* modified some comments

* modified some other confused comments

* reduce redundant tests in test_elementwise_layernorm_fp16.cpp
2022-10-25 10:23:20 +08:00
Adam Osewski
3048028897 Refactor device op implementations into impl subdirectory. (#420)
* Move kernel implementation files under impl directory.

* Update examples paths.

* Update device kernel impl include paths.

* Update tensor operation instances include paths.

* Update profiler and tests include paths.

* Clang-format

* Update include paths for batched gemm reduce

* Refactor UnitTest ConvNDBwdWeight.

* Refactor fwd and bwd data convND UT.

* Fix used test macro.

* Fix include path.

* Fix include paths.

* Fix include paths in profiler and tests.

* Fix include paths.

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-10-13 09:05:08 -05:00
rocking5566
1b62bfaa2a Fix bug of layernorm ckProfiler and refine code (#448)
* Fix bug of profiler for layernorm

* 1. Rename layernorm into normalization
2. Decouple softmax from normalization

* clang-format
2022-10-12 21:06:39 -05:00
Shaojie WANG
40942b9098 Optimization for gridwise group norm (#453)
* use another instance to check the efficiency

* optimize group layer norm

* 1. coalesce load/store data for gridwise layer norm welford. 2. move a sqrt and divison into a outer static loop

* add more instances to layernorm

* add 2 more test cases

* remove ignore in generating tuple of vector

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-10-06 21:24:13 -05:00
JD
2c6d63d031 Fix device instance libarary to include all instances (#418)
* fix device instance library to add all instances

* remove cppcheck from requirements.txt

Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-09-23 13:30:18 -05:00
Chao Liu
e9d4e893e5 fix build (#434)
* fix

* fix

* add instance
2022-09-22 12:32:41 -05:00
Shaojie WANG
ebab84b6f9 MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425)
* add lower triangle bmm

* init code for tile skipping

* functionality right with lower triangle mask

* add decoder lower triangular mask calculation

* use 7*13 group

* fix n2 compute error

* attention with lower triangle mask with tile skipping

* add template to distinguish masking kernel

* rename template and remove default template value

* remove lower triangle gemm reference struct

* add some comments on example

* add 10 instance for masking bmm + scale + softmax + bmm + permute kernels

* add test

* add test file

* add gtest for bmm masking scale softmax bmm permute

* clang-format

* fix compile error

* check lef bottom corner for tile skipping

* fix error: check left bottom corner for tile skipping

* add k padding

* add test and instance for MNK padding

* passing a mask struct

* fix instances

* delete used comments

* format

Co-authored-by: danyao12 <yaodan@dc-smc-13.amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-09-20 12:43:53 -05:00
rocking5566
4eba345f6e Group norm (#417)
* Add groupnorm example by layernorm
1.  Reference is not ready
2. shape of gamma and beta need to be fix

* Let shape of gamma and beta can be same as x

* Modify test, instance and client example

* [What] Fix bug of layernorm for greater than 2 dimension.
[Why] We need to get upper length from merge transform instead of embed transform.

* Add reference for groupnorm

* Fuse sigmoid after groupnorm

* [What] Rename original layernorm into layernorm2d
[Why] Prepare to add groupnorm using layernorm5d

* clang-format

* Add groupnorm test

* Refine error message

* Add groupnorm ckProfiler

* Test groupnorm kernel from device_instance

* update example

* upadte profiler

* Fix test naming

* Fix argc number

* Move descriptor and sweeponce to argument for quick debugging

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-09-19 22:30:46 -05:00
Anthony Chang
7c788e10ce Add batched attention special kernel instances (#424)
* sanity check

* add attribution

* add irrgular k tile size for batched attention

* format
2022-09-19 19:20:54 -05:00
ltqin
370efa6c08 batched_gemm + multiple_d + gemm + multiple_d (#394)
* refactor

* start

* add device gemm file

* add BatchStrideD0

* add stridd0

* add gridwise file

* add d0 parameters to gridwise gemm

* add c layout transformer

* add d0 threadwise copy

* init kernel

* init kernel

* regular code

* nm desc put to out

* kernel parameter can not use reference

* host add bias+gelu

* run right for bias+gelu

* change AddFastGelu into another file

* interface add d1 bias parameters

* add d1 parameter to argument

* add d1 parameter to gridwise

* first all code,not verify

* gelu change to relu and GetElementSpaceSize bug

* add instance

* start add to ckprofiler

* ckprofiler finish code

* change input parameter for ckProfiler

* fix host bias+gelu bug

* show help for ckProfiler

* fix bug for lunch kernel ignore parametes

* add pad and fix about bug

* mutiple d0

* add dynamic d0_element_op

* change profiler and  instance to mutiple d0

* example have 2 d0

* remove some comments not using

* change 2 d0 have self  parameters

* change d element_op name

* change class name(multiple_d)

* fix bug

* fix bug that don't find file

* update profiler

* refactor

* update profiler

* clean

* revert example change

* add gon layout

* optimize parameter for gno

* add gon to gemm+gemm

* change helping input parameters

* change to GemmPadder_v2

* using ForEach

* fix gb_per_sec

Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: ltqin <letaoqin@amd.com>
2022-09-14 17:54:18 -05:00
Anthony Chang
868e5c555b Fused attention instances & padding tests (#395)
* modify comment

* trim unnecessary check

* add gemm spec in kernel name

* add TNTT gemm_gemm + atten kernel instances

* refactor attention padding to better fit in unit tests

This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
Also added compile-time conditionals that load OOB value as NaN only after padding is enabled

* add adhoc padding test for atten

* shrink input value range for attention kernel validation to avoid occasional error by 1e-3

Still unsure whether this kind of deterministic floating point accurary issue is expected
or not. May want to try exact same approach as the GPU kernel in the host reference
GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
shrink the input value range as it is less likely to produce errors of around ~1e-3.

* attention kernel proper granular padding for all 4 dims

* IsSupportedArgument checks

* test more padded cases

* block PadK specialization in attention kernels

* workaround clang crash for gfx908

(gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
VGPR_32: Cannot scavenge register without an emergency spill slot!"
this fall back to less ideal way of handle NPadding in fused attention kernel

* comment out kernels giving wrong results on MI100; MI200 doesn't seem affected
2022-09-06 14:38:56 -05:00
Anthony Chang
fe52c94c98 GemmGemm TNNT instances (#399)
* add gemm_gemm TNNT instance

* sanitize Gemm1KPack

* disable instances that failed validation on mi100
2022-09-06 13:38:01 -05:00
Adam Osewski
3da5c19e62 Softmax client example (#396)
* Update Softmax device operation interface.

* Update ckProfiler.

* Update Softmax UT.

* Update example.

* Client example.

* Clang format

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-09-06 12:22:48 -05:00
zjing14
7589116121 [Hotfix] SplitK Gemm fp32 (#401)
* add scripts

* fixed splitK_gemm_fp32

* clean

* clean

* use gemm_xdl_splitK_c_shuffle into profiler

* remove device_gemm_xdl_splitk.hpp
2022-09-02 11:16:09 -05:00
Anthony Chang
f4047c9418 Implement padding and sanity checks for fused GEMM+GEMM (#376)
* GemmPadder and GemmGemmPadder

* proper padding using GemmGemmPadder

* test gemm_gemm padding

* properly check size K in IsSupportedArgument()

* properly check size requirement given SrcScalarPerVector in IsSupportedArgument()

* comment

* format
2022-08-23 10:01:02 -05:00
Qianfeng
53ea4713af Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320)
* Implement multiple-reduction in one kernel (kernels, device ops, examples)

* Add generic elementwise kernel and device interface

* Add generator for normal-distributed data initialization

* Add host refer implementation of batchnorm-forward and batchnorm-infer

* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels

* Remove un-needed including in batchnorm example

* Renaming generic_elementwise to elementiwise in kernel and device classes/functions

* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise

* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise

* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise

* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp

* Removing and renaming files

* Update to synchronize gemm_layernorm client example to the generic element-wise device op API

* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming

* Merge two static member functions in device_elementwise.hpp

* Remove unary_elementwise_1d kernel and device
2022-08-15 10:11:02 -05:00
cloudhan
fb1cbf025b Change all device operations to use add_instance_library (#338)
* Change all device operations to use add_instance_library to avoid duplicated cmake configuration.

* update DeviceMem

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 12:17:58 -05:00
rocking5566
0bd6b842b9 Layernorm welford (#346)
* Add threadwise and blockwise welford

* Rename gridwise op, prepare to add welford version

* implement welford and integrate welford into layernorm

* Take care of tail loop

* Fix buf when ThreadSliceK > 1

* Fix bug of merging of two empty set

* Rename clip to clamp

* 1. Fix type of count
2. Remove useless static_assert

* Do not inherit Reduction::Argument

* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)

* Add y stride

* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm

* Move literal ""_uz & ""_zu into namespace 'literals'

* Move namespace 'literals' as 'ck::literals'

Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 09:43:18 -05:00
Anthony Chang
c20a75b07d Fused GEMM+GEMM (#351)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* add gemm_gemm instances and tests

* avoid LDS data hazard

* fix build

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 09:18:58 -05:00
Anthony Chang
cac014f173 Fused attention (#345)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* attention host validation

* add blockwsie softmax v1

* iteratively update softmax+gemm

* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum

* add init method for easier debugging

* do away with manual thread cluster calculation

* generalize blockwise softmax interface

* row-wise softmax sum & max

* format

* rename to DeviceBatchedGemmSoftmaxGemm

* add gemm_softmax_gemm instances and tests

* comment

Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 00:16:14 -05:00
rocking5566
fdfd7eb597 ckProfiler for layernorm (#330)
* Refine parameter

* Add base class for layernorm

* Add layernorm instance

* Add layernorm to ckProfiler

* Remove redundant

* Add verification

* Fix compile error due to merge
2022-08-11 17:03:54 -05:00
Chao Liu
146972f447 fix bug in gemm profiler (#344) 2022-08-07 12:23:32 -05:00
Chao Liu
75ab874e02 Update Group convolution (#341)
* add conv oddC

* update example

* update example

* fix bug in example

* fix bug in group conv example
2022-08-03 12:28:33 -05:00
Chao Liu
500fa99512 Clean up conv example, Instances, profiler and test (#324)
* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test
2022-07-29 18:19:25 -05:00
zjing14
7959dad566 Grouped Gemm device with multiD grid (#319)
* replace gridwise_v2r3 with multiD

* adjust parameters

* add instances

* fixed test_grouped_gemm

* fix standalone softmax race condition around blockwise reduction

* fixed ci

* fixed comment: remove redundant workspace

* use instanceFactory

* add test layout

* add empty Ds

* add bias example

* use array

* sperate examples

Co-authored-by: Anthony Chang <ac.chang@outlook.com>
2022-07-21 10:07:01 -05:00
Shaojie WANG
763ca61581 add conv1d/3d bwd weight instances (#318)
* add conv1d/3d bwd weight instances

* add profiler code
2022-07-08 15:42:20 -05:00