Commit Graph

104 Commits

Author SHA1 Message Date
guangzlu
ebb0fa5b3b Add instance for elementwise normlization (#573)
* added instances for large N

* add instance for elementwise normlization

* added supported restrict in device_elementwise_normalization_impl.hpp

[ROCm/composable_kernel commit: 76d144fa7c]
2023-02-09 09:37:29 -08:00
ltqin
7cb7031d6a Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576)
* add instance for gemm bias softmax gemm

* add client example

* change CGridDesc_G_M_N to CGridDesc_G_M_O

* add gridwise

* change c grid name

* device add d0s data

* fix 08 client_example

* add example 47_fused_attention

* example output correct

* add d0 to example

* add d0 element op

* rechange instance code

* change Acc0ElementwiseOperation to C0DEElementwiseOperation

* change example name

* update instance for cdeelementwiseop

* add bhalf_t ScaleAdd

* add test

* not surport geem1 bias

* remove some ignore

* fix test bug

[ROCm/composable_kernel commit: 332ccc3367]
2023-02-08 14:34:45 -06:00
Adam Osewski
a203d3db7b Add more instances for irregular GEMM sizes. (#560)
Co-authored-by: Adam Osewski <aosewski@amd.com>

[ROCm/composable_kernel commit: 7494c1c611]
2023-01-26 13:42:20 -06:00
Qianfeng
2c1a324b99 Batchnorm inference instances, external API, client examples and gtests (#531)
* File renaming and class renaming for device element-wise operation

* Add batchnorm-infer instances, external API and client example

* Add batchnorm-infer profiler module and gtests

* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp

* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer

* Rename class and file due to conflict from device_elementwise_2d.hpp

* Fix namespace in batcnnorm_infer_nhwc client example

[ROCm/composable_kernel commit: a1b2441f8d]
2023-01-25 17:09:04 -06:00
ltqin
c87d5b6832 Add multiD Gemm client APIs (#534)
* start add example

* fix config

* fix showinfo bug

* add an elementop

* change to padding

* add xdl example

* change elementwiseop

* add instance

* add instance to profiler

* change file name

* fix deive not support issue

* add client example

* fix client gemm_add_multiply name

* change AddMultiply elementwiseop

* fix elementwiseop

* fix client example

* fix addmultiply op

* fix comments and fun name

Co-authored-by: letaoqin <letaoqin@amd.com>

[ROCm/composable_kernel commit: d66421fe34]
2023-01-18 11:53:56 -06:00
ltqin
26767954fd Add client API/examples for 3xGemm+Bias+Add+Permute{0, 2, 3, 1} (#550)
* add example

* fix example

* add instance for gemm permute

* add to client example

* change configs

* change instance file name

* formate

* change client example file name and remove example

[ROCm/composable_kernel commit: 55236709e2]
2023-01-18 10:52:52 -06:00
Qianfeng
46a0aceec1 Reduction external API and client examples (#493)
* Change to the DeviceReduce base class template to include all problem description information

* Add external api for reduction

* Add client example to test the reduction external api

* Spelling correction

* Re-implement the host_reduction to follow the DeviceReduce base API format

* Change the reduce profiler to call the external API for collecting device instances

* Rename reduce client example directory from 08_reduce to 12_reduce

* Remove (void) before the functional call

* Tiny update in reduce client example

* Tiny update in profile_reduce_impl.hpp

* Rename the reduce client example directory

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 80e0526741]
2023-01-16 22:18:06 -06:00
zjing14
afa7c8eea1 Add MNK padding, M = 0 support into grouped_gemm (#539)
* add mnk padding, support m=0

* clean code

* clean code

Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>

[ROCm/composable_kernel commit: 0345963eef]
2022-12-15 15:07:24 -06:00
Rostyslav Geyyer
9cdd223de5 Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable arbitrary problem size (#535)
* Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances

* Add padding device_gemm_add_fastgelu_xdl_c_shuffle instances

* Add gemm_add_fastgelu profiler impl

* Add padding device_gemm_fastgelu_xdl_c_shuffle instances

* Add gemm_fastgelu profiler impl

[ROCm/composable_kernel commit: 9a1f2475e3]
2022-12-14 18:12:09 -06:00
Rostyslav Geyyer
7355f95afe Add padding device_gemm_xdl instances (#529)
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: c7a4d36147]
2022-12-07 17:46:03 -06:00
ltqin
621c12302f Add multiple d gridwise gemm on Navi21 for ResNet50 (#517)
* start add example

* add multiple d fp16 example

* device transfer elementwiseop to gridwise

* gridwise add multiple d

* change example for multiple d

* fix spill registers

* fix for passthrough element op

* fix int8 overflow

* change example file name

* add instance for dl multiple d

* example add DsDataType

* remove grouped_convolution_forward_dl.hpp

* add head file(was deleted before)

* fix not support device issue

* format

* remove passthrough check

Co-authored-by: letaoqin <letaoqin@amd.com>

[ROCm/composable_kernel commit: 23ecf0fa9e]
2022-12-02 11:42:31 -06:00
rocking5566
8e868bf880 gemm, conv perchannel quantization (#503)
* Use gemm_multiple_D instead

* Add gemm bias relu quantization example

* Add pure gemm quantization example

* Add quantization of perchannel conv + bias + relu example

* Refine the code

* Rename multiplier to requant_scale

* Rename the folder

* Remove redundant comment

* Rename the file. Prepare to add perchannel

* Add conv perchannel instance

* Move to quantization folder

* Add conv perchannel client example

* Apply Rangify constructor of HostTensorDescriptor & Tensor<>

* Fix merge error

[ROCm/composable_kernel commit: ad541ad6b9]
2022-11-30 14:13:04 -06:00
Qianfeng
c3bb3db252 BatchNorm backward instance/external API/profiler/tests (#519)
* Refine the device batchnorm-backward base API templates and data type assignments

* Remove duplicated kernel file

* Add batchnorm backward instances and external API

* Add batchnorm-backward profiler and tests

* Add client example which uses batchnorm backward external API

* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory

* Loose the threshold for batchnorm-backward check_err()

[ROCm/composable_kernel commit: 63af525c06]
2022-11-30 13:32:20 -06:00
Qianfeng
1a6febf03a Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test (#516)
[ROCm/composable_kernel commit: 5bf0475afd]
2022-11-28 14:33:00 -06:00
Qianfeng
144efbf9b6 BatchNorm forward instance/external api/profiler/tests/client example (#511)
* Update to device_batchnorm_forward base class to include all template parameters for problem description

* Add batchnorm forward instances and external api

* Add batchnorm forward profiler module which uses the external api

* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]

* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward

* Improvement to the batchnorm infer base API

* Add batchnorm forward client example which shows using the batchnorm forward external API

* Add test for batchnorm forward

* Tuning the batchnorm profiler initialized values and error threshold

* Add support for bhalf_t in instances/external api/tests

* Add support for int8_t in instances/external api/tests

* Add support for double in instances/external api/tests

* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances

* Checking before running best instance in batchnorm_fwd_nhwc client example

* Add checking for YElementwiseOp in batchnorm_forward external API

* Add more types in batchnorm forward profiler

* Add more test lengths

Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>

[ROCm/composable_kernel commit: 4e6a5575be]
2022-11-24 18:02:27 -06:00
Adam Osewski
0f3d9639a8 Client examples AddFastGelu and FastGelu + instances. (#509)
* FastGelu support for more data types.

* AddFastGelu & FastGelu instances.

* Client example.

* clang-format

* Remove unused stride variable.

* Add new line at EOF.

Co-authored-by: Adam Osewski <aosewski@amd.com>

[ROCm/composable_kernel commit: 43a889b72e]
2022-11-19 22:08:26 -06:00
guangzlu
3941d1e815 Add BF16 tests for batched_gemm_softmax_gemm_permute (#504)
* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm

* added bf16 tests for batched_gemm_softmax_gemm_permute

* changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* aligned annotations

* modified CMakeLists for examples

* add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl

* use macro to control the instances

* added macro control into instances

* clang-format some files

* changed error tolerance for bf16

* changed index for 10_elementwise_normalization

* fixed xdlops code bug in amd_xdlops.hpp

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 4c4c7328a6]
2022-11-15 16:30:23 -06:00
ltqin
3073d82b47 Add Conv Backward Data on Navi21 for ResNet50 (#499)
* start add example

* add device dl

* change launch kernel

* change init data method

* change example config

* add config valid check

* add instance for dl bwd

* add instance to ckProfiler

* reserver to profiler and cmakelist

* add instance to ckProfiler2

* change instance f32 config

* fix example return value

Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: db0eb1ea9c]
2022-11-15 16:22:20 -06:00
Po Yen Chen
e418b29268 Introduce ck::accumulate_n() (#439)
We can use this template to eliminate duplicated iterator computing
logics. By providing return type to ck::accumulate_n(), we can avoid
type conversion operations.

[ROCm/composable_kernel commit: 730204eed0]
2022-11-14 19:53:39 -06:00
Po Yen Chen
44e669e0dd Add client example of grouped conv2d backward weight (data type: fp16) (#498)
* Remove redundant CMake setting

* Extract common code from files

* Rename folder 'convnd' to 'conv'

* Use std::array<> to accept compile-time kwnown # of arguments

* Fix compilation error of tuning parameter

* In example, use same setting as unit-test

* Remove no-longer used include directive

* Add interface for grouped conv bwd weight

* Add group support for conv bwd weight

* Add grouped conv bwd weight example

* Use group parameter in example

* Rename example folder

* Remove non-grouped version example source files

* Rename device op template

* Add group support to convolution backward weight

* Remove debug messages

* Use smaller group size in example

* Use named variable as loop terminate condition

* Prettify example output message

* Enlarge used grid size

* Allow real grid size exceeds expected grid size

* Rename interface file

* Add client example for grouped conv2d bwd weight

* Fix wrong include directive

* Rename client example folder

[ROCm/composable_kernel commit: 38470e0497]
2022-11-09 18:50:03 -06:00
Po Yen Chen
7af0ecfa69 Remove interface 'DeviceGroupedConvBwdData' (#500)
* Remove interface 'DeviceGroupedConvBwdData'

* Remove no-longer needed include directive

* Rename client example folder

[ROCm/composable_kernel commit: 67423a2275]
2022-11-09 18:32:17 -06:00
guangzlu
b569103030 Fused elementwise normalization (#492)
* add fused addition lyernorm

* add fused addition lyernorm

* changed CMakelist

* removed annotates

* modified descriptor of C

* fixed bug in gridwise add layernorm

* format the files

* modified name from add&layernorm into elementwise&layernorm

* created fused elementwise layernorm branch

* change input into tuple type

* add sweep once to reduce load & read of C from global memory

* modified Argument api

* modified way to malloc c in global memory

* changed gamma and beta to m_k_desc

* fixed bug when sweep once and move CDataType when define device level struct

* add src dim for gamma and beta

* implement optimization for coalesced

* delete a annotation line

* fixed some bug to meet the requirements of ck

* add bandwidth computing in example, and fixed the time unit

* move device_elementwise_layernorm_impl.hpp into device/impl

* fixed bug in device_elementwise_layernorm_impl.hpp

* changed name from layernorm into normalization

* clang-format the changed files

* changed the names

* moved immidiate results into lds, it become faster in non-sweeponce cases

* changed naming of C into X to make the defination more clear

* changed naming in example

* add tests for elementwise normalization

* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization

* move test_elementwise_layernorm_fp16 into new folder

* move elementwise_normalization_instances into a new folder

* add more tests in test_elementwise_layernorm_fp16.cpp

* added some corner cases in test

* fixed method to compute lds size for matrix X

* changed name of 44_elementwise_normalization into 45_elementwise_normalization

* modified some comments

* modified some other confused comments

* reduce redundant tests in test_elementwise_layernorm_fp16.cpp

[ROCm/composable_kernel commit: 8a4253baaf]
2022-11-03 12:01:58 -06:00
Anthony Chang
619cd1fc1b remove atten kernel workarounds as we move over to rocm 5.3 (#496)
[ROCm/composable_kernel commit: 451f1e3d65]
2022-11-02 16:56:07 -06:00
Po Yen Chen
77544fdc8c Add client example of grouped conv2d backward data (data type: fp16) (#481)
* Improve example reusability

* Remove no-longer used file

* Rename folder of grouped_conv_bwd_data example

* Add normal grouped conv bwd example

* Add interface 'DeviceGroupedConvBwdData'

* Prettify comment of device op type arguments

* Add grouped conv2d/conv3d backward data fp16 instances

* Fix wrong template argument

* Add grouped_conv2d_bwd_data client example

* Use simpler expression to calculate memory size

* Fix formating

* Remove grouped_conv3d_bw_data instances

Underlying device operator is not ready to handle 3D input

* Remove no-longer necessary include directive

* Add missing include directive

* Use more realistic conv param in example

[ROCm/composable_kernel commit: 9e57a290af]
2022-11-02 16:54:41 -06:00
Rostyslav Geyyer
5de0cc39d0 Add pipeline v1/v2 selector, add more instances (#381)
* Add gridwise gemm pipeline v1/v2 selector

* Pipeline selector working, test-wise add pipeline options to one instance

* Add gemm instances

* Add debug info to DeviceGemmXdl

* Add debug info to DeviceGemmXdl_CShuffle

* Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu

* Minor fix

* Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm

* set up inter-wave configuration

* use defualt loop scheduling for supported gemm ops

for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI

* Add enum PipelineVersion

* Update instances

* Format

* Fix the merge conflict

* Add flags to disable added instances

* Test disable flag check

* Disable flag check

* Enable the instances

Co-authored-by: Anthony Chang <ac.chang@outlook.com>

[ROCm/composable_kernel commit: 1a0b0e7bec]
2022-11-02 16:50:48 -06:00
Adam Osewski
efd02ae56c Softmax unit-test reduction across all and non innermost dims cases. (#406)
* Add reduction across all dims cases.

* host softmax: handle all reduce

* Test cases when reduced dim is not innermost axis.

* Fix syntax.

* Test non innermost dim for fp32 and int8

* Group test suites wrt NumReduceDim.

* Additionally test failing cases.

* Throw error when Rank or NumReduceDims doesn't match arguments.

* Check reducedDims has correct values

* Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
Instead implement own. (in fact just get rid of one check to enable
reduction across inner dimensions).

* Reorganize unit tests to better cover use scenarios.

* Test input validation
* Test reduction of inner dimensions with custom op instances.

* Refactor fp32 and int8 unit tests.

* Fix FP32 instance template parameters.

* Add more instances.

* Instances with InSrcVectorDim=0.

* Do not initialize and copy data when arg not supported.

* ckProfiler Softmax use instance factory.

* Refactor device softmax IsSupported.

* Additionally add non-polymorphic api functions

* Split softmax instances into multiple files.

* Fix profiler.

* Reorganize tests to reuse profiler and cover edge cases.

* Clang-format

* I8 Softmax instances along with UT.

* Reuse type alias definitions from instance factory header.

* Clean included headers

* Fix variable names.

* Add missing checks in Argument constructor.

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

[ROCm/composable_kernel commit: 6d8614ee50]
2022-11-02 16:46:08 -06:00
rocking5566
9c3ea1fcc5 Conv perlayer int8 quantization (#471)
* Add conv2d requant example

* Fix bash error

* Rename example

* 1. Rename gemm quantization
2. shares the requantization lambda function with conv

* Refine declare type

* Add conv bias relu quantization exmaple

* clang format

* Fix compile error due to merge develop

* Fix CI error

* Extract quantization post operation into another file

* Support quantization for non piecewise linear function

* Add instance for conv quantization

* Add convolution quantization factory

* Add convolution quantization client example

* Add more instances with different template parameters

* clang format

* Sync the naming with the develop

[ROCm/composable_kernel commit: 226bc02b73]
2022-11-02 13:56:26 -06:00
ltqin
04a615957e Add Conv Forward on Navi21 for ResNet50 (#490)
* add device of dl

* fix k1 of GridwiseGemmDl_km_kn_mn_v1r3

* init version for dl conv

* add example(init)

* result right

* disable elementwise operation

* check parameters

* add fp32,int8 example and change check code

* change deive file and class name

* add check vector access of C

* add instance

* add to ckProfiler

* add Filter1x1Pad0 instances

* fix ignore error

* fix for CI

Co-authored-by: letaoqin <letaoqin@amd.com>

[ROCm/composable_kernel commit: 8ee36118be]
2022-10-31 10:24:25 -06:00
Anthony Chang
0c8306f1cd fix missing -fPIC flag for conv3d_fwd instance lib (#473)
[ROCm/composable_kernel commit: d8f5f717f8]
2022-10-27 15:32:46 -06:00
Anthony Chang
a109f73b95 Input/output permutation for fused attention (#460)
* reopen masking att instance due to CI is upgraded

* re-enable instances previously failed on 9110

* enable ksize-kpadding pair validity test

* add non-masked attention+permute test; expose masking boolean to attention kernel handles

* disable bench

* fix test

* move files

* bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute

* format

* amend rename

* disable bench in test

* add mask/no-mask test for non-permute attention kernels

* disable broken kernel instance

* example working

add non-permuted problem statement

evaluating whether overhead comes from permutation or the extra kernel arg

* interface for bias addition without implementing it

* test and profiler running

* tidy

* mask type determined by enum class

* unify example code

* move masking specialization to its own header

* align formats

* extract helper functions

* experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute

* add tensor specialization to template args

since tensor spec packed shows perf parity when permutation isn't needed

remove redundant template args

comment on 'packed' tensor specialization

* grouped attention with input/output permute example

* format

* clean up

* refactor acc0 tile visitor

Co-authored-by: shaojiewang <wsjmessi@163.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: de37550f72]
2022-10-27 14:58:20 -06:00
Rostyslav Geyyer
8eaa9ee4cc Fix Batched Gemm op for int8 data (#482)
* Fix for lwpck-425, update BlockTransferSrcVectorDim

* Revert "Fix for lwpck-425, update BlockTransferSrcVectorDim"

This reverts commit fd24e280e2.

* Add Batched Gemm int8 test, expect it to fail

* Format

* Re-add the fix

[ROCm/composable_kernel commit: cd51732690]
2022-10-27 12:34:04 -06:00
Qianfeng
95dacf7c96 Update to the Reduction API and instances (#476)
* Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances

* Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces

* Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock

* Split the cpp and hpp files for reduction instances to enable more parallel compiling

* Remove the using of macros for declaring reduction instances and instance references

* Update to add_device_reduce_instance_xxxx templated functions

* Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates

* Change return format

[ROCm/composable_kernel commit: dda3a0a10b]
2022-10-25 09:39:11 -06:00
guangzlu
537f7af7b7 Revert "Fused elementwise layernorm (#468)" (#491)
This reverts commit ba4d3575b2.

[ROCm/composable_kernel commit: 6ea9257e9d]
2022-10-25 18:37:12 +08:00
guangzlu
ba4d3575b2 Fused elementwise layernorm (#468)
* add fused addition lyernorm

* add fused addition lyernorm

* changed CMakelist

* removed annotates

* modified descriptor of C

* fixed bug in gridwise add layernorm

* format the files

* modified name from add&layernorm into elementwise&layernorm

* created fused elementwise layernorm branch

* change input into tuple type

* add sweep once to reduce load & read of C from global memory

* modified Argument api

* modified way to malloc c in global memory

* changed gamma and beta to m_k_desc

* fixed bug when sweep once and move CDataType when define device level struct

* add src dim for gamma and beta

* implement optimization for coalesced

* delete a annotation line

* fixed some bug to meet the requirements of ck

* add bandwidth computing in example, and fixed the time unit

* move device_elementwise_layernorm_impl.hpp into device/impl

* fixed bug in device_elementwise_layernorm_impl.hpp

* changed name from layernorm into normalization

* clang-format the changed files

* changed the names

* moved immidiate results into lds, it become faster in non-sweeponce cases

* changed naming of C into X to make the defination more clear

* changed naming in example

* add tests for elementwise normalization

* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization

* move test_elementwise_layernorm_fp16 into new folder

* move elementwise_normalization_instances into a new folder

* add more tests in test_elementwise_layernorm_fp16.cpp

* added some corner cases in test

* fixed method to compute lds size for matrix X

* changed name of 44_elementwise_normalization into 45_elementwise_normalization

* modified some comments

* modified some other confused comments

* reduce redundant tests in test_elementwise_layernorm_fp16.cpp

[ROCm/composable_kernel commit: efbcc6eddc]
2022-10-25 10:23:20 +08:00
Adam Osewski
c747be612f Refactor device op implementations into impl subdirectory. (#420)
* Move kernel implementation files under impl directory.

* Update examples paths.

* Update device kernel impl include paths.

* Update tensor operation instances include paths.

* Update profiler and tests include paths.

* Clang-format

* Update include paths for batched gemm reduce

* Refactor UnitTest ConvNDBwdWeight.

* Refactor fwd and bwd data convND UT.

* Fix used test macro.

* Fix include path.

* Fix include paths.

* Fix include paths in profiler and tests.

* Fix include paths.

Co-authored-by: Adam Osewski <aosewski@amd.com>

[ROCm/composable_kernel commit: 3048028897]
2022-10-13 09:05:08 -05:00
rocking5566
1dcaa3991f Fix bug of layernorm ckProfiler and refine code (#448)
* Fix bug of profiler for layernorm

* 1. Rename layernorm into normalization
2. Decouple softmax from normalization

* clang-format

[ROCm/composable_kernel commit: 1b62bfaa2a]
2022-10-12 21:06:39 -05:00
Shaojie WANG
5f03000127 Optimization for gridwise group norm (#453)
* use another instance to check the efficiency

* optimize group layer norm

* 1. coalesce load/store data for gridwise layer norm welford. 2. move a sqrt and divison into a outer static loop

* add more instances to layernorm

* add 2 more test cases

* remove ignore in generating tuple of vector

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 40942b9098]
2022-10-06 21:24:13 -05:00
JD
9adbddfb55 Fix device instance libarary to include all instances (#418)
* fix device instance library to add all instances

* remove cppcheck from requirements.txt

Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 2c6d63d031]
2022-09-23 13:30:18 -05:00
Chao Liu
b06deda080 fix build (#434)
* fix

* fix

* add instance

[ROCm/composable_kernel commit: e9d4e893e5]
2022-09-22 12:32:41 -05:00
Shaojie WANG
a121a10a26 MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425)
* add lower triangle bmm

* init code for tile skipping

* functionality right with lower triangle mask

* add decoder lower triangular mask calculation

* use 7*13 group

* fix n2 compute error

* attention with lower triangle mask with tile skipping

* add template to distinguish masking kernel

* rename template and remove default template value

* remove lower triangle gemm reference struct

* add some comments on example

* add 10 instance for masking bmm + scale + softmax + bmm + permute kernels

* add test

* add test file

* add gtest for bmm masking scale softmax bmm permute

* clang-format

* fix compile error

* check lef bottom corner for tile skipping

* fix error: check left bottom corner for tile skipping

* add k padding

* add test and instance for MNK padding

* passing a mask struct

* fix instances

* delete used comments

* format

Co-authored-by: danyao12 <yaodan@dc-smc-13.amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: ebab84b6f9]
2022-09-20 12:43:53 -05:00
rocking5566
f09ecf09f6 Group norm (#417)
* Add groupnorm example by layernorm
1.  Reference is not ready
2. shape of gamma and beta need to be fix

* Let shape of gamma and beta can be same as x

* Modify test, instance and client example

* [What] Fix bug of layernorm for greater than 2 dimension.
[Why] We need to get upper length from merge transform instead of embed transform.

* Add reference for groupnorm

* Fuse sigmoid after groupnorm

* [What] Rename original layernorm into layernorm2d
[Why] Prepare to add groupnorm using layernorm5d

* clang-format

* Add groupnorm test

* Refine error message

* Add groupnorm ckProfiler

* Test groupnorm kernel from device_instance

* update example

* upadte profiler

* Fix test naming

* Fix argc number

* Move descriptor and sweeponce to argument for quick debugging

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 4eba345f6e]
2022-09-19 22:30:46 -05:00
Anthony Chang
efd8aaa0df Add batched attention special kernel instances (#424)
* sanity check

* add attribution

* add irrgular k tile size for batched attention

* format

[ROCm/composable_kernel commit: 7c788e10ce]
2022-09-19 19:20:54 -05:00
ltqin
9d42129ff3 batched_gemm + multiple_d + gemm + multiple_d (#394)
* refactor

* start

* add device gemm file

* add BatchStrideD0

* add stridd0

* add gridwise file

* add d0 parameters to gridwise gemm

* add c layout transformer

* add d0 threadwise copy

* init kernel

* init kernel

* regular code

* nm desc put to out

* kernel parameter can not use reference

* host add bias+gelu

* run right for bias+gelu

* change AddFastGelu into another file

* interface add d1 bias parameters

* add d1 parameter to argument

* add d1 parameter to gridwise

* first all code,not verify

* gelu change to relu and GetElementSpaceSize bug

* add instance

* start add to ckprofiler

* ckprofiler finish code

* change input parameter for ckProfiler

* fix host bias+gelu bug

* show help for ckProfiler

* fix bug for lunch kernel ignore parametes

* add pad and fix about bug

* mutiple d0

* add dynamic d0_element_op

* change profiler and  instance to mutiple d0

* example have 2 d0

* remove some comments not using

* change 2 d0 have self  parameters

* change d element_op name

* change class name(multiple_d)

* fix bug

* fix bug that don't find file

* update profiler

* refactor

* update profiler

* clean

* revert example change

* add gon layout

* optimize parameter for gno

* add gon to gemm+gemm

* change helping input parameters

* change to GemmPadder_v2

* using ForEach

* fix gb_per_sec

Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: ltqin <letaoqin@amd.com>

[ROCm/composable_kernel commit: 370efa6c08]
2022-09-14 17:54:18 -05:00
Anthony Chang
311435b21a Fused attention instances & padding tests (#395)
* modify comment

* trim unnecessary check

* add gemm spec in kernel name

* add TNTT gemm_gemm + atten kernel instances

* refactor attention padding to better fit in unit tests

This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
Also added compile-time conditionals that load OOB value as NaN only after padding is enabled

* add adhoc padding test for atten

* shrink input value range for attention kernel validation to avoid occasional error by 1e-3

Still unsure whether this kind of deterministic floating point accurary issue is expected
or not. May want to try exact same approach as the GPU kernel in the host reference
GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
shrink the input value range as it is less likely to produce errors of around ~1e-3.

* attention kernel proper granular padding for all 4 dims

* IsSupportedArgument checks

* test more padded cases

* block PadK specialization in attention kernels

* workaround clang crash for gfx908

(gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
VGPR_32: Cannot scavenge register without an emergency spill slot!"
this fall back to less ideal way of handle NPadding in fused attention kernel

* comment out kernels giving wrong results on MI100; MI200 doesn't seem affected

[ROCm/composable_kernel commit: 868e5c555b]
2022-09-06 14:38:56 -05:00
Anthony Chang
2d119fda7b GemmGemm TNNT instances (#399)
* add gemm_gemm TNNT instance

* sanitize Gemm1KPack

* disable instances that failed validation on mi100

[ROCm/composable_kernel commit: fe52c94c98]
2022-09-06 13:38:01 -05:00
Adam Osewski
5643625481 Softmax client example (#396)
* Update Softmax device operation interface.

* Update ckProfiler.

* Update Softmax UT.

* Update example.

* Client example.

* Clang format

Co-authored-by: Adam Osewski <aosewski@amd.com>

[ROCm/composable_kernel commit: 3da5c19e62]
2022-09-06 12:22:48 -05:00
zjing14
bac57cc4e5 [Hotfix] SplitK Gemm fp32 (#401)
* add scripts

* fixed splitK_gemm_fp32

* clean

* clean

* use gemm_xdl_splitK_c_shuffle into profiler

* remove device_gemm_xdl_splitk.hpp

[ROCm/composable_kernel commit: 7589116121]
2022-09-02 11:16:09 -05:00
Anthony Chang
1fbd80b0a0 Implement padding and sanity checks for fused GEMM+GEMM (#376)
* GemmPadder and GemmGemmPadder

* proper padding using GemmGemmPadder

* test gemm_gemm padding

* properly check size K in IsSupportedArgument()

* properly check size requirement given SrcScalarPerVector in IsSupportedArgument()

* comment

* format

[ROCm/composable_kernel commit: f4047c9418]
2022-08-23 10:01:02 -05:00
Qianfeng
25cca7c462 Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320)
* Implement multiple-reduction in one kernel (kernels, device ops, examples)

* Add generic elementwise kernel and device interface

* Add generator for normal-distributed data initialization

* Add host refer implementation of batchnorm-forward and batchnorm-infer

* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels

* Remove un-needed including in batchnorm example

* Renaming generic_elementwise to elementiwise in kernel and device classes/functions

* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise

* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise

* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise

* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp

* Removing and renaming files

* Update to synchronize gemm_layernorm client example to the generic element-wise device op API

* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming

* Merge two static member functions in device_elementwise.hpp

* Remove unary_elementwise_1d kernel and device

[ROCm/composable_kernel commit: 53ea4713af]
2022-08-15 10:11:02 -05:00
cloudhan
f1a2efceb5 Change all device operations to use add_instance_library (#338)
* Change all device operations to use add_instance_library to avoid duplicated cmake configuration.

* update DeviceMem

Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: fb1cbf025b]
2022-08-13 12:17:58 -05:00