Commit Graph

83 Commits

Author SHA1 Message Date
guangzlu
8a4253baaf Fused elementwise normalization (#492)
* add fused addition lyernorm

* add fused addition lyernorm

* changed CMakelist

* removed annotates

* modified descriptor of C

* fixed bug in gridwise add layernorm

* format the files

* modified name from add&layernorm into elementwise&layernorm

* created fused elementwise layernorm branch

* change input into tuple type

* add sweep once to reduce load & read of C from global memory

* modified Argument api

* modified way to malloc c in global memory

* changed gamma and beta to m_k_desc

* fixed bug when sweep once and move CDataType when define device level struct

* add src dim for gamma and beta

* implement optimization for coalesced

* delete a annotation line

* fixed some bug to meet the requirements of ck

* add bandwidth computing in example, and fixed the time unit

* move device_elementwise_layernorm_impl.hpp into device/impl

* fixed bug in device_elementwise_layernorm_impl.hpp

* changed name from layernorm into normalization

* clang-format the changed files

* changed the names

* moved immidiate results into lds, it become faster in non-sweeponce cases

* changed naming of C into X to make the defination more clear

* changed naming in example

* add tests for elementwise normalization

* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization

* move test_elementwise_layernorm_fp16 into new folder

* move elementwise_normalization_instances into a new folder

* add more tests in test_elementwise_layernorm_fp16.cpp

* added some corner cases in test

* fixed method to compute lds size for matrix X

* changed name of 44_elementwise_normalization into 45_elementwise_normalization

* modified some comments

* modified some other confused comments

* reduce redundant tests in test_elementwise_layernorm_fp16.cpp
2022-11-03 12:01:58 -06:00
Po Yen Chen
9e57a290af Add client example of grouped conv2d backward data (data type: fp16) (#481)
* Improve example reusability

* Remove no-longer used file

* Rename folder of grouped_conv_bwd_data example

* Add normal grouped conv bwd example

* Add interface 'DeviceGroupedConvBwdData'

* Prettify comment of device op type arguments

* Add grouped conv2d/conv3d backward data fp16 instances

* Fix wrong template argument

* Add grouped_conv2d_bwd_data client example

* Use simpler expression to calculate memory size

* Fix formating

* Remove grouped_conv3d_bw_data instances

Underlying device operator is not ready to handle 3D input

* Remove no-longer necessary include directive

* Add missing include directive

* Use more realistic conv param in example
2022-11-02 16:54:41 -06:00
Adam Osewski
6d8614ee50 Softmax unit-test reduction across all and non innermost dims cases. (#406)
* Add reduction across all dims cases.

* host softmax: handle all reduce

* Test cases when reduced dim is not innermost axis.

* Fix syntax.

* Test non innermost dim for fp32 and int8

* Group test suites wrt NumReduceDim.

* Additionally test failing cases.

* Throw error when Rank or NumReduceDims doesn't match arguments.

* Check reducedDims has correct values

* Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
Instead implement own. (in fact just get rid of one check to enable
reduction across inner dimensions).

* Reorganize unit tests to better cover use scenarios.

* Test input validation
* Test reduction of inner dimensions with custom op instances.

* Refactor fp32 and int8 unit tests.

* Fix FP32 instance template parameters.

* Add more instances.

* Instances with InSrcVectorDim=0.

* Do not initialize and copy data when arg not supported.

* ckProfiler Softmax use instance factory.

* Refactor device softmax IsSupported.

* Additionally add non-polymorphic api functions

* Split softmax instances into multiple files.

* Fix profiler.

* Reorganize tests to reuse profiler and cover edge cases.

* Clang-format

* I8 Softmax instances along with UT.

* Reuse type alias definitions from instance factory header.

* Clean included headers

* Fix variable names.

* Add missing checks in Argument constructor.

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
2022-11-02 16:46:08 -06:00
rocking5566
226bc02b73 Conv perlayer int8 quantization (#471)
* Add conv2d requant example

* Fix bash error

* Rename example

* 1. Rename gemm quantization
2. shares the requantization lambda function with conv

* Refine declare type

* Add conv bias relu quantization exmaple

* clang format

* Fix compile error due to merge develop

* Fix CI error

* Extract quantization post operation into another file

* Support quantization for non piecewise linear function

* Add instance for conv quantization

* Add convolution quantization factory

* Add convolution quantization client example

* Add more instances with different template parameters

* clang format

* Sync the naming with the develop
2022-11-02 13:56:26 -06:00
ltqin
8ee36118be Add Conv Forward on Navi21 for ResNet50 (#490)
* add device of dl

* fix k1 of GridwiseGemmDl_km_kn_mn_v1r3

* init version for dl conv

* add example(init)

* result right

* disable elementwise operation

* check parameters

* add fp32,int8 example and change check code

* change deive file and class name

* add check vector access of C

* add instance

* add to ckProfiler

* add Filter1x1Pad0 instances

* fix ignore error

* fix for CI

Co-authored-by: letaoqin <letaoqin@amd.com>
2022-10-31 10:24:25 -06:00
Qianfeng
7fa892e63e Batchnorm-forward implemented using welford method to calculate variance (#403)
* Update to the batchnorm-forward API and base class

* Fix leeked header including in gridwise_set_buffer_value.hpp

* Add kernels and device file for batchnorm-forward welford supporting both blockwise and multi-block reduction

* Update to the batchnorm-forward example to use the new batchnorm-forward device interface

* Change the batchnorm-forward reference to use sequential welford method

* Change to assign the workspace into four buffers in the host layer

* Use GetReduceCountPerThread functor to replace the initial count for Blockwise and Multiblock welford

* Tiny correction and remove un-used file under example/34_batchnorm

* Renaming in the kernel arguments

* Explicitly use ck::math::sqrt in batchnorm-forward kernels

* Add some comments to some kernels

* Tiny fix

* Generalize the data types in reference_batchnorm_forward_nhwc_c

* Use ck::ignore to mark un-used parameters

* Move GetReduceCountPerThread functor codes from kernel to device

* Remove some un-used codes in device_batchnorm_forward_impl.hpp

* Tiny fix in batchnorm_forward example

* Move GetReduceCountPerThread() to welford_helper.hpp

* Use seperate data type for Scale and Bias

* Renaming in device Op

* Tiny fix in forward example

* Updata to batchnorm-infer (type spliting, renaming)

* Add time and bandwidth measurement to the batchnorm-forward example

* Add support of elementwise operation for batchnorm forward output

* Reduce object copying by passing object as reference type

* Tiny change for performance

* Updates for performance again

* Some Renamings

* Add GetActualVariance template parameter for ThreadwiseWelfordMerge

* Tiny update in reference batchnorm forward nhwc/c

* Move batchnorm multiblock kernel files to grid/batchnorm_multiblock sub-directory

* Fuse mean and bias in the normalization calculation

Co-authored-by: root <root@dc-smc-18.amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
2022-10-27 18:52:54 -06:00
Anthony Chang
de37550f72 Input/output permutation for fused attention (#460)
* reopen masking att instance due to CI is upgraded

* re-enable instances previously failed on 9110

* enable ksize-kpadding pair validity test

* add non-masked attention+permute test; expose masking boolean to attention kernel handles

* disable bench

* fix test

* move files

* bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute

* format

* amend rename

* disable bench in test

* add mask/no-mask test for non-permute attention kernels

* disable broken kernel instance

* example working

add non-permuted problem statement

evaluating whether overhead comes from permutation or the extra kernel arg

* interface for bias addition without implementing it

* test and profiler running

* tidy

* mask type determined by enum class

* unify example code

* move masking specialization to its own header

* align formats

* extract helper functions

* experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute

* add tensor specialization to template args

since tensor spec packed shows perf parity when permutation isn't needed

remove redundant template args

comment on 'packed' tensor specialization

* grouped attention with input/output permute example

* format

* clean up

* refactor acc0 tile visitor

Co-authored-by: shaojiewang <wsjmessi@163.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-10-27 14:58:20 -06:00
Qianfeng
dda3a0a10b Update to the Reduction API and instances (#476)
* Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances

* Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces

* Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock

* Split the cpp and hpp files for reduction instances to enable more parallel compiling

* Remove the using of macros for declaring reduction instances and instance references

* Update to add_device_reduce_instance_xxxx templated functions

* Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates

* Change return format
2022-10-25 09:39:11 -06:00
guangzlu
6ea9257e9d Revert "Fused elementwise layernorm (#468)" (#491)
This reverts commit efbcc6eddc.
2022-10-25 18:37:12 +08:00
guangzlu
efbcc6eddc Fused elementwise layernorm (#468)
* add fused addition lyernorm

* add fused addition lyernorm

* changed CMakelist

* removed annotates

* modified descriptor of C

* fixed bug in gridwise add layernorm

* format the files

* modified name from add&layernorm into elementwise&layernorm

* created fused elementwise layernorm branch

* change input into tuple type

* add sweep once to reduce load & read of C from global memory

* modified Argument api

* modified way to malloc c in global memory

* changed gamma and beta to m_k_desc

* fixed bug when sweep once and move CDataType when define device level struct

* add src dim for gamma and beta

* implement optimization for coalesced

* delete a annotation line

* fixed some bug to meet the requirements of ck

* add bandwidth computing in example, and fixed the time unit

* move device_elementwise_layernorm_impl.hpp into device/impl

* fixed bug in device_elementwise_layernorm_impl.hpp

* changed name from layernorm into normalization

* clang-format the changed files

* changed the names

* moved immidiate results into lds, it become faster in non-sweeponce cases

* changed naming of C into X to make the defination more clear

* changed naming in example

* add tests for elementwise normalization

* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization

* move test_elementwise_layernorm_fp16 into new folder

* move elementwise_normalization_instances into a new folder

* add more tests in test_elementwise_layernorm_fp16.cpp

* added some corner cases in test

* fixed method to compute lds size for matrix X

* changed name of 44_elementwise_normalization into 45_elementwise_normalization

* modified some comments

* modified some other confused comments

* reduce redundant tests in test_elementwise_layernorm_fp16.cpp
2022-10-25 10:23:20 +08:00
Adam Osewski
3048028897 Refactor device op implementations into impl subdirectory. (#420)
* Move kernel implementation files under impl directory.

* Update examples paths.

* Update device kernel impl include paths.

* Update tensor operation instances include paths.

* Update profiler and tests include paths.

* Clang-format

* Update include paths for batched gemm reduce

* Refactor UnitTest ConvNDBwdWeight.

* Refactor fwd and bwd data convND UT.

* Fix used test macro.

* Fix include path.

* Fix include paths.

* Fix include paths in profiler and tests.

* Fix include paths.

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-10-13 09:05:08 -05:00
rocking5566
1b62bfaa2a Fix bug of layernorm ckProfiler and refine code (#448)
* Fix bug of profiler for layernorm

* 1. Rename layernorm into normalization
2. Decouple softmax from normalization

* clang-format
2022-10-12 21:06:39 -05:00
Shaojie WANG
ebab84b6f9 MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425)
* add lower triangle bmm

* init code for tile skipping

* functionality right with lower triangle mask

* add decoder lower triangular mask calculation

* use 7*13 group

* fix n2 compute error

* attention with lower triangle mask with tile skipping

* add template to distinguish masking kernel

* rename template and remove default template value

* remove lower triangle gemm reference struct

* add some comments on example

* add 10 instance for masking bmm + scale + softmax + bmm + permute kernels

* add test

* add test file

* add gtest for bmm masking scale softmax bmm permute

* clang-format

* fix compile error

* check lef bottom corner for tile skipping

* fix error: check left bottom corner for tile skipping

* add k padding

* add test and instance for MNK padding

* passing a mask struct

* fix instances

* delete used comments

* format

Co-authored-by: danyao12 <yaodan@dc-smc-13.amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-09-20 12:43:53 -05:00
rocking5566
4eba345f6e Group norm (#417)
* Add groupnorm example by layernorm
1.  Reference is not ready
2. shape of gamma and beta need to be fix

* Let shape of gamma and beta can be same as x

* Modify test, instance and client example

* [What] Fix bug of layernorm for greater than 2 dimension.
[Why] We need to get upper length from merge transform instead of embed transform.

* Add reference for groupnorm

* Fuse sigmoid after groupnorm

* [What] Rename original layernorm into layernorm2d
[Why] Prepare to add groupnorm using layernorm5d

* clang-format

* Add groupnorm test

* Refine error message

* Add groupnorm ckProfiler

* Test groupnorm kernel from device_instance

* update example

* upadte profiler

* Fix test naming

* Fix argc number

* Move descriptor and sweeponce to argument for quick debugging

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-09-19 22:30:46 -05:00
Po Yen Chen
f584ab0c54 Add 'Permute' device op & example (#408)
* Add example folder for 'DeviceElementwise'

* Re-structure example files

* Move common parts into common.hpp

* Use more strict input

* Add more helper methods in 'DeviceElementwise'

* Use more specific method to write example

* Allow specify problem through command line argument

* Allow specify problem 'axes' through command line argument

* Add check to template type argument

* Add transpose_shape() to generalize shape permute

* Generalize transpose utility functions

* Use better name for tensor indices

* Add checks in helper functions

* Remove debug messages

* Refine error message for check_err()

* Generalize variable naming in example code

* Add device op 'DevicePermute'

This device op is clone of 'DeviceElementwise'

* Use 'DevicePermute' device op in example

* Remove 'elementwise' from identifiers

* Remove 'elementwise' from file paths

* Remove base class of 'DevicePermute'

* Let 'DevicePermute' inherit from 'BaseOperator'

* Add simple type traits to validate device op type

* Add static_assert() to check type constraints

* Create 'DevicePermuteBase' to generate methods

* Use indirect base type to generate methods

* Remove 'is_device_op<>' type traits

* Only accept single-input-single-output for 'DervicePermute'

* Simplify 'DevicePermute' interface

* Re-format 'DeviceElementwise'

* Use CRTP to generate overridden virtual method

* Remove unnecessary include directives

* Distinguish input & output shape in 'DevicePermute'

* Passing 'axes' to 'DevicePermute'

* Use more reasonable return value for Invoker::Run()

* Add 'GridwisePermute' kernel

This kernel is a clone of 'GridwiseElementwise_1D'

* Remove no-longer used type argument

* Check if input/output shape meet the requirement

* Remove no-longer used method

* Remove never-entered-if-clause

* Change problem description for 'DevicePermute'

* Transform descriptor into 3 dimensions

* Add debug code the verify result

* Add comment to indicate template argument location

* Add N/H/WPerBlock template parameter to 'DevicePermute'

* Rename 'GridwisePermute' to 'GridwiseCopy'

* Check tensor descriptor dimensions in 'GridwiseElementwise_1D'

* Add missing include directive

* Add 'BlockSize' parameter to 'DevicePermute'

* Remove no-longer used method

* Add 'BlockToTileMap' for 'GridwiseCopy'

* Use the normal Block2TileMap convention

* Rename 'BlockToTileMap' as 'Block2TileMap'

* Fix most of compilation errors

* Let 'Block2TileMap' map block to 2d coordinate

* Allow data transfer in 'GridwiseCopy'

* Fix wrong output descriptor for 2nd blockwise copy

* Rename 'GridwiseCopy' as 'GridwisePermute'

* Remove '1d' in identifiers

* Remove commented-out codes

* Remove 'MPerThread' template parameter

* Seperate template parameters

* Unify variable namming convention

* Use more verbose way to create expressions

* Add template parameter 'InBlockLdsExtraW'

* Release the constraint on In/OutGridDesc

* Use date type directly as template argument

* Re-arrange template arguments for blockwise copy

* Remove no-longer used template parameters

* Embed layout in the variable names

* Add GridwisePermute::CheckValidity()

* Extract local types as template parameters

* Rename local type alias

* Add more template parameters (vector width related)

* Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions

* Fill tensor values start from 1

* Re-formate example code

* Avoid too-large block id

* Add comment

* Make sure 'SrcVectorDim' is not same as 'DstVectorDim'

* Add check for the 'VectorDim' & 'ScalarPerVector' template params

* Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc

* Remove no-longer used template parameter 'NPerBlock'

* Fix wrong descriptor creation logics

* Specify problem in each examples

* Use better example name

* Add new example 'example_permute_NxHxW_fp32'

* Add example for demonstrating bundle multiple elems in tensor

* Add support to permute multiple elements together

* Change the default problem size

* Add span<> class template

* Use span<> to generalize check_err() interface

* Fix ambiguous ctor call

* Avoid create necessary objects

* Use helper functions to simplify example code

* Add example for 4xfp16 permute

* Disable failed-to-compile example

* Add check for the NUM_ELEMS_IN_BUNDLE

* Remove redundant parameter in helper lambda function

* Add check for the input tensor type's byte-size

* Check scalar-per-vector with padded length

* Use more verbose name to avoid name collision

* Use fixed 'VectorDim' & 'ScalarPerVector' for LDS

* Embed shape info in name of descriptor constructor

* Rename example folder '36_permute' into '37_permute'

* Avoid using too-large LDS in kernel code

* Remove redundant example

* Usw switch() to group similar codes

* Add const to the span<> type arguement

* Simply initialize tensor with floating point values

* Use fp16 as data type in all examples

* Enlarge tensor size in example

* Enalrge N-dim in example

* Add check for the bundled type in example

* Use more stricter error threshold

* Remove global load/store loop in kernel code

* Measure execution time by default

* Use faster device op config for example 'NxHxW_fp16'

* Use faster device op config for example '1xHxW_fp16'

* Use faster device op config for example 'HxWx4_fp16'

* Remove cmd arg parsing logics

* Rename functions

* Extract bundle permutation logic out

* Simplify permute bundle example

* Add Tensor<>::GetElementSpaceSizeInBytes()

* Add Tensor<>::data()

* Use new methods to simplify code

* Use type alias to replace duplicated code

* Use existing method to shorten code

* Allow FillUniformDistribution accept range arugment

* Intialize random values in range

* Add Tensor<>::size()

* Use more meaningful names in permute bundle example

* Use more meaningful names in permute element examples

* Use rangified copy() to copy elements

* Use function return value directly to eliminate variables

* Add to_array() conversion tool to eliminate more variables

* Add Tensor<>::AsSpan<>() to create view of tensor values

* Use AsSpan() to shorten check_err() calls

* Remove no-longer-used 'using' directives

* Move 'using' directive to proper code position

* Remove redudant variables

* Remove useless static_assert()

* Add check for range types

* Declare variable right before first use

* Move long return type as tailing return type

* Add BaseInvokerCRTP<> class template to generate method

* Create new base type for 'DervicePermute' implementations

* Move 'NumDim' template param to the first

* Rename 'DevicePermute' to 'DevicePermuteImpl'

* Add 'noexcept' specifier to CRTP generated method

* Move 'Block2TileMap' definition into 'GridwisePermute'

* Use type alias to reduce code

* Unify naming style in 'DevicePermute'

* Add comments in 'GridwisePermute'

* Rename permute example folder

* Use std::cerr to report error

* Use larger shape in examples

* Rename '38_permute' to '39_permute'

* Make sure we use unsigned type for shape & indices

* Remove opt-ed out assertion

* Remove template BaseInvokerCRTP<>
2022-09-19 21:30:25 -05:00
ltqin
370efa6c08 batched_gemm + multiple_d + gemm + multiple_d (#394)
* refactor

* start

* add device gemm file

* add BatchStrideD0

* add stridd0

* add gridwise file

* add d0 parameters to gridwise gemm

* add c layout transformer

* add d0 threadwise copy

* init kernel

* init kernel

* regular code

* nm desc put to out

* kernel parameter can not use reference

* host add bias+gelu

* run right for bias+gelu

* change AddFastGelu into another file

* interface add d1 bias parameters

* add d1 parameter to argument

* add d1 parameter to gridwise

* first all code,not verify

* gelu change to relu and GetElementSpaceSize bug

* add instance

* start add to ckprofiler

* ckprofiler finish code

* change input parameter for ckProfiler

* fix host bias+gelu bug

* show help for ckProfiler

* fix bug for lunch kernel ignore parametes

* add pad and fix about bug

* mutiple d0

* add dynamic d0_element_op

* change profiler and  instance to mutiple d0

* example have 2 d0

* remove some comments not using

* change 2 d0 have self  parameters

* change d element_op name

* change class name(multiple_d)

* fix bug

* fix bug that don't find file

* update profiler

* refactor

* update profiler

* clean

* revert example change

* add gon layout

* optimize parameter for gno

* add gon to gemm+gemm

* change helping input parameters

* change to GemmPadder_v2

* using ForEach

* fix gb_per_sec

Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: ltqin <letaoqin@amd.com>
2022-09-14 17:54:18 -05:00
carlushuang
efd1d25733 embedding fuse layernorm (#405)
* add gridwise/device sparse embedding

* update code

* update code

* remove useless makefile

* code fix

* workable

* work properly

* emb add

* add more instance

* format

* remove useless code

* fix format

* fix clang-tidy

* clean

* fix a compile error

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
2022-09-09 10:41:15 -05:00
Anthony Chang
fe52c94c98 GemmGemm TNNT instances (#399)
* add gemm_gemm TNNT instance

* sanitize Gemm1KPack

* disable instances that failed validation on mi100
2022-09-06 13:38:01 -05:00
Adam Osewski
3da5c19e62 Softmax client example (#396)
* Update Softmax device operation interface.

* Update ckProfiler.

* Update Softmax UT.

* Update example.

* Client example.

* Clang format

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-09-06 12:22:48 -05:00
Po Yen Chen
46a675aa6f Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380)
* Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle

* Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface

* Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle

* Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'

* Add 'TransformConvFwdToGemm<>' utility class (from Chao)

* Use 'TransformConvFwdToGemm<>' to shorten code

* Fix ill-formed method declaration

* Re-implement MakeRGridDescriptor_M() function

* Change problem description

* Use macro to define layout types

* Define K-reduced output tensor layout types

* Let user to decide R output tensor layout

* Rename variables

* Add padding to the reduced output tensor if necessary

* Extract common code as helper method

* Remove debug message

* Add missing include directive

* Add partial fp16 Conv + Reduction example

* Add example verification code for 2D Conv problem

* Use type alias to simplify code

* Share code across different-dimension Conv problems

* Rename file/functions from run_conv_fwd* to run_convnd_fwd*

* Make example code more verbose

* Add code to support 1D & 3D Conv + Reduction on host

* Add more examples for data type: bf16, fp32

* Add example for int8

* Add custom target to group examples

* Use more general custom target name

* Change the description in error message

* Disable testing for example other than fp32

* Add examplel for int4 (just copy from int8)

* Fix wrong data type

* Use larger data type for intermediate tensors

* Finish int4 example

* Undefine macro PP_DEFINE_LAYOUT_TYPE() after use

* Use named variables to replace magic numbers

* Remove debug messages

* Use same A/B data type for host Conv in int4 example

* Add check for the 'RLayout' type argument

* Group same-dim-layouts together in 'LayoutSetting<>'

* Add 'final' specifier to utility classes

* Use different initialization method for examples

* Remove macro PP_DEFINE_LAYOUT_TYPE()

* Fix code-comment mismatch

* Use more reasonable initialization value for all data types

* Default use init_method=1 for all examples

* Remove never-used code

* Remove confusing out-of-date comments

* clean

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
2022-08-31 16:32:17 -05:00
Chao Liu
4df6d93f60 conv+conv (1x1 only) example using gemm+gemm (#393)
* refactor conv

* add conv+conv example, 1x1 only
2022-08-31 11:27:11 -05:00
Adam Osewski
d00e6115b9 Gemm reduce examples int4/int8/fp32/bf16 (#368)
* GEMM + Reduce max fp16+fp32

* GEmm + Max bf16 + int8

* Refactor common definitions.

* Refactor common func of mean meansquare example.

* More examples for mean meansquare.

* Update int8 examples and skip them cause of random errors.

* Int4 examples.

* Fix examples for max int4/8

* Tensor conversion for int4 input data for mean meansquare example.

* Remove int4 mean_meansquare example

* Fix int8 mean_meansquare example.

-All ReductionAccData and R<N>DataType have to be F32. The INT32 data
type is giving wrong results.

* Guard int4 with ifdef

* Change int8 example to add_addsquare due to div rounding err.

* Clang format

* Change the return type of common function.

* Get back int8 example with division.

* Remove int8 mean meansquare.

* Use proper cast for BF16 data type.

* Use ck::literals.

* Use proper data type for host tensors & reference.

- Use ReduceAccDataType for reference gemm output data type.
- Cast host reference output tensor to EDataType
- Fix ifdefs for int4.

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-08-30 11:38:26 -05:00
Illia Silin
9061d39bd6 Fix the slow cpu reference batched gemm kernels. (#388)
* fix the performance of the batched gemm verification

* fix tabs
2022-08-29 08:39:21 -05:00
rocking5566
e1a3fff675 layernorm external api (#379)
* Add layernorm client example

* [What] Add default make install dir to gitignore
[Why] client example need to make install
2022-08-24 18:43:43 -05:00
Po Yen Chen
88e43744d8 Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle (#378) 2022-08-24 10:12:54 -05:00
Po Yen Chen
fa2d894be1 Add examples of Gemm (data type: int4) (#367)
* Add GEMM examples for int4

Currently the source files are just copied from int8 examples

* Re-use pre-defined alias in int4 exmples

* Distinguish user-side type from kernel-side type

* Add int4_t support for check_err()

* Allow conversion between Tensor<> specializations

* Re-format source files

* Use different type for host tensors

* Re-use CopyAsType<>() to implement copy ctor

* Re-use element-wise operation type alias

* Fix typo in alias names

* Complete the int4 examples

* Add constraint to Tensor<> templated methods

* Add type traits 'is_signed_integral<>'

* Add type constraints for integer version check_err<>()

* Allow comparing different-sized integral types in check_err()

* Check converted Tensor<int4_t> with golden Tensor<int8_t>

* Remove constraint of Tensor<>::CopyAsType()

* Avoid compilation error while disabling ck::int4_t support

* Remove debug messages

* Add #error directive to prevent compile sources with wrong setting

* Simplify tensor usages in examples

* Add constraint to check_err() input reference type

* Align design with other PR

* Use ""_uz to simplify example code

* Avoid too much generalizing check_err()

* Re-format GEMM instance template arguments

* Extract int4 example common codes

* Sort include directives

* Move #include directives into new header

* Move common codes together

* Re-format template argument in example code

* Reuse same implementation code for most of GEMM examples

* Re-format common.hpp

* Unify structured comment in examples

* Use reinterpret_cast<>() for cross-type pointer conversion

* Revert "Add type traits 'is_signed_integral<>'"

This reverts commit f2c148efae.

* Allow unsigned integer arguments for check_err()

* Fix compilation error in check_err()

* Remove unnecessary copy ctor for Tensor<>

* Mark Tensor<> special member functions as 'default'

* Use more strict condition to add code in examples

* Fix wrong program return value of GEMM examples

* Handle the case while user specify all the strides

* Fix never-ran examples

* Exit successfully if GEMM instance does not support given problem

* Add missing 'else' keyword

* Re-format CMakeLists.txt

* Add wrapper function to hide value conversion while copying memory

* Add new DeviceMem API to copy memory

* Use new DeviceMem API to implement examples

* Revert "Add new DeviceMem API to copy memory"

This reverts commit 3f190b0779.

* Add conversion ctor for Tensor<>

* Write Tensor<> conversion logics explicitly in example code

* Convert Tensor<> values after transfer data to host
2022-08-23 18:25:05 -05:00
Po Yen Chen
2327f1a640 Add example of Gemm + AddAddFastGelu (data type: int4) (#369)
* Add custom target to bundle examples together

* Add int4 example conditionally (just copy from int8 example)

* Extract common code into common.hpp

* Move ref gemm type alias into data-type-specific sources

* Add #error directive to prevent compile with wrong setting

* Let AddAddFastGelu support int4 parameter type

* Let check_err() support int4 parameter type

* Add wrapper function to hide value conversion while copying memory

* Finish int4 example for GEMM + AddAddFastGelu

* Add new DeviceMem API to copy memory

* Use new DeviceMem API to implement examples

* Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'

* Revert "Add new DeviceMem API to copy memory"

This reverts commit e26e7af71e.

* Add conversion ctor for Tensor<>

* Add 'const' specifier to Tensor<>::CopyAsType()

* Convert Tensor<> values before/after transfer between host & device
2022-08-23 10:38:41 -05:00
Qianfeng
53ea4713af Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320)
* Implement multiple-reduction in one kernel (kernels, device ops, examples)

* Add generic elementwise kernel and device interface

* Add generator for normal-distributed data initialization

* Add host refer implementation of batchnorm-forward and batchnorm-infer

* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels

* Remove un-needed including in batchnorm example

* Renaming generic_elementwise to elementiwise in kernel and device classes/functions

* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise

* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise

* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise

* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp

* Removing and renaming files

* Update to synchronize gemm_layernorm client example to the generic element-wise device op API

* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming

* Merge two static member functions in device_elementwise.hpp

* Remove unary_elementwise_1d kernel and device
2022-08-15 10:11:02 -05:00
cloudhan
fb1cbf025b Change all device operations to use add_instance_library (#338)
* Change all device operations to use add_instance_library to avoid duplicated cmake configuration.

* update DeviceMem

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 12:17:58 -05:00
Anthony Chang
c20a75b07d Fused GEMM+GEMM (#351)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* add gemm_gemm instances and tests

* avoid LDS data hazard

* fix build

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 09:18:58 -05:00
Anthony Chang
cac014f173 Fused attention (#345)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* attention host validation

* add blockwsie softmax v1

* iteratively update softmax+gemm

* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum

* add init method for easier debugging

* do away with manual thread cluster calculation

* generalize blockwise softmax interface

* row-wise softmax sum & max

* format

* rename to DeviceBatchedGemmSoftmaxGemm

* add gemm_softmax_gemm instances and tests

* comment

Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 00:16:14 -05:00
Po Yen Chen
a670a5a092 Move literal ""_uz & ""_zu into namespace 'ck::literals' (#354)
* Move literal ""_uz & ""_zu into namespace 'literals'

* Move namespace 'literals' as 'ck::literals'
2022-08-12 17:48:35 -05:00
Po Yen Chen
68b61504a3 Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340)
* Add always_false<> util to delay symbol resolution

* Use always_false<> to prevent trying instantiate unwanted method

* Add new specializations of AddAddFastGelu::operator() method

* Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32

* Use floating point literal to simplify code

* Remove unnecessary capture in lambda expressions

* Extract fast GeLU calculation as standalone method

* Mark methods as 'constexpr'

* Add constraint for HostTensorDescriptor templated ctors

* Simplify HostTensorDescriptor ctor calls

* Add C++23 std::size_t literal suffix

* Use _uz suffix to shorten example code

* Remove unnecessary conversion to std::array<>

* Re-order include directives

* Remove C-style casting by literal suffix

* Remove unnecessary statements in main()

* Remove unused type parameter of always_false<>

* Remove unused include directive

* Exit main() by returning meaningful value

* Use 'if constexpr' to switch example flow

* Use std::is_same_v<> to shorten example code

* Add 'inline' specifier to literal functions

* Unify output methods in example

* Move common codes into .inc file

* Add type check in type_convert<>()

* Add type_convert<float>() before computation

* Merge AddAddFastGelu method specializations

* Remove always_false<>

* Add constraint to AddAddFastGelu::operator() parameter types
2022-08-11 17:31:28 -05:00
Chao Liu
146972f447 fix bug in gemm profiler (#344) 2022-08-07 12:23:32 -05:00
Adam Osewski
fb0dc35861 CGEMM examples bf16, fp32, int8 (#332)
* Add int8 specialization for elementwise Add and Subtract.

* CGEMM examples bf16, fp32, int8

* Add convert reference output to CDataType.

* Skip BF16 data type during testing.

* Lower K value to get rid of accumulation error.

* Fix merge artifact.

* Fix changed function name: GetElementSpaceSize()

* Fix merge artifact.

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-08-02 14:52:27 -05:00
Chao Liu
500fa99512 Clean up conv example, Instances, profiler and test (#324)
* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test
2022-07-29 18:19:25 -05:00
Illia Silin
d8415a96b3 Add full QA with verification option, few other changes. (#331)
* add verify flag and update scripts

* replace old check_error function with the new check_err

* fix syntax

* remove blank spaces

* remove empty line

* add check_err for tensors

* fix syntax

* replace tensors with vectors in check_err calls

* fix syntax

* remove blank spaces

* fix syntax

* add new line at end of file

* disable conv2d_bwd_weight test, add gpu check

* set check_gpu using export

* check GPU using runShell

* add definition of runShell

* fix script syntax

* reduce the number of threads, add full qa option

* run processing scripts in bash

* fix the branch and host names in performance scripts, add chronos

* replace parameterizedCron with cron

* archive the perf log files

* try to fix git call

* pass branch and host names as arguments into scripts

* fix script arguments

* fix script arguments

* process results on master

* fix pipeline

* add definition of gpu_arch

* run processing scripts in docker

* fix the brackets

* add agent master for the processing stage

* get rid of show_node_info call on master

* try using mici label instead of master, disable MI100 tests for now

* fix syntax

* simplify container for results processing

* remove node(master) from the process_results stage

* put all stages in original order

* change the agent label from master to mici for gfx908
2022-07-21 15:25:46 -05:00
zjing14
7959dad566 Grouped Gemm device with multiD grid (#319)
* replace gridwise_v2r3 with multiD

* adjust parameters

* add instances

* fixed test_grouped_gemm

* fix standalone softmax race condition around blockwise reduction

* fixed ci

* fixed comment: remove redundant workspace

* use instanceFactory

* add test layout

* add empty Ds

* add bias example

* use array

* sperate examples

Co-authored-by: Anthony Chang <ac.chang@outlook.com>
2022-07-21 10:07:01 -05:00
rocking5566
7f21662089 Standalone layernorm (#315)
* Implement layernorm kernel and deviceOp

* verify gpu kernel with host code

* 1. Separate gamma aand beta from affine
2. Check if argument is valid

* clean

* Sync the naming

* Support sweep once mode if we can put k dimension data inside one block

* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.

* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.

* Use 1d descriptor for gamma and beta

* Add accElementwiseOp

* Extract layernorm host code

* Support different YVectorDim in GridwiseLayernorm

* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp

* Gamma and beta can share the VGPR.

* Add test for fp32 and fp16

* Fix bug of concurrency and add test case which may fail orignally

* Propagate NaN for layernorm

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-07-13 11:16:14 -05:00
Chao Liu
4fe9c393b8 N-D Tensor Contraction example, instance, and client example (#270)
* adding contraction

* add contraction example

* update examle

* update example

* format

* update readme

* clean header

* clean header

* contraction with multiple D

* rename

* fix naming issue; add instances for contraction+bilinear

* change assumed virtual layout of contraction; add client example

* update example

* update

* contraction+scale

* use type_convert

* rename
2022-07-07 14:31:11 -05:00
Chao Liu
9e4429f9c3 Gemm+Bilinear (#316)
* refactor

* update example

* update example

* gemm bilinear

* clean

* update
2022-07-02 09:15:38 -05:00
Anthony Chang
63fd5da637 Single-kernel GEMM + layernorm (#263)
* dump lds content in appropriate precision type

* add squared add reduction op; allows sq sum

* initial stub from regular gemm impl

* layernorm example code & host verification

* initial layernorm implementation

* tidy up

* make C0 precision type consistent with C

* clang-tidy and additional comments

* tighten up example code

* account for extra flops/bytes from normalization

* clang-format

* c0 bias/beta/gamma now have its own precision type

* AccElemOp for gemm outputs prior to feeding to layernorm

* update workgroup mapping

* rename kernel template param to reflect its dual use

* use LDS mem pool for reduction workspace

* change cshuffle precision type to f16; clean up

* clang-format

* correct naming

* explicit cast

* fully implemented gemm + bias + activation + add + norm

* activation in correct order

* reflect reduction API's recent change

* amend

* clean up; add comment

* keep up with recent changes in reduction API

* format

* resolve merge conflicts

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-07-01 01:38:00 -05:00
Chao Liu
0dcb3496cf Improve external interface for GEMM and GEMM+add+add+fastgelu (#311)
* interface for GEMM and GEMM+add+add+fastgelu

* rename namespace

* instance factory

* fix build

* fix build; add GEMM client example

* clean
2022-06-30 22:11:00 -05:00
Anthony Chang
93c99f3d87 Standalone sweep once softmax kernel w/ ckProfiler (#295)
* use 'sweep once' softmax kernel where applicable

* threadwise copy's dst buffer can specify invalid element value

* add int8 in/out float compute softmax support

give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error

* format

* softmax inherits DeviceNormalization

* softmax profiler stub

* tighten up reference softmax interface

* example prints tensor dimension

* add fp32 to softmax profiler

* rename header

* hook with ckProfiler

* format

* resolve merge conflict

* resolve merge conflicts

* update normalization profiler help string

* resolve conflict

* typo

* remove residual

* softmax profiler: address feedback

* test for mixed precision input/output

* fully qualify ck::math::isnan

* add comment for device normalization interface

* revise wording

* constness for alpha/beta scaler pointer
2022-06-30 12:08:50 -05:00
rocking5566
12235112a1 external api for gemm + layernorm (#285)
* Extract base class for elementwise

* Refactor interface of DeviceGemmReduce. Do not use tuple in interface

* [What] Rename d into reduce in gemm + reduction related code
[Why] Prepare to add d term for add

* Unify base class of gemm + reduce and gemm + bias + add + reduce

* 1. Rename gemm_bias_add_reduce for external api
 2. Refine cmake

* Add normalize device operation

* [What] Reorder the argument
[Why] Because d0 is also the input of c.

* Add type string

* Add example of gemm_bias_add_layernorm  via external api

* Refactor example code

* clang-format

* Fix compile error

* clang-format

* Add external api for gemm_add_add_layernorm and normalize

* Add client example

* clang-format
2022-06-27 14:25:10 -05:00
Chao Liu
aebd211c36 External Interface (#304)
* add client example

* clean

* clean

* reorg

* clean up profiler

* reorg

* clea

* fix profiler

* function for getinstances

* update client example

* update client example

* update client example

* update

* update example

* update Jenkins file

* update cmake

* update Jenkins
2022-06-26 19:39:02 -05:00
Chao Liu
d3051d7517 add license in file (#303) 2022-06-24 23:32:43 -05:00
Chao Liu
d1db6a0c3e Absolute include path (#281)
* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add client app example

* update readme

* delete obselete files

* remove old client app

* delete old file

* cleaning

* clean

* remove half

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path for all examples

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* revert client app example

* clean build

* fix build

* temporary disable client test on Jenkins

* clean

* clean

* clean
2022-06-24 20:51:04 -05:00
Adam Osewski
a2edd7d802 Testing all fwd convolution specializations. (#259)
* UniforFill with integer values.

* Log tested instance type string.

* Add UT for all convolution specializations.

* debugging conv

* Fix dangling reference bug.

* Small refinements.

* Fix call to error checking function.

* Small refinements to tests.

* Configure error tolerance
* Change problem size.
* Remove OddC case from types that do not support it.

* Add helper traits for AccumulatorDataType.

* Print first 5 errs in check_err for integral types.

* Rename FillUniform to FillUniformDistribution

* Refactor

* Do not use typed tests.
* Instead use plain fixture class with templatized member functions.
* Initialize tensors with integer values.

* Refine test instances.

* Properly set accumulator data type.
* Add another "big" instance.

* Refactor convolution tests.

* Revert "debugging conv"

This reverts commit b109516455.

* Add pragma once + format + small refinement.

* Fix some unwanted changes.

* Clang-format

* Fix profile_convnd to use renamed tensor initializer.

* Add instances for ConvFWDND kernel case 2D

* Helpers to get ConvNDFwd 2D instances.

* Refactoring.

* Remove "small block" instance as it was generating compiler errors.
* Remove default template parameters values.

* Refine and fix test.

* Fix problem with default template parameter types.
* Adjust error thresholds for floating point values test.
* Use integer values initialization for instances test.
* Add tests for ConvNDFwd 2D case.

* Remove AccumulatorDataType type trait.

* Update unit-tests.

* Remove operator<< overload.

* Unlock conv1d/3d nd fwd instances.

* Enable skipping calculating reference using flag.

* Fix number of channels for first ResNet50 layer.

* Clang-format.

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-06-22 22:05:04 -05:00
Anthony Chang
15c89e81f0 Standalone softmax kernel (#284)
* initial stub for standalone softmax

* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m

* host softmax validates

* compiles; to implement beta scaling

* use NaN trick to efficiently ignore OOB values during sum of exponentials

* freeload device_reduce's utility functions

* clean up interface

* adding prior value (beta scaling)

* remove restriction related to perf considerations

* apply clang-format

* clean; disable diagnostics

* resolve conflicts

* add exp wrapper

* honor HostTensorDesc interface; allow implicit cast from different vector<T> type

* test softmax for fp16/fp32

* update readme

* amend commit NaN trick

* remove redundant param added during development

* format

* replace ScalarDataType with AccDataType

* separate out test programs by precision type

* move softmax sample code to its own folder

* format

* keep up with recent changes in reduction API

* remove extra header
2022-06-21 14:59:19 -05:00