Commit Graph

107 Commits

Author SHA1 Message Date
ltqin
d8b41e1c96 Example contraction splitk (#430)
* start split k

* add base device class

* add example after merge develop

* add gridwise gemm

* add b matrix split k

* split=1

* change name for kb

* not bias result right

* bias only add once

* fix register spill

* regular code

* add fp32 example

* fix for 64bit index

* fix CheckValidity of gridwise
2022-10-11 17:54:34 -05:00
Chao Liu
e9d4e893e5 fix build (#434)
* fix

* fix

* add instance
2022-09-22 12:32:41 -05:00
zjing14
01876afafe fixed G offset calc for long_index (#428) 2022-09-21 10:15:43 -05:00
Chao Liu
567f70f552 fix build (#427)
* fix build

* fix build
2022-09-20 14:56:33 -05:00
Shaojie WANG
ebab84b6f9 MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425)
* add lower triangle bmm

* init code for tile skipping

* functionality right with lower triangle mask

* add decoder lower triangular mask calculation

* use 7*13 group

* fix n2 compute error

* attention with lower triangle mask with tile skipping

* add template to distinguish masking kernel

* rename template and remove default template value

* remove lower triangle gemm reference struct

* add some comments on example

* add 10 instance for masking bmm + scale + softmax + bmm + permute kernels

* add test

* add test file

* add gtest for bmm masking scale softmax bmm permute

* clang-format

* fix compile error

* check lef bottom corner for tile skipping

* fix error: check left bottom corner for tile skipping

* add k padding

* add test and instance for MNK padding

* passing a mask struct

* fix instances

* delete used comments

* format

Co-authored-by: danyao12 <yaodan@dc-smc-13.amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-09-20 12:43:53 -05:00
rocking5566
4eba345f6e Group norm (#417)
* Add groupnorm example by layernorm
1.  Reference is not ready
2. shape of gamma and beta need to be fix

* Let shape of gamma and beta can be same as x

* Modify test, instance and client example

* [What] Fix bug of layernorm for greater than 2 dimension.
[Why] We need to get upper length from merge transform instead of embed transform.

* Add reference for groupnorm

* Fuse sigmoid after groupnorm

* [What] Rename original layernorm into layernorm2d
[Why] Prepare to add groupnorm using layernorm5d

* clang-format

* Add groupnorm test

* Refine error message

* Add groupnorm ckProfiler

* Test groupnorm kernel from device_instance

* update example

* upadte profiler

* Fix test naming

* Fix argc number

* Move descriptor and sweeponce to argument for quick debugging

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-09-19 22:30:46 -05:00
Po Yen Chen
f584ab0c54 Add 'Permute' device op & example (#408)
* Add example folder for 'DeviceElementwise'

* Re-structure example files

* Move common parts into common.hpp

* Use more strict input

* Add more helper methods in 'DeviceElementwise'

* Use more specific method to write example

* Allow specify problem through command line argument

* Allow specify problem 'axes' through command line argument

* Add check to template type argument

* Add transpose_shape() to generalize shape permute

* Generalize transpose utility functions

* Use better name for tensor indices

* Add checks in helper functions

* Remove debug messages

* Refine error message for check_err()

* Generalize variable naming in example code

* Add device op 'DevicePermute'

This device op is clone of 'DeviceElementwise'

* Use 'DevicePermute' device op in example

* Remove 'elementwise' from identifiers

* Remove 'elementwise' from file paths

* Remove base class of 'DevicePermute'

* Let 'DevicePermute' inherit from 'BaseOperator'

* Add simple type traits to validate device op type

* Add static_assert() to check type constraints

* Create 'DevicePermuteBase' to generate methods

* Use indirect base type to generate methods

* Remove 'is_device_op<>' type traits

* Only accept single-input-single-output for 'DervicePermute'

* Simplify 'DevicePermute' interface

* Re-format 'DeviceElementwise'

* Use CRTP to generate overridden virtual method

* Remove unnecessary include directives

* Distinguish input & output shape in 'DevicePermute'

* Passing 'axes' to 'DevicePermute'

* Use more reasonable return value for Invoker::Run()

* Add 'GridwisePermute' kernel

This kernel is a clone of 'GridwiseElementwise_1D'

* Remove no-longer used type argument

* Check if input/output shape meet the requirement

* Remove no-longer used method

* Remove never-entered-if-clause

* Change problem description for 'DevicePermute'

* Transform descriptor into 3 dimensions

* Add debug code the verify result

* Add comment to indicate template argument location

* Add N/H/WPerBlock template parameter to 'DevicePermute'

* Rename 'GridwisePermute' to 'GridwiseCopy'

* Check tensor descriptor dimensions in 'GridwiseElementwise_1D'

* Add missing include directive

* Add 'BlockSize' parameter to 'DevicePermute'

* Remove no-longer used method

* Add 'BlockToTileMap' for 'GridwiseCopy'

* Use the normal Block2TileMap convention

* Rename 'BlockToTileMap' as 'Block2TileMap'

* Fix most of compilation errors

* Let 'Block2TileMap' map block to 2d coordinate

* Allow data transfer in 'GridwiseCopy'

* Fix wrong output descriptor for 2nd blockwise copy

* Rename 'GridwiseCopy' as 'GridwisePermute'

* Remove '1d' in identifiers

* Remove commented-out codes

* Remove 'MPerThread' template parameter

* Seperate template parameters

* Unify variable namming convention

* Use more verbose way to create expressions

* Add template parameter 'InBlockLdsExtraW'

* Release the constraint on In/OutGridDesc

* Use date type directly as template argument

* Re-arrange template arguments for blockwise copy

* Remove no-longer used template parameters

* Embed layout in the variable names

* Add GridwisePermute::CheckValidity()

* Extract local types as template parameters

* Rename local type alias

* Add more template parameters (vector width related)

* Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions

* Fill tensor values start from 1

* Re-formate example code

* Avoid too-large block id

* Add comment

* Make sure 'SrcVectorDim' is not same as 'DstVectorDim'

* Add check for the 'VectorDim' & 'ScalarPerVector' template params

* Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc

* Remove no-longer used template parameter 'NPerBlock'

* Fix wrong descriptor creation logics

* Specify problem in each examples

* Use better example name

* Add new example 'example_permute_NxHxW_fp32'

* Add example for demonstrating bundle multiple elems in tensor

* Add support to permute multiple elements together

* Change the default problem size

* Add span<> class template

* Use span<> to generalize check_err() interface

* Fix ambiguous ctor call

* Avoid create necessary objects

* Use helper functions to simplify example code

* Add example for 4xfp16 permute

* Disable failed-to-compile example

* Add check for the NUM_ELEMS_IN_BUNDLE

* Remove redundant parameter in helper lambda function

* Add check for the input tensor type's byte-size

* Check scalar-per-vector with padded length

* Use more verbose name to avoid name collision

* Use fixed 'VectorDim' & 'ScalarPerVector' for LDS

* Embed shape info in name of descriptor constructor

* Rename example folder '36_permute' into '37_permute'

* Avoid using too-large LDS in kernel code

* Remove redundant example

* Usw switch() to group similar codes

* Add const to the span<> type arguement

* Simply initialize tensor with floating point values

* Use fp16 as data type in all examples

* Enlarge tensor size in example

* Enalrge N-dim in example

* Add check for the bundled type in example

* Use more stricter error threshold

* Remove global load/store loop in kernel code

* Measure execution time by default

* Use faster device op config for example 'NxHxW_fp16'

* Use faster device op config for example '1xHxW_fp16'

* Use faster device op config for example 'HxWx4_fp16'

* Remove cmd arg parsing logics

* Rename functions

* Extract bundle permutation logic out

* Simplify permute bundle example

* Add Tensor<>::GetElementSpaceSizeInBytes()

* Add Tensor<>::data()

* Use new methods to simplify code

* Use type alias to replace duplicated code

* Use existing method to shorten code

* Allow FillUniformDistribution accept range arugment

* Intialize random values in range

* Add Tensor<>::size()

* Use more meaningful names in permute bundle example

* Use more meaningful names in permute element examples

* Use rangified copy() to copy elements

* Use function return value directly to eliminate variables

* Add to_array() conversion tool to eliminate more variables

* Add Tensor<>::AsSpan<>() to create view of tensor values

* Use AsSpan() to shorten check_err() calls

* Remove no-longer-used 'using' directives

* Move 'using' directive to proper code position

* Remove redudant variables

* Remove useless static_assert()

* Add check for range types

* Declare variable right before first use

* Move long return type as tailing return type

* Add BaseInvokerCRTP<> class template to generate method

* Create new base type for 'DervicePermute' implementations

* Move 'NumDim' template param to the first

* Rename 'DevicePermute' to 'DevicePermuteImpl'

* Add 'noexcept' specifier to CRTP generated method

* Move 'Block2TileMap' definition into 'GridwisePermute'

* Use type alias to reduce code

* Unify naming style in 'DevicePermute'

* Add comments in 'GridwisePermute'

* Rename permute example folder

* Use std::cerr to report error

* Use larger shape in examples

* Rename '38_permute' to '39_permute'

* Make sure we use unsigned type for shape & indices

* Remove opt-ed out assertion

* Remove template BaseInvokerCRTP<>
2022-09-19 21:30:25 -05:00
Anthony Chang
9287b7c6b3 Grouped batched attention + permute (#412)
* grouped attn without batch validates; now move toward grouped batched attn

* grouped batched attention

* working

* remove debug logging

clean up

clean up

* reintroduce g_ prefix back to host tensor variables

* format

* rename file

* restore old file

* rename

* consolidate padded/non-padded attention example

* harmonize padding specialization in attn examples
2022-09-19 16:09:44 -05:00
Shaojie WANG
27858374ac Conv bwd data multiple d (#404)
* init commit of convnd bwd data

* begin compiling example

* have a first version that produce a right result

* refine device level launch kernel code

* add more instances in example and get right results

* clang-format

* format example file

* add more instances

* fix instances

* adding conv_bwd_data multile_d

* adding conv_bwd_data multile_d

* adding conv_bwd multiple d

* adding conv_bwd multiple d

* adding conv_bwd multiple d

* refactor

* refactor

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* adding conv bwd data multiple d

* refactor

* update conv fwd's bias impl

* refactor

* reorg file

* clean up cmake

* clean

* clean

* clean

Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-09-19 11:25:28 -05:00
Chao Liu
43c898f6ff disable print for group conv multiple D (#421) 2022-09-16 09:46:32 -05:00
ltqin
370efa6c08 batched_gemm + multiple_d + gemm + multiple_d (#394)
* refactor

* start

* add device gemm file

* add BatchStrideD0

* add stridd0

* add gridwise file

* add d0 parameters to gridwise gemm

* add c layout transformer

* add d0 threadwise copy

* init kernel

* init kernel

* regular code

* nm desc put to out

* kernel parameter can not use reference

* host add bias+gelu

* run right for bias+gelu

* change AddFastGelu into another file

* interface add d1 bias parameters

* add d1 parameter to argument

* add d1 parameter to gridwise

* first all code,not verify

* gelu change to relu and GetElementSpaceSize bug

* add instance

* start add to ckprofiler

* ckprofiler finish code

* change input parameter for ckProfiler

* fix host bias+gelu bug

* show help for ckProfiler

* fix bug for lunch kernel ignore parametes

* add pad and fix about bug

* mutiple d0

* add dynamic d0_element_op

* change profiler and  instance to mutiple d0

* example have 2 d0

* remove some comments not using

* change 2 d0 have self  parameters

* change d element_op name

* change class name(multiple_d)

* fix bug

* fix bug that don't find file

* update profiler

* refactor

* update profiler

* clean

* revert example change

* add gon layout

* optimize parameter for gno

* add gon to gemm+gemm

* change helping input parameters

* change to GemmPadder_v2

* using ForEach

* fix gb_per_sec

Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: ltqin <letaoqin@amd.com>
2022-09-14 17:54:18 -05:00
carlushuang
efd1d25733 embedding fuse layernorm (#405)
* add gridwise/device sparse embedding

* update code

* update code

* remove useless makefile

* code fix

* workable

* work properly

* emb add

* add more instance

* format

* remove useless code

* fix format

* fix clang-tidy

* clean

* fix a compile error

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
2022-09-09 10:41:15 -05:00
Anthony Chang
d6709dc373 Fix gemm-softmax-gemm-permute padding cases (#409)
* fix example; make padding on by default in example; fix argument checks

* fix Gemm1KPacK which has since regressed from PR #399
2022-09-08 09:27:50 -05:00
Anthony Chang
868e5c555b Fused attention instances & padding tests (#395)
* modify comment

* trim unnecessary check

* add gemm spec in kernel name

* add TNTT gemm_gemm + atten kernel instances

* refactor attention padding to better fit in unit tests

This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
Also added compile-time conditionals that load OOB value as NaN only after padding is enabled

* add adhoc padding test for atten

* shrink input value range for attention kernel validation to avoid occasional error by 1e-3

Still unsure whether this kind of deterministic floating point accurary issue is expected
or not. May want to try exact same approach as the GPU kernel in the host reference
GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
shrink the input value range as it is less likely to produce errors of around ~1e-3.

* attention kernel proper granular padding for all 4 dims

* IsSupportedArgument checks

* test more padded cases

* block PadK specialization in attention kernels

* workaround clang crash for gfx908

(gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
VGPR_32: Cannot scavenge register without an emergency spill slot!"
this fall back to less ideal way of handle NPadding in fused attention kernel

* comment out kernels giving wrong results on MI100; MI200 doesn't seem affected
2022-09-06 14:38:56 -05:00
Adam Osewski
3da5c19e62 Softmax client example (#396)
* Update Softmax device operation interface.

* Update ckProfiler.

* Update Softmax UT.

* Update example.

* Client example.

* Clang format

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-09-06 12:22:48 -05:00
zjing14
7589116121 [Hotfix] SplitK Gemm fp32 (#401)
* add scripts

* fixed splitK_gemm_fp32

* clean

* clean

* use gemm_xdl_splitK_c_shuffle into profiler

* remove device_gemm_xdl_splitk.hpp
2022-09-02 11:16:09 -05:00
Po Yen Chen
46a675aa6f Add examples of Conv + reduction (data type: int4, int8, bf16, fp16, fp32) (#380)
* Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle

* Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface

* Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle

* Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'

* Add 'TransformConvFwdToGemm<>' utility class (from Chao)

* Use 'TransformConvFwdToGemm<>' to shorten code

* Fix ill-formed method declaration

* Re-implement MakeRGridDescriptor_M() function

* Change problem description

* Use macro to define layout types

* Define K-reduced output tensor layout types

* Let user to decide R output tensor layout

* Rename variables

* Add padding to the reduced output tensor if necessary

* Extract common code as helper method

* Remove debug message

* Add missing include directive

* Add partial fp16 Conv + Reduction example

* Add example verification code for 2D Conv problem

* Use type alias to simplify code

* Share code across different-dimension Conv problems

* Rename file/functions from run_conv_fwd* to run_convnd_fwd*

* Make example code more verbose

* Add code to support 1D & 3D Conv + Reduction on host

* Add more examples for data type: bf16, fp32

* Add example for int8

* Add custom target to group examples

* Use more general custom target name

* Change the description in error message

* Disable testing for example other than fp32

* Add examplel for int4 (just copy from int8)

* Fix wrong data type

* Use larger data type for intermediate tensors

* Finish int4 example

* Undefine macro PP_DEFINE_LAYOUT_TYPE() after use

* Use named variables to replace magic numbers

* Remove debug messages

* Use same A/B data type for host Conv in int4 example

* Add check for the 'RLayout' type argument

* Group same-dim-layouts together in 'LayoutSetting<>'

* Add 'final' specifier to utility classes

* Use different initialization method for examples

* Remove macro PP_DEFINE_LAYOUT_TYPE()

* Fix code-comment mismatch

* Use more reasonable initialization value for all data types

* Default use init_method=1 for all examples

* Remove never-used code

* Remove confusing out-of-date comments

* clean

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
2022-08-31 16:32:17 -05:00
Chao Liu
4df6d93f60 conv+conv (1x1 only) example using gemm+gemm (#393)
* refactor conv

* add conv+conv example, 1x1 only
2022-08-31 11:27:11 -05:00
Shaojie WANG
45adb736e7 Padding for attention: bmm+scale+softmax+bmm kernel (#385)
* add padding algo for bmm+scale+softmax+bmm. Version for verification

* remove verification code

* remove comments

* add padded bmm scale softmax bmm example

* format

* refactor

* add comments for usages of padding bmm+scale+softmax+bmm

Co-authored-by: Chao Liu <lc.roy86@gmail.com>
2022-08-30 11:01:37 -05:00
zjing14
9881625b2d Fixed splitk gemm fp32 (#384)
* add scripts

* fixed splitK_gemm_fp32

* clean

* clean
2022-08-26 09:59:50 -05:00
Po Yen Chen
88e43744d8 Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle (#378) 2022-08-24 10:12:54 -05:00
Anthony Chang
e0d8806ca1 Attention with output permutation (#370)
* comment on specialization for TensorSpecialization::Packed

* gemm_softmax_gemm with output permutation

* scaling

* refactor MatrixPadder; rename to GemmPadder

* remove old sanity check

* restore original gemm_softmax_gemm

* revise comment in gemm_softmax_gemm example

* use GetElementSpaceSize()

* remove extra header

* typo

* remove archaic DeviceOpPtr
2022-08-23 14:52:56 -05:00
zjing14
6091458300 Add examples of batched/grouped/SplitK Gemm for int8/bfp16/fp16/fp32 (#361)
* add examples into grouped/batched_gemm

* adding splitK examples

* fixed splitK

* add bfp16 int8 example into splitK

* formatting

* use static_cast

* added common for batched_gemm

* add commons for examples of splitK/batched/grouped_gemm

* return true

* adjust splitK check tol

* update example

Co-authored-by: Chao Liu <lc.roy86@gmail.com>
2022-08-23 14:41:56 -05:00
Anthony Chang
f4047c9418 Implement padding and sanity checks for fused GEMM+GEMM (#376)
* GemmPadder and GemmGemmPadder

* proper padding using GemmGemmPadder

* test gemm_gemm padding

* properly check size K in IsSupportedArgument()

* properly check size requirement given SrcScalarPerVector in IsSupportedArgument()

* comment

* format
2022-08-23 10:01:02 -05:00
Qianfeng
53ea4713af Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320)
* Implement multiple-reduction in one kernel (kernels, device ops, examples)

* Add generic elementwise kernel and device interface

* Add generator for normal-distributed data initialization

* Add host refer implementation of batchnorm-forward and batchnorm-infer

* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels

* Remove un-needed including in batchnorm example

* Renaming generic_elementwise to elementiwise in kernel and device classes/functions

* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise

* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise

* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise

* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp

* Removing and renaming files

* Update to synchronize gemm_layernorm client example to the generic element-wise device op API

* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming

* Merge two static member functions in device_elementwise.hpp

* Remove unary_elementwise_1d kernel and device
2022-08-15 10:11:02 -05:00
rocking5566
0bd6b842b9 Layernorm welford (#346)
* Add threadwise and blockwise welford

* Rename gridwise op, prepare to add welford version

* implement welford and integrate welford into layernorm

* Take care of tail loop

* Fix buf when ThreadSliceK > 1

* Fix bug of merging of two empty set

* Rename clip to clamp

* 1. Fix type of count
2. Remove useless static_assert

* Do not inherit Reduction::Argument

* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)

* Add y stride

* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm

* Move literal ""_uz & ""_zu into namespace 'literals'

* Move namespace 'literals' as 'ck::literals'

Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 09:43:18 -05:00
Anthony Chang
c20a75b07d Fused GEMM+GEMM (#351)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* add gemm_gemm instances and tests

* avoid LDS data hazard

* fix build

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 09:18:58 -05:00
ltqin
10b3278b05 Skip lds of b matrix (#326)
* start

* read for gridwise gemm

* add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1

* add thread  copy desc and register buffer

* add K0PerBlock dim

* add read global data

* finish gridwise gemm

* finish blockwise gemm

* add print data

* add smallest config

* add compare code for gridwis gemm

* fix NXdlPerWave

* fix k0perthread and gridewis gemm main loop

* remove b matrix lds alloc

* fix name

* add test code

* create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter

* add double register

* modify b_thread_desc_

* add float

* fp16 tag

* add tail for pipeline

* finish main loop

* optimize main loop

* start clear gridwise gemm

* clear code

* clear redundant code

* change file name

* change file name

* fix bug after merge develop

* fix input parameters

* using MultiK0 control b load data loop

* fix some config

* 4 buffer

* fix bug

* one can use

* change read order

* change buffer array to tuple

* change to 8 buffer

* interleave buffer load

* change to 16

* read 8 buffer

* add data buffer to template

* fix after merge develop(head file)

* format

* change to 4 buffer

* remove unnecessary lambda fun
2022-08-13 01:35:49 -05:00
rocking5566
6c3c06bf1f Gemm multiple d multiple r (#335)
* Imitate XXX_gemm_multiple_d, add XXX_gemm_multiple_d_multiple_r for gemm + reduction

* Implement run of kernel

* Add example

* Fix parameter of typo

* Rewrite the reduceMax example

* Rewrite the reduceMean + reduceMeanSquare example

* Refine naming

* Refine folder name

* refine naming

* Rewrite the gemm + bias + relu + add + layernorm example

* Rewrite the gemm + layernorm example

* clang-format

* Fix bug if sync lds

* Fix compile error
2022-08-13 01:07:12 -05:00
Anthony Chang
cac014f173 Fused attention (#345)
* initial stub for gemm_gemm_xdl_cshuffle

* set up example code

* compiles

* prevent integer overflow

* harmonize interface between ref_gemm and ref_batched_gemm

* batched_gemm_gemm

* fix example

* host tensor gen: diagonal pattern in lowest two-dimensions only

* make c descriptors containing only integral constants

* clean up

* add BlockwiseGemmXdlops_v2 while exploring an unified approach

* implement proper interface

* tidy up example

* fix compilation warnings

* coarsely controlled 2nd gemm padding

* remove rocm-cmake's hard requirement for certain revision

* clang-format

* resolve merge conflict

* fix compilation error on gfx10

* adds acc0 elementwise op to interface

* attention host validation

* add blockwsie softmax v1

* iteratively update softmax+gemm

* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum

* add init method for easier debugging

* do away with manual thread cluster calculation

* generalize blockwise softmax interface

* row-wise softmax sum & max

* format

* rename to DeviceBatchedGemmSoftmaxGemm

* add gemm_softmax_gemm instances and tests

* comment

Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 00:16:14 -05:00
zjing14
35e49f2de6 add g; fixed strides (#355) 2022-08-12 15:22:39 -05:00
rocking5566
fdfd7eb597 ckProfiler for layernorm (#330)
* Refine parameter

* Add base class for layernorm

* Add layernorm instance

* Add layernorm to ckProfiler

* Remove redundant

* Add verification

* Fix compile error due to merge
2022-08-11 17:03:54 -05:00
zjing14
e08d68d25d Add batched/grouped_gemm contraction deviceOps (#349)
* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test

* change gemm_c_permute with contraction

* add grouped_contraction

* add contraction in group_gemm

* add example of grouped_gemm with contraction

* add example of grouped_contraction_bias_e_permute

* clean

* fixed ds

* add m3n2 m2n3 examples into gemm_bias_e_permute

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-10 12:20:29 -05:00
Chao Liu
75ab874e02 Update Group convolution (#341)
* add conv oddC

* update example

* update example

* fix bug in example

* fix bug in group conv example
2022-08-03 12:28:33 -05:00
Chao Liu
500fa99512 Clean up conv example, Instances, profiler and test (#324)
* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test
2022-07-29 18:19:25 -05:00
zjing14
d7d7829096 Batched Gemm with multiD (#329)
* add batched_gemm_multiD

* add ds

* rename file

* add batched_gemm_bias example

* add batch_strides into bmm_c_permute

* clean

* rename example_28 to example_29

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-07-22 09:33:50 -05:00
zjing14
7959dad566 Grouped Gemm device with multiD grid (#319)
* replace gridwise_v2r3 with multiD

* adjust parameters

* add instances

* fixed test_grouped_gemm

* fix standalone softmax race condition around blockwise reduction

* fixed ci

* fixed comment: remove redundant workspace

* use instanceFactory

* add test layout

* add empty Ds

* add bias example

* use array

* sperate examples

Co-authored-by: Anthony Chang <ac.chang@outlook.com>
2022-07-21 10:07:01 -05:00
rocking5566
7f21662089 Standalone layernorm (#315)
* Implement layernorm kernel and deviceOp

* verify gpu kernel with host code

* 1. Separate gamma aand beta from affine
2. Check if argument is valid

* clean

* Sync the naming

* Support sweep once mode if we can put k dimension data inside one block

* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.

* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.

* Use 1d descriptor for gamma and beta

* Add accElementwiseOp

* Extract layernorm host code

* Support different YVectorDim in GridwiseLayernorm

* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp

* Gamma and beta can share the VGPR.

* Add test for fp32 and fp16

* Fix bug of concurrency and add test case which may fail orignally

* Propagate NaN for layernorm

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-07-13 11:16:14 -05:00
Shaojie WANG
763ca61581 add conv1d/3d bwd weight instances (#318)
* add conv1d/3d bwd weight instances

* add profiler code
2022-07-08 15:42:20 -05:00
Chao Liu
4fe9c393b8 N-D Tensor Contraction example, instance, and client example (#270)
* adding contraction

* add contraction example

* update examle

* update example

* format

* update readme

* clean header

* clean header

* contraction with multiple D

* rename

* fix naming issue; add instances for contraction+bilinear

* change assumed virtual layout of contraction; add client example

* update example

* update

* contraction+scale

* use type_convert

* rename
2022-07-07 14:31:11 -05:00
zjing14
334361cbde Batched Gemm with C Permute (#305)
* init commit

* add c_permute

* add mnk padding

* fixed comments

* Fixed comments

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-07-06 10:38:29 -05:00
Chao Liu
9e4429f9c3 Gemm+Bilinear (#316)
* refactor

* update example

* update example

* gemm bilinear

* clean

* update
2022-07-02 09:15:38 -05:00
guangzlu
8e374781d5 modified grouped gemm addressing method (#307)
* modified grouped gemm addressing method

* modified addressing method in device_grouped_gemm_xdl.hpp

Co-authored-by: root <root@dc-smc-13.amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-07-01 01:38:21 -05:00
Anthony Chang
63fd5da637 Single-kernel GEMM + layernorm (#263)
* dump lds content in appropriate precision type

* add squared add reduction op; allows sq sum

* initial stub from regular gemm impl

* layernorm example code & host verification

* initial layernorm implementation

* tidy up

* make C0 precision type consistent with C

* clang-tidy and additional comments

* tighten up example code

* account for extra flops/bytes from normalization

* clang-format

* c0 bias/beta/gamma now have its own precision type

* AccElemOp for gemm outputs prior to feeding to layernorm

* update workgroup mapping

* rename kernel template param to reflect its dual use

* use LDS mem pool for reduction workspace

* change cshuffle precision type to f16; clean up

* clang-format

* correct naming

* explicit cast

* fully implemented gemm + bias + activation + add + norm

* activation in correct order

* reflect reduction API's recent change

* amend

* clean up; add comment

* keep up with recent changes in reduction API

* format

* resolve merge conflicts

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-07-01 01:38:00 -05:00
zjing14
1c8126a4c2 add batch_stride into batched gemm (#314)
* add batch_stride

* fixed test

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-07-01 01:35:37 -05:00
Chao Liu
0dcb3496cf Improve external interface for GEMM and GEMM+add+add+fastgelu (#311)
* interface for GEMM and GEMM+add+add+fastgelu

* rename namespace

* instance factory

* fix build

* fix build; add GEMM client example

* clean
2022-06-30 22:11:00 -05:00
zjing14
fa9a0a5cfb Gemm + bias + c_permute (#312)
* init commit

* add desc

* finished c permute

* fixed vector lens
2022-06-30 19:55:09 -05:00
Anthony Chang
93c99f3d87 Standalone sweep once softmax kernel w/ ckProfiler (#295)
* use 'sweep once' softmax kernel where applicable

* threadwise copy's dst buffer can specify invalid element value

* add int8 in/out float compute softmax support

give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error

* format

* softmax inherits DeviceNormalization

* softmax profiler stub

* tighten up reference softmax interface

* example prints tensor dimension

* add fp32 to softmax profiler

* rename header

* hook with ckProfiler

* format

* resolve merge conflict

* resolve merge conflicts

* update normalization profiler help string

* resolve conflict

* typo

* remove residual

* softmax profiler: address feedback

* test for mixed precision input/output

* fully qualify ck::math::isnan

* add comment for device normalization interface

* revise wording

* constness for alpha/beta scaler pointer
2022-06-30 12:08:50 -05:00
rocking5566
12235112a1 external api for gemm + layernorm (#285)
* Extract base class for elementwise

* Refactor interface of DeviceGemmReduce. Do not use tuple in interface

* [What] Rename d into reduce in gemm + reduction related code
[Why] Prepare to add d term for add

* Unify base class of gemm + reduce and gemm + bias + add + reduce

* 1. Rename gemm_bias_add_reduce for external api
 2. Refine cmake

* Add normalize device operation

* [What] Reorder the argument
[Why] Because d0 is also the input of c.

* Add type string

* Add example of gemm_bias_add_layernorm  via external api

* Refactor example code

* clang-format

* Fix compile error

* clang-format

* Add external api for gemm_add_add_layernorm and normalize

* Add client example

* clang-format
2022-06-27 14:25:10 -05:00
Chao Liu
aebd211c36 External Interface (#304)
* add client example

* clean

* clean

* reorg

* clean up profiler

* reorg

* clea

* fix profiler

* function for getinstances

* update client example

* update client example

* update client example

* update

* update example

* update Jenkins file

* update cmake

* update Jenkins
2022-06-26 19:39:02 -05:00