Commit Graph

188 Commits

Author SHA1 Message Date
Bartłomiej Kocot
472fa029ba Enable grouped conv with small K or C (#822)
* Enable grouped conv with small K or C

* Add missing instances

* Refactor grouped conv fwd instances

* Fix fp16 instances since it supports src_per_vec %2 = 0

* Add generic instances
2023-08-09 10:40:55 -05:00
Illia Silin
08eb176929 Allow building CK for specific data types and split off last remaining DL instances. (#830)
* properly split conv_nd_bwd_data instances

* split conv2d_fwd instance data types

* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm

* split the tests by data types where possible

* filter examples by DTYPES

* split few remaining examples by DTYPES

* filter most instances by DTYPES

* add new lines at end of headers, fix grouped_gemm profiler

* fix syntax

* split the ckprofiler instances by DTYPES

* split the conv2d and quantization DL and XDL instances

* fix the splitting of conv2d DL instances

* split softmax and pool_fwd tests for fp16 and fp32 types

* fix syntax

* fix the dl_int8 quantization instances isolation
2023-08-07 14:56:10 -07:00
Bartlomiej Kocot
aac65a031e Change to github_issue prefix 2023-08-03 16:38:28 +02:00
Bartlomiej Kocot
e6a826d35a Rename the workaround to a proper issue name 2023-08-03 16:38:28 +02:00
Po Yen Chen
f7cc8c3b03 Update tuning parameter & compilation options of DeviceGemmXdl<> instance (layout=TT) (#819)
* Enable pipeline v2 opt for layout=TT instance

* Use better thread mapping for reading A tile

* Conditionally enable pipeline v2 opt

* Allow enabling only fp16 gemm instances in profiler

* Fix formatting error

* Fix compilation error if we enable fp32 in profiler
2023-08-02 10:32:22 -05:00
carlushuang
e7dca79d27 initial stream-k implementation with example (#699)
* initial stream-k implementation with example

* fix unexpected change in err

* improve a little bit performance by reorganize pipeline.

* improve perf a little bit by swizzle block idx

* add profiler

* update example

* fix spelling

* shrink karg for streamk

* support dynamic buffer using memory coherence glc_slc bit from template

* control memory coherence while construct dynamic buffer

* update reduction for streamk(not ready yet)

* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting

* fix build issue

* fix several bug

* now result is correct, everything works (but has scratch)

* remove scratch by manually reset coordinate

* update device code

* fix a bug in final reduce

* fix something in example

* update async memset

* fix enum as camel case

* modify coherence enum name

* clean code and use atomic streamk by default

* remove unused var

* throw exception if have empty pointer

* fix format

* fix CI warning

* fix type in init

* modify CI error

* filter out on gfx10+

* restore changed example code

---------

Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
2023-07-26 14:18:15 -05:00
Illia Silin
9195435c77 Disable DL kernels by default. (#816) 2023-07-26 11:06:45 -05:00
Po Yen Chen
f4ea560112 Speed-up global memory reading for GEMM instances (#813)
* Use better ThreadClusterLengths to speed up

* Update B tile reading pattern for layout=NN instance
2023-07-25 18:54:47 -05:00
ltqin
50643dd555 Add bias scalar vectorload = 1 for gemm bias gemm (#791)
* first change bias load

* add bias dim and scalervector parameter

* make CDE0BlockTransferSrcVectorDim not work

* changse toinstance

* add limit for CDE0BlockTransferSrcScalarPerVector
2023-07-24 20:08:15 -05:00
Bartłomiej Kocot
10732847e7 Grouped conv bwd wei NDHWGC/NDHWGK (#804) 2023-07-21 12:00:55 -05:00
Bartłomiej Kocot
49180fd60b Grouped 3d conv backward data support (#799)
* Grouped 3d conv backward data support

* Fix comments
2023-07-18 11:01:33 -05:00
Illia Silin
189ea3b9aa Add mechanism to build CK for select data types, add Navi3x CI. (#790)
* allow building CK for specific data types

* add CI build and test stage on Naiv3x without some int8 instances

* add missing gemm fp16 instances

* add the changes to the missed cmake file

* add empty lines at end of source files

* Do not build quantization client example on navi3 in CI

* disable batched_gemm_multi_d_int8 instances with DTYPES

* disable device_conv2d_bwd_data_instance with DTYPES

* fix ckprofiler for conv_bwd_data for int8

* properly isolate the conv_bwd_data int8 instances

* remove empty line
2023-07-17 18:02:42 -07:00
Bartłomiej Kocot
1ee99dcaa6 Support NHWGC conv2d_bwd_weight (#769)
* Support NHWGC conv2d_bwd_weight

* Fix client example

* Fix client example

* Fix comments

* Redesign grouped_conv_bwd_weight instances

* Clang format fix

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-07-12 08:25:02 -05:00
Po Yen Chen
850144a0d3 Split GEMM instance library & enable pipeline v2 optimization (#783)
* Move source file into sub-directories

* Add missing include directive

* Split DeviceGemmXdl<> fp16 instances

* Fix format

* Remove unnecessary CMakeLists.txt

* Add macros to toggle new features

* Remove debug message

* Turn off GEMM v2 pipeline optimization by default

* Fix format

* Extract duplicated string as list

* Enlarge indent in CMakeLists.txt
2023-07-06 10:59:35 -05:00
Adam Osewski
f4dfc060b7 Move Device Ops implementations into impl directory. (#777)
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-07-06 16:15:51 +02:00
Bartlomiej Kocot
2b0b6d9f46 Fix copyrights for DeviceBatchedGemmMultipleD_Dl 2023-07-06 15:50:27 +02:00
Bartłomiej Kocot
63388e84ab Support bf16/f32/f16 and NHWGC conv2d_bwd_data (#757)
* Support bf16/f32/f16 and NHWGC conv2d_bwd_data

* Add interface test

* clang format

* Comment fixes

* Add more friendly error message
2023-06-21 08:20:31 -05:00
Rostyslav Geyyer
f0c620c42e FP8 enablement - add a pseudorandom number generator, add conversion methods (#708)
* Add basic fp8 definitions and prn-generator

* Format

* Add fp8<->fp32 type_convert

* Format

* Split type_convert and cast_to/from_f8

* Format

* Minor fix

* Minor fix

* Move fp8 utils to a separate header

* Add elementwise ops

* Add fp8_convert_sr

* Format

* Add element op

* Eliminate magic numbers

* Split f8_convert_sr in host and device

* Format

* Add some constexpr

* Add a datatype test

* Format

* Another format

* Add fp8<->fp16 tests

* Update type_converts

* Format

* Add fp16 casting functions

* Format

* Use seed as a runtime arg

* Use element location for PRNG

* Format

* Add fp8<->fp16 to PassThrough element op

* Clean up

* Merge host and device implementations

* Add comments on rounding modes

* Remove leftover code

* Put type_converts into a separate header

* Put random number gen to a separate header

* Rearrange f8_utils' namespaces

* Refactor type_convert.hpp

* Move f8_t definition
2023-06-19 11:20:35 -05:00
rocking
341ad95665 Maxpool bwd (#750)
* Add maxpool f32 kernel and example

* Revise copyright

* Add device pool bwd device op

* Support f16 and bf16

* Add compute datatype for reference code.
Prevent error in bf16

* Fix type error

* Remove layout

* Fix bf16 error

* Add f16 and bf16 example

* Add more operations

* Implement IsSupportedArgument

* Add changelog

* Add comment

* Add comment

* Remove useless header

* Move initialize of workspace to the run

* Move set din zero to the device operator

* Save din_length_raw

* Remove useless header

* Calculate gridsize according to the number of CU

* Calculate gridSize according to the number of CU.
Remove useless header

* Add put example

* Remove useless header

* Fix CI fail
2023-06-19 09:44:22 -05:00
Qianfeng
0d9118226b Padded Generic Kernel Instance (#730)
* Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting

* Move the generic kernel instance to be the first of the instance list for elementwise op of normalization

* Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax

* Add testing of GetGenericInstance() in client_example of Softmax

* Revert "Add testing of GetGenericInstance() in client_example of Softmax"

This reverts commit f629cd9a93.

* Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"

This reverts commit a9f0d000eb.

* Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm

* Move generic kernel instance to separate tuple for elementwise op of normalization

* Remove un-used files for softmax instance

* Store generic kernel instance to separate tuple for softmax

* Add IsSupported checking for generic instance to client example of softmax

* Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization

* clang-format fix

* Remove int8 from softmax instances

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-06-16 23:43:11 -05:00
zjing14
309b1c6461 Fixed Weight layout of grouped_conv 3d fwd (#743)
* Changed wei layout

* changed layout for examples

* fixed client example

---------

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
2023-06-15 10:19:33 -05:00
Rostyslav Geyyer
54b68eb343 Add generic kernel instances for ck::tensor_operation::device::DeviceGemmMultipleD (#741)
* Add generic instance gemm_add_add_fastgelu

* Add a client example for generic gemm_add_add_fastgelu

* Update CMakeLists

* Format

* Format

* Add generic instance gemm_add_fastgelu

* Format

* Add a gemm_add_fastgelu client example

* Format

* Add generic instance gemm_fastgelu

* Format

* Fix argument order

* Add gemm_fastgelu client example

* Add exceptions if argument is not supported
2023-06-14 16:06:56 -05:00
Bartłomiej Kocot
fc9f97568f Add DeviceBatchedGemmMultipleD_Dl (#732)
* Add DeviceBatchedGemmMultipleD_Dl

* Fix batched_gemm tests

* Fix comments

* test_batched_gemm_multi_d fixes

* Fix args for isSupported batchedGemmMultipleDDl

* Disable tests for gfx90a
2023-06-12 08:37:15 -05:00
ltqin
0ede66de54 Fix flash attn mask bug (#733)
* add check input parameter

* add instance for vector load = 1

* move gerneral instance to first pos

* fix read bias code

* regular code for bias load

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-06-12 08:35:31 -05:00
Illia Silin
b94fd0b227 update copyright headers (#726) 2023-05-31 18:46:57 -05:00
Adam Osewski
70e4eb567f Multiple fixes to GroupedGemm+SplitK (#707)
* Add license header.

* Reduce number of logged output. Add constant initialization.

* Add functional tests for grouped_gemm with different kbatch value.

* Add debug log informations + remove unused code.

* Don't pass kbatch to CalculateKPadded.

* Turn on logging in grouped gemm and gemm splitk profiler

* Debug: limit number of test cases to run;

* Log more information and initialize with constant value.

* Turn on DEBUG_LOG

* Add more debug log informations.

* Limit the number of instances to compile.

* Use GridwiseGemmPipeline

* Use KBatch to calculate K0

* Multiple DebugLog messages.

* Unit tests for multiple KBatch values.

* Refactoring

* Disable logging
* extract out of if statement KBatch update.

* Uncomment instances.

* Disable DebugLog.

* Use Kbatch when calculate KPadded.

* Fix CGridDesc padding.

* Use available helper functions.

* Uncomment code commented for debuggin.

* Remove unnecessary debug log messages.

* Uncomment previously commented code for debug purposes.

* Add KBatch info to profiler output summary log.

* Add gtests for gemm splitk using ckProfiler API.

* Add more test-cases for different data layout.

* Add more test cases for gemm splitk

* Remove old test.

* Unit tests for MKNK ggemm interface.

* Fix and add more unit-tests.

* Constepxr everything!

* Increase error threshold for fp16 and splitk.

Since we're using fp16 atomic add for splitk there's a
known precision loss.

---------

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-05-30 07:09:06 -05:00
Bartłomiej Kocot
c2d7a29dec Add instances for fp16/int8 Gemm kernels (Navi21) (#717)
* Add instances for fp16/int8 Gemm kernels (Navi21)

* Extend instances with smaller tiles

* Fix SrcVectorTensor for km_kn_mn int8
2023-05-30 07:07:17 -05:00
Illia Silin
ac9e01e2cc Clean-up the headers (#713)
* fix headers for gpu instances

* remove unused headers

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-05-24 08:11:25 -07:00
rocking
76ec0089fb Pool3d fwd (#697)
* Expand the base class of pool2d, prepare to share base class with pool3d

* Add pool3d device op

* Add pool3d f16 example

* Refactor the base class. implement generic pooling in the future

* clang format

* get original index in max pooling

* Add outputindex to base class

* Fix dimension

* Add pooling instance

* Use indexType instead

* Remove useless header

* Extract IndexDataType to template

* Extract pooling reference code

* clang format

* clang format

* Fix typo

* Add tensor stride

* Add missing header

* Add index stride and output stride

* Refine naming

* Add type to base class

* Rename file

* Use proper size

* Fix typo

* Refine naming

* Modify the argument into vector.

* Add max pool profiler

* Refine naming

* Support f32 pool

* Fix typo

* Add avg pool2d fwd in profiler

* clang format

* Rename AccDatatype to ComputeDatatype

* Fix init

* test pool

* Extract variable

* Add client example

* Check the pooling dim

* clang format

* Connect argv and arg_parser

* Add found check

* Remove useless header

* Refine naming

* Adjust the order of device_pool_fwd
2023-05-24 09:05:04 -05:00
Bartłomiej Kocot
642d5e9155 Add contraction profiler and tests (#701)
* Add contraction profiler and tests

* Build and style fixes

* Allow to use any elementwise operator for ref_contraction

* Introduce profile_contraction_scale and profile_contraction_bilinear

* Make ref_contraction generic and extend interface tests

* Stylistic minor fixes

* Extend test_contraction_interface
2023-05-15 09:46:52 -05:00
Rostyslav Geyyer
b076a02ad2 Optimize bf16 conversion (#664)
* Add TypeConvert class and start refactoring

* Refactor TypeConvert as a struct

* Get back to template functions type_convert

* Add a type_convert_bf16_rtn, set rtz as default

* Clean up

* Add UnaryConvertPrecision struct for high-precision workloads

* Format

* Update type_convert to UnaryConvert on threadwise level

* Update UnaryConvertPrecision

* Format

* Fix chmod

* Add a flag to pick converion method

* Format

* Remove the added flag

* Merge elementwise op with type conversion

* Move type_convert to elemwise op, update the op

* Update type_convert_precision -> bf16_convert_rtn

* Clean up

* Update comments

* Update the CK_WORKAROUND_DENORM_FIX flag handling

* Update the unneeded op to work but warn user

* Remove the message

* Use a PassThrough instead of ConvertBF16RTN to calcaulate reference

* Format

* Add missing include
2023-05-04 10:25:47 -05:00
Adam Osewski
8bb2bb4a05 Grouped Gemm + SplitK + simplified Kernel Args (#669)
* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* B2C with 3D grid for KSplit

* Remove unused code.

* Use default B2C (3D grid) in grid gemm v2r4r2.

* Device gemm splitk use B2C map.

* Device GroupedGemmXdlSplitKCShuffle

* Example for GroupedGemm Xdl SplitK

* Introduce Device GroupedGemmSplitK

* Fix updating kbatch size.

* Add instance mk-nk-mn

* Enable set kbatch in profiler.

* Add GGemmSplitK mk-kn-mn instances

* Add more instances & split into multiple files.

* minor fix

* tuning

* clean

* disabled failed instances

* use pipe v2

* Ignore arg on not supported arch.

* fix warning

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
2023-04-24 15:43:36 -05:00
rocking
3eecbfb6ec Revise layout of group convolution (#675)
* [What] Remove pure conv int8 instance
[Why] We will never use pure int8 conv in AI, use int8 quantization instead

* Change layout

* Share the kernel parameter

* Support more type of NHWGC for group conv

* Revise client example of conv 2d, use NHWGC layout

* Add instance to cmake

* Revise layout of group conv quantization instance

* Revise layout of external api of group conv quantization

* Revise layout of group conv quantization client example

* Fix clang format

* Add comment to describe meaning of each parameter
2023-04-23 23:40:00 -05:00
Illia Silin
903cd19ce3 Put back the split-k gemm code. (#684)
* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
2023-04-21 19:37:00 -05:00
rocking5566
fd11a4a12a Add (#677) 2023-04-17 10:12:10 -05:00
rocking5566
ed3a2e5226 Groupnorm + swish external api (#668)
* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp
2023-04-10 08:02:17 -05:00
Jun Liu
3248387bbb Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665)
This reverts commit bb5530af91.
2023-04-06 17:14:11 -07:00
zjing14
fde6d2742b add fp64 instances (#658)
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
2023-03-30 13:30:43 -05:00
carlushuang
bb5530af91 simplify karg in device/grid of split-k op (#644)
* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* use name from tensor layout
2023-03-29 19:03:07 -05:00
rocking5566
389e84a83b Conv + quantization + tanh (#645)
* Rename file. Prepare to support another activation

* Add comment for quantization

* Extract out_elementop

* Add tanh example

* Add conv + bias + tanh quantization instance

* Add missing parameter

* Refine cmake

* Add external api and client example

* Extract variable in example

* Fix the comment

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-03-29 14:50:23 -05:00
ltqin
6ae12434d2 workaround 637 (#640)
* add workaround 637

* format

* change id

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-03-20 11:49:31 -05:00
rocking5566
16dc18e0f9 gemm/Conv xdlops + dlops quantization (#625)
* Add conv perlayer quantization

* Add gemm_dlops quantization

* Support int8 for innerproduct

* Refine gemm dlops int8 kernel parameter

* Support gfx908(MI100) and gfx90a(MI200)

* clang-format

* Rename example number

* Support different layout for d tensor

* Add conv dlops perchannel quantization example

* Move to example 40

* Extract the common code for different platform (dlops and xdlops)

* Move ot subfolder. Prepare to add other op of quantization

* Refine the quantization instance library

* Add conv dl instances and client example

* Remove unnecessary type

* Add gemm quantization instance

* Add external api and client example

* Refine num_bytes

* Separete different layout to different cpp

* Add more xdl instances

* Revert "Remove unnecessary type"

This reverts commit 820869182f.

* Remove CShuffleDataType in dlops
Let acc and CShuffleDataType be the same in xdlops

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-03-15 15:29:40 -05:00
Adam Osewski
9096b1c7b2 GroupedGEMM + Gelu client example/instances/profiler (#614)
* Grouped gemm + Gelu instances.

* Device Instance Factory for GroupedGemm+Gelu

* Client example

* Rangify fill helper functions.

* Fix name clash.

* Profiler for grouped_gemm+gelu

* No need to use full namespace name.

* Add check for MRaw divisible by vector load.

* Ugly fix for big errors.

* Add grouped_gemm+gelu to profiler CMakelists.

* Store in argument additional info.

* Information about Mraw, Nraw, Kraw values.

* Use FastGelu instead of Gelu.

* Change client ex to use FastGelu

* Remove relaxed error precision.

* Remove duplicate output elementwise-op

---------

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-03-07 22:06:56 -06:00
pmaybank
e4bf6d422e Generate output using Doxygen / Breathe (#598)
* Modify Doxygen config to pick up include directories recursively

* Add DeviceMem struct to API Reference guide

* Add classes that are used in Flash Attention kernel

* Add a reference and config for generating bibliography

Co-authored-by: Philip Maybank <Philip.Maybank@amd.com>
2023-03-06 11:39:16 -06:00
rocking5566
6a6163a3d1 Improve normalization (#580)
* Sync the order of type string with template parameter

* Add more instances

* Check the vector size and remove redundant var

* Extract var to static, prepare to separate sweep once kernel

* Separate sweeponce flow and optimize the flow

* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization

* Remove useless code

* Update naive variance kernel

* Refine string

* Fix typo

* Support naive variance for device_normalization

* Check the blocksize

* Share the VGPR of x and y

* Share the VGPR of gamma and beta

* Add more instances

* Support fp16 sqrt for experiment

* Add CHANGELOG

* Fix typo

* clang-format
2023-02-15 11:59:35 -06:00
Adam Osewski
e9fd122889 Conv3D FWD BWD WRW fp16 fp32 client examples (#559)
* Conv3d bwd weight client example.

* Update year in license

* Convolution bwd data 3D fp16/fp32 client example.

* Client example for convnd fwd fp16 fp32

* clang-format

* Review remarks.

* Fix compiler err.

* Update data layout to standard one.

* Add conv 3d fwd NDHWGC instances

* clang-format

* Conv3d fwd NDHWGC instances.

---------

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-02-15 11:16:47 -06:00
Adam Osewski
8f42780fd6 GroupedGEMM more bigger tiles. (#577)
* Adding more bigger tiles.

* Remove failing instance.

* Remove instances which that don't improve perf.

---------

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-02-13 10:06:24 -06:00
rocking5566
f7d28f3e4b Gemm+layernorm instance, ckProfiler, client example (#568)
* Add gemm + layernorm instance

* Add ckProfiler

* Add test

* Add client example

* Detect if user forger to set the workrspace

* Use literal in the example

* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()

* check gemm vaildity in IsSupportedArgument

* Add more testcases

* Merge duplicated folder in client example

* Print more infomation

* Use better kernel parameter for MS problem size

* clang format

* Add constexpr for if condition and remove redundant include

* Remove cstdlib and add constexpr
2023-02-09 15:02:55 -06:00
guangzlu
76d144fa7c Add instance for elementwise normlization (#573)
* added instances for large N

* add instance for elementwise normlization

* added supported restrict in device_elementwise_normalization_impl.hpp
2023-02-09 09:37:29 -08:00
ltqin
332ccc3367 Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576)
* add instance for gemm bias softmax gemm

* add client example

* change CGridDesc_G_M_N to CGridDesc_G_M_O

* add gridwise

* change c grid name

* device add d0s data

* fix 08 client_example

* add example 47_fused_attention

* example output correct

* add d0 to example

* add d0 element op

* rechange instance code

* change Acc0ElementwiseOperation to C0DEElementwiseOperation

* change example name

* update instance for cdeelementwiseop

* add bhalf_t ScaleAdd

* add test

* not surport geem1 bias

* remove some ignore

* fix test bug
2023-02-08 14:34:45 -06:00