Commit Graph

173 Commits

Author SHA1 Message Date
Harisankar Sadasivan
75e622f02f Universal streamk with atomics (#1360)
* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 

* Update README.md

* fixing clang-format issues

* removed conflicts in struct members between streamk and universal streamk

* corrected arg parsing for streamk and universal streamk

* added stream-k policies for 3 tile and 4 tile

* fixed argument type issue with parsing cmd args

* changes suggested in PR review are made- removing comments and correcting copyright

* file permissions updated

* added default value support for grid_size and streamk-policy selection set to -1

* print messages for arguments

* print messages for arguments

* print messages for arguments1
2024-07-05 21:40:30 -07:00
jakpiase
e2d139201b Switch to universal gemm in grouped gemm tile loop (#1335)
* switch to universal gemm in grouped gemm tile loop

* minor fixes

* add reviewers comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2024-06-18 09:01:49 -05:00
Illia Silin
1274861a9d replace the ENV macro with CK_ENV (#1296) 2024-05-17 10:42:51 -07:00
Bartłomiej Kocot
0b6b5d1785 Add two stage grouped conv bwd weight kernel (#1280) 2024-05-08 09:53:24 +02:00
Illia Silin
bf42097646 Enable logging in CK with environment variable. (#1278)
* enable logging using environment variable

* update ck.hpp header

* fix typo

* fix clang format

* Update include/ck/utility/env.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2024-05-07 16:26:43 -07:00
Adam Osewski
b4032629e5 Grouped GEMM Multiple D tile loop. (#1247)
* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Validate on host only NK dims if M is not known.

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.
2024-04-25 15:12:53 -05:00
ltqin
f448d179b7 Universal gemm flush cache (#1251)
* add flush cache to device op

* add flush cache parameter to ckProfiler

* change calculate size a and b method

* chang evaluation time method foro AVERAGE to MEDIAN

* format code

* adjust some code

* fix core dumped

* remove loop call flush icache in kernel

* remove loop(outer) call flush icache

---------

Co-authored-by: letaoqin <letaoqin@amd.com>
2024-04-25 15:07:14 -05:00
Bartłomiej Kocot
ad1597c499 Refactor elementwise kernels (#1222)
* Refactor elementwise kernels

* Instances fixes

* Fix cmake

* Fix max pool bwd test

* Update two stage gemm split k

* Restore elementwise scale for hiptensor backward compatiblity

* Fix Acc data type check in conv fwd multiple abd

* Disable conv fp64 fwd example

* Update grouped conv weight multi d
2024-04-19 13:31:17 +02:00
Bartłomiej Kocot
fd923b6d86 Add grouped conv bwd weight multi d kernel (#1237)
* Add grouped conv bwd weight multi d kernel

* Reference fix

* Fix cmake files

* bwd weight scale only xdl

* Fixes

* Fix client conv fwd example
2024-04-18 23:35:04 +02:00
Haocong WANG
f83e9701e9 [GEMM] Gemm universal device operation (#1154)
* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances

* clang format + typo fix

* Fix a bug

* initial commit

* Add more instances to irregular splitk

* blkgemm pipeline v1~4 prototype

* Sanity Checked. Known issue:
1. Poor performance of splitk
2. Register spill on blkgemmpipeline v3

* Sanity and Performance fix:
1. fix a bug related to sanity in grouped b2c mapping
2. fix a bug related to sanity and performance in splitk offset

* Sanity and API update:
1. Remove prefetch stage
2. Fix valid check bug
3, Add first gemm_universal instance into ckProfiler

* Add NN instances for gemm universal

* 1. Add NT instances for gemm_universal
2. Fix a bug about Kpadding in gemm_universal

* Fix a bug regarding padding Odd K number

* remove kernel print

* Fix KPadding bug...

* Update safety check

* another try to fix kpadding..

* Sanity checked

* new instances..

* clang format+typo fix

* remove clang format script's change

* Add non-hotloop compile option

* 1. Add fp16xfp8 example
2. pull packed convert f8 from pr1150

* Some miscs.. opt and fix

* Add pipeline description docs

* Split universal gemm instance library to cut profiler compiling time

* uncomment cmakefile

* Fix a bug caused by blockwise_gemm_pipe_v2

* reduce default splitk to 1

* Add 224x256x64 tile size

* update, including:
1. Experiment pipeline 5~7
2. Optimization for pipeline 4
3. Organized instance library

* temp save

* temp save

* Permuted lds layout, sanity and function checked

* clang format

* Move OOB check from RunRead to RunWrite, for better software pipeline.
TODO: agpr spill when NN layout

* clangformat

* A/B splitpipe scheduler for v3

* Fix two bugs

* bug fix

* fix a bug in oob check

* Example for mixed fp16_fp8 gemm

* Clean experimental code blocks

* Add mixed precision gemm into profiler

* tempsave

* optimize m/n major lds layout

* Add RRR GEMM  mixed precision instances

* Optimize f8 matrix transpose

* Add test_gemm_universal

* A/B spilt schedule for blkpip v5

* Take ds_read2 into iglp scheduling scheme

* format

* fixed cmake

* Add llvm-option into CI cmake flag

---------

Co-authored-by: Jing Zhang <jizhan@amd.com>
2024-04-13 21:03:18 -05:00
Bartłomiej Kocot
ced5af16f7 Extend support for contraction 6D (#1207)
* Extend support for contraction up to 5D

* Extend contraction bilinear instances

* Fix interface test

* Add 6d support, remove 3d,4d,5d

* Fixes

* Fix readme

* Make defualt dim for contraction instances
2024-04-09 23:46:21 +02:00
jakpiase
c701071666 Add Grouped Gemm Multiple D SplitK TwoStage (#1212)
* Support A/B/C elementwise ops.

* First part of GGEMM multiD splitk two stage.

* WIP - changes for debuggin.

* tmp save

* working version

* added bf16@int8 version

* fixes

* add reviewers sugestions

* pre-commited missing files

* switched to ifs from elseifs

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2024-04-04 11:01:33 +02:00
Rostyslav Geyyer
a61e73bc56 Add instances for conv_scale with fp8@bf8->fp8 (#1220)
* Update device op api to support BComputeType

* Add example

* Add instances

* Add profiler mode

* Add client example

* Update copyright year

* Add BComputeType check

* Fix compute types
2024-04-03 09:08:08 -05:00
Bartłomiej Kocot
9a194837af Introduce combined elementwise ops (#1217)
* Introduce combined elementwise ops

* Introduce refrence elementwise
2024-04-02 17:23:49 -05:00
Bartłomiej Kocot
9c052804a7 Add elementwise with dynamic vector dim (#1198)
* Add elementwise with dynamic vector dim

* Reduce number of instaces

* Fixes

* Fixes
2024-03-22 10:40:43 +01:00
Illia Silin
d0c7b45150 Clip fp8 to +/-240 on all targets. (#1172)
* clip fp8 to +/-240 on all targets

* if inputs to fp8 conversion are +/-inf, they remain unaltered

* increase tolerance for test_elementwise_layernorm to prevent false errors

* change the input values for gemm examples to floats

* reduce gemm example float input values to prevent errors

* increase the tolerance for gemm examples
2024-02-27 14:31:05 -06:00
jakpiase
32d4be3d09 Add support for mixed precision bf16&int8 grouped gemm (#1166)
* add support for mixed precision bf16&int8 grouped gemm

* fix gfx versions and add bf16 kbatch condition

* added reviewers comments
2024-02-21 10:35:35 +01:00
Bartłomiej Kocot
66736edb95 Extend permute scale support up to 6D (#1168)
* Extend permute scale support up to 6D

* Fixes

* Fixes

* Update profiler/README.md

Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md

Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md

Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md

Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md

Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md

Co-authored-by: Lisa <lisajdelaney@gmail.com>

* Update profiler/README.md

Co-authored-by: Lisa <lisajdelaney@gmail.com>

---------

Co-authored-by: Lisa <lisajdelaney@gmail.com>
2024-02-20 09:56:54 -08:00
jakpiase
ba86eadce5 Add support for mixed-precision f16bf16_int8 gemm (#1127) 2024-02-07 15:54:13 +01:00
rocking
28f68a5a99 layernorm & groupnorm bwd gamma beta (#1133)
* Add layernorm bwd gamma beta external api

* Add groupnorm external api

* Add layernorm bwd gamma beta profiler

* Add groupnorm bwd gamma beta ckProfiler

* Add layernorm & groupnorm bwd gamma beta test

* Fix groupnorm bwd gamma beta profiler bug

* Layernorm bwd weight client example

* Groupnorm bwd weight client example

* clang format

* Remove useless header

* Let inv_std be positive

* Rename to num_bytes and move this calculation outside the loop
2024-01-25 19:53:15 +08:00
zjing14
1be4706366 fixed return (#1138) 2024-01-22 08:42:26 -08:00
Haocong WANG
bb63b9732c [GEMM] Optimization for MI200/300. (#1135)
* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances

* clang format + typo fix

* Fix a bug
2024-01-19 07:02:22 -06:00
Illia Silin
886d9eeb99 Add an option to change the number of warm-up cycles and iterations. (#1124)
* allow setting the number of warmup cycles and iterations for profiler

* fix the gemm_splitk and grouped_gemm examples
2024-01-09 09:43:08 -08:00
arai713
aa3e2d7967 Transpose profiler fix (#1114)
* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

* disabled device op for MI300

* skip  elementwise_permute_2d on gfx94x

* Update CMakeLists.txt

* fixing CMake - disabling some GPU targets

* added transpose profiler to CMake

* fixed transpose profiler errors

* fixed instances for tests/profiler

* cleaned up code in transpose profiler source code

* added some comments, updated copyright

* made function arguments const where possible

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2024-01-04 10:33:19 -06:00
rocking
a69aa2a11a layernorm and groupnorm backward data (#1083)
* rename folder

* Add type string

* Remove typo

* Add deviceOp to backward x

* Add comment to describe the behavior of backward normalization

* Add kernel function, prepare to implement

* implement generic kernel

* Check vector size

* Add sweep once pipeline for small reduce size

* Fix bug of KRaw_ error

* Fix bug of dx stride

* sanity check for mean and rstd

* backward x for groupnorm

* Add bwd x instance

* add layernorm 2d bwd gamma beta instances

* Change save mean var type from f32 to f16 in f16 mode

* Change the example to f16

* Add groupnorm bwd gamma beta instance

* Add groupnorm bwd x instance

* Fix naming

* Add layernorm bwd x ckprofiler

* Add groupnorm bwd x profiler

* clang format

* Rename bwd x to bwd data

* Fix bug of verification in profiler

* Add test of layernorm and groupnorm bwd data

* Add missing cmake

* Add layernorm2d bwd data

* rename fwd example

* Add groupnorm client example

* Fix typo. replace Invarient with Invariant

* Add checking before running the best instance
2023-12-19 04:23:11 +08:00
zjing14
ae5e5181aa recover default niter (#1064) 2023-11-28 12:18:42 -08:00
zjing14
e8cddfdc3b Improve 4k gemm perf (#1047)
* improve 4k gemm perf

* add f8 instances

* format

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
2023-11-17 07:06:24 -06:00
Bartłomiej Kocot
f2398f612d Introduce multiABD api and deprecate multiD (#1035)
* Introduce multiABD api and deprecate multiD

* Replace multiD with multiABD

* Mark structures as deprecated

* Change doxygen deprecated to note to avoid warnings
2023-11-14 17:00:40 +01:00
zjing14
600fc000ed add more instances for bfp16 gemm (#1036)
* add more instances for bfp16

* reduce the gemm input values to prevent round-off errors

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2023-11-11 07:09:32 -08:00
arai713
3af8c81a72 Transpose 3d (#984)
* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
2023-11-08 19:45:07 -06:00
rocking
a3d9a2cd42 Layernorm4d (#1022)
* Rename folder

* Add layernorm 4d fwd example

* Rename original layernorm example

* Add layernorm 4d f16  test

* Add layernorm4d_fwd client example

* Support layernorm4D in ckProfiler

* Rename groupnorm to groupnorm fwd in example

* Rename layernorm and group fwd in test

* Rename normalization to normalization_fwd (instances)

* Add fwd to DeviceNormalization

* Rename external api header

* Rename folder, because we can also add bwd in this folder

* Add fwd in layernorm and groupnorm (profiler

* Fix compile error

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2023-11-09 08:34:51 +08:00
zjing14
98fd41f597 Add Gemm instances for performance improvement (#1018)
* improve kpad

* more tuning parameters

* f16_f8_fp16

* cut test time

* add f16_f8_fp16

* add f16_f8_f16

* testing instances for skinny cases

* format

* clean

* add fp16_f8_fp16

* clang-format

* add grouped gemm instalces

* fixed profile grouped_gemm

* clean

* clean

* clean

* clean

* clean

* add missing instance func

* fixed inferface

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: root <root@sh5-1e707-rc06-38.mkm.dcgpu>
2023-11-07 09:09:58 -06:00
Bartlomiej Wroblewski
4ef704d8a6 Add support for mixed precision in contraction scale and bilinear (#973)
* Add support for mixed precision in contraction scale and bilinear (#936)

* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

* Make the ComputeDataType an optional argument in instance

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2023-11-02 14:26:33 -07:00
Bartłomiej Kocot
2e824c6d46 Add support for groups in Img2Col/Col2Img (#1007)
* Add support for groups in Img2Col/Col2Img

* Fix interface test

* Fix interface test G to N

* Improve performance

* Change gemm layout to 3d

* Fixes
2023-10-31 10:46:32 +01:00
Illia Silin
f46a6ffad8 Fix the fp8 gemm for large tensors on MI300. (#1011)
* Fix the fp8 conversion

* Try clipping value before conversion

* Fix return

* Simplify with a const

* reduce the gemm input tensor values to reduce round-off error

* replace if-else with lambda

* fix syntax

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
2023-10-27 21:10:47 -07:00
rocking
3696fe1c76 Layernorm and groupnorm support to save mean and inverse std in forward (#929)
* save mean and inverse std in normalization

* Save mean and inverse std in splitK

* Vector save mean and inv std

* Modify instance for save mean and std

* simplify the layernorm example

* Save mean and std in groupnorm example

* Save mean and inv std in ckProfiler and test

* Remove compute data type from base class

* Save mean and inv std in client example

* Add changelog

* clang format

* Fix compile error

* Refine naming

* Avoid error in bf16

* revert changelog
2023-10-19 07:36:29 +08:00
Rostyslav Geyyer
fa753f27ba Add splitk gemm fp16 @ fp16 with fp8 compute instances (#983)
* Add ComputeType

* Update for compatibility

* Add instances

* Update profiler api
2023-10-13 16:27:11 -05:00
Illia Silin
4daedf8ca5 Revert "Add support for mixed precision in contraction scale and bilinear" (#967)
* Revert "Add support for mixed precision in contraction scale and bilinear (#936)"

This reverts commit f07485060e.

* revert commits #957 and #960
2023-10-05 14:58:23 -07:00
Rostyslav Geyyer
42facfc6b7 Add conv bwd weight fp16 comp bf8 fp8 op, instances and example (#945)
* Add f8 bf8 gemm example

* Add element-wise ops

* Add intrinsics

* Update reference calculation

* Add an additional type option for xdlops gemm

* Fix build process

* Add bf8 to buffer addressing

* Update blockwise op, split typeA and typeB

* Update for compatibility

* Uppdate naming to f8->fp8

* Update naming

* Format

* Update naming (#937)

* Add a client example

* Add computetypes to device and gridwise ops

* Add instances, update instance factory

* Format

* Fix a flag

* Add ckProfiler mode

* Fix typos

* Add an example

* Add bf8 generator

* add bf8 mfma; fixed type_convert for bf8

* move verfication ahead of timing

* Update reference calculation

* Fix reference

* Narrow down float init range

* Fix bf8 bf8 mfma

* Add bf8 @ fp8 mfma

* Update example

* Update instances

* Update profiler api

* Update for compatibility

* Format

* Remove extra example

* Clean up

* workaround convert

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
2023-10-04 08:19:08 -05:00
Bartlomiej Wroblewski
f07485060e Add support for mixed precision in contraction scale and bilinear (#936)
* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code
2023-09-29 10:54:31 -05:00
Bartłomiej Kocot
e2243a4d1e Add column to image kernel (#930)
* Add column to image kernel

* Minor fixes for dtypes and client examples

* Disable tests for disabled dtypes

* Disable add instances functions for disabled data types

* Minor stylistic fixes

* Revert "Disable add instances functions for disabled data types"

This reverts commit 728b869563.

* Instances reduction

* Add comments in device_column_to_image_impl

* Update changelog and Copyrights

* Improve changelog
2023-09-27 17:19:06 +02:00
Rostyslav Geyyer
94bfa50256 Add fp8 gemm instances (#920)
* Add fp8 gemm instances

* Update instance naming
2023-09-26 14:59:33 -05:00
zjing14
a66d14edf2 fixed fp8 issues (#894)
* fixed fp8 init; and reference gemm

* Update host_tensor_generator.hpp

* fixed convert

* fixed reference gemm

* fixed comments

* fixed comments

* fixed ci

* fixed computeType

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
2023-09-12 22:17:56 -05:00
Rostyslav Geyyer
62d4af7449 Refactor f8_t, add bf8_t (#792)
* Refactor f8_t to add bf8_t

* Add check_err impl for f8_t

* Update fp8 test

* Format

* Revert the fix

* Update vector_type implementation

* Add bf8 test

* Add bf8, use BitInt types

* Add bf8 conversion methods

* Update type_convert for fp8/bf8

* Add check_err fp8/bf8 support

* Add subnorm fp8 tests

* Add subnorm bf8 tests

* Fix conversion

* Add bf8 cmake bindings

* Add macros to enable build with disabled fp8/bf8

* Remove is_native method

* Update flag combination for mixed precision instances

* Add more flag checks

* Add another flag to a client example

* Add type traits, decouple f8/bf8 casting

* Clean up

* Decouple fp8 and bf8 flags

* Remove more redundant flags

* Remove leftover comments
2023-09-12 17:04:27 -05:00
Haocong WANG
562b4cec48 [Navi3x] Add fp16/int8 wmma conv forward instances (#746)
* fix wmma gemm int8; add grouped conv int8 example

* Add int8 gemm-bilinear instances

* compile sanity check unknown

* Sanity pass + clang-format

* add int8 conv profiler instances

* solve merge conflict

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2023-09-07 21:59:26 -05:00
Bartłomiej Kocot
0077eeb3be Add image to column kernel (#867)
* Add image to column kernel

* Add instances, tests, profiler, example

* Add client example

* Several fixes of image to column

* Fix variable name in device_image_to_column_impl

* Several fixes of image to column profiler

* Fix num_btype calculation

* Make new mesaurements for correct bytes calculation
2023-09-05 10:11:40 -05:00
rocking
866377de18 MaxPool & AvgPool bwd instances, test, ckProfiler, client example (#861)
* Add maxpool instances

* Rename index pool to max pool.

* Add maxpool bwd bf16 instances

* Add avg pool bwd instances

* Rename avgpool and maxpool to avg_pool3d and max_pool

* Add bf16 pool fwd instances

* Add max pool bwd to ckProfiler

* Add avg pool3d bwd to ckProfiler

* Add avg pool bwd test

* Fix bug of reference pool fwd (dilation)

* Fix bug of max pool bwd  (dilation and initZero)

* Support bf16 compute data type

* Force compute type be f32. Because atomicAdd only support f32

* Add max pool bwd test

* Rename folder

* Rename pool

* Add max pool bwd client example

* Add avg pool bwd client example

* Add missing workspace

* clang format

* Rename macro

* remove useless header

* remove useless layout
2023-08-31 21:01:50 +08:00
zjing14
31ea132aa2 Fp16/fp8 mixed-precision Gemm with multiply+add fusion (#865)
* add compute_type

* add multiply_add ckProfiler

* add f8_fp16 support

* clean

* clean

* fixed lds size calc

* format

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
2023-08-28 16:27:32 -05:00
zjing14
8ebea3a56e add generic instances (#858)
Co-authored-by: Jing Zhang <jizha@amd.com>
2023-08-23 09:18:10 -05:00
zjing14
ca3115e7e8 Ck profiler splitk (#857)
* updated regular gemm

* update ckProfiler

* fixed gtests

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
2023-08-22 16:54:34 -07:00