Commit Graph

264 Commits

Author SHA1 Message Date
Bartlomiej Wroblewski
d270b881fc Fix cluster length arrange order in fp16 GEMM example (#1055)
[ROCm/composable_kernel commit: bfecc19352]
2023-11-27 11:31:14 +01:00
Bartlomiej Wroblewski
4d9c41c7f5 Add basic support for direct loads from global to LDS (#999)
* Add basic support for direct loads from global to LDS

* Clean the code and comments

* Add support for fp16

* Add comments

* Add check for thread cluster lengths

* Align non-direct-load fp16 example

* Small fixes

* Extend IsSupported to check for supported GPU gens

* Build examples only on the supported HW

* Do not throw when instance not supported in 04 example

* Review: Apply review suggestions

* Review: small fix

* Review: small fix

[ROCm/composable_kernel commit: 627054b941]
2023-11-25 13:35:22 +01:00
zjing14
e25c18aeb7 Improve 4k gemm perf (#1047)
* improve 4k gemm perf

* add f8 instances

* format

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: e8cddfdc3b]
2023-11-17 07:06:24 -06:00
Bartłomiej Kocot
c80803505f Introduce multiABD api and deprecate multiD (#1035)
* Introduce multiABD api and deprecate multiD

* Replace multiD with multiABD

* Mark structures as deprecated

* Change doxygen deprecated to note to avoid warnings

[ROCm/composable_kernel commit: f2398f612d]
2023-11-14 17:00:40 +01:00
arai713
64ab0788f6 Hip tensor permute (#1002)
* adding files for F32 example

* adding functioning implementation with scalar multiplication and unary operator support

* added fp 16 type check in unary square

* updating scalar multiplication as an operator

* functioning version with scalar operator

* changing strides for col major

* updated column major implementation

* working column major implementation

* cleaned up comments, rearranged/renamed files

[ROCm/composable_kernel commit: 454cf7bd1f]
2023-11-13 11:15:48 -06:00
Bartłomiej Kocot
05d18a052b Support multi AB for grouped conv fwd xdl (#1027)
* Support multi AB for grouped conv fwd xdl

* Add instances

* Add client example

* Add example

* Add interface test

* Minor fixes

Minor fixes

Minor fixes

* Comment fixes

* Fixes

* Reference fix

* Test xdl fixes

* Improve multi_ab interface test

[ROCm/composable_kernel commit: 49e52bb357]
2023-11-10 15:54:44 +01:00
rocking
283920969f Backward of gamma and beta for layernorm and groupnorm (#1013)
* Add layernorm backward reference code

* Add groupnorm backward reference code

* Add example

* clang format

* Fixc bug of reference layernorm and groupnorm

* Fix naming

* Refine naming

* Add device op for normalization bwd gamma and beta

* Refine template parameter

* Add bwd gamma & beta of kernel

* 1. Add groupnorm example
2. Refine layernorm naming

* Narrow down the static check for performance

* Refine variable name

[ROCm/composable_kernel commit: 1db7560365]
2023-11-10 18:02:03 +08:00
arai713
cba606adf6 Transpose 3d (#984)
* added working example for 5D input using 1D kernel

* example with 5D input tensor and 2d kernel - not working: issues with arguments

* added updated version of 3d device op - changed descriptors/dims

* added example file to check kernel

* fixed descriptor and isSupportedArgument stride problem

* added and modified kernel for 3d - updated tids/loop

* adding some more 5d example files

* fixed some issues

* changes made for testing

* working version: fixed error in stride for A, still a bit inefficient

* cleaned up formatting/comments

* updating formatting

* more formatting fixes

* fixing cmake, adding back gpu targets in cmake script

* adding client example

* added instances for client example

* fixed errors in client example

* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp

* removed extra files

* minor formatting and naming fixes

* adding test files and profiler

* fixing minor error

* minor fix

* removed unneccesary comments, renamed files

* updated instance list for client example, added different layout example

* removing instances

* fixed error in instance generation

* remove comments

* update profiler and client example tensor layouts

* fixed errors in test/profiler

* updated vector dim access to enable vector load

* updated test/profiler files

* updated example with 1d kernel

* updating profiler

* renamed files

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: 3af8c81a72]
2023-11-08 19:45:07 -06:00
rocking
867bc90509 Layernorm4d (#1022)
* Rename folder

* Add layernorm 4d fwd example

* Rename original layernorm example

* Add layernorm 4d f16  test

* Add layernorm4d_fwd client example

* Support layernorm4D in ckProfiler

* Rename groupnorm to groupnorm fwd in example

* Rename layernorm and group fwd in test

* Rename normalization to normalization_fwd (instances)

* Add fwd to DeviceNormalization

* Rename external api header

* Rename folder, because we can also add bwd in this folder

* Add fwd in layernorm and groupnorm (profiler

* Fix compile error

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: a3d9a2cd42]
2023-11-09 08:34:51 +08:00
Bartlomiej Wroblewski
a85ef2a95e Add missing ComputeDatatype in contraction_multi_ABD_xdl_fp16 (#1024)
[ROCm/composable_kernel commit: 16eb824c90]
2023-11-03 08:22:11 -07:00
Bartlomiej Wroblewski
382a513acb Add support for mixed precision in contraction scale and bilinear (#973)
* Add support for mixed precision in contraction scale and bilinear (#936)

* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

* Make the ComputeDataType an optional argument in instance

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 4ef704d8a6]
2023-11-02 14:26:33 -07:00
Bartłomiej Kocot
c63701ff60 Add ScaleAddScaleAddRelu post op for conv fwd (#1006)
* Add ScaleAddScaleAddRelu post op for conv fwd

* Fixes

* Fix instance file name

* Minor fix

[ROCm/composable_kernel commit: f27ea94ecb]
2023-11-01 18:31:30 -05:00
Bartłomiej Kocot
60a0f176d3 Add support for groups in Img2Col/Col2Img (#1007)
* Add support for groups in Img2Col/Col2Img

* Fix interface test

* Fix interface test G to N

* Improve performance

* Change gemm layout to 3d

* Fixes

[ROCm/composable_kernel commit: 2e824c6d46]
2023-10-31 10:46:32 +01:00
Illia Silin
60b282fb62 Fix the fp8 gemm for large tensors on MI300. (#1011)
* Fix the fp8 conversion

* Try clipping value before conversion

* Fix return

* Simplify with a const

* reduce the gemm input tensor values to reduce round-off error

* replace if-else with lambda

* fix syntax

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>

[ROCm/composable_kernel commit: f46a6ffad8]
2023-10-27 21:10:47 -07:00
Bartłomiej Kocot
42608c87ea Fix cmake dtype check (#989)
* Fix instances dtype check

* Fix source dtypes seletor for examples and tests

* Sync with new cmakefile changes

* Remove not needed ifdefs

* Remove not needed ifdefs

[ROCm/composable_kernel commit: ac0e006766]
2023-10-21 22:19:43 +02:00
Rostyslav Geyyer
07d78c032b Fix bf8 conversion issues (#1003)
* Fix the conversion

* Add bf8 functionality

* Enable example on MI200 as well

[ROCm/composable_kernel commit: 1fd27d520f]
2023-10-20 08:00:45 -05:00
Bartłomiej Kocot
2d230d0f5c Extend available elementwise operations with conv examples (#995)
* Extend available elementwise operations with conv examples

* Fixes

* Remove not needed convert

* Update CMakeFile and dir name

[ROCm/composable_kernel commit: 82f3a835d5]
2023-10-19 17:23:19 +02:00
Bartlomiej Wroblewski
d2049bc4e7 Change 1d,2d,... to 1D,2D,... (#997)
[ROCm/composable_kernel commit: 0abc0f87db]
2023-10-19 16:53:18 +02:00
rocking
815ed3a1f9 Layernorm and groupnorm support to save mean and inverse std in forward (#929)
* save mean and inverse std in normalization

* Save mean and inverse std in splitK

* Vector save mean and inv std

* Modify instance for save mean and std

* simplify the layernorm example

* Save mean and std in groupnorm example

* Save mean and inv std in ckProfiler and test

* Remove compute data type from base class

* Save mean and inv std in client example

* Add changelog

* clang format

* Fix compile error

* Refine naming

* Avoid error in bf16

* revert changelog

[ROCm/composable_kernel commit: 3696fe1c76]
2023-10-19 07:36:29 +08:00
zjing14
dc94c20258 Clean DTYPES conditions in CMake (#974)
* Add a condition to build fp8 instances

* simplified buffer_load/store

* add bfp8/fp8

* fixed

* remove all f8/bf8 condition include folder

* fixed cmake conditions

* fixed DTYPES=fp16/bfp16

* fix

* fixed buffer_load

* fixed buffer_store

* fix

* clean example cmake files

* fixed ci

* fixed cit

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: bf435140dc]
2023-10-18 11:14:14 -05:00
zjing14
522bbcb766 Add contraction_multi_abd (#972)
* add gridwise_multi_abd

* move element_op into RunRead

* merge element_wise op with data read

* add multiABD example

* allow packed elementwise_op

* changed example

* clean

* clean

* add is_detected

* fix

* minor fix

* add scaleAdd_vec4 example

* init commit for contraction_multi_ABD

* add examples

* add examples of multiA and broadcast

* update example

* fixed comments

* Update cmake-ck-dev.sh

* Update cmake-ck-dev.sh

* Add comments into the example

* Update CMakeLists.txt

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: 1cc36ba5fb]
2023-10-17 20:17:58 -05:00
Bartłomiej Kocot
49f179f755 Add grouped conv bwd weight wmma (#985)
* Add grouped conv bwd weight wmma

* Update README, changelog, profiler

* Minor fixes

* Fix grouped conv bwd wei dl kernel

* Minor fixes

* Minor stylistic fixes

[ROCm/composable_kernel commit: 16d7c4d2f7]
2023-10-17 10:32:26 +02:00
zjing14
1271deb162 add vector_type support into thread_copy_v3r1 (#969)
* add vector_type support into thread_copy_v3r1

* remove unncessary type_convert

* fixed datatype

* fixed dataType

* changed API with is_packx_invocable

* changed example

* add missing cmake file

* fixed ci

* fixed cmake

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: 2ce9b56c64]
2023-10-13 15:11:43 -05:00
zjing14
0053cbab22 Fixed f8_gemm NaN (#975)
* workaround nan problem by changing output to fp16

* enable f8/bf8 gemm tests on MI200

* workaround f16 to f8 conversion

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: ac9595a9f1]
2023-10-10 10:30:26 -05:00
Lauren Wrubleski
aeb94be7b4 Replace CMake return from later CMake (#970)
[ROCm/composable_kernel commit: 5913609168]
2023-10-05 14:58:58 -07:00
Illia Silin
3b3782dd07 Revert "Add support for mixed precision in contraction scale and bilinear" (#967)
* Revert "Add support for mixed precision in contraction scale and bilinear (#936)"

This reverts commit f7aff936cb9d02dc8e53a8a3ea8648e1058253a2.

* revert commits #957 and #960

[ROCm/composable_kernel commit: 4daedf8ca5]
2023-10-05 14:58:23 -07:00
zjing14
41eb6c1899 remove example 60 (#963)
Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: 570ff3ddbe]
2023-10-05 09:41:01 -07:00
Rostyslav Geyyer
6a3eedbff0 Add conv bwd weight fp16 comp bf8 fp8 op, instances and example (#945)
* Add f8 bf8 gemm example

* Add element-wise ops

* Add intrinsics

* Update reference calculation

* Add an additional type option for xdlops gemm

* Fix build process

* Add bf8 to buffer addressing

* Update blockwise op, split typeA and typeB

* Update for compatibility

* Uppdate naming to f8->fp8

* Update naming

* Format

* Update naming (#937)

* Add a client example

* Add computetypes to device and gridwise ops

* Add instances, update instance factory

* Format

* Fix a flag

* Add ckProfiler mode

* Fix typos

* Add an example

* Add bf8 generator

* add bf8 mfma; fixed type_convert for bf8

* move verfication ahead of timing

* Update reference calculation

* Fix reference

* Narrow down float init range

* Fix bf8 bf8 mfma

* Add bf8 @ fp8 mfma

* Update example

* Update instances

* Update profiler api

* Update for compatibility

* Format

* Remove extra example

* Clean up

* workaround convert

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: 42facfc6b7]
2023-10-04 08:19:08 -05:00
zjing14
a7a9249cef changed test for grouped_gemm to be random (#959)
Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: 5311d1b325]
2023-10-03 09:32:58 -05:00
zjing14
33859062bd Fixed contraction issues (#960)
* add missing ComputeType

* fixed

* Update cmake-ck-dev.sh

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: aa46039f2d]
2023-10-03 09:32:44 -05:00
Rostyslav Geyyer
28a1199b62 Add fp8 @ bf8 gemm support and example (#933)
* Add f8 bf8 gemm example

* Add element-wise ops

* Add intrinsics

* Update reference calculation

* Add an additional type option for xdlops gemm

* Fix build process

* Add bf8 to buffer addressing

* Update blockwise op, split typeA and typeB

* Update for compatibility

* Uppdate naming to f8->fp8

* Update naming

* Format

[ROCm/composable_kernel commit: bd09b5c538]
2023-10-02 16:39:03 -05:00
Illia Silin
6a8658812a get rid of gfx900/906, set rocm5.7 as default (#958)
[ROCm/composable_kernel commit: 59dbb01fd1]
2023-10-02 12:01:11 -07:00
zjing14
50c12c6c43 Contraction multi abd (#957)
* add gridwise_multi_abd

* move element_op into RunRead

* merge element_wise op with data read

* add multiABD example

* allow packed elementwise_op

* changed example

* clean

* clean

* add is_detected

* fix

* minor fix

* add scaleAdd_vec4 example

* init commit for contraction_multi_ABD

* add examples

* add examples of multiA and broadcast

* update example

* fixed comments

* Update cmake-ck-dev.sh

* Update cmake-ck-dev.sh

* Add comments into the example

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: 9d58c42103]
2023-10-02 09:18:36 -05:00
Bartlomiej Wroblewski
ce003d6493 Add support for mixed precision in contraction scale and bilinear (#936)
* Extract common functionality to separate files

* Reference contraction: Remove incorrect consts from type_converts

* Reference contraction: Add missing type_convert for dst value

* Reference contraction: Fix incorrect order of B matrix dimensions

* Add support for mixed precision in contraction scale and bilinear

* Move using statements from instances to a common file

* Move using statements from examples to a common file

* Fix the order of B matrix dimensions across examples and profiler

* Fix the computation of error threshold

* Make ComputeDataType an optional argument

* Include possible DataType -> ComputeDataType casting error in the threshold

* Remove commented code

[ROCm/composable_kernel commit: f07485060e]
2023-09-29 10:54:31 -05:00
Bartłomiej Kocot
612cbbdc54 Add grouped conv bwd data wmma (#950)
* Add grouped conv bwd data wmma

* Fix copyrights

* Add instances with smaller NPerBlock

* Update interface test

* Minor stylistic fixes

* Minor stylistic fixes

[ROCm/composable_kernel commit: cb53874002]
2023-09-28 23:10:18 +02:00
Bartłomiej Kocot
be5cb244c0 Add column to image kernel (#930)
* Add column to image kernel

* Minor fixes for dtypes and client examples

* Disable tests for disabled dtypes

* Disable add instances functions for disabled data types

* Minor stylistic fixes

* Revert "Disable add instances functions for disabled data types"

This reverts commit 728b869563.

* Instances reduction

* Add comments in device_column_to_image_impl

* Update changelog and Copyrights

* Improve changelog

[ROCm/composable_kernel commit: e2243a4d1e]
2023-09-27 17:19:06 +02:00
zjing14
fb513ac42b Add multiple A/B support (#906)
* add gridwise_multi_abd

* move element_op into RunRead

* merge element_wise op with data read

* add multiABD example

* allow packed elementwise_op

* changed example

* clean

* clean

* add is_detected

* fix

* minor fix

* add scaleAdd_vec4 example

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: 11676c7e49]
2023-09-26 21:16:23 -05:00
Illia Silin
3609ff10f7 Refactoring cmake files to build data types separately. (#932)
* refactor cmake files for the tests

* refactor cmake files for examples

* fix cmake for gemm example

* fix the cmake file for all examples

* add splitting by data types in gemm_splitk instance header

* rename test to reflect only dl instances are used

* clean up CI workspace, update cmake for instances

* change the jenkinsfile syntax

* build all instances except DL on gfx11

* move workspace cleanup after stages

* clean up workspace after every stage

* isolate data types in grouped_conv_fwd header

* isolate dl instances for grouped_conv2d_fwd

* fix syntax

* fix cmake and batchnorm instances

* fix typo

* fix reduction instances

* fix grouped_conv headers

* fix syntax

* replace parsing logic for instances, replace bfp16 with bf16

* fix the client examples build

* clean up DTYPES from instances cmake files

* update the parsing logic in cmake files

* make an exception for reduction kernels

* update few remaining cmake files to handle DTYPES

* fix syntax

* fix cmake conflicts

* replace f8 with fp8 test name

* resolve conflicts for dpp instances

[ROCm/composable_kernel commit: bba085d2b5]
2023-09-20 22:15:56 -07:00
zjing14
2d384eaba7 Add fp16/fp8 support into Grouped gemm FixedNK (#874)
* move all arguments into device

* add b2c_tile_map

* add examples

* add SetDeviceKernelArgs

* dedicated fixed_nk solution

* init client api

* add grouped_gemm_bias example

* add a instance

* add instances

* formatting

* fixed cmake

* Update EnableCompilerWarnings.cmake

* Update cmake-ck-dev.sh

* clean; fixed comments

* fixed comment

* add instances for fp32 output

* add instances for fp32 output

* add fp32 out client example

* fixed CI

* init commit for kbatch

* add splitk gridwise

* format

* fixed

* clean deviceop

* clean code

* finish splitk

* fixed instances

* change m_loops to tile_loops

* add setkbatch

* clean code

* add splitK+bias

* add instances

* opt mk_nk instances

* clean examples

* fixed CI

* remove zero

* finished non-zero

* clean

* clean code

* optimized global_barrier

* fixed ci

* fixed CI

* instance and client

* removed AddBias

* format

* fixed CI

* fixed CI

* move 20_grouped_gemm to 21_grouped_gemm

* clean

* formatting

* clean

* clean

* fixed computeType

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: f9d0eddb90]
2023-09-14 21:04:10 -05:00
Bartłomiej Kocot
f4999cd99a Add grouped conv bwd weight dl instances and new layout (#897)
* Add grouped conv bwd weight dl instances and new layout

* Add M and N padding

* Remove todo comment

* Enable grouped conv fwd dl k,c=1 generic instance

* Comment fixes

[ROCm/composable_kernel commit: 475188ca2e]
2023-09-13 10:14:31 -05:00
zjing14
5bb25a9688 fixed fp8 issues (#894)
* fixed fp8 init; and reference gemm

* Update host_tensor_generator.hpp

* fixed convert

* fixed reference gemm

* fixed comments

* fixed comments

* fixed ci

* fixed computeType

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: a66d14edf2]
2023-09-12 22:17:56 -05:00
Rostyslav Geyyer
0752117077 Refactor f8_t, add bf8_t (#792)
* Refactor f8_t to add bf8_t

* Add check_err impl for f8_t

* Update fp8 test

* Format

* Revert the fix

* Update vector_type implementation

* Add bf8 test

* Add bf8, use BitInt types

* Add bf8 conversion methods

* Update type_convert for fp8/bf8

* Add check_err fp8/bf8 support

* Add subnorm fp8 tests

* Add subnorm bf8 tests

* Fix conversion

* Add bf8 cmake bindings

* Add macros to enable build with disabled fp8/bf8

* Remove is_native method

* Update flag combination for mixed precision instances

* Add more flag checks

* Add another flag to a client example

* Add type traits, decouple f8/bf8 casting

* Clean up

* Decouple fp8 and bf8 flags

* Remove more redundant flags

* Remove leftover comments

[ROCm/composable_kernel commit: 62d4af7449]
2023-09-12 17:04:27 -05:00
Haocong WANG
c2866bb432 [Navi3x] Add fp16/int8 wmma conv forward instances (#746)
* fix wmma gemm int8; add grouped conv int8 example

* Add int8 gemm-bilinear instances

* compile sanity check unknown

* Sanity pass + clang-format

* add int8 conv profiler instances

* solve merge conflict

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

[ROCm/composable_kernel commit: 562b4cec48]
2023-09-07 21:59:26 -05:00
Bartlomiej Wroblewski
02f8f707e8 Redesign the DPP8 GEMM kernel to use warp-wise component (#863)
* Redesign the DPP8 GEMM kernel to use warp-wise component

* Review: Improve error messages

* Review: Remove unnecessary empty lines

* Review: Fix M, N per thread names

* Review: Rename mfma_input_type to dpp_input_type

* Review: Fix tensor adaptor; remove unnecessary element

* Review: Remove calls to dpp_gemm's MakeCDescriptor

* Review: Add blockwise doc, change function names to include dimension names

* Review: Remove duplicated code; Move Block2CtileMap alias to the top of the file

* Review: Add __restrict__ keywords

* Review: Use MatrixPadder for padding A, B, C matrices

* Review: Remove hardcoded datatypes

* Review: Change names from FloatX to XDataType

* Review: Introduce AK0 and BK0 instead of a single K0

* Review: Remove construction of dpp_datatypes object

* Review: Rename DppInstrRunner to DppLanegroupGemm

[ROCm/composable_kernel commit: 37a8c1f756]
2023-09-06 11:44:09 -05:00
Bartłomiej Kocot
d79b1c5dd0 Add image to column kernel (#867)
* Add image to column kernel

* Add instances, tests, profiler, example

* Add client example

* Several fixes of image to column

* Fix variable name in device_image_to_column_impl

* Several fixes of image to column profiler

* Fix num_btype calculation

* Make new mesaurements for correct bytes calculation

[ROCm/composable_kernel commit: 0077eeb3be]
2023-09-05 10:11:40 -05:00
zjing14
c79ecbccfb Grouped Gemm with Fixed K and N with SplitK (#818)
* move all arguments into device

* add b2c_tile_map

* add examples

* add SetDeviceKernelArgs

* dedicated fixed_nk solution

* init client api

* add grouped_gemm_bias example

* add a instance

* add instances

* formatting

* fixed cmake

* Update EnableCompilerWarnings.cmake

* Update cmake-ck-dev.sh

* clean; fixed comments

* fixed comment

* add instances for fp32 output

* add instances for fp32 output

* add fp32 out client example

* fixed CI

* init commit for kbatch

* add splitk gridwise

* format

* fixed

* clean deviceop

* clean code

* finish splitk

* fixed instances

* change m_loops to tile_loops

* add setkbatch

* clean code

* add splitK+bias

* add instances

* opt mk_nk instances

* clean examples

* fixed CI

* remove zero

* finished non-zero

* clean

* clean code

* optimized global_barrier

* fixed ci

* fixed CI

* removed AddBias

* format

* fixed CI

* fixed CI

* move 20_grouped_gemm to 21_grouped_gemm

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: f5ec04f091]
2023-08-31 09:22:12 -05:00
rocking
0b07461518 MaxPool & AvgPool bwd instances, test, ckProfiler, client example (#861)
* Add maxpool instances

* Rename index pool to max pool.

* Add maxpool bwd bf16 instances

* Add avg pool bwd instances

* Rename avgpool and maxpool to avg_pool3d and max_pool

* Add bf16 pool fwd instances

* Add max pool bwd to ckProfiler

* Add avg pool3d bwd to ckProfiler

* Add avg pool bwd test

* Fix bug of reference pool fwd (dilation)

* Fix bug of max pool bwd  (dilation and initZero)

* Support bf16 compute data type

* Force compute type be f32. Because atomicAdd only support f32

* Add max pool bwd test

* Rename folder

* Rename pool

* Add max pool bwd client example

* Add avg pool bwd client example

* Add missing workspace

* clang format

* Rename macro

* remove useless header

* remove useless layout

[ROCm/composable_kernel commit: 866377de18]
2023-08-31 21:01:50 +08:00
zjing14
f8bcfe60ac add an example of customized type convert - bfp16_rtn (#869)
* add an example of customized bfp16_rtn

* fixed threadwise_copy

---------

Co-authored-by: Jing Zhang <jizha@amd.com>

[ROCm/composable_kernel commit: 38ada109ea]
2023-08-29 12:31:24 -05:00
Illia Silin
d019f82b57 use correct data types in cmake conditions for splitk gemm example (#862)
[ROCm/composable_kernel commit: 7c71dc7e70]
2023-08-23 09:52:11 -07:00
Rostyslav Geyyer
9ef85a359d Add instances/ckProfiler/client example for fp8/fp16 mixed precision Gemm (#853)
* Add ComputeType arg to splitk device and gridwise ops

* Update for gridwise op compatibility

* Update bf16 and int8 splitk gemm examples with ComputeType

* Add instances

* Update ckProfiler for mixed precision cases

* Add a mixed precision splitK gemm client example

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>

[ROCm/composable_kernel commit: eac50708d9]
2023-08-22 09:34:49 -05:00