Commit Graph

828 Commits

Author SHA1 Message Date
Chao Liu
8f455615a8 Fast GeLU using built-in function (#587)
* clean up

* fast gelu using builtin function

* clean

* clean

* clean

* clean:

* clean

* fix compilation

* clean

* clean

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-02-26 23:19:11 -06:00
zjing14
209baee299 disable tensor contraction f64 on MI100 (#602) 2023-02-23 16:59:37 -08:00
Rostyslav Geyyer
246ceee49e Add Grouped Conv Backward Weight on Navi21 for ResNet50. (#505)
* Add DeviceOp and examples

* Format DeviceOp template arguments

* Remove bf16 example

* Format

* Format

* Update MakeABCGridDescriptor_A_K0_M_K1_B_K0_N_K1_C_M_N

* Refactor argument preparation

* Update conv_bwd_weight_dl to grouped_conv_bwd_weight_dl

* Rename device op file

* Update include directive in the example file

* Update descriptor preparation for grouped op

* Update the argument

* Update batch handling

* Add gridwise gemm supporting batched input

* Update blockwise indexing, working version

* Update copyright year

* Update check if argument is supported

* Refactor and make consistent with xdl examples

* Update check if argument is supported

* Add changelog entry

* Added comments on Dl op split_k>1 support

---------

Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-02-22 11:59:53 -06:00
ltqin
830d37a7d5 Grouped conv1d client example (#589)
* add conv1d fwd client example

* change 07_grouped_conv2d_fwd to 07_grouped_convnd_fwd

* add conv1d bwd weight

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-02-22 11:55:21 -06:00
Illia Silin
bef0cb20db fix a bug when building for gfx1030 target. (#591)
* fix a bug while building for gfx1030 and add gfx1030 to targets

* fix syntax
2023-02-16 13:54:08 -06:00
Illia Silin
584d233cfe Build and archive deb packages. (#590)
* build and archive deb packages

* fix syntax

* run QA to test building packages

* apply cron to develop branch again
2023-02-16 13:11:23 -06:00
pmaybank
cb3fac4d2a Sphinx doc (#581)
* New docs directory with minimal config

* Based on docs directory of rocBLAS

* Config for running Doxygen then Sphinx to generate HTML

* Add minimal content - intro to doc

* Add some boilerplate sections to doc

* content still needs to be done,
* e.g., need to generate API documentation using Doxygen
* need to write contributor guide

* Start Softmax section of Support Primitives doc

* Written as a test bed for typesetting math content

* Need to decide how much detail to go into

* add doc directories to git ignore file.

* Minor edits - new line at EOF, change year in copyright notices

* Port Markdown files to ReStructuredText

* Copy Markdown files from pre-existing doc directory to docs directory

* Convert to reStructured Text (rst) - section headings, links, tables
  have a different syntax in rst

* New rst files added to index - can generate HTML with same style as
  HTML generated from rst files in previous commits

* Intention is to make all the content in doc redundant and use rst
  throughout rather than mix of md and rst

* Extend Softmax section of Primitives Guide

* rename l to z

* add material on applying softmax row-wise to matrix

* define macro for diag operator (represents diagonal matrix)

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-02-15 17:17:46 -06:00
Illia Silin
19490ac4f7 Clean up kernel launch output (#569)
* clean up output from kernel_launch

* set RUN_WARMUP to 0 by default

* split the warm-up into a separate issue

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-02-15 12:07:21 -06:00
zjing14
24c9ee1d22 Add contraction_fp64 example (#570)
* add contraction_bilinear

* add contraction_scale_xdl_fp64

* reduce tile size to avoid register spill

---------

Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
2023-02-15 12:00:58 -06:00
rocking5566
6a6163a3d1 Improve normalization (#580)
* Sync the order of type string with template parameter

* Add more instances

* Check the vector size and remove redundant var

* Extract var to static, prepare to separate sweep once kernel

* Separate sweeponce flow and optimize the flow

* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization

* Remove useless code

* Update naive variance kernel

* Refine string

* Fix typo

* Support naive variance for device_normalization

* Check the blocksize

* Share the VGPR of x and y

* Share the VGPR of gamma and beta

* Add more instances

* Support fp16 sqrt for experiment

* Add CHANGELOG

* Fix typo

* clang-format
2023-02-15 11:59:35 -06:00
Haocong WANG
0cfda84d05 [Navi3x] Add Device Operations (#567)
* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* tempsave

* tempsave

* tempsave

* runtime bug, cannot find symbol

* workaround for incorrect HIP warpSize return value

* debugging

* tempsave

* Correctness OK, waiting for optimization

* Tidy up + format

* temp save

* temp save, reproduce the v_bfi_b32 issue

* add inline asm for wmmaop test

* tidy up

* clean some debug purpose code

* discard some codes

* clang format

* clang format

* compiler issue fixed + increase tile size

* navi3x_multipleD+example

* temp save

* workable

* batchedgemm[OK], groupconv[debug]

* groupconv: Sanity check[OK], Performance[Bad]

* navi3x_groupconv_need_optimization

* format

* Add arch limitation to all wmma examples

* fix bug: example30 input conv args
2023-02-15 11:50:51 -06:00
Adam Osewski
e9fd122889 Conv3D FWD BWD WRW fp16 fp32 client examples (#559)
* Conv3d bwd weight client example.

* Update year in license

* Convolution bwd data 3D fp16/fp32 client example.

* Client example for convnd fwd fp16 fp32

* clang-format

* Review remarks.

* Fix compiler err.

* Update data layout to standard one.

* Add conv 3d fwd NDHWGC instances

* clang-format

* Conv3d fwd NDHWGC instances.

---------

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-02-15 11:16:47 -06:00
Illia Silin
06f1fc864c Remove the workaround for bf16 attention tests. (#586)
* remove workanround in bf16 attention test

* clean up another workaround
2023-02-14 18:06:24 -06:00
Adam Osewski
8f42780fd6 GroupedGEMM more bigger tiles. (#577)
* Adding more bigger tiles.

* Remove failing instance.

* Remove instances which that don't improve perf.

---------

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-02-13 10:06:24 -06:00
Illia Silin
0ac0f51ad6 enable batched_gemm_softmax_bf16 tests (#582) 2023-02-10 13:00:37 -06:00
rocking5566
f7d28f3e4b Gemm+layernorm instance, ckProfiler, client example (#568)
* Add gemm + layernorm instance

* Add ckProfiler

* Add test

* Add client example

* Detect if user forger to set the workrspace

* Use literal in the example

* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()

* check gemm vaildity in IsSupportedArgument

* Add more testcases

* Merge duplicated folder in client example

* Print more infomation

* Use better kernel parameter for MS problem size

* clang format

* Add constexpr for if condition and remove redundant include

* Remove cstdlib and add constexpr
2023-02-09 15:02:55 -06:00
guangzlu
76d144fa7c Add instance for elementwise normlization (#573)
* added instances for large N

* add instance for elementwise normlization

* added supported restrict in device_elementwise_normalization_impl.hpp
2023-02-09 09:37:29 -08:00
Illia Silin
b63accee2b adding the first draft of changelog (#571)
* adding the first draft of changelog

* second draft of changelog
2023-02-08 17:25:53 -06:00
ltqin
332ccc3367 Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576)
* add instance for gemm bias softmax gemm

* add client example

* change CGridDesc_G_M_N to CGridDesc_G_M_O

* add gridwise

* change c grid name

* device add d0s data

* fix 08 client_example

* add example 47_fused_attention

* example output correct

* add d0 to example

* add d0 element op

* rechange instance code

* change Acc0ElementwiseOperation to C0DEElementwiseOperation

* change example name

* update instance for cdeelementwiseop

* add bhalf_t ScaleAdd

* add test

* not surport geem1 bias

* remove some ignore

* fix test bug
2023-02-08 14:34:45 -06:00
Illia Silin
bb3d9546f1 Fix a couple more CI issues. (#578)
* test the QA cron parameter for compiler commit

* create separate dockers for latest and fixed amd-stg-open compiler versions

* change groovy syntax

* apply cron timers back to develop branch
2023-02-08 11:50:09 -06:00
Illia Silin
f73574ffdd Fix CI issues. (#572)
* switch to recent staging compiler as default for CI

* fix the baseline query

* roll back sqlalchemy to version 1.4.46
2023-02-06 13:15:45 -06:00
Rostyslav Geyyer
afdfef74f7 Add the markdown tutorial hello world (#563)
* Add the markdown tutorial

* Clean up

---------

Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
2023-02-01 15:56:59 -06:00
who who who
ba40c2ce9d remove unused variable (#564)
* remove unused variable

* format code
2023-01-31 10:34:35 +08:00
Adam Osewski
274108d6e6 Use defined seed for deterministic test runs. (#562)
Co-authored-by: Adam Osewski <aosewski@amd.com>
2023-01-30 13:03:59 -06:00
Adam Osewski
7494c1c611 Add more instances for irregular GEMM sizes. (#560)
Co-authored-by: Adam Osewski <aosewski@amd.com>
2023-01-26 13:42:20 -06:00
Qianfeng
a1b2441f8d Batchnorm inference instances, external API, client examples and gtests (#531)
* File renaming and class renaming for device element-wise operation

* Add batchnorm-infer instances, external API and client example

* Add batchnorm-infer profiler module and gtests

* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp

* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer

* Rename class and file due to conflict from device_elementwise_2d.hpp

* Fix namespace in batcnnorm_infer_nhwc client example
2023-01-25 17:09:04 -06:00
Qianfeng
52abc2f371 Use double for all scaling values and float-point constant values at the Device Op API (#557)
* Use double as alpha/beta values type in reduce device op api

* Use double as alpha/beta values type in softmax device op api

* Use double as alpha/beta values type in multiple-reduce device op api

* Use double as epsilon value type in normalization/elementwise-normalization device op api
2023-01-18 12:02:50 -06:00
Raman R jana
1cfa87608a Wavelet (inter-wave consumer-producer) GEMM (#310)
* wavelet gemm programming model support for CK

* GEMM pipeline update for wavelet progrmmaing model

* Updated wavelet programming pipeline

* fixes for global-write for math-wave

* fixed bug in global writes

* Updated comments for better readability

* fixed clang format errors

* added block_lds without barrier sync

* clean

* clean

* clean

* clean

* refactor

* prototype

4 layouts

fix default stride

all problem sizes

tidy

move file

update build script

restore old file

fix build

* refactor standalone test to use gemm test harness

* simplify gemm test

* update build script

* remove redundant

* early return when cmd arg doesn't match

* tidy

* report failure when result not validated

* tidy

* Add comment depicting B2C mapping pattern.

* Formatting & comments.

* Comparison with custom B2C mapping pattern.

* Example for wavelet gemm.

* Add wavelet to Gemm standalone test.

* Remove debug code.

* Remove dangling #endif directive.

Co-authored-by: root <Raman Jana>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2023-01-18 12:00:02 -06:00
ltqin
d66421fe34 Add multiD Gemm client APIs (#534)
* start add example

* fix config

* fix showinfo bug

* add an elementop

* change to padding

* add xdl example

* change elementwiseop

* add instance

* add instance to profiler

* change file name

* fix deive not support issue

* add client example

* fix client gemm_add_multiply name

* change AddMultiply elementwiseop

* fix elementwiseop

* fix client example

* fix addmultiply op

* fix comments and fun name

Co-authored-by: letaoqin <letaoqin@amd.com>
2023-01-18 11:53:56 -06:00
Illia Silin
00ff30af8c fix a bug for 6-dim kernels (#555) 2023-01-18 11:44:11 -06:00
who who who
147b7db561 add multi embeddings support (#542)
* add multi embeddings support

* fix format

* optimize sqrt

* add reduce operation

* change to elementwise op

* fix name

* rename

* run ci cd

* format example

* format code

* format code
2023-01-18 11:32:12 -06:00
ltqin
55236709e2 Add client API/examples for 3xGemm+Bias+Add+Permute{0, 2, 3, 1} (#550)
* add example

* fix example

* add instance for gemm permute

* add to client example

* change configs

* change instance file name

* formate

* change client example file name and remove example
2023-01-18 10:52:52 -06:00
Qianfeng
80e0526741 Reduction external API and client examples (#493)
* Change to the DeviceReduce base class template to include all problem description information

* Add external api for reduction

* Add client example to test the reduction external api

* Spelling correction

* Re-implement the host_reduction to follow the DeviceReduce base API format

* Change the reduce profiler to call the external API for collecting device instances

* Rename reduce client example directory from 08_reduce to 12_reduce

* Remove (void) before the functional call

* Tiny update in reduce client example

* Tiny update in profile_reduce_impl.hpp

* Rename the reduce client example directory

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2023-01-16 22:18:06 -06:00
rocking5566
7829d729fb Gemm layernorm welford (#413)
* Add device op of gemm layernorm

* [What] Rename F to H
[Why] F and G prepare for welford tensor

* Add gridwise gemm + welford

* Extract template parameter

* Rename kernel. Prepare to add second half kernel

* Extract var

* Add second kernel for gemm+layernorm

* Move to the gemm_layernorm folder

* Rename F and G to mean and var

* Do not use snakeCurved, it makes determination of padding  for welford difficult

* Rewrite the device interface and rename some var

* Add welford count

* Update interface

* Sync code, prepare to test on MI200

* Clean the code

* Implement layernorm

* Add comment to mension hipFree

* Wrtie out the e for debug.
This could be remove and use h for instead

* 1. Allocate mean, var and count into by SetWorkSpacePointer.
2. Add GetWorkSpaceSize to calculate the space size

* Add gemm layernorm host code

* use reference layernorm

* Fix bug of blockwise welford for first kernel

* Fix bug of mean var padding for layernorm

* Use sgpr for shuffleM_index

* padding for GemmMeanVarCountGridDescriptor_M_NBlock

* Add layout parameter

* Check argument for gemm

* calculate max count for tail block

* Share E and H memory in device op

* Hard code the vector dim

* Refine the MakeDescriptor

* 1. Remove E parameter, because E is inside of device op
2. Check vector size

* [What] Rename MakeMeanVarDescriptor_M_N
[Why] Prepare to add count version of make descriptor

* Use 1D global memory for count

* Prevent redundant IO

* Update parameter

* Add pipeline v1/v2 selector

* Rename the example name

* Add base class for gemm layernorm

* Refine naming to distinguish naive and welford

* Add comment to explan in detail

* We don't need to pad in N dimension in gemm for mean/var/count. Set NPerTile 1

* Rewrite the 2st kernel, use multiple block along N dimension in layernorm kernel

* Share the vector size

* Refine var name

* [What] Force LayernormThreadSliceSize_N = vector size.
[Why] Memory coalesce

* Add comment

* Extract divisor out of the loop in reference layernorm

* Pad different size for E and H in layernorm kernel according to different block tile

* Refine naming

* Refine naming

* Prevent implicit cast

* [What] use ck::math::sqrt instead of __builtin_amdgcn_sqrtf
[Why] __builtin_amdgcn_sqrtf is only support float, double will cause casting

* Cast only constant

* Change of post shuffle thread descriptor

* Add EMeanVarDataType parameter.

* Merge the mean and var threadwise copy

* Add missing index

* Fix Typo

* Sync the variable with previous if

* 1. Declare e inside the host_gemm_layernorm()
2. Prevent implicit cast in reference code

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2023-01-16 20:08:25 -06:00
Haocong WANG
919aeb1f52 [Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541)
* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* tempsave

* tempsave

* tempsave

* runtime bug, cannot find symbol

* workaround for incorrect HIP warpSize return value

* debugging

* tempsave

* Correctness OK, waiting for optimization

* Tidy up + format

* temp save

* temp save, reproduce the v_bfi_b32 issue

* add inline asm for wmmaop test

* tidy up

* clean some debug purpose code

* discard some codes

* clang format

* clang format

* compiler issue fixed + increase tile size
2023-01-16 20:06:01 -06:00
Illia Silin
715e8dd241 Add a flag to enable/disable debug output in many kernels. (#549)
* add DEBUG_LOG macro to enable/disable debug output

* fix syntax

* fix syntax again

* fix syntax one more time

* remove balnk spaces

* use ifdefs

* add the Print argument

* move the definition of DEBUG_LOG to ck.hpp

* add the missign argument to Print()
2023-01-11 19:55:56 -06:00
Qianfeng
a17b041486 Remove including of cmath (#551)
* Let cmath included when compiling host codes in math_v2.hpp

* Remove including of cmath in device_base.hpp and device_permute.hpp
2023-01-11 19:52:47 -06:00
zjing14
0345963eef Add MNK padding, M = 0 support into grouped_gemm (#539)
* add mnk padding, support m=0

* clean code

* clean code

Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
tutorial_hello_world
2022-12-15 15:07:24 -06:00
Illia Silin
1115117503 disable the attention test that fails on MI100 (#540) 2022-12-15 10:20:21 -06:00
Qianfeng
10c72aced8 Add interface GetTypeIdName() and GetTypeIdHashCode() for Device Op (#533) 2022-12-14 18:34:02 -06:00
Rostyslav Geyyer
9a1f2475e3 Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable arbitrary problem size (#535)
* Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances

* Add padding device_gemm_add_fastgelu_xdl_c_shuffle instances

* Add gemm_add_fastgelu profiler impl

* Add padding device_gemm_fastgelu_xdl_c_shuffle instances

* Add gemm_fastgelu profiler impl
2022-12-14 18:12:09 -06:00
Rostyslav Geyyer
74744cab3e Add a docker hub doc file (#538) 2022-12-14 12:17:28 -08:00
arai713
0e5c264c3e Gridwise elementwise 2d (#466)
* added 2d gridwise elementwise

* added 2d version of device elementwise

* added example file with updated device elementwise call

* added Cmake file

* changed NumDim into 2D

* fixed compiler issues

* fixed indexing for loop step

* fixed NumDim dimension error

* changed blockID to 2D

* updated Grid Desc

* updated kernel call

* fixed 2d thread indexing

* added dimensions for example file

* commented out unused code

* changed vector load

* removed extra code

* temporarily removing vector load on 2nd dim

* changed vector load back, still causing errors

* altered indexing

* changed isSupportedArgument for 2D

* changed indexing + do/while

* fixed isSupportedArgument

* changed dimension for debugging

* fixed

* added testing printouts

* testing change

* added variables to distribute threads through both dimensions

* testing changes

* integrated variable for thread distribution into device elementwise and added as parameter for gridwise elementwise

* removed most of the extraneous code, testing with different dimensions

* testing

* removed debugging print statements

* moved 2d elementwise permute into elementwise permute directory

* fixed formatting

* removed debugging comments from threadwise transfer

Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2022-12-12 09:18:10 -06:00
Illia Silin
d58b7f5155 Make sure that GEMM sizes in K dimension are supported. (#527)
* apply new K-dimension check in gemm_xdl_cshuffle

* add K-dim check to gemm_xdl and batched_gemm_xdl

* fix syntax

* fix syntax

* clean-up the debug output
2022-12-08 11:48:43 -06:00
Po Yen Chen
614a7b1bb0 Fix Grouped ConvBwdWeight test case failure (#524)
* Use smaller tensor size in test

* Use even more smaller tensor size

* Touch only failing test case inputs
2022-12-07 17:46:28 -06:00
Rostyslav Geyyer
c7a4d36147 Add padding device_gemm_xdl instances (#529)
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-12-07 17:46:03 -06:00
guangzlu
ce87b4f765 modified half function in math_v2.hpp (#528)
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-12-07 17:43:02 -06:00
Illia Silin
d072790fe2 Fix CI error. (#530)
* ignore .git folder when doing clang-format

* fix syntax

* add backslashes before quotes

* add path filter for several extensions
2022-12-06 15:09:51 -06:00
Anthony Chang
d156709432 Fix bug where scaling may not be applied in some code path (#526)
* fix bug where scaling may not be applied in some code path

* more test

* revert accidental example code changes
2022-12-02 11:43:34 -06:00
ltqin
23ecf0fa9e Add multiple d gridwise gemm on Navi21 for ResNet50 (#517)
* start add example

* add multiple d fp16 example

* device transfer elementwiseop to gridwise

* gridwise add multiple d

* change example for multiple d

* fix spill registers

* fix for passthrough element op

* fix int8 overflow

* change example file name

* add instance for dl multiple d

* example add DsDataType

* remove grouped_convolution_forward_dl.hpp

* add head file(was deleted before)

* fix not support device issue

* format

* remove passthrough check

Co-authored-by: letaoqin <letaoqin@amd.com>
2022-12-02 11:42:31 -06:00