Commit Graph

941 Commits

Author SHA1 Message Date
Illia Silin
9195435c77 Disable DL kernels by default. (#816) 2023-07-26 11:06:45 -05:00
Bartłomiej Kocot
ac6d68b353 Disable XDL kernels on unsupported HW Add ck::is_xdl_supported (#768)
* Disable XDL kernels on unsupported HW; Add ck::is_xdl_supported function (#765)

* Do not throw an error when GEMM problem is not supported.

---------

Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2023-07-26 07:19:55 -07:00
rocking
016bd428df Refine the dimension of host tesnor. This example only require 1D (#812) 2023-07-25 23:18:56 -05:00
Po Yen Chen
f4ea560112 Speed-up global memory reading for GEMM instances (#813)
* Use better ThreadClusterLengths to speed up

* Update B tile reading pattern for layout=NN instance
2023-07-25 18:54:47 -05:00
ltqin
50643dd555 Add bias scalar vectorload = 1 for gemm bias gemm (#791)
* first change bias load

* add bias dim and scalervector parameter

* make CDE0BlockTransferSrcVectorDim not work

* changse toinstance

* add limit for CDE0BlockTransferSrcScalarPerVector
2023-07-24 20:08:15 -05:00
Illia Silin
844b215d92 add ninja profiling tools to the base docker (#805) 2023-07-21 15:33:17 -07:00
Illia Silin
7a29f711d4 add INSTANCES_ONLY cmake macro to build only instances (#807) 2023-07-21 15:31:19 -07:00
Bartłomiej Kocot
10732847e7 Grouped conv bwd wei NDHWGC/NDHWGK (#804) 2023-07-21 12:00:55 -05:00
Bartłomiej Kocot
49180fd60b Grouped 3d conv backward data support (#799)
* Grouped 3d conv backward data support

* Fix comments
2023-07-18 11:01:33 -05:00
Rostyslav Geyyer
f82bd59389 Remove type_convert bf16 to int32 and back (#802) 2023-07-18 09:44:51 -05:00
Illia Silin
189ea3b9aa Add mechanism to build CK for select data types, add Navi3x CI. (#790)
* allow building CK for specific data types

* add CI build and test stage on Naiv3x without some int8 instances

* add missing gemm fp16 instances

* add the changes to the missed cmake file

* add empty lines at end of source files

* Do not build quantization client example on navi3 in CI

* disable batched_gemm_multi_d_int8 instances with DTYPES

* disable device_conv2d_bwd_data_instance with DTYPES

* fix ckprofiler for conv_bwd_data for int8

* properly isolate the conv_bwd_data int8 instances

* remove empty line
2023-07-17 18:02:42 -07:00
Illia Silin
4867db4290 Add check for compiler GPU target support. (#800)
* check if gpu_targets are supported by compiler

* set default list of targets and filter for them
2023-07-17 09:44:40 -07:00
arvindcheru
03d3395b3c Disable Werror to ignore xnack+ warnings (#794)
* Disable Werror to ignore xnack+ warnings
2023-07-14 20:00:20 -04:00
Bartłomiej Kocot
1ee99dcaa6 Support NHWGC conv2d_bwd_weight (#769)
* Support NHWGC conv2d_bwd_weight

* Fix client example

* Fix client example

* Fix comments

* Redesign grouped_conv_bwd_weight instances

* Clang format fix

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-07-12 08:25:02 -05:00
Illia Silin
87f2bbcf5c change the build thread usage in CI (#787) 2023-07-06 20:17:25 -05:00
Adam Osewski
237f9cd3aa Add basic setup for precommit (#749) (#764)
* Add basic setup for precommit

* Update README.md with instructions on installing precommit hooks

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>
2023-07-06 11:01:06 -05:00
Po Yen Chen
850144a0d3 Split GEMM instance library & enable pipeline v2 optimization (#783)
* Move source file into sub-directories

* Add missing include directive

* Split DeviceGemmXdl<> fp16 instances

* Fix format

* Remove unnecessary CMakeLists.txt

* Add macros to toggle new features

* Remove debug message

* Turn off GEMM v2 pipeline optimization by default

* Fix format

* Extract duplicated string as list

* Enlarge indent in CMakeLists.txt
2023-07-06 10:59:35 -05:00
Qianfeng
8f5cafaf04 Batchnorm splitk single kernel (#771)
* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]

* Add CountDataType as template parameter in blockwise_welford

* Add utility/get_shift.hpp

* Add BatchNorm multiblock single-kernel implementation

* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a

* Renaming in device_batchnorm_forward_impl.hpp

* Tiny fix in the batchnorm_fwd profiler

* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"

This reverts commit d16d00919c.

* Use the old two-kernel batchnorm multiblock method for gfx1030

* Use the old two-kernel batchnorm multiblock method for gfx908

* use the single-kernel batchnorm multiblock method only for gfx90a

* Remove get_wave_id() from utility/get_id.hpp since it is not used

* Set true for testing running mean/variance and saving mean/invvariance in the examples

* Fix to copy-right words

* Remove un-needed including in utility/get_id.hpp

* Add comments to workgroup_synchronization.hpp

* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp

* Renaming in the kernels

* Remove un-used kernel file
2023-07-06 10:58:55 -05:00
Adam Osewski
f4dfc060b7 Move Device Ops implementations into impl directory. (#777)
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-07-06 16:15:51 +02:00
Bartlomiej Kocot
2b0b6d9f46 Fix copyrights for DeviceBatchedGemmMultipleD_Dl 2023-07-06 15:50:27 +02:00
Rostyslav Geyyer
61dc9aa932 Add the missing archs (#785) 2023-07-05 18:29:56 -05:00
Rostyslav Geyyer
1cf5003179 Add fp8 GEMM and an example for it (#767)
* Add fp8 xdl gemm

* Add example

* Use int8 intrinsics for buffer load/store

* Format

* Update cmakelists
2023-07-04 20:38:49 -06:00
Illia Silin
7797bd3d2b Upgrade default docker to ROCM5.6 release. (#778)
* upgrade default compiler to rocm5.6 release

* do daily runs with rocm5.6 instead of 5.5
2023-06-30 08:06:54 -07:00
Illia Silin
d3adc66581 Add rocm5.6 RC4 and rocm5.7 to docker build options. (#770)
* upgrade to rocm5.6 rc4

* add rocm5.7 docker
2023-06-28 08:58:28 -05:00
Illia Silin
3b18f1e38c do not build gfx941/942 targets during CI (#766) 2023-06-21 10:47:35 -07:00
Bartłomiej Kocot
63388e84ab Support bf16/f32/f16 and NHWGC conv2d_bwd_data (#757)
* Support bf16/f32/f16 and NHWGC conv2d_bwd_data

* Add interface test

* clang format

* Comment fixes

* Add more friendly error message
2023-06-21 08:20:31 -05:00
ltqin
32d2f52bf7 remove useless comments (#760) 2023-06-19 19:25:08 -07:00
zjing14
05ea6452b6 changed pipeline v1 (#763) 2023-06-19 19:24:18 -07:00
Illia Silin
645eb2f2a0 do not build gemm-gemm and conv-conv examples for gfx94* (#761)
* do not build gemm-gemm and conv-conv examples for gfx94*

* do not build gemm-gemm and conv-conv examples on navi
2023-06-19 16:55:03 -07:00
Rostyslav Geyyer
f0c620c42e FP8 enablement - add a pseudorandom number generator, add conversion methods (#708)
* Add basic fp8 definitions and prn-generator

* Format

* Add fp8<->fp32 type_convert

* Format

* Split type_convert and cast_to/from_f8

* Format

* Minor fix

* Minor fix

* Move fp8 utils to a separate header

* Add elementwise ops

* Add fp8_convert_sr

* Format

* Add element op

* Eliminate magic numbers

* Split f8_convert_sr in host and device

* Format

* Add some constexpr

* Add a datatype test

* Format

* Another format

* Add fp8<->fp16 tests

* Update type_converts

* Format

* Add fp16 casting functions

* Format

* Use seed as a runtime arg

* Use element location for PRNG

* Format

* Add fp8<->fp16 to PassThrough element op

* Clean up

* Merge host and device implementations

* Add comments on rounding modes

* Remove leftover code

* Put type_converts into a separate header

* Put random number gen to a separate header

* Rearrange f8_utils' namespaces

* Refactor type_convert.hpp

* Move f8_t definition
2023-06-19 11:20:35 -05:00
rocking
341ad95665 Maxpool bwd (#750)
* Add maxpool f32 kernel and example

* Revise copyright

* Add device pool bwd device op

* Support f16 and bf16

* Add compute datatype for reference code.
Prevent error in bf16

* Fix type error

* Remove layout

* Fix bf16 error

* Add f16 and bf16 example

* Add more operations

* Implement IsSupportedArgument

* Add changelog

* Add comment

* Add comment

* Remove useless header

* Move initialize of workspace to the run

* Move set din zero to the device operator

* Save din_length_raw

* Remove useless header

* Calculate gridsize according to the number of CU

* Calculate gridSize according to the number of CU.
Remove useless header

* Add put example

* Remove useless header

* Fix CI fail
2023-06-19 09:44:22 -05:00
Qianfeng
0d9118226b Padded Generic Kernel Instance (#730)
* Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting

* Move the generic kernel instance to be the first of the instance list for elementwise op of normalization

* Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax

* Add testing of GetGenericInstance() in client_example of Softmax

* Revert "Add testing of GetGenericInstance() in client_example of Softmax"

This reverts commit f629cd9a93.

* Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"

This reverts commit a9f0d000eb.

* Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm

* Move generic kernel instance to separate tuple for elementwise op of normalization

* Remove un-used files for softmax instance

* Store generic kernel instance to separate tuple for softmax

* Add IsSupported checking for generic instance to client example of softmax

* Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization

* clang-format fix

* Remove int8 from softmax instances

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-06-16 23:43:11 -05:00
Illia Silin
d140bdc9fa do not build gfx941/942 targets during daily QA runs (#758) 2023-06-16 12:13:16 -07:00
Illia Silin
027e46ee82 Enable gfx941 and gfx942 architectures. (#752)
* enable gfx941/942 targets

* fix clang format

* fix the cmake logic for multiple targets

* fix cmake syntax for looping over targets

* add gfx941/942 support for gemm_xdl instances
2023-06-15 08:20:59 -07:00
zjing14
309b1c6461 Fixed Weight layout of grouped_conv 3d fwd (#743)
* Changed wei layout

* changed layout for examples

* fixed client example

---------

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
2023-06-15 10:19:33 -05:00
Qianfeng
c5f6ec842c Using number of compute units to set gridSize (#754)
* Add getAvailableComputeUnitCount() interface

* Use available number of compute units to set kernel grid size
2023-06-15 10:13:59 -05:00
Illia Silin
d1838d328c Fix the daily CI job with latest staging compiler. (#753)
* fix CI builds with latest staging compiler

* remove mount flags from dockerfile
2023-06-14 16:44:13 -07:00
Rostyslav Geyyer
54b68eb343 Add generic kernel instances for ck::tensor_operation::device::DeviceGemmMultipleD (#741)
* Add generic instance gemm_add_add_fastgelu

* Add a client example for generic gemm_add_add_fastgelu

* Update CMakeLists

* Format

* Format

* Add generic instance gemm_add_fastgelu

* Format

* Add a gemm_add_fastgelu client example

* Format

* Add generic instance gemm_fastgelu

* Format

* Fix argument order

* Add gemm_fastgelu client example

* Add exceptions if argument is not supported
2023-06-14 16:06:56 -05:00
Rostyslav Geyyer
a35456a3f4 Fix arg order (#751) 2023-06-12 08:38:46 -05:00
Bartłomiej Kocot
fc9f97568f Add DeviceBatchedGemmMultipleD_Dl (#732)
* Add DeviceBatchedGemmMultipleD_Dl

* Fix batched_gemm tests

* Fix comments

* test_batched_gemm_multi_d fixes

* Fix args for isSupported batchedGemmMultipleDDl

* Disable tests for gfx90a
2023-06-12 08:37:15 -05:00
Po Yen Chen
7c24654c24 Fix incomplete object size (=4n + 3) support of amd_wave_read_first_lane() (#738)
* Fix wrong pointer type

* Rename type trait get_unsigned_int<> to get_carrier<>

* Add 3-bytes carrier type

* Add missing __device__ specifier

* Rename template non-type parameter

* Leave the rest byte uninitialized

* Avoid invoking (host) STL algorithms

* Remove unnecessary 'inline' specifier

* Extract common logic out as helper method

* Hide dummy member function

* Add missing __device__ specifier
2023-06-12 08:36:40 -05:00
ltqin
0ede66de54 Fix flash attn mask bug (#733)
* add check input parameter

* add instance for vector load = 1

* move gerneral instance to first pos

* fix read bias code

* regular code for bias load

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-06-12 08:35:31 -05:00
carlushuang
016ebaa7f3 support dynamic buffer using memory coherence glc_slc bit from template (#725) 2023-06-08 07:40:29 -05:00
Illia Silin
1dd455d633 Update docker (#744)
* update dockerfile to build rocm5.6 rc3

* fix couple of docker issues
2023-06-07 09:35:14 -07:00
Illia Silin
4036590401 fix clang format (#740) 2023-06-02 14:10:02 -07:00
who who who
e2ebc8e795 replace hipMemcpy with hipMemcpyWithStream (#734) 2023-06-01 16:23:41 -05:00
Po Yen Chen
9eae73df9b Simplify kernel argument of device operator Device(Batched)GemmXdl<> (#723)
* Remove M/N/KPad local variables

* Use M/N/KPad to name padded lengths

* Replace duplicated local variable by parameters

* Rename variables M/N/KRaw to M/N/K

* Move AK0/BK0 compute logic into GridwiseGemm

* Use macro to shorten code

* Move CalculateGridSize() logic into GridwiseGemm

* Add comment to credit the implementation source

* Reuse the existing implementation

* Remove no-longer used data members

* Remove elementwise-op objects from interfaces

* Reserve kernel arg as whole object in interfaces

* Remove redundant data member

* Make 3rd type parameter optional

* Remove unnesscary type parameters

* Remove no-longer used descriptor-creation methods

* Move kernel arg type definition into GridwiseGemm

* Add macro to switch between code sections

* Move argument field computing logic into device op side

* Make utility method 'static'

* Declare special methods

* Unify MakeArgument() usage

* Adapt the new GridwiseGemm interface

* Push-down class 'GridwiseGemm::Argument' fields

* Remove no-longer used methods

* Add unused parameters

* Force copying parameters in 'Embed' ctor

* Remove no-longer used descriptors

* Fallback change on BaseArgument

* Remove macro 'INTEGER_DIVIDE_CEIL'

* Make variable naming more consistent

* Make sure methods are only invoked on right place

* Remove tailing underscore in public attribute name

* Remove necessary methods

* Hide computing logic of derived attributes

* Make new 'Embed' ctor only available for device code

* Make sure 'Embed' type args are not references

* Move check for karg.K into CheckValidity()

* Remove more integer division logic form device code

* Undo changes on Embed

* Separate 'Problem' concept out from 'Argument'

* Add overloaded version of __builtin_amdgcn_readfirstlane()

* Remove 'static' specifiers

* Remove more 'static' specifier

* Replace unsigne char by std::byte

* Add 'const' specifier to never changing variable

* Add 'inline' specifier to funcion definition

* Share same name for kernel interfaces

* Fix wrong boundar calculation logic

* Leave the third template arg for compatibility

* Remove unnecessary parameters

* Fix wrong error message (for type name)

* Create descriptor on device side

* Fix wrong debug message

* Remove no-longer used data members

* Rename type trait

* Remove std:: qualifier from standard types

* Replace 'size_t' by 'unsigned'

* Use type alias to hint usage

* Replace static_for<> by ordinary 'for' loop

* Reject unsupported argument

* Rename readfirstlane() to amd_wave_read_first_lane()

* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

* Update function calls

* Reorder statements

* Re-format files

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-06-01 16:23:02 -05:00
Illia Silin
b94fd0b227 update copyright headers (#726) 2023-05-31 18:46:57 -05:00
Po Yen Chen
582e31e88d Add class type support for __builtin_amdgcn_readfirstlane() (#711)
* Add overloaded version of __builtin_amdgcn_readfirstlane()

* Remove 'static' specifiers

* Remove more 'static' specifier

* Replace unsigne char by std::byte

* Add 'const' specifier to never changing variable

* Add 'inline' specifier to funcion definition

* Fix wrong boundar calculation logic

* Rename type trait

* Remove std:: qualifier from standard types

* Replace 'size_t' by 'unsigned'

* Use type alias to hint usage

* Replace static_for<> by ordinary 'for' loop

* Rename readfirstlane() to amd_wave_read_first_lane()

* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

* Reorder statements
2023-05-31 10:25:25 -05:00
Haocong WANG
6eef0755c9 fix wmma gemm int8; add grouped conv int8 example (#716) 2023-05-30 07:18:53 -05:00