Commit Graph

42 Commits

Author SHA1 Message Date
Michał Kulikowski
5c4f52a02a [CK][Examples] - fixing grouped_conv_bwd_weight command parser. (#2840)
-added parameter to change group count for grouped_gemm examples.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2025-09-17 10:39:48 -07:00
linqunAMD
321627aec5 Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724)
* Fix example

* fix build error

* update pk_i4 & moe test case

* fix all instance build (examples)

* fix batched_gemm_gemm (example)

* disable example_gemm_bias_softmax_gemm_permute on gfx11

* remove unnecessary disable gfx11

* update tests

* update tests2
2025-09-12 08:17:07 -07:00
Michał Kulikowski
ffe9775e70 [CK][EXAMPLES] (#2826)
-Added parameter to enable/disable verification and timing of kernel in various examples that missed it.
-Added parameter to change number of groups to execute in grouped_gemm_examples.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2025-09-11 12:33:00 -07:00
jefyang1
b8b12bb81e Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203)
* Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950

* Run clang format
2025-05-19 14:25:50 -07:00
aledudek
9329432f6c Post-merge changes for fully async args copy in ck grouped gemm (#1991)
* Post-merge changes for fully async args copy in ck grouped gemm

* Post-merge documentation and naming changes

* Build fix and updated changelog

* Revised comments
2025-04-03 13:35:43 +02:00
aledudek
5095906975 Async grouped gemm v3 (#1940)
* Fully async grouped gemm

* Remove commented code

* Remvoe maybe_unused

* host kernel args

* Checkpoint segfault debugging...

* Working part1

* Working part2

* Remvoe comments...

* Use void ptr for gemm kernel host args

* Fix device_grouped_gemm_multiple_d_dl build issue

* Fix device_grouped_gemm_xdl build issue
2025-03-17 16:42:43 +01:00
Illia Silin
08d5c02c37 OCP FP8 support for gfx12. (#1710)
* (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds

* (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds

* (4/5) grouped conv pass

* (5/5) attention pass, todo: debug lds perf bug

* AIT Attention API refactor (#8)

* sanity pass

* sanity pass 2

* confirm significant performance regression.

* turn on all instances

* turn off instance format

* Fix bug & tunning & format

* DML meta, self_attn+cross_attn

* sanity pass

* remove useless flag

* update tile and problem size used in AIT attention

* bug fix in grouped conv supporting check

* deprecate inline asm wmma

* Bug fix: double lds skip

* clang-format

* Fix errors in
1. example, fmha
2. gridwise pipeline
3. deviceop, fmha, change some containers from vector to array

* part2 of previous commit

* clang format

* API fix of gridwisegemmpipeline

* separate array base and vector base attention tensor transformation

* fix gemm

* clang format

* add gemm fp16 instances

* Temp save

* fpAintB kernel compile pass

* Sanity pass.

* Temp save

* debug code enabled

* Fp16AInt8B_GEMM sanity

* MQA implementation

* GQA-4 example

* tempsave

* Compile pass

* New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm

* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* initial enablement of gfx950

* fix clang format

* disable examples 31 and 41 int8 on gfx950

* initial navi4x enablement

* remove extra endif

* enabled dl_gemm

* update s_barrier and s_waitcnt for gfx12

* fix the gfx12 assembly syntax

* fixed block_sync_lds

* add support for more dl kernels on navi4

* add wmma

* format

* Todo: fix gemm_bilinear_wmma instances compilation bug

* Solve a bug when K1=16

* remove unnecessary changes

* Remove tensor layout limitation to LDS usage in tesnor contraction

* fixed block_sync_lds

* merge navi3_ref

* update self-attention and cross-attention

* fix a typo of name

* fixed layout

* debugging

* Add arch limiter for fp8 gemm

* fixed wmma

* enable fp8 gemm_xdl for all gfx9 targets

* temporarily disable gemm_xdl_fp16_fp8 on MI100/200

* fix the cmake logic for gemm_xdl_fp16_fp8

* fixed c_output

* re-enable the gemm_xdl_fp16_fp8 on MI100/200

* fixed gfx12

* fixed

* fixed

* seperate gfx12 blockwise_gemm

* fixed

* enable fwd conv on navi4x

* enable gridwise

* enabled gemm

* fixed merge

* remove empty example fold

* fixed conflicts

* some small changes

* Update cmake-ck-dev.sh

* Update cmake-ck-dev.sh

* enabled other types

* fixed register loads

* test fa

* enable gfx12

* clean up

* enable some instances on gfx12

* add gfx1201 macro in amd_wmma header

* fix clang format

* enable batched_gemm_softmax_gemm_perm_wmma for gfx12

* disable instances with blocksize=256 in attention examples

* debuggging

* debug

* fixed lds_enabled

* debugging

* Fix and add limit to skiplds feature

* Enable skipLds feature and fix compilation bugs

* add ck_tile definitions for gfx12

* fix clang format and test/wmma_op

* updage instances cmake for gfx12

* disable the test_wmma_op on gfx12

* fix the builds for gfx950

* add gfx12 and gfx950 to default target list

* clean-up cmake file

* Initial introduction of OFP8 data types.

* Renamed FP8 and BF8 tests into FP8_FNUZ and BF8_FNUZ.

* Implementation of ConvertFP32Nearest in test_fp8_ocp.

* Remove dependence on possibly undeclared alias.

* Implement FP8OCP test for stochastic rounding mode.

* Implement FP8OCP tests for half_t type conversions.

* enable bf16 atomic add on gfx950

* Implement ConvertFP32Nearest test.

* Implement ConvertFP32Stochastic test.

* Implement ConvertFP16Nearest and ConvertFP16Stochastic tests.

* Refactoring. Move FP8 definitions into a separate header file.

* Enable easy switching between architectures.

* Fix compilation error for gfx942 architecture.

* only builf gfx950 branch for gfx950 target by default

* Enable OCP build of example_gemm_xdl_fp8.

* Fix formatting.

* fix the build logic for gfx950

* Improve GEMM example verbosity.

* Add constexpr where applicable.

* fix the logic of enabling XDL and WMMA instances

* Improve GEMM example verbosity.

* Enable build of example_gemm_xdl_fp8_bf8 test.

* Fix tests for gfx1101 architecture.

* Build DPP examples only on gfx103 and gfx11 architectures.

* Optionaly run either CPU or GPU verifications with GEMM examples.

* Extend GeneratorTensor_Sequential to produce values of prescribed data types.

* Add missing constructor.

* Improve infrastructure for OFP8 data type support.

* BUGFIX. Should not use FP8 as Compute/Accum data type.

* Add custom target for grouped_convnd_bwd_weight tests.

* Can build `tests` target on gfx950.

* Bugfixes on gfx1101 architecture.

* Fix dependencies.

* Provide single point of truth for FP8 INF and NAN checks

* Prevent instantiation of operators that are not supported by FP8 data types

* Add FP8 type selection into client_axample CMakeLists.txt

* Prevent sccache server from shutting down during build

* Fix test success reporting logic

* Change default verification method to CPU.

GPU verification takes too much time to complete on the emulator.

* Make sure all tests and examples are built for gfx950

* Facilitate testing of FP8 data types on the emulator

* Introduce two new tensor generators

* Enable instances built for gfx94 to be built on gfx950

* Verify 35_splitk_gemm on floating point numbers.

splitk gemm appears to be losing precision VS reference implementation when FP numbers are involved.

* Verify 04_gemm_add_add_fastgelu on floating point numbers

* Verify 20_grouped_conv_bwd_weight on floating point numbers

* Verify 38_grouped_conv_bwd_data_multiple_d on floating point numbers

* Verify more tests on floating point data

* Fix data types and improve testing verbocity.

* Upgrade to NPI 573 build docker.

* Skip on gemm_universal tests.

The tests take too long to complete on the emulator.
Need to see if it is possible to reduce the scope of the testing to just FP8 data types.

* Fix gfx1101 build

* Document test availability

* Re-enable fp8 gemms for gfx94/95

* Cherry-pick GEMM Universal tests for FP8 data types

* Cleanup

* CK_USE_GFX94 has already been set on this branch

* Address formatting issues and leftovers

* Make fail/pass logic consistent within 01_gemm folder

Removed multiple negations in fail/pass logic to propagate `true` as the success indicator.

* Fix GPU verification reporting logic.

* Update year in copyright notice.

* Cleanup

* Use `enum class` instead of `enum`

* Remove set_property for FP8 tests

* Narrowing the scope of PR to OCP FP8 enablement only

* Add tests for OCP FP8 vector_type storage

* Enable gemm kernel on all gfx9 architectures (#227)

* clean-up

* Implement `non_native_vector_base` with `ext_vector_type` array. (#232)

* Enable support of 1, 2, 4, and 8-byte custom types in CK.

* Fix pool tests for OCP FP8 data type

* fix jenkins file

* restore cron trigger

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2024-12-03 08:42:55 -08:00
Adam Osewski
061ac0649c Polished Grouped GEMM APIs and new BF16 instances (#1600)
* Few small fixes.

* New GroupedGemm instances (BF16)

* Unify and refactor GroupedGEMM device API.

* Adapt changes to new API.

* Adapt grouped gemm profiler.

* Accept multiple kbatches for grouped gemm profiler.

- delete obsolete two stage as it is now covered by grouped gemm

* Update unit test for grouped gemm.

* Fix thresholds for BF16 and F8. Unblock tests.

* Fix few instances.

* Multiple small fixes.

* Adapt to new API, check dynamic casting.

* Uncomment few data types in grouped gemm profiler.

* Fix call to SetDeviceArgs.

* Fix profile grouped gemm multiply tile loop.

* Fix grouped gemm tile loop kernel args in client examples.

* Review comments.
2024-11-27 13:02:44 +01:00
jakpiase
e2d139201b Switch to universal gemm in grouped gemm tile loop (#1335)
* switch to universal gemm in grouped gemm tile loop

* minor fixes

* add reviewers comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2024-06-18 09:01:49 -05:00
Illia Silin
566b6480a2 Code clean-up (#1285)
* code clean-up

* remove the profiling output samples
2024-05-10 09:41:39 -07:00
Adam Osewski
a0ae1c6133 Fix MakeArgument (#1284) 2024-05-09 09:42:41 -07:00
Adam Osewski
b4032629e5 Grouped GEMM Multiple D tile loop. (#1247)
* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Validate on host only NK dims if M is not known.

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.
2024-04-25 15:12:53 -05:00
jakpiase
c701071666 Add Grouped Gemm Multiple D SplitK TwoStage (#1212)
* Support A/B/C elementwise ops.

* First part of GGEMM multiD splitk two stage.

* WIP - changes for debuggin.

* tmp save

* working version

* added bf16@int8 version

* fixes

* add reviewers sugestions

* pre-commited missing files

* switched to ifs from elseifs

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2024-04-04 11:01:33 +02:00
zjing14
303d4594f4 improved zeroing (#1221) 2024-04-02 11:02:52 -05:00
zjing14
49df1dc595 Fixed GroupedGemmFixedNK with hipGraph (#1065)
* fixed examples; add async_mem_set

* add stream to all deviceOp using SetWorkspace

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
2023-11-30 15:09:27 -06:00
zjing14
bf435140dc Clean DTYPES conditions in CMake (#974)
* Add a condition to build fp8 instances

* simplified buffer_load/store

* add bfp8/fp8

* fixed

* remove all f8/bf8 condition include folder

* fixed cmake conditions

* fixed DTYPES=fp16/bfp16

* fix

* fixed buffer_load

* fixed buffer_store

* fix

* clean example cmake files

* fixed ci

* fixed cit

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
2023-10-18 11:14:14 -05:00
zjing14
5311d1b325 changed test for grouped_gemm to be random (#959)
Co-authored-by: Jing Zhang <jizha@amd.com>
2023-10-03 09:32:58 -05:00
Illia Silin
bba085d2b5 Refactoring cmake files to build data types separately. (#932)
* refactor cmake files for the tests

* refactor cmake files for examples

* fix cmake for gemm example

* fix the cmake file for all examples

* add splitting by data types in gemm_splitk instance header

* rename test to reflect only dl instances are used

* clean up CI workspace, update cmake for instances

* change the jenkinsfile syntax

* build all instances except DL on gfx11

* move workspace cleanup after stages

* clean up workspace after every stage

* isolate data types in grouped_conv_fwd header

* isolate dl instances for grouped_conv2d_fwd

* fix syntax

* fix cmake and batchnorm instances

* fix typo

* fix reduction instances

* fix grouped_conv headers

* fix syntax

* replace parsing logic for instances, replace bfp16 with bf16

* fix the client examples build

* clean up DTYPES from instances cmake files

* update the parsing logic in cmake files

* make an exception for reduction kernels

* update few remaining cmake files to handle DTYPES

* fix syntax

* fix cmake conflicts

* replace f8 with fp8 test name

* resolve conflicts for dpp instances
2023-09-20 22:15:56 -07:00
zjing14
f9d0eddb90 Add fp16/fp8 support into Grouped gemm FixedNK (#874)
* move all arguments into device

* add b2c_tile_map

* add examples

* add SetDeviceKernelArgs

* dedicated fixed_nk solution

* init client api

* add grouped_gemm_bias example

* add a instance

* add instances

* formatting

* fixed cmake

* Update EnableCompilerWarnings.cmake

* Update cmake-ck-dev.sh

* clean; fixed comments

* fixed comment

* add instances for fp32 output

* add instances for fp32 output

* add fp32 out client example

* fixed CI

* init commit for kbatch

* add splitk gridwise

* format

* fixed

* clean deviceop

* clean code

* finish splitk

* fixed instances

* change m_loops to tile_loops

* add setkbatch

* clean code

* add splitK+bias

* add instances

* opt mk_nk instances

* clean examples

* fixed CI

* remove zero

* finished non-zero

* clean

* clean code

* optimized global_barrier

* fixed ci

* fixed CI

* instance and client

* removed AddBias

* format

* fixed CI

* fixed CI

* move 20_grouped_gemm to 21_grouped_gemm

* clean

* formatting

* clean

* clean

* fixed computeType

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
2023-09-14 21:04:10 -05:00
zjing14
f5ec04f091 Grouped Gemm with Fixed K and N with SplitK (#818)
* move all arguments into device

* add b2c_tile_map

* add examples

* add SetDeviceKernelArgs

* dedicated fixed_nk solution

* init client api

* add grouped_gemm_bias example

* add a instance

* add instances

* formatting

* fixed cmake

* Update EnableCompilerWarnings.cmake

* Update cmake-ck-dev.sh

* clean; fixed comments

* fixed comment

* add instances for fp32 output

* add instances for fp32 output

* add fp32 out client example

* fixed CI

* init commit for kbatch

* add splitk gridwise

* format

* fixed

* clean deviceop

* clean code

* finish splitk

* fixed instances

* change m_loops to tile_loops

* add setkbatch

* clean code

* add splitK+bias

* add instances

* opt mk_nk instances

* clean examples

* fixed CI

* remove zero

* finished non-zero

* clean

* clean code

* optimized global_barrier

* fixed ci

* fixed CI

* removed AddBias

* format

* fixed CI

* fixed CI

* move 20_grouped_gemm to 21_grouped_gemm

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
2023-08-31 09:22:12 -05:00
Illia Silin
08eb176929 Allow building CK for specific data types and split off last remaining DL instances. (#830)
* properly split conv_nd_bwd_data instances

* split conv2d_fwd instance data types

* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm

* split the tests by data types where possible

* filter examples by DTYPES

* split few remaining examples by DTYPES

* filter most instances by DTYPES

* add new lines at end of headers, fix grouped_gemm profiler

* fix syntax

* split the ckprofiler instances by DTYPES

* split the conv2d and quantization DL and XDL instances

* fix the splitting of conv2d DL instances

* split softmax and pool_fwd tests for fp16 and fp32 types

* fix syntax

* fix the dl_int8 quantization instances isolation
2023-08-07 14:56:10 -07:00
Illia Silin
b94fd0b227 update copyright headers (#726) 2023-05-31 18:46:57 -05:00
Illia Silin
4a51d2da9d Fix grouped_gemm_splitk kernels on MI300. (#694)
* replace amd_buffer_atomic_add with hip_atomic_add

* fix grouped_gemm_splitk kernels on mi300

* fix syntax

* revert experimental atomic_add changes

---------

Co-authored-by: Jing Zhang <jizhan@amd.com>
2023-05-03 08:25:25 -07:00
Adam Osewski
8bb2bb4a05 Grouped Gemm + SplitK + simplified Kernel Args (#669)
* simplify karg in device/grid split-k op

* fix mk_kn_mn instances

* add more instances

* B2C with 3D grid for KSplit

* Remove unused code.

* Use default B2C (3D grid) in grid gemm v2r4r2.

* Device gemm splitk use B2C map.

* Device GroupedGemmXdlSplitKCShuffle

* Example for GroupedGemm Xdl SplitK

* Introduce Device GroupedGemmSplitK

* Fix updating kbatch size.

* Add instance mk-nk-mn

* Enable set kbatch in profiler.

* Add GGemmSplitK mk-kn-mn instances

* Add more instances & split into multiple files.

* minor fix

* tuning

* clean

* disabled failed instances

* use pipe v2

* Ignore arg on not supported arch.

* fix warning

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
2023-04-24 15:43:36 -05:00
Adam Osewski
a2d5ca8e95 Device Op GroupedGemmMultipleD + example fp16 (#633)
* Pass shared mem pointer as pointer to void.

* Device Op GroupedGEMM Multiple D

* Example for grouped gemm multiple d.

* Add MI200 to supported archs.

---------

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2023-03-15 11:22:59 -05:00
Po Yen Chen
4a2a56c22f Rangify constructor of HostTensorDescriptor & Tensor<> (#445)
* Rangify STL algorithms

This commit adapts rangified std::copy(), std::fill() & std::transform()

* Rangify check_err()

By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.

* Allow constructing Tensor<> like a HostTensorDescriptor

* Simplify Tensor<> object construction logics

* Remove more unnecessary 'HostTensorDescriptor' objects

* Re-format example code

* Re-write more HostTensorDescriptor ctor call
2022-11-11 11:36:01 -06:00
Adam Osewski
3048028897 Refactor device op implementations into impl subdirectory. (#420)
* Move kernel implementation files under impl directory.

* Update examples paths.

* Update device kernel impl include paths.

* Update tensor operation instances include paths.

* Update profiler and tests include paths.

* Clang-format

* Update include paths for batched gemm reduce

* Refactor UnitTest ConvNDBwdWeight.

* Refactor fwd and bwd data convND UT.

* Fix used test macro.

* Fix include path.

* Fix include paths.

* Fix include paths in profiler and tests.

* Fix include paths.

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-10-13 09:05:08 -05:00
Adam Osewski
3ab20fd753 GEMM batched/splitK/cgemm/grouped int4 examples (#383)
* Grouped GEmm int4.

* Formatting + fix K dimension for int8.

* Batched Gemm int4 example.

* CGEMM int4 example.

* Include inc filese in clang-format.

* SplitK int4 example

* Refactoring of performance measurement.

* Fix #ifdef statements.

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-08-25 17:19:15 -05:00
zjing14
6091458300 Add examples of batched/grouped/SplitK Gemm for int8/bfp16/fp16/fp32 (#361)
* add examples into grouped/batched_gemm

* adding splitK examples

* fixed splitK

* add bfp16 int8 example into splitK

* formatting

* use static_cast

* added common for batched_gemm

* add commons for examples of splitK/batched/grouped_gemm

* return true

* adjust splitK check tol

* update example

Co-authored-by: Chao Liu <lc.roy86@gmail.com>
2022-08-23 14:41:56 -05:00
Chao Liu
500fa99512 Clean up conv example, Instances, profiler and test (#324)
* convnd_fwd fp16 example

* update example

* update example

* update instance

* updating refernce conv

* update reference conv

* update conv fwd profiler

* update conv 1d and 3d instance

* update include path

* clean

* update profiler for conv bwd data and weight

* update conv bwd weight

* clean

* update conv example

* update profiler for conv bwd weight

* update ckprofiler for conv bwd data

* fix reference conv bwd data bug; update conv bwd data test

* update examples

* fix initialization issue

* update test for conv fwd

* clean

* clean

* remove test case too sensitive to error threshhold

* fix test

* clean

* fix build

* adding conv multiple d

* adding conv multiple D

* add matrix padder

* add gemm padding to convnd

* adding group conv

* update gemm multi-d

* refactor

* refactor

* refactor

* clean

* clean

* refactor

* refactor

* reorg

* add ds

* add bias

* clean

* add G

* adding group

* adding group

* adding group

* update Tensor

* clean

* update example

* update DeviceGemmMultipleD_Xdl_CShuffle

* update conv bwd-data and bwd-weight

* upate contraction example

* update gemm and batch gemm with e permute

* fix example build

* instance for grouped conv1d

* update example

* adding group conv instance

* update gemm bilinear instance

* update gemm+add+add+fastgelu instance

* update profiler

* update profiler

* update test

* update test and client example

* clean

* add grouped conv into profiler

* update profiler

* clean

* add test grouped conv, update all conv test to gtest

* update test
2022-07-29 18:19:25 -05:00
zjing14
7959dad566 Grouped Gemm device with multiD grid (#319)
* replace gridwise_v2r3 with multiD

* adjust parameters

* add instances

* fixed test_grouped_gemm

* fix standalone softmax race condition around blockwise reduction

* fixed ci

* fixed comment: remove redundant workspace

* use instanceFactory

* add test layout

* add empty Ds

* add bias example

* use array

* sperate examples

Co-authored-by: Anthony Chang <ac.chang@outlook.com>
2022-07-21 10:07:01 -05:00
Chao Liu
d3051d7517 add license in file (#303) 2022-06-24 23:32:43 -05:00
Chao Liu
d1db6a0c3e Absolute include path (#281)
* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add client app example

* update readme

* delete obselete files

* remove old client app

* delete old file

* cleaning

* clean

* remove half

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path for all examples

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* revert client app example

* clean build

* fix build

* temporary disable client test on Jenkins

* clean

* clean

* clean
2022-06-24 20:51:04 -05:00
zjing14
b6eaf3eb7e Pass gemm_descs for grouped gemm via __constant__ buff (#232)
* moved gemm_descs_args into const buff

* use CK_CONSTANT_ADDRESS_SPACE instead of global constant

* clean

* moved hipMemAlloc outside of deviceOp

* add SetWorkSpacePointer

* fix ignore
2022-05-31 17:00:43 -05:00
ltqin
3e6c2610ae Add FP64 XDL GEMM built-in function (#199)
* add intrin_mfma_f64_16x16x4f64

* add example

* gemm reference add double data type

* chang init data

* fix M N PerXdlops

* fix ifdef

* add comparsion config

* add conv fwd example

* format log out

* change rc matrix egister layout

* reorganize example

* reorganize example 2

* format,because merge develop

* fix call impl adding acc data type

* lost ;

* add compiler warning

* change example tunning parameters

* add test for fp64

* add instance

* add test/gemm/gemm_fp64.cpp

* fix get name issue

* remove some tunning parameter

* fix conflict

* format

* use integer value for GEMM test

* add acc data type

* remove typeid because fp16

* fix streamconfig etc bug from merging develop

* format

* remove test_gemm_xdl_fp64

* add AccDataType

* AccDataType problem

Co-authored-by: qinletao <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-05-26 14:48:57 -05:00
Anthony Chang
9f71ff48e2 Validate examples in CI (#233)
* validate examples in ctest runs

* format

* fix usage of check_err

* amend

* add example codes to custom target 'check'

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-05-13 16:54:44 -05:00
JD
cec69bc3bc Add host API (#220)
* Add host API

* manually rebase on develop

* clean

* manually rebase on develop

* exclude tests from all target

* address review comments

* update client app name

* fix missing lib name

* clang-format update

* refactor

* refactor

* refactor

* refactor

* refactor

* fix test issue

* refactor

* refactor

* refactor

* upate cmake and readme

Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-05-12 09:21:01 -05:00
myamlak
f03a1738d9 Resolution of issue #153: Add compiler warning on comparing int and size_t (#212)
* Turning compare warnings on

* Cleaning part I

* Cleaning part II

* Explicit static_cast to ck::type_convert

* Resolving large tensor size issue.

* format

* revert change to tensor descriptor; promote lementSpaceSize to 64bit

* use integer value for GEMM test

* Review remarks

* Review remarks + issues with (un)signed arithmetic

* Format fix

* Format

* Clang-format.

* fix 2gb limit issue

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-05-09 15:06:49 -05:00
Adam Osewski
abf4bdb9a9 Common forward convolution utility refactor. (#141)
* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.

* 3D cases for convolution utility functions.

* 3D reference convolution.

* Add support for 3D convolution.

* Check for inputs bigger than  2GB.

* Formatting

* Support for bf16/f16/f32/i8 - conv instances + UT.

* Use check_err from test_util.hpp.

* Split convnd test into separate files for each dim.

* Fix data generation and use proper instances.

* Formatting

* Skip tensor initialization if not necessary.

* Fix CMakefiles.

* Remove redundant conv2d_fwd test.

* Lower problem size for conv3D UT.

* 3D case for convnd example.

* Remove leftovers after merge.

* Add Conv Specialization string to GetTypeString

* Skip instance causing numerical errors.

* Small fixes.

* Remove redundant includes.

* Fix namespace name error.

* Script for automatic testing and logging convolution fwd UTs

* Comment out numactl cmd.

* Refine weights initalization and relax rtol for fp16

* Move test_util.hpp to check_err.hpp

* Refine weights initalization and relax rtol for fp16

* Refactor common part of test conv utils.

* Move utility function to single common place.

* Add additional common functions to utility.

* Refactor convnd_fwd_xdl examples.

* Remove redundant files.
* Unify structure.

* Add constructor to ConvParams.

* And add input parameters validation.

* Modify conv examples to use single utility file.

* Remove check_error from host_tensor.hpp

* Get rid of check_indices function.

* Remove bf16_to_f32 function overload for scalars.

* Fix namespace.

* Add half_float::half for check_err.

* Fix conv params size in UT.

* Fix weights initialization for int8.

* Fix weights initialization for int8.

* Add type_convert when store output in ref conv 1D.

* Get back old conv2d_fwd_xdl operation.

* Silence conv debug print.

* format

* clean

* clean

* Fix merge.

* Fix namespace for check_err

* Formatting.

* Fix merge artifacts.

* Remove deleted header.

* Fix some includes and use ck::utils::check_err.

* Remove unused check_indices restored by previous merge.

* Fix namespaces after merge.

* Fix compilation error.

* Small fixes.

* Use common functions.
* Fix filename
* Fix namespaces.

* Fix merge artifact - retrieve removed by accident fun.

* Fix ConvForwardSpecialization.

* Adhere to coding style rules.

* Fix merge artifacts.

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-04-05 15:16:59 -05:00
Chao Liu
cd167e492a Compile for gfx908 and gfx90a (#130)
* adding compilation for multiple targets

* fix build

* clean

* update Jekinsfile

* update readme

* update Jenkins

* use ck::half_t instead of ushort for bf16

* rename enum classes

* clean

* rename

* clean
2022-03-31 12:33:34 -05:00
zjing14
12f4cfce96 fixed alloc mem size (#145) 2022-03-23 22:19:38 -05:00
zjing14
716f1c7fb1 Grouped GEMM for fp16 (#126)
* init of grouped_gemm

* 2 gemm test

* perf test

* clean

* wrap desc into a struct

* test cast static_arr to pointer

* add ptr to GemmDesc

* add grouped gemm profiler

* fixed mem issue with unique_ptr

* clean

* clean

* finished ckprofiler

* Update README.md

* readme

* fixed readme

* add example

* improve code

* fixed comments: reserve, seperate ptr and gemm_shapes

* merge group and non-group

* fixed comments: replace push_back with emplace_back to avoid copy constructor

* fixed comments: unified blk2ctile; add test

* ci fix

* fixed ci

* fixed ci

* fixed ci
2022-03-22 18:18:18 -05:00