Commit Graph

1329 Commits

Author SHA1 Message Date
Illia Silin
4c3107fdcb [ASAN builds] Modify the list of default targets for ASAN builds. (#1389)
* add a build parameter to build only XNACK targets

* use ENABLE_ASAN_PACKAGING flag to set targets for ASAN builds

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2024-07-16 09:19:23 -07:00
Andriy Roshchenko
802a8a1df1 Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386)
* Add CMakePresets configurations.

* Add ConvScale+ReLU Functor and an Example

* Account for ReLU FLOPs.

* Add instances of 3D convolutions with ConvscaleRelu operation.

* Implement Client Example

* Cleanup
2024-07-16 08:51:49 -07:00
Haocong WANG
1ff4f25138 Disbale failed instance in rocm6.2 rel (#1388) 2024-07-16 08:46:48 -07:00
Illia Silin
eca39050c6 add Rosty and Bartek to code owners (#1392) 2024-07-16 23:44:46 +08:00
Bartłomiej Kocot
82e8a78a3f Support access per groups and filter3x3 in grouped conv fwd (#1382)
* Support access per groups and filter3x3 in grouped conv fwd

* Fixes for large cases

* Fixes for large tensors
2024-07-12 11:08:42 -07:00
zjing14
13c1e64daa add gemm_bias_add example (#1361)
* add gemm_bias_add example

* changed strideD

* clang-format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2024-07-11 18:08:07 -07:00
Rostyslav Geyyer
7a46a91c84 Add instances for grouped conv fwd 3d with ConvScale for bf8@fp8->fp8 (#1369)
* Add an example

* Add instances

* Add a client example
2024-07-11 13:31:39 -07:00
Illia Silin
98a01bbc72 Add CK_TILE tests to daily CI builds. (#1381)
* add ck_tile tests to CI

* build and run ck_tile tests on gfx90a and gfx942 in parallel

* fix groovy syntax

* turn ck_tile tests OFF by default

* skip creating the build folder

* build ck_tile examples with 64 threads

* build ck_tile examples with cmake-ck-dev.sh script

* add video group to docker on mi300

* do not retry to rebuild the early CI stages

* help prevent jenkins false failure

* restore cron trigger
2024-07-11 13:22:40 -07:00
Illia Silin
f914c228c6 [Jenkins] restore cron jobs (#1380)
* test the cron trigger

* fix the cron jobs

* restore the list of cron jobs
2024-07-11 10:28:11 -07:00
Illia Silin
a8eb872055 [gfx12] add gfx12 to the default target list (#1379) 2024-07-10 14:54:04 -07:00
Sam Wu
860f957c22 Update changelog release headers (#1378)
* Update doc codeowner syntax

* Add doc link to changelog

* Update changelog formatting for markdownlint

Also change headings for releases
2024-07-10 09:36:10 -06:00
dependabot[bot]
da42a88964 Bump rocm-docs-core from 1.4.1 to 1.5.0 in /docs/sphinx (#1374)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.4.1 to 1.5.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.1...v1.5.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
2024-07-09 12:48:23 -07:00
carlushuang
ccfdc53022 update owner (#1377)
* remove zjing14, add poyenc

* remove yigex
2024-07-09 20:30:07 +08:00
Illia Silin
a328df25a1 Fix the cmake logic when building with INSTANCES_ONLY=ON. (#1376)
* fix the cmake logic when building for various targets

* another minor fix
2024-07-08 21:21:16 -07:00
carlushuang
8182976c37 [CK_TILE] wa prec, remove sgpr offset for inline asm (#1356)
* wa prec, remove sgpr offset for inline asm

* macro for set tile

* ignore unused param if no kernel instances in host API

* fix more prec issue

* cache buffer resource

* fix

* support pre-nop

* clear tile by vector type members

* add workaround to reduce scratch memory

* conditionally enable workaround code

* enable workaround start from certain build version

* fallback set_tile() implementation from certain build version

* undo template argument changes

* put dummy asm in load_raw()

* fix comments, refactor s_nop inside buffer_load

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2024-07-08 11:09:55 -07:00
Andriy Roshchenko
eb44e0472a Add ckProfiler support for forward 3D convolutions with OUT element-wise operations. (#1354) 2024-07-08 10:55:54 -07:00
Harisankar Sadasivan
75e622f02f Universal streamk with atomics (#1360)
* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 

* Update README.md

* fixing clang-format issues

* removed conflicts in struct members between streamk and universal streamk

* corrected arg parsing for streamk and universal streamk

* added stream-k policies for 3 tile and 4 tile

* fixed argument type issue with parsing cmd args

* changes suggested in PR review are made- removing comments and correcting copyright

* file permissions updated

* added default value support for grid_size and streamk-policy selection set to -1

* print messages for arguments

* print messages for arguments

* print messages for arguments1
2024-07-05 21:40:30 -07:00
jakpiase
eaa870a1ab Add structural sparsity xdlops (#1363)
* Implemented smfmac xdlops

* add reviewer comments
2024-07-04 12:00:14 +02:00
Jun Liu
959073842c Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372) 2024-07-03 23:34:38 -07:00
Illia Silin
497ccb872b fix the optional ckProfiler grouped_gemm arguments (#1368) 2024-06-28 06:50:46 -07:00
dependabot[bot]
614ebd050a Bump rocm-docs-core from 1.4.0 to 1.4.1 in /docs/sphinx (#1367)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.4.0 to 1.4.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.0...v1.4.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-27 22:14:36 -07:00
Ruturaj Vaidya
2525864fda Update CMakeLists.txt (#1364)
It is a good practice to check if the file CMakeLists.txt is in fact in the directory.
2024-06-27 12:34:25 -07:00
Illia Silin
fafa567b3c Adding a private docker for ROCm6.2 release candidate. (#1365)
* add private docker for rocm6.2_rc1

* update dockerfile
2024-06-27 11:09:00 -07:00
alexxu-amd
3bb0fe6c7e remove PR trigger for now due to high cost (#1329) 2024-06-27 09:57:58 -04:00
jakpiase
ed21948bcd Add structural sparsity gemm instruction tests (#1309)
* first version of smfmac test

* add reviewer comments

* add reviewer suggestions
2024-06-27 11:30:32 +02:00
Illia Silin
941d1f7ce0 Merging the gfx12 code into public repo. (#1362) 2024-06-27 00:33:34 -07:00
Po Yen Chen
a32b1bc647 Replace hipDeviceSynchronize() by hipStreamSynchronize(stream) calls (#1359) 2024-06-26 22:04:52 +08:00
Po Yen Chen
0cb2e06ddc [CK_TILE] fmha forward split-kv + combine kernels (#1338)
* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

* Add num_splits option and dummy split-kv api method

* Generate fmha_fwd_splitkv()

* Add SplitKV kernel codegen logics

* Add SplitKV combine kernel codegen logics

* Fix mismatched return type

* Clean-up code

* Replace sentinel value before storing

* Fix wrong layout of LSE/LSEacc/Oacc

* Format codes

* Fix o_acc memory error

* Fix wrong kBlockSize used in policy

* Reduce # of combine kernels

* Fix split-kv combine kernel name

* Fix wrong LDS indexing logics

* Fix wrong loop counter step logic

* Undo vector size changes

* Remove no-longer used field

* Remove in-consistent comment

* Remove debug statements in example

* Remove more debug statements

* Add constness to local variables

* Clearn up generate.py

* Fix unstable clang-format comment

* Remove unused include directive

* Use shorter template parameter name

* Enable non-split-kv blobs

* Update license date

* Print num_splits conditionally

* Undo disabling data types

* Remove unnessary tile size for fp8

* Fix wrong pipeline args for fp8

* Fix example output format

* Remove more debug code in combine pipeline

* Add stride kernel arguments for LSE/O acc workspace

* Re-order split-kv pipeline call operator arguments

* Pass LSE/O strides in kernel argument

* Re-order pipeline call operator arguments

* Use tensor_descriptor to locate LSEacc elements

* Support providing invalid element for tensor view

* Set invalid element value for LSEacc tensor view

* Remove hand-written store_tile() code

* Remove necessary value-overwrite logic

* Add transposed lds descriptor

* Support load_tile() for tile_window_with_static_lengths<>

* Undo removing necessary value-overwrite logic

* Use read descriptor to locate lds elements

* Simplify pipeline source code

* Add constraint to kMaxSplits

* Default use kMaxSplits=64 in generate.py

* Revert "Add constraint to kMaxSplits"

This reverts commit 0a2132d758.

* Revert "Default use kMaxSplits=64 in generate.py"

This reverts commit c7d9c80b77.

* Decide alignment by the padding parameter

* Remove no-longer used utility functions

* Remove not-working code

* Add comment & remove no-longer used code

* Fix computation errors

* Add heuristic to override num_splits option

* Add constraint to kMaxSplits

* Fix compilation error

* Clean up pipeline code

* Wrap pointer access as lambda function

* Rename confusing methods

* Use kLogMasSplits as template parameter

* Finish splitkv combine kernel codegen

* Update kMaxSplits limit

* Use smaller kM0 for splitkv combine kernel

* Ignore droupout flag in splitkv pipeline

* Unify flag usage

* Add back flag kStoreLSE

* Merge lambda calls in pipeline

* Fix compilation errors

* Avoid all empty splits

* Always check for empty loop in splitkv pipelines

* Re-order parameters

* Remove redundant p_drop option check

* Add traits/problem for fwd splitkv kernel

* Conditionally enable uneven split boundary checks

* Add comment for the splitkv traits field

* Change even split criteria

* Re-order statements

* Refine occupancy value for hdim=128&256

* Refine occupancy value for hdim=32&64

* Remove redundant kernel argument

* Separate fmha bwd codegen logics

* Separate fmha fwd codegen logics

* Remove redundant direction parameter in fwd&bwd codegen logics

* Support generate multiple APIs for an example

* Let 'api' an alias of 'direction' option

* Remove choices for the 'direction' option

* Use dictionary to config all the functions

* Move fmha splitkv codegen logics to other file

* Add fwd_splitkv api for tile_example_fmha_fwd

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
2024-06-26 17:41:15 +08:00
arai713
3e9711f0cb CK Instance Gen (#1145)
* Format

* Format

* Format

* Remove const

* Use the right template

* Format

* Format

* add row/col instances

* Add missing file

* fixed

* fixing block to etile error

* Format

* Updates

* Format

* fixed rrr layout

* generating a sample JSON file: currently contains includes, prologue/epilogue and instances

* version where the json is passed into the instances to generate a key

* updated run function to just launch kernel

* updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer

* adding in testing files

* cleaned up comments, still need to work on including header files

* removed unneeded files

* removed/commented out JSON implementation

* added fusion(prologue/epilogue) into instance generation

* working on instance selection

* added instance selection, need to fix instance validation

* removed block2etile map validity check for testing purposes

* test running: failing due to incorrect files/input

* all grid descs/ptrs completed, but device file not found

* Update test and embed modules

* Restore older version

* added convolution operation, written test, debugging generated code for compilation

* attempting to include CK in host directory: _Float16 error

* CK header file issues

* slight fix

* don't crash when hip can't report total memory

* dump generated code to a file

* changing sizes

* creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed

* some fixes to call the device code

* separating test files for conv and gemm

* completed arg ptr, now have linking errors

* clang format fix

* resolved linker issues in conv test

* remove dependency on libutility from ck

* resolved num dim error

* properly passing arg ptr, errors with passing typenames: redefinition/redeclaration

* undo the commenting of device function

* hand created kernel code to find rtc issues

* dump the full src to file

* resolved redeclaration errors, cleaned up errors for Amber's kernel code

* debugging purposes: redeclaration error

* config files

* resolved errors for NumTensor and redeclaration, formatted version.h

* resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type

* WIP: close to getting kernel compiled

* WIP: fixing rtc errors

* fixed sequence errors, formatting, still one error with run fcn

* yay: kernel compiles and runs

* updated templated/generated version to run and compile

* minor fixes

* working generated example, resolved memory access error due to padding

* adding in reference kernel, validation failing against reference

* debugging: printing kernel argsz

* reduced error in results

* debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues

* working validation (using reference convolution) with prologue function for both hard-coded and generated version

* WIP: create an alt version that creates Argument on the device

* wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments

* wip: making necessary methods device code

* added grid descs, working on grid pointers, errors with stl numerics

* wip: updating kernel args - issue, replacing some std functions

* replaced std::accumulate call with temp hardcoded version

* wip: args causing memory issue

* Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes

* adding object file dump

* temporary hardcoding of grid size, can remove device op inst + arg ptr

* minor fix for grid size

* added modified example where arg ptr is created on the device for generated version as well

* removed device op instance and arg ptr from modified examples

* moving device op file for testing purposes and to properly build CK

* commenting out print-outs

* adjust compiler args to produce a valid ELF file

* temporary removal of validation

* reverting compiler args back for working example

* retrieve necessary arguments from generated template parameters in correct format

* calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly

* scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example

* temporary change to generate ELF format binary object file

* removed unecessary code, added comments

* formatting fix

* cleaned up code, added new tests, restructured library: move helper into CK

* refactored launch parameter calculation to be more concise

* renamed files and variables for more clarity/uniformity

* more code cleaning, removed debug statements

* moved majority of my files into codegen directory, running properly

* updated Embed.cmake(string_view) in codegen directory

* updated host directory to match Embed.cmake as well

* added old tests in

* updated instance generation methods to be more concise

* removed layout from launch parameter calculation

* working test

* fixed issue with verification, all instances working

* updated verification in other tests

* removed duplicate matrix padder file, removed code dumps

* removed old hard-coded tests

* removed old host directory, all files in codegen directory now

* fixed copyright in files

* commenting out validation

* renamed files

* made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code

* updated headers

* removing duplicate file for fwd conv to gemm, merging with original file

* fix building codegen with clang++ directly

* resolving build error from conv_fwd_to_gemm

* fix for previous error

* renaming tests

* created common test file

* cleaned up code, added comments

* renamed device op

* fixed typos in comments

* removed extra space

* code cleanup: resolving Amber's comments

* removed wrapper struct for matrix padder, fixed template

* cleaned up if statements for better readability

---------

Co-authored-by: Paul <pfultz2@yahoo.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: M. Amber Hassaan <amber_474@yahoo.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2024-06-25 16:37:35 -05:00
rocking
cb13839425 layernorm2d forward (#1339)
* Add layernorm2d forward

* Refind file path

* clang format

* Exclude ck_tile op from all

* use add_executable instead

* refactor layernorm2d_fwd example

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-06-24 08:45:52 +08:00
Andriy Roshchenko
05b10e0e5a Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326)
We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation.
This commit handles bf8@bf8->fp8 data types combination.

* Included an example.
* Added instances.
* Added a client example.

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2024-06-21 19:02:57 -06:00
carlushuang
fa129c1a5d WA for rocm-6.2+ s constrait for buffer resource (#1346)
* WA for rocm-6.2+ s constrait for buffer resource

* add missing memory clobber
2024-06-21 11:00:13 -05:00
Bartłomiej Kocot
510325a468 Fix cmake warnings (#1342)
* Cmake add -Wno-nvcc-compt

* Remove template without initialization list

* dpp remove template without init list

* Fixes
2024-06-21 09:47:58 +02:00
Dan Yao
1da802bdf2 Fix FA bwd alibi+causal NaN errors (#1352)
* fix bwd alibi nan error

* fix datatype

---------

Co-authored-by: danyao12 <danyao12>
2024-06-20 09:50:53 -05:00
ThruptiRajLakshmanaGowda
0162a5f6ba Adding Missed Activation Functions for Grouped 2D/3D Convolutions (#1348)
* Initial Push

* First Push

* Fixed Clang format

* Resolve merge conflict

* Addressed review comments

* Addressed review comments

* Addressed review comments
2024-06-20 09:24:54 -05:00
Qianfeng
e3f44659cf Fix in dropout lambda to avoid the compiling issue on some docker/compiler envs (#1350) 2024-06-20 11:36:42 +08:00
zjing14
8db331a511 Remove gfx900 and gfx906 from default target device to reduce package size (#1351) 2024-06-19 11:47:18 -07:00
Qianfeng
1973903f49 Hacking ck_tile fmha Dropout facility (#1344)
* Add NullBlockDropout to be used when kHasDropout is false

* Change to BlockDropout::Run() for forward to reduce conditional checkings

* Re-format files

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2024-06-19 10:37:22 +08:00
Bartłomiej Kocot
8faec23cb4 Add read_first_lane function for int64 (#1347) 2024-06-18 15:05:30 -05:00
jakpiase
e2d139201b Switch to universal gemm in grouped gemm tile loop (#1335)
* switch to universal gemm in grouped gemm tile loop

* minor fixes

* add reviewers comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2024-06-18 09:01:49 -05:00
Bartłomiej Kocot
933951ed48 Fix continous dim selection in contraction (#1336)
* Fix continous dim selection in contraction

* Fixes
2024-06-18 10:26:49 +02:00
carlushuang
17ed368f58 [CK_TILE][FA] using pk f16_f32 (#1343)
* [CK_TILE][FA] using pk f16_f32

* correct a error
2024-06-17 17:16:46 +08:00
zjing14
e02103168a disabled lds direct load inline asm (#1331) 2024-06-16 20:33:47 -05:00
Bartłomiej Kocot
dc1e9c5df9 Support large tensors in grouped conv fwd (#1332)
* Support large tensors in grouped conv fwd

* Multi ABD fixes

* Fix calculate element space size
2024-06-14 09:53:03 -05:00
Qianfeng
37a347e380 Fix to the using of static_for in amd_buffer_addressing.hpp (#1337)
* Add insert_dummy_dep_per_dword over-loading for length 64

* Fix insert_dummy_dep_per_dword and remove over-loading for length 64

* Remove blank lines

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-06-13 16:12:20 +08:00
Rostyslav Geyyer
acda4c5a3c Add instances for grouped conv fwd 3d with ConvScale for fp8@bf8->fp8 (#1325)
* Add fp8 bf8 conv example

* Add instances

* Add client example

* Add random scale values

* Format
2024-06-12 14:41:56 -05:00
Bartłomiej Kocot
5fc1bee4c5 Fix nhwgc f16 wmma instances (#1328) 2024-06-11 09:52:38 +02:00
Rostyslav Geyyer
ce66277a76 Add a convinvscale op, related instances and examples (#1307)
* Update the element op

* Add an example

* Add instances

* Add a client example

* make sure new instances only build on gfx9

* Update element op and its handling

* Format

* Update instances to take element op as an argument

* Update examples to use random scale values

* Format

* Update client example with random scales

* Format

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
2024-06-10 14:48:49 -05:00
dependabot[bot]
8f5690c4bb Bump rocm-docs-core from 1.3.0 to 1.4.0 in /docs/sphinx (#1327)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.3.0 to 1.4.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.3.0...v1.4.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-06 22:38:26 -07:00
Bartłomiej Kocot
ac58cc5d1d Integrate universal gemm with conv forward (#1320)
* Integrate universal gemm with conv fwd

* Fix conv fwd wmma test

* Fix instances

* Remove direct load check
2024-06-05 13:01:29 -05:00