Bartłomiej Kocot
73b67f290f
Add support for NGCHW in grouped conv bwd wei ( #1491 )
...
* Add support for NGCHW in grouped conv bwd wei
* Comments fixes
* navi fixes
* Update function names
2024-09-03 10:52:03 +02:00
Illia Silin
19d22e60c1
Enable daily ninja build traces. ( #1487 )
...
* add ninja trace to CI builds
* fix ninja trace logic
* update the ninja trace logic in jenkins file
* limit the number of threads to run ninja build
* use ninja for installation after build
* update the path to ninjatracing tool
* use ninja to run check when using build trace
* fix jenkins logic
* fix typos
* set proper setup_args for all stages
* fix ninja syntax
* replace ninja check with ninja test
* enable ninja tracing with mainline and staging compilers
2024-08-26 13:27:01 -07:00
Illia Silin
0056e0bf4b
disable bad fp8 test on gfx12 ( #1481 )
2024-08-22 15:05:20 -07:00
Rostyslav Geyyer
e20f20efbf
Set RNE fp8 conversion as a default ( #1458 )
...
* Set RNE fp8 conversion as a default
* Update f8 tests
* Disable failing test on gfx11
* Update bf8 tests
* Add a flag
* Fix the flag
* Raise flag for gfx10 as well
* Temp commit for tolerance testing
* Update tolerances
2024-08-21 09:09:48 -07:00
Bartłomiej Kocot
2581727d2a
Add performance and large tensor tests for grouped conv ( #1456 )
...
* Add performance and large tensor tests for grouped conv
* Resize tests
* Resize tests
* update the python script to parse the grouped_conv results
* Remove int8 tests
* change bwd wei layout
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
2024-08-16 07:48:30 -07:00
Haocong WANG
3049b5467c
[GEMM] gemm_universal related optimization ( #1453 )
...
* replace buffer_atomic with global_atomic
* fixed global_atomic_add
* added bf16 atomic_add
* format
* clang-format-12
* clean
* clean
* add guards
* Update gtest.cmake
* enabled splitk_gemm_multi_d
* format
* add ckProfiler
* format
* fixed naming
* format
* clean
* clean
* add guards
* fix clang format
* format
* add kbatch printout
* clean
* Add rocm6.2 related gemm optimization
* Limit bf16 atomic usage
* remove redundant RCR gemm_universal instance
* Add RRR fp8 gemm universal instance
* Bug fix
* Add GPU_TARGET guard to FP8/BF8 target
* bug fix
* update cmake
* remove all fp8/bf8 example if arch not support
* Enable fp8 RRR support in ckProfiler
* limit greedy-reverse flag to gemm_universal in ckProfiler
---------
Co-authored-by: Jing Zhang <jizhan@fb.com >
Co-authored-by: Jing Zhang <jizhan@meta.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
2024-08-14 10:42:30 +08:00
Illia Silin
cbb6f2ab8c
Disable inapplicable xdl and mha instances for gfx12 ( #1464 )
2024-08-12 15:11:58 -07:00
Mateusz Ozga
ab60b390f8
Rewrite *sh reduce unit tests to gtest: part 1 ( #1407 )
...
* Rewrite .sh test to Gtest
* review chnages
* Removew unused comments
* Review v2
* Typo
* Separete UT: AMAX, MAX, MIN; added template params to trigger them
* Update test/reduce/reduce_no_index.cpp
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
2024-08-12 16:28:10 +02:00
Bartłomiej Kocot
4ec5c52a0c
Add Grouped Conv Fwd Large Tensor kernel ( #1432 )
...
* Support 64 bit indexing
* Add new grouped conv fwd kernel for large tensors
* Add instances large tensor
* Fixes for transform conv to gemm
* Fixes
* fixes
* Remove not needed instances
* examples fixes
* Remove not need ds arrays
* Fix tests
* Add 2GB check in gridwise dl
* Fixes
2024-08-06 10:06:10 +02:00
Illia Silin
d311c95396
Add compiler flags for ROCm versions 6.2+ ( #1429 )
...
* add compiler flags to fix compiler issues
* fix typo.
* disable test_smfmac_op on all devices except gfx942
* specify full path to compiler in CI
2024-08-01 08:27:52 -07:00
Haocong WANG
8c90f25be3
[GEMM] F8 GEMM, performance optimized. ( #1384 )
...
* add ab_scale init support
* enabled interwave
* add scale type; update isSupport
* adjust example
* clean
* enable f8 pure gemm rcr ckprofiler
* Add gemm_multiply_multiply instances
* clang format
* Optimize for ScaleBlockMNK=128
* enable abscale f8 gemm ck profiler
* Add pure f8 gemm test suite
* Reverting to the state of project at f60fd77
* update copyright
* clang format
* update copyright
---------
Co-authored-by: root <jizhan@amd.com >
2024-07-19 22:06:52 +08:00
Bartłomiej Kocot
82e8a78a3f
Support access per groups and filter3x3 in grouped conv fwd ( #1382 )
...
* Support access per groups and filter3x3 in grouped conv fwd
* Fixes for large cases
* Fixes for large tensors
2024-07-12 11:08:42 -07:00
Illia Silin
a328df25a1
Fix the cmake logic when building with INSTANCES_ONLY=ON. ( #1376 )
...
* fix the cmake logic when building for various targets
* another minor fix
2024-07-08 21:21:16 -07:00
Jun Liu
959073842c
Fix issue with multiple targets and remove smfmac tests from unsupported test targets ( #1372 )
2024-07-03 23:34:38 -07:00
jakpiase
ed21948bcd
Add structural sparsity gemm instruction tests ( #1309 )
...
* first version of smfmac test
* add reviewer comments
* add reviewer suggestions
2024-06-27 11:30:32 +02:00
Illia Silin
941d1f7ce0
Merging the gfx12 code into public repo. ( #1362 )
2024-06-27 00:33:34 -07:00
Bartłomiej Kocot
510325a468
Fix cmake warnings ( #1342 )
...
* Cmake add -Wno-nvcc-compt
* Remove template without initialization list
* dpp remove template without init list
* Fixes
2024-06-21 09:47:58 +02:00
Bartłomiej Kocot
933951ed48
Fix continous dim selection in contraction ( #1336 )
...
* Fix continous dim selection in contraction
* Fixes
2024-06-18 10:26:49 +02:00
Bartłomiej Kocot
dc1e9c5df9
Support large tensors in grouped conv fwd ( #1332 )
...
* Support large tensors in grouped conv fwd
* Multi ABD fixes
* Fix calculate element space size
2024-06-14 09:53:03 -05:00
Bartłomiej Kocot
ac58cc5d1d
Integrate universal gemm with conv forward ( #1320 )
...
* Integrate universal gemm with conv fwd
* Fix conv fwd wmma test
* Fix instances
* Remove direct load check
2024-06-05 13:01:29 -05:00
carlushuang
5055b3bdcb
[CK_TILE] support group from cmdline ( #1295 )
...
* support cmdline seqlen decode
* silent print
* update readme
* update kernel launch 3d
* update tile partitioner
* fix spill for bf16
* modify based on comment
* modify payload_t
* fix bug for alibi mode
* fix alibi test err
* refactor kernel launch, support select timer
* add missing file
* remove useless code
* add some comments
2024-05-28 11:13:21 +08:00
Bartłomiej Kocot
fd72380aeb
Optimize grouped conv bwd weight for small M and N ( #1303 )
...
* Optimize grouped conv bwd weight for small M and N
* Fixes
2024-05-22 21:01:01 +02:00
Illia Silin
7b027d5643
Select appropriate GPU targets for instances, tests, and examples. ( #1304 )
...
* set individual gpu targets for instances, examples, tests
* fix path to hip compiler
* fix path to hip compiler once more
* aggregate device macros in ck_tile config header
* fix the cmake logic for instances
* fix clang format
* add gfx900 and gfx906 to default set of targets
2024-05-22 11:45:27 -07:00
jakpiase
3e3471d5d2
Add unit tests for grouped gemm two stage ( #1256 )
...
* add unit tests for grouped gemm two stage
* add reviewers suggestions
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2024-05-15 10:03:39 +02:00
Illia Silin
566b6480a2
Code clean-up ( #1285 )
...
* code clean-up
* remove the profiling output samples
2024-05-10 09:41:39 -07:00
Bartłomiej Kocot
8346af9c68
Change output gemm type to AccDataType in two stage conv bwd wei ( #1283 )
2024-05-10 10:57:42 +02:00
carlushuang
851c3ed157
[CK_TILE] support alibi ( #1269 )
...
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
2024-05-07 22:32:54 +08:00
Haocong WANG
764164b488
[GEMM] UniversalGemm update ( #1262 )
...
* Add bf16 instances
* Add bf16 gemm universal example
* tempsave
* Add guard to navi compilation
* workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
* fix formatting condition statement issue
* solve conflict
---------
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
2024-04-26 12:56:07 -05:00
Adam Osewski
b4032629e5
Grouped GEMM Multiple D tile loop. ( #1247 )
...
* Overload output stream operator for LoopScheduler and PiplineVersion
* Add Run overload accepting grid descriptors MK.
* Add __device__ keyword for CalculateGridSize
* Create device op GroupedGemmMultipleD
* Add GroupedGemm MultipleD Tile Loop implementation.
* Add an example for GroupedGemm MultipleD tile loop.
* Device Op GroupedGEMMTileLoop.
* Bunch of small changes in exmaple.
* CkProfiler
* Remove unused tparam.
* Fix include statement.
* Fix output stream overloads.
* Do not make descriptors and check validity untill we find group.
* Fix gemm desc initialization.
* Revert device op
* Fix compilation for DTYPES=FP16
* Validate tensor transfers paramters.
* Validate on host only NK dims if M is not known.
* Fix bug.
* A convenient debug func for selecting threads.
* Fix has main k block loop bug.
* Make sure that b2c has up to date tile offset.
* Output stream operator for Sequence type.
* Cmake file formatting.
2024-04-25 15:12:53 -05:00
Bartłomiej Kocot
fd923b6d86
Add grouped conv bwd weight multi d kernel ( #1237 )
...
* Add grouped conv bwd weight multi d kernel
* Reference fix
* Fix cmake files
* bwd weight scale only xdl
* Fixes
* Fix client conv fwd example
2024-04-18 23:35:04 +02:00
Haocong WANG
f83e9701e9
[GEMM] Gemm universal device operation ( #1154 )
...
* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances
* clang format + typo fix
* Fix a bug
* initial commit
* Add more instances to irregular splitk
* blkgemm pipeline v1~4 prototype
* Sanity Checked. Known issue:
1. Poor performance of splitk
2. Register spill on blkgemmpipeline v3
* Sanity and Performance fix:
1. fix a bug related to sanity in grouped b2c mapping
2. fix a bug related to sanity and performance in splitk offset
* Sanity and API update:
1. Remove prefetch stage
2. Fix valid check bug
3, Add first gemm_universal instance into ckProfiler
* Add NN instances for gemm universal
* 1. Add NT instances for gemm_universal
2. Fix a bug about Kpadding in gemm_universal
* Fix a bug regarding padding Odd K number
* remove kernel print
* Fix KPadding bug...
* Update safety check
* another try to fix kpadding..
* Sanity checked
* new instances..
* clang format+typo fix
* remove clang format script's change
* Add non-hotloop compile option
* 1. Add fp16xfp8 example
2. pull packed convert f8 from pr1150
* Some miscs.. opt and fix
* Add pipeline description docs
* Split universal gemm instance library to cut profiler compiling time
* uncomment cmakefile
* Fix a bug caused by blockwise_gemm_pipe_v2
* reduce default splitk to 1
* Add 224x256x64 tile size
* update, including:
1. Experiment pipeline 5~7
2. Optimization for pipeline 4
3. Organized instance library
* temp save
* temp save
* Permuted lds layout, sanity and function checked
* clang format
* Move OOB check from RunRead to RunWrite, for better software pipeline.
TODO: agpr spill when NN layout
* clangformat
* A/B splitpipe scheduler for v3
* Fix two bugs
* bug fix
* fix a bug in oob check
* Example for mixed fp16_fp8 gemm
* Clean experimental code blocks
* Add mixed precision gemm into profiler
* tempsave
* optimize m/n major lds layout
* Add RRR GEMM mixed precision instances
* Optimize f8 matrix transpose
* Add test_gemm_universal
* A/B spilt schedule for blkpip v5
* Take ds_read2 into iglp scheduling scheme
* format
* fixed cmake
* Add llvm-option into CI cmake flag
---------
Co-authored-by: Jing Zhang <jizhan@amd.com >
2024-04-13 21:03:18 -05:00
Bartłomiej Kocot
ced5af16f7
Extend support for contraction 6D ( #1207 )
...
* Extend support for contraction up to 5D
* Extend contraction bilinear instances
* Fix interface test
* Add 6d support, remove 3d,4d,5d
* Fixes
* Fix readme
* Make defualt dim for contraction instances
2024-04-09 23:46:21 +02:00
Illia Silin
7e5c81fed2
fix the latest errors with staging compiler ( #1229 )
2024-04-04 11:33:29 -07:00
Illia Silin
ae57e5938e
Split the instances by architecture. ( #1223 )
...
* parse examples inside the add_example_executable function
* fix the example 64 cmake file
* add xdl flag to the gemm_bias_softmax_gemm_permute example
* add filtering of tests based on architecture type
* enable test_grouped_gemm for gfx9 only
* enable test_transpose only for gfx9
* only linnk test_transpose if it gets built
* split the gemm instances by architectures
* split gemm_bilinear,grouped_conv_bwd_weight instances by targets
* split instances by architecture
* split grouped_conv instances by architecture
* fix clang format
* fix the if-else logic in group_conv headers
* small fix for grouped convolution instances
* fix the grouped conv bwd weight dl instances
* fix client examples
* only enable client examples 3 and 4 on gfx9
* set the gfx9 macro
* make sure the architecture macros are set by cmake
* use separate set of xdl/wmma flags for host code
* sinmplify the main cmake file
* add conv_fwd_bf8 instance declaration
2024-04-02 09:42:17 -07:00
Bartłomiej Kocot
9c052804a7
Add elementwise with dynamic vector dim ( #1198 )
...
* Add elementwise with dynamic vector dim
* Reduce number of instaces
* Fixes
* Fixes
2024-03-22 10:40:43 +01:00
zjing14
1837040a9c
Navi3 rel ( #1176 )
...
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* tempsave
* tempsave
* tempsave
* runtime bug, cannot find symbol
* workaround for incorrect HIP warpSize return value
* debugging
* tempsave
* Correctness OK, waiting for optimization
* Tidy up + format
* temp save
* temp save, reproduce the v_bfi_b32 issue
* add inline asm for wmmaop test
* tidy up
* clean some debug purpose code
* discard some codes
* clang format
* clang format
* compiler issue fixed + increase tile size
* navi3x_multipleD+example
* temp save
* workable
* batchedgemm[OK], groupconv[debug]
* groupconv: Sanity check[OK], Performance[Bad]
* navi3x_groupconv_need_optimization
* create necessary files
* save progress
* Add Inter-Row thread transfer
* save progress
* save debugging progress
* sanity check pass
* fix a host tensor bug and clean up flash-attn code
* format
* cancel unnecessary change
* cancel unnecessary change
* cancel unnecessary change
* temp save, add asm backend flag to amd_wmma
* Mat-A LDS Bypass sanity pass
* temp save
* gemm sanity fix
* Porting new blockwise gemm to flash attention
* Example branch provide to compiler team
* tempsave
* Fix a bug
* batched gemm ported
* conv A-skip lds ported
* Skip B-Lds real gemm
* Skip B Lds Gemm + MulD
* batched gemm, conv, skip b lds
* format
* Attn, skip b lds
* Change GridwiseOp nam
* fix a typo caused bug
* Skip A_Lds sanity pass, Skip B_Lds scratch occured
* Bug found, intra-row permute off caused
* bug found
* a fix
* disable buffer load due to incorrect 3rd dword
* update fmha config, no scratch generated
* update 3rd dword
* fmha config update
* FMHA, add support to gfx1101/gfx1102
* Merge origin dev (#2 )
* [Navi3x] Fix Gridwise_multiple_d operation (#649 )
* Add CMake Option "USE_OPT_NAVI3X"
* fix bug
* standardize docs (#655 )
* Separate bibtex requirement from rocm-docs-core (#656 )
* separate bibtex requirement from rocm-docs-core
* point requirements to source rocm-docs-core repo
* Add CMake Option "USE_OPT_NAVI3X" (#647 )
* Add CMake Option "USE_OPT_NAVI3X"
* remove navi3x opt compile option from cmake script
* Conv + quantization + tanh (#645 )
* Rename file. Prepare to support another activation
* Add comment for quantization
* Extract out_elementop
* Add tanh example
* Add conv + bias + tanh quantization instance
* Add missing parameter
* Refine cmake
* Add external api and client example
* Extract variable in example
* Fix the comment
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* Add a denorm test fix (#603 )
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* simplify karg in device/grid of split-k op (#644 )
* simplify karg in device/grid split-k op
* fix mk_kn_mn instances
* add more instances
* use name from tensor layout
* fix 3rd dword of buffer source descriptor (#659 )
* add fp64 instances (#658 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Issue #666 : Revert "simplify karg in device/grid of split-k op (#644 )" (#665 )
This reverts commit bb5530af91 .
* Groupnorm + swish external api (#668 )
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* add a marco to turn on/off denorm fix (off by default) (#673 )
* add a marco to turn off denorm fix by default
* expose the marco
---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* fixed quant example (#672 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Add dependabot config and pin rocm-docs-core (#663 )
* [gtest] suppress unsafe buffer warn (#670 )
ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
* Add memory index guard in wmma device ops (#667 )
* Add more macros to turn on/off denorm fix (#678 )
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
* Fix a typo (#676 )
* Add (#677 )
* Allow using ROCm release candidate compilers. (#679 )
* enable use of rocm5.5 release candidate 4
* upgrade to ROCM5.5 RC5
* try fix the PUB_KEY error, remove the cmake-data package
* upgrade to latest cmake version
* use private dockerhub repo for rocm5.5 rc5
* add missing bracket
* add vector load check
* solve conflicts
---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca >
Co-authored-by: Sam Wu <sam.wu2@amd.com >
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
* Disable SkipLDS & Align AIT api (#3 )
* fix layernorm, reduction Ops (#4 )
* [Navi3x] Fix Gridwise_multiple_d operation (#649 )
* Add CMake Option "USE_OPT_NAVI3X"
* fix bug
* standardize docs (#655 )
* Separate bibtex requirement from rocm-docs-core (#656 )
* separate bibtex requirement from rocm-docs-core
* point requirements to source rocm-docs-core repo
* Add CMake Option "USE_OPT_NAVI3X" (#647 )
* Add CMake Option "USE_OPT_NAVI3X"
* remove navi3x opt compile option from cmake script
* Conv + quantization + tanh (#645 )
* Rename file. Prepare to support another activation
* Add comment for quantization
* Extract out_elementop
* Add tanh example
* Add conv + bias + tanh quantization instance
* Add missing parameter
* Refine cmake
* Add external api and client example
* Extract variable in example
* Fix the comment
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* Add a denorm test fix (#603 )
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* simplify karg in device/grid of split-k op (#644 )
* simplify karg in device/grid split-k op
* fix mk_kn_mn instances
* add more instances
* use name from tensor layout
* fix 3rd dword of buffer source descriptor (#659 )
* add fp64 instances (#658 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Issue #666 : Revert "simplify karg in device/grid of split-k op (#644 )" (#665 )
This reverts commit bb5530af91 .
* Groupnorm + swish external api (#668 )
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* add a marco to turn on/off denorm fix (off by default) (#673 )
* add a marco to turn off denorm fix by default
* expose the marco
---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* fixed quant example (#672 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Add dependabot config and pin rocm-docs-core (#663 )
* [gtest] suppress unsafe buffer warn (#670 )
ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
* Add memory index guard in wmma device ops (#667 )
* Add more macros to turn on/off denorm fix (#678 )
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
* Fix a typo (#676 )
* Add (#677 )
* Allow using ROCm release candidate compilers. (#679 )
* enable use of rocm5.5 release candidate 4
* upgrade to ROCM5.5 RC5
* try fix the PUB_KEY error, remove the cmake-data package
* upgrade to latest cmake version
* use private dockerhub repo for rocm5.5 rc5
* add missing bracket
* Disable SkipLDS & Align AIT api
* Update dependabot config (#682 )
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
* update attn api
* solve type_convert bug + enable
---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca >
Co-authored-by: Sam Wu <sam.wu2@amd.com >
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
Co-authored-by: haocwang <Haocong.WANG@amd.com >
* fix typo
* Fix attention with causal mask
* multiple fix, try ait compile
* Add A/B not use LDS pipeline
* Clang format, Add gfx1101, gfx1102 support of FMHA example
* cancel change of format script
* 1. Enable 2-stage global Prefetch ( May cause VGPR spilling)
2. Enable FP16 accumulator blockwise_gemm
* clang-format
* 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement)
2. change kernel timing mode to 50 warmup + 50 timed repeat
* Update low level abstration of blockwise gemm wmma
* (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds
* (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds
* (4/5) grouped conv pass
* (5/5) attention pass, todo: debug lds perf bug
* AIT Attention API refactor (#8 )
* sanity pass
* sanity pass 2
* confirm significant performance regression.
* turn on all instances
* turn off instance format
* Fix bug & tunning & format
* DML meta, self_attn+cross_attn
* sanity pass
* remove useless flag
* update tile and problem size used in AIT attention
* bug fix in grouped conv supporting check
* deprecate inline asm wmma
* Bug fix: double lds skip
* clang-format
* Fix errors in
1. example, fmha
2. gridwise pipeline
3. deviceop, fmha, change some containers from vector to array
* part2 of previous commit
* clang format
* API fix of gridwisegemmpipeline
* separate array base and vector base attention tensor transformation
* fix gemm
* clang format
* add gemm fp16 instances
* Temp save
* fpAintB kernel compile pass
* Sanity pass.
* Temp save
* debug code enabled
* Fp16AInt8B_GEMM sanity
* MQA implementation
* GQA-4 example
* tempsave
* Compile pass
* New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm
* format
* Todo: fix gemm_bilinear_wmma instances compilation bug
* Solve a bug when K1=16
* remove unnecessary changes
* Remove tensor layout limitation to LDS usage in tesnor contraction
* update self-attention and cross-attention
* fix a typo of name
* Add arch limiter for fp8 gemm
* enable fp8 gemm_xdl for all gfx9 targets
* temporarily disable gemm_xdl_fp16_fp8 on MI100/200
* fix the cmake logic for gemm_xdl_fp16_fp8
* re-enable the gemm_xdl_fp16_fp8 on MI100/200
---------
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: Sam Wu <sjwu@ualberta.ca >
Co-authored-by: Sam Wu <sam.wu2@amd.com >
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
Co-authored-by: haocwang <Haocong.WANG@amd.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
2024-03-08 17:11:51 -08:00
Bartłomiej Kocot
66736edb95
Extend permute scale support up to 6D ( #1168 )
...
* Extend permute scale support up to 6D
* Fixes
* Fixes
* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com >
* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com >
* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com >
* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com >
* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com >
* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com >
* Update profiler/README.md
Co-authored-by: Lisa <lisajdelaney@gmail.com >
---------
Co-authored-by: Lisa <lisajdelaney@gmail.com >
2024-02-20 09:56:54 -08:00
Bartłomiej Kocot
1e73adbc28
Add optimized blockwise gemm using ck wrapper ( #1157 )
...
* Add optimized blockwise gemm using ck wrapper
* Add basic gemm example
* Update docs
* Add tutorial for gemm using ck wrapper
* Add perf note
* edits
* Fix cmake
* Fixes
---------
Co-authored-by: Lisa Delaney <lisa.delaney@amd.com >
2024-02-13 17:04:36 +01:00
jakpiase
ba86eadce5
Add support for mixed-precision f16bf16_int8 gemm ( #1127 )
2024-02-07 15:54:13 +01:00
Illia Silin
180f16f9ac
Add support for more Navi2x and Navi3x models. ( #1152 )
...
* add support for navi2x and navi3x models
* fix syntax
* use common macro for different mi300 architectures
2024-02-02 11:35:26 -08:00
Bartłomiej Kocot
171ca260b5
Extend gemm traits number for ck wrapper ( #1153 )
2024-02-02 11:25:54 -08:00
Bartłomiej Kocot
f3b6c23ac5
Add blockwise gemm to ck wrapper ( #1139 )
...
* Add blockwise gemm to ck wrapper
* Add blockwise gemm traits
* Disable test_gemm for non xdl devices
* Fixes
* Add c layout descritpions
2024-01-31 21:24:40 +01:00
rocking
28f68a5a99
layernorm & groupnorm bwd gamma beta ( #1133 )
...
* Add layernorm bwd gamma beta external api
* Add groupnorm external api
* Add layernorm bwd gamma beta profiler
* Add groupnorm bwd gamma beta ckProfiler
* Add layernorm & groupnorm bwd gamma beta test
* Fix groupnorm bwd gamma beta profiler bug
* Layernorm bwd weight client example
* Groupnorm bwd weight client example
* clang format
* Remove useless header
* Let inv_std be positive
* Rename to num_bytes and move this calculation outside the loop
2024-01-25 19:53:15 +08:00
Illia Silin
180e572076
Fixing most of the cppcheck errors. ( #1142 )
...
* fix cppcheck errors, first pass
* fix format
* fix returned value in examples
* add macro definitions for cppcheck
* fix the profile_gemm logic
* update the gemm profiler logic
* add more difinitions to cppcheck, fix couple more errors
* replace runtime error with message in device function
* fix a couple of int4 issues
* no return for fill function
* fix errors in data_types.hpp
* fix format
* fix few remaining errors
* fix errors in data_types.hpp
* fix last couple of errors in datat_types.hpp
2024-01-24 13:47:48 -08:00
Bartłomiej Kocot
7e4eb4b800
Add optimized copy to ck wrapper ( #1126 )
...
* Add optimized copy to ck wrapper
* Example optimizations
* Fixes
* Move img2col test to client example
* Refactor example
* Fix docs
* Fixes
* Fix
* Fixes
* Fixes
* Fixes
* Fixes
* Fixes
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com >
2024-01-19 11:29:00 +01:00
Illia Silin
886d9eeb99
Add an option to change the number of warm-up cycles and iterations. ( #1124 )
...
* allow setting the number of warmup cycles and iterations for profiler
* fix the gemm_splitk and grouped_gemm examples
2024-01-09 09:43:08 -08:00
arai713
aa3e2d7967
Transpose profiler fix ( #1114 )
...
* added working example for 5D input using 1D kernel
* example with 5D input tensor and 2d kernel - not working: issues with arguments
* added updated version of 3d device op - changed descriptors/dims
* added example file to check kernel
* fixed descriptor and isSupportedArgument stride problem
* added and modified kernel for 3d - updated tids/loop
* adding some more 5d example files
* fixed some issues
* changes made for testing
* working version: fixed error in stride for A, still a bit inefficient
* cleaned up formatting/comments
* updating formatting
* more formatting fixes
* fixing cmake, adding back gpu targets in cmake script
* adding client example
* added instances for client example
* fixed errors in client example
* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp
* removed extra files
* minor formatting and naming fixes
* adding test files and profiler
* fixing minor error
* minor fix
* removed unneccesary comments, renamed files
* updated instance list for client example, added different layout example
* removing instances
* fixed error in instance generation
* remove comments
* update profiler and client example tensor layouts
* fixed errors in test/profiler
* updated vector dim access to enable vector load
* updated test/profiler files
* updated example with 1d kernel
* updating profiler
* renamed files
* disabled device op for MI300
* skip elementwise_permute_2d on gfx94x
* Update CMakeLists.txt
* fixing CMake - disabling some GPU targets
* added transpose profiler to CMake
* fixed transpose profiler errors
* fixed instances for tests/profiler
* cleaned up code in transpose profiler source code
* added some comments, updated copyright
* made function arguments const where possible
---------
Co-authored-by: Jing Zhang <jizha@amd.com >
Co-authored-by: Jing Zhang <jizhan@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
2024-01-04 10:33:19 -06:00
Bartłomiej Kocot
4234b3a691
Add tensor partition and generic copy for ck wrapper ( #1108 )
...
* Add tensor partition and generic copy for ck wrapper
* Update changelog
* Stylistic fixes
* Change shape/strides logic to descriptor transforms
* Fixes
* Fix client example
* Fix comments
2024-01-03 01:10:57 +01:00
Bartłomiej Kocot
20b1ae7ced
Fix results verify in test_tensor ( #1109 )
2023-12-23 22:12:49 +01:00
Artur Wojcik
fb5bd51b42
enable compilation of INSTANCES_ONLY for Windows ( #1082 )
...
* enable compilation of INSTANCES_ONLY for Windows
* suppress ROCMChecks warnings on GoogleTests
* suppress -Wfloat-equal warning on GoogleTests
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2023-12-20 14:34:53 -08:00