Po Yen Chen
c6eac9746f
Fix type errors in composes<>
2024-04-09 13:18:17 +00:00
rocking
83b8a99018
Merge branch 'ck_tile/refactor' into ck_tile/elementwise
2024-04-09 19:45:43 +08:00
carlushuang
89a75a97fa
fix some bug in group-mode masking and codegen. update README
2024-04-09 19:01:25 +00:00
Po Yen Chen
ecc64bce12
Generalize the composes<> template
2024-04-09 10:14:56 +00:00
Po Yen Chen
6ed739f913
Fix wrong value produced by saturating
2024-04-09 09:27:58 +00:00
Po Yen Chen
5d0ebdbfe4
Re-use already-existing scales<> functor template
2024-04-09 08:06:38 +00:00
Po Yen Chen
db0d7c6a99
Use conditional_t<> to simplify code
2024-04-09 06:52:54 +00:00
Po Yen Chen
a9adfbe54a
Small refinements in C++ source files
2024-04-09 06:45:03 +00:00
Po Yen Chen
7c95464799
Remove more not-in-use elementwise function kargs
2024-04-09 06:20:50 +00:00
rocking
b64d3f6eec
prevent warning in filter mode
2024-04-08 21:43:35 +00:00
rocking
525b89e538
1. codgen the f8 api and kernel
...
2. f8 host code
2024-04-08 21:36:23 +00:00
rocking
5860f3134a
Merge branch 'ck_tile/refactor' into ck_tile/elementwise
2024-04-09 02:37:42 +08:00
Po Yen Chen
e49498f616
Set fp8 rounding error for check_err()
2024-04-08 12:39:37 +00:00
rocking
5c3fdeb0b8
Remove f8 pipeline, we should share the same pipeline even in f8
2024-04-08 09:56:23 +00:00
rocking
f7d81364f3
To prevent compiler issue, remove the elementwise function we have not used.
2024-04-08 09:44:21 +00:00
carlushuang
42ebffe822
1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
2024-04-07 23:30:34 +00:00
carlushuang
8050921512
Merge branch 'develop' into ck_tile/refactor
2024-04-05 20:49:13 +08:00
jakpiase
c701071666
Add Grouped Gemm Multiple D SplitK TwoStage ( #1212 )
...
* Support A/B/C elementwise ops.
* First part of GGEMM multiD splitk two stage.
* WIP - changes for debuggin.
* tmp save
* working version
* added bf16@int8 version
* fixes
* add reviewers sugestions
* pre-commited missing files
* switched to ifs from elseifs
---------
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com >
2024-04-04 11:01:33 +02:00
rocking
68153dea0b
Let generate.py can generate different elementwise function
2024-04-04 03:59:38 +00:00
Rostyslav Geyyer
a61e73bc56
Add instances for conv_scale with fp8@bf8->fp8 ( #1220 )
...
* Update device op api to support BComputeType
* Add example
* Add instances
* Add profiler mode
* Add client example
* Update copyright year
* Add BComputeType check
* Fix compute types
2024-04-03 09:08:08 -05:00
carlushuang
06f1cabd78
Merge branch 'develop' into ck_tile/refactor
2024-04-03 20:51:01 +08:00
Bartłomiej Kocot
9a194837af
Introduce combined elementwise ops ( #1217 )
...
* Introduce combined elementwise ops
* Introduce refrence elementwise
2024-04-02 17:23:49 -05:00
Illia Silin
ae57e5938e
Split the instances by architecture. ( #1223 )
...
* parse examples inside the add_example_executable function
* fix the example 64 cmake file
* add xdl flag to the gemm_bias_softmax_gemm_permute example
* add filtering of tests based on architecture type
* enable test_grouped_gemm for gfx9 only
* enable test_transpose only for gfx9
* only linnk test_transpose if it gets built
* split the gemm instances by architectures
* split gemm_bilinear,grouped_conv_bwd_weight instances by targets
* split instances by architecture
* split grouped_conv instances by architecture
* fix clang format
* fix the if-else logic in group_conv headers
* small fix for grouped convolution instances
* fix the grouped conv bwd weight dl instances
* fix client examples
* only enable client examples 3 and 4 on gfx9
* set the gfx9 macro
* make sure the architecture macros are set by cmake
* use separate set of xdl/wmma flags for host code
* sinmplify the main cmake file
* add conv_fwd_bf8 instance declaration
2024-04-02 09:42:17 -07:00
zjing14
303d4594f4
improved zeroing ( #1221 )
2024-04-02 11:02:52 -05:00
rocking
cf57626c07
Merge branch 'ck_tile/refactor' into ck_tile/elementwise
2024-04-01 16:07:27 +08:00
carlushuang
855a264b72
remove ck_tile example from default cmake target like all/install/check
2024-03-30 23:58:48 +00:00
rocking
286c74468d
Add element function to fmha api
2024-03-29 18:05:36 -04:00
carlushuang
076da565dd
let python version to be 3.8 as minimal
2024-03-29 17:07:23 +00:00
carlushuang
f236a13d1b
fix several issue
2024-03-28 22:00:11 +00:00
carlushuang
b0b8a5ad46
update README of ck_tile example
2024-03-26 18:57:29 +00:00
carlushuang
97902de98c
sync 22
2024-03-26 16:30:50 +00:00
carlushuang
f955af6ff7
sync upstream again
2024-03-26 16:25:32 +00:00
carlushuang
1c92c5d83d
sync with upstream
2024-03-26 16:05:54 +00:00
carlushuang
04ee01191a
fix merge from upstream
2024-03-26 14:09:54 +00:00
Bartłomiej Kocot
9c052804a7
Add elementwise with dynamic vector dim ( #1198 )
...
* Add elementwise with dynamic vector dim
* Reduce number of instaces
* Fixes
* Fixes
2024-03-22 10:40:43 +01:00
Rostyslav Geyyer
fd0d093e78
Add instances for conv_scale with bf8 in / fp8 out ( #1200 )
...
* Add bf8 conv fwd instances
* Add example
* Add profiler mode
* Add client example
* Fix copyright headers
* Format
2024-03-21 13:57:34 -05:00
carlushuang
f55c7629bc
not using custom data type by default, now we can have ISA-level same code as opt_padding
2024-03-17 23:23:32 +00:00
Rostyslav Geyyer
e626d5202a
Add instances for conv_scale with fp8 in/out ( #1193 )
...
* Add fp8 conv instances and client example
* Format
* Add example
* Update cmakelists
* Add profiler mode
* Format
* Fix copyright headers
2024-03-15 09:50:03 -07:00
Po-Yen, Chen
0bd76de8a6
Update executable name in test scripts
2024-03-11 01:54:48 -04:00
zjing14
1837040a9c
Navi3 rel ( #1176 )
...
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* tempsave
* tempsave
* tempsave
* runtime bug, cannot find symbol
* workaround for incorrect HIP warpSize return value
* debugging
* tempsave
* Correctness OK, waiting for optimization
* Tidy up + format
* temp save
* temp save, reproduce the v_bfi_b32 issue
* add inline asm for wmmaop test
* tidy up
* clean some debug purpose code
* discard some codes
* clang format
* clang format
* compiler issue fixed + increase tile size
* navi3x_multipleD+example
* temp save
* workable
* batchedgemm[OK], groupconv[debug]
* groupconv: Sanity check[OK], Performance[Bad]
* navi3x_groupconv_need_optimization
* create necessary files
* save progress
* Add Inter-Row thread transfer
* save progress
* save debugging progress
* sanity check pass
* fix a host tensor bug and clean up flash-attn code
* format
* cancel unnecessary change
* cancel unnecessary change
* cancel unnecessary change
* temp save, add asm backend flag to amd_wmma
* Mat-A LDS Bypass sanity pass
* temp save
* gemm sanity fix
* Porting new blockwise gemm to flash attention
* Example branch provide to compiler team
* tempsave
* Fix a bug
* batched gemm ported
* conv A-skip lds ported
* Skip B-Lds real gemm
* Skip B Lds Gemm + MulD
* batched gemm, conv, skip b lds
* format
* Attn, skip b lds
* Change GridwiseOp nam
* fix a typo caused bug
* Skip A_Lds sanity pass, Skip B_Lds scratch occured
* Bug found, intra-row permute off caused
* bug found
* a fix
* disable buffer load due to incorrect 3rd dword
* update fmha config, no scratch generated
* update 3rd dword
* fmha config update
* FMHA, add support to gfx1101/gfx1102
* Merge origin dev (#2 )
* [Navi3x] Fix Gridwise_multiple_d operation (#649 )
* Add CMake Option "USE_OPT_NAVI3X"
* fix bug
* standardize docs (#655 )
* Separate bibtex requirement from rocm-docs-core (#656 )
* separate bibtex requirement from rocm-docs-core
* point requirements to source rocm-docs-core repo
* Add CMake Option "USE_OPT_NAVI3X" (#647 )
* Add CMake Option "USE_OPT_NAVI3X"
* remove navi3x opt compile option from cmake script
* Conv + quantization + tanh (#645 )
* Rename file. Prepare to support another activation
* Add comment for quantization
* Extract out_elementop
* Add tanh example
* Add conv + bias + tanh quantization instance
* Add missing parameter
* Refine cmake
* Add external api and client example
* Extract variable in example
* Fix the comment
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* Add a denorm test fix (#603 )
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* simplify karg in device/grid of split-k op (#644 )
* simplify karg in device/grid split-k op
* fix mk_kn_mn instances
* add more instances
* use name from tensor layout
* fix 3rd dword of buffer source descriptor (#659 )
* add fp64 instances (#658 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Issue #666 : Revert "simplify karg in device/grid of split-k op (#644 )" (#665 )
This reverts commit bb5530af91 .
* Groupnorm + swish external api (#668 )
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* add a marco to turn on/off denorm fix (off by default) (#673 )
* add a marco to turn off denorm fix by default
* expose the marco
---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* fixed quant example (#672 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Add dependabot config and pin rocm-docs-core (#663 )
* [gtest] suppress unsafe buffer warn (#670 )
ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
* Add memory index guard in wmma device ops (#667 )
* Add more macros to turn on/off denorm fix (#678 )
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
* Fix a typo (#676 )
* Add (#677 )
* Allow using ROCm release candidate compilers. (#679 )
* enable use of rocm5.5 release candidate 4
* upgrade to ROCM5.5 RC5
* try fix the PUB_KEY error, remove the cmake-data package
* upgrade to latest cmake version
* use private dockerhub repo for rocm5.5 rc5
* add missing bracket
* add vector load check
* solve conflicts
---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca >
Co-authored-by: Sam Wu <sam.wu2@amd.com >
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
* Disable SkipLDS & Align AIT api (#3 )
* fix layernorm, reduction Ops (#4 )
* [Navi3x] Fix Gridwise_multiple_d operation (#649 )
* Add CMake Option "USE_OPT_NAVI3X"
* fix bug
* standardize docs (#655 )
* Separate bibtex requirement from rocm-docs-core (#656 )
* separate bibtex requirement from rocm-docs-core
* point requirements to source rocm-docs-core repo
* Add CMake Option "USE_OPT_NAVI3X" (#647 )
* Add CMake Option "USE_OPT_NAVI3X"
* remove navi3x opt compile option from cmake script
* Conv + quantization + tanh (#645 )
* Rename file. Prepare to support another activation
* Add comment for quantization
* Extract out_elementop
* Add tanh example
* Add conv + bias + tanh quantization instance
* Add missing parameter
* Refine cmake
* Add external api and client example
* Extract variable in example
* Fix the comment
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* Add a denorm test fix (#603 )
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* simplify karg in device/grid of split-k op (#644 )
* simplify karg in device/grid split-k op
* fix mk_kn_mn instances
* add more instances
* use name from tensor layout
* fix 3rd dword of buffer source descriptor (#659 )
* add fp64 instances (#658 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Issue #666 : Revert "simplify karg in device/grid of split-k op (#644 )" (#665 )
This reverts commit bb5530af91 .
* Groupnorm + swish external api (#668 )
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* add a marco to turn on/off denorm fix (off by default) (#673 )
* add a marco to turn off denorm fix by default
* expose the marco
---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* fixed quant example (#672 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Add dependabot config and pin rocm-docs-core (#663 )
* [gtest] suppress unsafe buffer warn (#670 )
ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
* Add memory index guard in wmma device ops (#667 )
* Add more macros to turn on/off denorm fix (#678 )
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
* Fix a typo (#676 )
* Add (#677 )
* Allow using ROCm release candidate compilers. (#679 )
* enable use of rocm5.5 release candidate 4
* upgrade to ROCM5.5 RC5
* try fix the PUB_KEY error, remove the cmake-data package
* upgrade to latest cmake version
* use private dockerhub repo for rocm5.5 rc5
* add missing bracket
* Disable SkipLDS & Align AIT api
* Update dependabot config (#682 )
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
* update attn api
* solve type_convert bug + enable
---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca >
Co-authored-by: Sam Wu <sam.wu2@amd.com >
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
Co-authored-by: haocwang <Haocong.WANG@amd.com >
* fix typo
* Fix attention with causal mask
* multiple fix, try ait compile
* Add A/B not use LDS pipeline
* Clang format, Add gfx1101, gfx1102 support of FMHA example
* cancel change of format script
* 1. Enable 2-stage global Prefetch ( May cause VGPR spilling)
2. Enable FP16 accumulator blockwise_gemm
* clang-format
* 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement)
2. change kernel timing mode to 50 warmup + 50 timed repeat
* Update low level abstration of blockwise gemm wmma
* (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds
* (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds
* (4/5) grouped conv pass
* (5/5) attention pass, todo: debug lds perf bug
* AIT Attention API refactor (#8 )
* sanity pass
* sanity pass 2
* confirm significant performance regression.
* turn on all instances
* turn off instance format
* Fix bug & tunning & format
* DML meta, self_attn+cross_attn
* sanity pass
* remove useless flag
* update tile and problem size used in AIT attention
* bug fix in grouped conv supporting check
* deprecate inline asm wmma
* Bug fix: double lds skip
* clang-format
* Fix errors in
1. example, fmha
2. gridwise pipeline
3. deviceop, fmha, change some containers from vector to array
* part2 of previous commit
* clang format
* API fix of gridwisegemmpipeline
* separate array base and vector base attention tensor transformation
* fix gemm
* clang format
* add gemm fp16 instances
* Temp save
* fpAintB kernel compile pass
* Sanity pass.
* Temp save
* debug code enabled
* Fp16AInt8B_GEMM sanity
* MQA implementation
* GQA-4 example
* tempsave
* Compile pass
* New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm
* format
* Todo: fix gemm_bilinear_wmma instances compilation bug
* Solve a bug when K1=16
* remove unnecessary changes
* Remove tensor layout limitation to LDS usage in tesnor contraction
* update self-attention and cross-attention
* fix a typo of name
* Add arch limiter for fp8 gemm
* enable fp8 gemm_xdl for all gfx9 targets
* temporarily disable gemm_xdl_fp16_fp8 on MI100/200
* fix the cmake logic for gemm_xdl_fp16_fp8
* re-enable the gemm_xdl_fp16_fp8 on MI100/200
---------
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: Sam Wu <sjwu@ualberta.ca >
Co-authored-by: Sam Wu <sam.wu2@amd.com >
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
Co-authored-by: haocwang <Haocong.WANG@amd.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
2024-03-08 17:11:51 -08:00
Rostyslav Geyyer
363feb482d
Refactor tolerances for correctness check in gemm op ( #1188 )
...
* Refactor tolerances for correctness check
* Update tolerances
* Update host-side gemm
* Update reference gemm call
2024-03-08 12:05:05 -08:00
carlushuang
7df3947819
fix macro for exp2; fix warpgemm a/b in transposedC
2024-03-06 15:59:21 +00:00
carlushuang
0e7df1999f
wip fix
2024-03-06 14:31:36 +00:00
carlushuang
f549bb5d39
minor fix
2024-03-04 21:11:53 +00:00
carlushuang
a83c181bb2
naming
2024-03-04 20:49:02 +00:00
carlushuang
a67473fff8
now can build
2024-03-04 20:45:51 +00:00
carlushuang
112d521b09
fix xx
2024-03-03 23:48:31 +00:00
Rostyslav Geyyer
9ce18b045d
Fix example_gemm_xdl_fp8 ( #1183 )
2024-03-01 16:42:15 -08:00
carlushuang
fbd25cea35
fix build wip
2024-02-29 22:27:31 +00:00
carlushuang
f69356b1d7
add code
2024-02-28 22:57:19 +00:00