Po Yen Chen
abc7e7ed30
Merge branch 'develop' into ck_tile/fa_train
2024-06-04 16:03:01 +08:00
danyao12
327074c3f8
fix error in WarpGemm
2024-06-04 11:42:33 +08:00
danyao12
bdd4a87199
format
2024-06-04 08:26:53 +08:00
zjing14
6fb1f4e03f
Post-merge fix of PR 1300 ( #1313 )
...
* add f8 gemm with multiD for both row/col wise
* change compute_type to fp8
* changed tuning parameters in the example
* add rcr example
* post-merge fix
* fix
* reduce init range
2024-05-31 22:46:41 -07:00
root
c70662a92e
format
2024-06-01 01:42:45 +00:00
Jing Zhang
09e9f10f97
format
2024-05-31 13:59:47 +00:00
root
60b328d597
Merge branch 'ck_tile/fa_train' of github.com:ROCm/composable_kernel into ck_tile/fa_train
2024-05-31 13:51:37 +00:00
Jing Zhang
0d7f71779b
format
2024-05-31 13:51:28 +00:00
Po Yen Chen
ff31c6a70c
Merge branch 'develop' into ck_tile/fa_train
2024-05-31 15:52:47 +08:00
danyao12
87f73f30e8
Transpose -> transpose
2024-05-29 16:54:26 +08:00
danyao12
58f61716b5
CK_TILE_HOST_DEVICE in philox
2024-05-29 16:20:34 +08:00
zjing14
80db62f08d
add f8 gemm multiD with both row/col wise scale ( #1300 )
...
* add f8 gemm with multiD for both row/col wise
* change compute_type to fp8
* changed tuning parameters in the example
* add rcr example
2024-05-28 12:04:22 -05:00
danyao12
1c511b3e7d
update bwd kernel launch
2024-05-28 23:14:18 +08:00
danyao12
ba6437868b
Merge branch 'develop' into ck_tile/fa_train
2024-05-28 11:42:38 +08:00
carlushuang
5055b3bdcb
[CK_TILE] support group from cmdline ( #1295 )
...
* support cmdline seqlen decode
* silent print
* update readme
* update kernel launch 3d
* update tile partitioner
* fix spill for bf16
* modify based on comment
* modify payload_t
* fix bug for alibi mode
* fix alibi test err
* refactor kernel launch, support select timer
* add missing file
* remove useless code
* add some comments
2024-05-28 11:13:21 +08:00
Bartłomiej Kocot
fd72380aeb
Optimize grouped conv bwd weight for small M and N ( #1303 )
...
* Optimize grouped conv bwd weight for small M and N
* Fixes
2024-05-22 21:01:01 +02:00
Illia Silin
06b891c5c2
aggregate device macros in ck_tile config header ( #1297 )
2024-05-20 08:34:45 -07:00
Illia Silin
1274861a9d
replace the ENV macro with CK_ENV ( #1296 )
2024-05-17 10:42:51 -07:00
rocking
aaa8dfdae9
Fix compile error ( #1292 )
...
error: no viable conversion from returned value of type '__half' to function return type 'fp16_hip_t' (aka '_Float16')
Co-authored-by: carlushuang <carlus.huang@amd.com >
2024-05-17 17:19:17 +08:00
Illia Silin
c44137838e
remove wrong use of nonexistent class members ( #1290 )
2024-05-15 08:08:17 -07:00
carlushuang
dd0dd13d4e
remove operator-deref ( #1291 )
2024-05-15 08:06:50 -07:00
danyao12
826a894335
support bwd alibi
2024-05-15 21:55:02 +08:00
jakpiase
3e3471d5d2
Add unit tests for grouped gemm two stage ( #1256 )
...
* add unit tests for grouped gemm two stage
* add reviewers suggestions
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2024-05-15 10:03:39 +02:00
danyao12
a84009f83b
bwd alibi
2024-05-13 10:39:44 +08:00
carlushuang
35f59c04e6
Merge remote-tracking branch 'origin/develop' into ck_tile/fa_train
2024-05-12 23:03:10 +00:00
carlushuang
bd9cd53885
now fwd/bwd can build
2024-05-12 22:33:22 +00:00
carlushuang
90700dbefa
[CK_TILE] support alibi ( #1269 )
...
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
2024-05-11 10:43:56 +00:00
Illia Silin
566b6480a2
Code clean-up ( #1285 )
...
* code clean-up
* remove the profiling output samples
2024-05-10 09:41:39 -07:00
Bartłomiej Kocot
8346af9c68
Change output gemm type to AccDataType in two stage conv bwd wei ( #1283 )
2024-05-10 10:57:42 +02:00
danyao12
15187df456
epilogue reuse
2024-05-10 10:57:53 +08:00
Adam Osewski
a0ae1c6133
Fix MakeArgument ( #1284 )
2024-05-09 09:42:41 -07:00
Adam Osewski
3c043cd10b
Add vector instruction coherency bits for gfx94 targets. ( #1268 )
2024-05-09 07:30:17 -07:00
danyao12
e1a21655ae
FA bwd
2024-05-09 17:08:08 +08:00
Illia Silin
fdbf8ccbd7
fix the output formatting ( #1282 )
2024-05-08 16:11:54 -07:00
Bartłomiej Kocot
0b6b5d1785
Add two stage grouped conv bwd weight kernel ( #1280 )
2024-05-08 09:53:24 +02:00
Illia Silin
bf42097646
Enable logging in CK with environment variable. ( #1278 )
...
* enable logging using environment variable
* update ck.hpp header
* fix typo
* fix clang format
* Update include/ck/utility/env.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
2024-05-07 16:26:43 -07:00
carlushuang
851c3ed157
[CK_TILE] support alibi ( #1269 )
...
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
2024-05-07 22:32:54 +08:00
Illia Silin
08d51d9bc4
add missing vector header ( #1275 )
2024-05-02 11:27:59 -07:00
Rostyslav Geyyer
6ced3c12ff
Mark unneeded instances as "getting deprecated" ( #1265 )
...
* Add a flag
* Add flag check and messages
---------
Co-authored-by: root <root@aus-g7-rogeyyer.amd.com >
2024-04-29 12:00:55 -07:00
danyao12
bbd2e1eae3
FA fwd dropout
2024-04-29 14:13:00 +08:00
Haocong WANG
764164b488
[GEMM] UniversalGemm update ( #1262 )
...
* Add bf16 instances
* Add bf16 gemm universal example
* tempsave
* Add guard to navi compilation
* workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
* fix formatting condition statement issue
* solve conflict
---------
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
2024-04-26 12:56:07 -05:00
Rostyslav Geyyer
f044ff71fb
Add element op ( #1259 )
2024-04-26 12:55:45 -05:00
zjing14
0d0150db20
bf16A_Int8B with fastgelu/bias ( #1264 )
...
* changed the copy function to v7r2
* adding multi_abd
* in-progress
* add post-load oob check
* debugging
* adjust instances
* add run_lds
* add elemntwise_op
* replace multi_abd_device with v3
* clean up
* clean
* clean
* Added LDSType
* profiling
* adjust oobcheck
* add missing file
* refactor
* clean
* add examples
2024-04-26 07:26:30 -05:00
Adam Osewski
b4032629e5
Grouped GEMM Multiple D tile loop. ( #1247 )
...
* Overload output stream operator for LoopScheduler and PiplineVersion
* Add Run overload accepting grid descriptors MK.
* Add __device__ keyword for CalculateGridSize
* Create device op GroupedGemmMultipleD
* Add GroupedGemm MultipleD Tile Loop implementation.
* Add an example for GroupedGemm MultipleD tile loop.
* Device Op GroupedGEMMTileLoop.
* Bunch of small changes in exmaple.
* CkProfiler
* Remove unused tparam.
* Fix include statement.
* Fix output stream overloads.
* Do not make descriptors and check validity untill we find group.
* Fix gemm desc initialization.
* Revert device op
* Fix compilation for DTYPES=FP16
* Validate tensor transfers paramters.
* Validate on host only NK dims if M is not known.
* Fix bug.
* A convenient debug func for selecting threads.
* Fix has main k block loop bug.
* Make sure that b2c has up to date tile offset.
* Output stream operator for Sequence type.
* Cmake file formatting.
2024-04-25 15:12:53 -05:00
ltqin
f448d179b7
Universal gemm flush cache ( #1251 )
...
* add flush cache to device op
* add flush cache parameter to ckProfiler
* change calculate size a and b method
* chang evaluation time method foro AVERAGE to MEDIAN
* format code
* adjust some code
* fix core dumped
* remove loop call flush icache in kernel
* remove loop(outer) call flush icache
---------
Co-authored-by: letaoqin <letaoqin@amd.com >
2024-04-25 15:07:14 -05:00
Bartłomiej Kocot
b1f8ae379b
Fix contraction IsSupported checks ( #1257 )
2024-04-23 22:59:39 +02:00
rocking
43879b89e4
Small refactor ( #1246 )
...
* Remove kIsFp8
* Extract alias
* Fix K, V and corresponding acc type
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2024-04-22 20:28:49 +08:00
Bartłomiej Kocot
ad1597c499
Refactor elementwise kernels ( #1222 )
...
* Refactor elementwise kernels
* Instances fixes
* Fix cmake
* Fix max pool bwd test
* Update two stage gemm split k
* Restore elementwise scale for hiptensor backward compatiblity
* Fix Acc data type check in conv fwd multiple abd
* Disable conv fp64 fwd example
* Update grouped conv weight multi d
2024-04-19 13:31:17 +02:00
Bartłomiej Kocot
fd923b6d86
Add grouped conv bwd weight multi d kernel ( #1237 )
...
* Add grouped conv bwd weight multi d kernel
* Reference fix
* Fix cmake files
* bwd weight scale only xdl
* Fixes
* Fix client conv fwd example
2024-04-18 23:35:04 +02:00
zjing14
12865fbf28
Added Multi_ABD support into Gemm and GroupedGemmFixedNK ( #978 )
...
* added an example grouped_gemm_multi_abd
* fixed ci
* add setElementwiseOp
* changed API
* clean code: add multiA into example
* fixed v7r2 copy
* add transpose
* clean
* fixed vector_load check
* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* add reduce
* testing
* add example_b16_i8
* refactor example
* clean
* add mpading
* disable reduce for kbatch = 1
* seperate reduce device op
* add reduce op
* add guard for workspace_size
* add instances
* format
* fixed
* add client example
* add a colmajor
* add instances
* Update cmake-ck-dev.sh
* Update profile_gemm_splitk.cpp
* Update gridwise_gemm_xdlops_v2r4r2.hpp
* format
* Update profile_gemm_splitk.cpp
* fixed
* fixed
* adjust test
* adjust precision loss
* adjust test
* fixed
* add bf16_i8 scale bias
* fixed scale
* fixed scale elementwise_op
* revert contraction deviceop changes
* fixed
* Add AddFastGelu
* Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"
This reverts commit 3b5d001efd , reversing
changes made to 943199a991 .
* add Scales into elementwise
* add gemm_multi_abd client example
* add client examples
* add rcr and crr
* add grouped gemm client example
* add grouped gemm client example
* add instance for rcr crr
* format
* fixed
* fixed cmake
* fixed
* fixed client_example
* format
* fixed contraction isSupport
* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update device_reduce_threadwise.hpp
* clean
* Fixes
* Fix example
---------
Co-authored-by: Jing Zhang <jizha@amd.com >
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
2024-04-15 21:09:45 -05:00