composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-13 09:45:56 +00:00

Author	SHA1	Message	Date
Illia Silin	a328df25a1	Fix the cmake logic when building with INSTANCES_ONLY=ON. (#1376 ) * fix the cmake logic when building for various targets * another minor fix	2024-07-08 21:21:16 -07:00
Harisankar Sadasivan	75e622f02f	Universal streamk with atomics (#1360 ) * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). * Update README.md * fixing clang-format issues * removed conflicts in struct members between streamk and universal streamk * corrected arg parsing for streamk and universal streamk * added stream-k policies for 3 tile and 4 tile * fixed argument type issue with parsing cmd args * changes suggested in PR review are made- removing comments and correcting copyright * file permissions updated * added default value support for grid_size and streamk-policy selection set to -1 * print messages for arguments * print messages for arguments * print messages for arguments1	2024-07-05 21:40:30 -07:00
Illia Silin	941d1f7ce0	Merging the gfx12 code into public repo. (#1362 )	2024-06-27 00:33:34 -07:00
Andriy Roshchenko	05b10e0e5a	Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326 ) We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation. This commit handles bf8@bf8->fp8 data types combination. * Included an example. * Added instances. * Added a client example. --------- Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2024-06-21 19:02:57 -06:00
jakpiase	e2d139201b	Switch to universal gemm in grouped gemm tile loop (#1335 ) * switch to universal gemm in grouped gemm tile loop * minor fixes * add reviewers comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2024-06-18 09:01:49 -05:00
Rostyslav Geyyer	acda4c5a3c	Add instances for grouped conv fwd 3d with ConvScale for fp8@bf8->fp8 (#1325 ) * Add fp8 bf8 conv example * Add instances * Add client example * Add random scale values * Format	2024-06-12 14:41:56 -05:00
Rostyslav Geyyer	ce66277a76	Add a convinvscale op, related instances and examples (#1307 ) * Update the element op * Add an example * Add instances * Add a client example * make sure new instances only build on gfx9 * Update element op and its handling * Format * Update instances to take element op as an argument * Update examples to use random scale values * Format * Update client example with random scales * Format --------- Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-06-10 14:48:49 -05:00
Bartłomiej Kocot	ac58cc5d1d	Integrate universal gemm with conv forward (#1320 ) * Integrate universal gemm with conv fwd * Fix conv fwd wmma test * Fix instances * Remove direct load check	2024-06-05 13:01:29 -05:00
Rostyslav Geyyer	cb0645bedc	Add a scale op, related instances and examples (#1242 ) * Add a scale op * Update the element op * Add instances * Add an example * Add a client example * Add a flag check * Revert flag check addition * Fix flag check * Update d strides in example * Update d strides in client example * Apply suggestions from code review Update copyright header Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Move the example * Move the client example * Update element op * Update example with the new element op * Add scalar layout * Update example * Update kernel for scalar Ds * Revert kernel changes * Update element op * Update example to use scales' pointers * Format * Update instances * Update client example * Move element op to unary elements * Update element op to work with values instead of pointers * Update instances to take element op as an argument * Update examples to use random scale values --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2024-06-04 19:28:15 -05:00
Illia Silin	ec2bae27ff	Split the gemm_multi_abd instances. (#1306 ) * split the gemm_multi_abd instances * update the dates	2024-05-23 09:17:02 -07:00
Bartłomiej Kocot	fd72380aeb	Optimize grouped conv bwd weight for small M and N (#1303 ) * Optimize grouped conv bwd weight for small M and N * Fixes	2024-05-22 21:01:01 +02:00
Illia Silin	7b027d5643	Select appropriate GPU targets for instances, tests, and examples. (#1304 ) * set individual gpu targets for instances, examples, tests * fix path to hip compiler * fix path to hip compiler once more * aggregate device macros in ck_tile config header * fix the cmake logic for instances * fix clang format * add gfx900 and gfx906 to default set of targets	2024-05-22 11:45:27 -07:00
Bartłomiej Kocot	0b6b5d1785	Add two stage grouped conv bwd weight kernel (#1280 )	2024-05-08 09:53:24 +02:00
Rostyslav Geyyer	a2d0bdd5a9	Add an ignore (#1270 )	2024-04-30 20:45:22 -07:00
Rostyslav Geyyer	6ced3c12ff	Mark unneeded instances as "getting deprecated" (#1265 ) * Add a flag * Add flag check and messages --------- Co-authored-by: root <root@aus-g7-rogeyyer.amd.com>	2024-04-29 12:00:55 -07:00
Haocong WANG	764164b488	[GEMM] UniversalGemm update (#1262 ) * Add bf16 instances * Add bf16 gemm universal example * tempsave * Add guard to navi compilation * workground on a specific mixed gemm instance ( bring back it when compiler fix upload) * fix formatting condition statement issue * solve conflict --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com>	2024-04-26 12:56:07 -05:00
zjing14	5ae893c0d3	ggemm tile_loop multD bf16 int8 (#1258 ) * Overload output stream operator for LoopScheduler and PiplineVersion * Add Run overload accepting grid descriptors MK. * Add __device__ keyword for CalculateGridSize * Create device op GroupedGemmMultipleD * Add GroupedGemm MultipleD Tile Loop implementation. * Add an example for GroupedGemm MultipleD tile loop. * Device Op GroupedGEMMTileLoop. * Bunch of small changes in exmaple. * CkProfiler * Remove unused tparam. * changed the copy function to v7r2 * adding multi_abd * in-progress * add post-load oob check * Fix include statement. * Fix output stream overloads. * Do not make descriptors and check validity untill we find group. * Fix gemm desc initialization. * debugging * adjust instances * add run_lds * add elemntwise_op * replace multi_abd_device with v3 * clean up * clean * clean * Revert device op * Fix compilation for DTYPES=FP16 * Validate tensor transfers paramters. * Added LDSType * profiling * adjust oobcheck * add missing file * Validate on host only NK dims if M is not known. * add * clean * refactor * clean * add examples * add fuse * add fusion and client example * Fix bug. * A convenient debug func for selecting threads. * Fix has main k block loop bug. * Make sure that b2c has up to date tile offset. * Output stream operator for Sequence type. * Cmake file formatting. * clean --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2024-04-26 10:37:49 -05:00
zjing14	0d0150db20	bf16A_Int8B with fastgelu/bias (#1264 ) * changed the copy function to v7r2 * adding multi_abd * in-progress * add post-load oob check * debugging * adjust instances * add run_lds * add elemntwise_op * replace multi_abd_device with v3 * clean up * clean * clean * Added LDSType * profiling * adjust oobcheck * add missing file * refactor * clean * add examples	2024-04-26 07:26:30 -05:00
Adam Osewski	b4032629e5	Grouped GEMM Multiple D tile loop. (#1247 ) * Overload output stream operator for LoopScheduler and PiplineVersion * Add Run overload accepting grid descriptors MK. * Add __device__ keyword for CalculateGridSize * Create device op GroupedGemmMultipleD * Add GroupedGemm MultipleD Tile Loop implementation. * Add an example for GroupedGemm MultipleD tile loop. * Device Op GroupedGEMMTileLoop. * Bunch of small changes in exmaple. * CkProfiler * Remove unused tparam. * Fix include statement. * Fix output stream overloads. * Do not make descriptors and check validity untill we find group. * Fix gemm desc initialization. * Revert device op * Fix compilation for DTYPES=FP16 * Validate tensor transfers paramters. * Validate on host only NK dims if M is not known. * Fix bug. * A convenient debug func for selecting threads. * Fix has main k block loop bug. * Make sure that b2c has up to date tile offset. * Output stream operator for Sequence type. * Cmake file formatting.	2024-04-25 15:12:53 -05:00
Bartłomiej Kocot	ad1597c499	Refactor elementwise kernels (#1222 ) * Refactor elementwise kernels * Instances fixes * Fix cmake * Fix max pool bwd test * Update two stage gemm split k * Restore elementwise scale for hiptensor backward compatiblity * Fix Acc data type check in conv fwd multiple abd * Disable conv fp64 fwd example * Update grouped conv weight multi d	2024-04-19 13:31:17 +02:00
jakpiase	e0f3f918f1	Add bf16 and bf16@int8 mk_nk_mn instances for grouped gemm two stage (#1228 ) * added bf16 and bf16@int8 mk_nk_mn instances * fix preprocessor guards	2024-04-19 13:16:10 +02:00
Bartłomiej Kocot	fd923b6d86	Add grouped conv bwd weight multi d kernel (#1237 ) * Add grouped conv bwd weight multi d kernel * Reference fix * Fix cmake files * bwd weight scale only xdl * Fixes * Fix client conv fwd example	2024-04-18 23:35:04 +02:00
zjing14	12865fbf28	Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978 ) * added an example grouped_gemm_multi_abd * fixed ci * add setElementwiseOp * changed API * clean code: add multiA into example * fixed v7r2 copy * add transpose * clean * fixed vector_load check * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * add reduce * testing * add example_b16_i8 * refactor example * clean * add mpading * disable reduce for kbatch = 1 * seperate reduce device op * add reduce op * add guard for workspace_size * add instances * format * fixed * add client example * add a colmajor * add instances * Update cmake-ck-dev.sh * Update profile_gemm_splitk.cpp * Update gridwise_gemm_xdlops_v2r4r2.hpp * format * Update profile_gemm_splitk.cpp * fixed * fixed * adjust test * adjust precision loss * adjust test * fixed * add bf16_i8 scale bias * fixed scale * fixed scale elementwise_op * revert contraction deviceop changes * fixed * Add AddFastGelu * Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example" This reverts commit `3b5d001efd`, reversing changes made to `943199a991`. * add Scales into elementwise * add gemm_multi_abd client example * add client examples * add rcr and crr * add grouped gemm client example * add grouped gemm client example * add instance for rcr crr * format * fixed * fixed cmake * fixed * fixed client_example * format * fixed contraction isSupport * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update device_reduce_threadwise.hpp * clean * Fixes * Fix example --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2024-04-15 21:09:45 -05:00
Haocong WANG	f83e9701e9	[GEMM] Gemm universal device operation (#1154 ) * Optimize GEMM on MI200/300: 1. Add new blockwise gemm pipeline 2. Add irregular splitk intances * clang format + typo fix * Fix a bug * initial commit * Add more instances to irregular splitk * blkgemm pipeline v1~4 prototype * Sanity Checked. Known issue: 1. Poor performance of splitk 2. Register spill on blkgemmpipeline v3 * Sanity and Performance fix: 1. fix a bug related to sanity in grouped b2c mapping 2. fix a bug related to sanity and performance in splitk offset * Sanity and API update: 1. Remove prefetch stage 2. Fix valid check bug 3, Add first gemm_universal instance into ckProfiler * Add NN instances for gemm universal * 1. Add NT instances for gemm_universal 2. Fix a bug about Kpadding in gemm_universal * Fix a bug regarding padding Odd K number * remove kernel print * Fix KPadding bug... * Update safety check * another try to fix kpadding.. * Sanity checked * new instances.. * clang format+typo fix * remove clang format script's change * Add non-hotloop compile option * 1. Add fp16xfp8 example 2. pull packed convert f8 from pr1150 * Some miscs.. opt and fix * Add pipeline description docs * Split universal gemm instance library to cut profiler compiling time * uncomment cmakefile * Fix a bug caused by blockwise_gemm_pipe_v2 * reduce default splitk to 1 * Add 224x256x64 tile size * update, including: 1. Experiment pipeline 5~7 2. Optimization for pipeline 4 3. Organized instance library * temp save * temp save * Permuted lds layout, sanity and function checked * clang format * Move OOB check from RunRead to RunWrite, for better software pipeline. TODO: agpr spill when NN layout * clangformat * A/B splitpipe scheduler for v3 * Fix two bugs * bug fix * fix a bug in oob check * Example for mixed fp16_fp8 gemm * Clean experimental code blocks * Add mixed precision gemm into profiler * tempsave * optimize m/n major lds layout * Add RRR GEMM mixed precision instances * Optimize f8 matrix transpose * Add test_gemm_universal * A/B spilt schedule for blkpip v5 * Take ds_read2 into iglp scheduling scheme * format * fixed cmake * Add llvm-option into CI cmake flag --------- Co-authored-by: Jing Zhang <jizhan@amd.com>	2024-04-13 21:03:18 -05:00
Rostyslav Geyyer	bbefc12a26	Add instances for conv_scale with bf8@fp8->fp8 (#1231 ) * Add instances * Add example * Add profiler mode * Add client example	2024-04-11 10:35:00 -05:00
Bartłomiej Kocot	ced5af16f7	Extend support for contraction 6D (#1207 ) * Extend support for contraction up to 5D * Extend contraction bilinear instances * Fix interface test * Add 6d support, remove 3d,4d,5d * Fixes * Fix readme * Make defualt dim for contraction instances	2024-04-09 23:46:21 +02:00
jakpiase	c701071666	Add Grouped Gemm Multiple D SplitK TwoStage (#1212 ) * Support A/B/C elementwise ops. * First part of GGEMM multiD splitk two stage. * WIP - changes for debuggin. * tmp save * working version * added bf16@int8 version * fixes * add reviewers sugestions * pre-commited missing files * switched to ifs from elseifs --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2024-04-04 11:01:33 +02:00
Rostyslav Geyyer	a61e73bc56	Add instances for conv_scale with fp8@bf8->fp8 (#1220 ) * Update device op api to support BComputeType * Add example * Add instances * Add profiler mode * Add client example * Update copyright year * Add BComputeType check * Fix compute types	2024-04-03 09:08:08 -05:00
Illia Silin	ae57e5938e	Split the instances by architecture. (#1223 ) * parse examples inside the add_example_executable function * fix the example 64 cmake file * add xdl flag to the gemm_bias_softmax_gemm_permute example * add filtering of tests based on architecture type * enable test_grouped_gemm for gfx9 only * enable test_transpose only for gfx9 * only linnk test_transpose if it gets built * split the gemm instances by architectures * split gemm_bilinear,grouped_conv_bwd_weight instances by targets * split instances by architecture * split grouped_conv instances by architecture * fix clang format * fix the if-else logic in group_conv headers * small fix for grouped convolution instances * fix the grouped conv bwd weight dl instances * fix client examples * only enable client examples 3 and 4 on gfx9 * set the gfx9 macro * make sure the architecture macros are set by cmake * use separate set of xdl/wmma flags for host code * sinmplify the main cmake file * add conv_fwd_bf8 instance declaration	2024-04-02 09:42:17 -07:00
Bartłomiej Kocot	9c052804a7	Add elementwise with dynamic vector dim (#1198 ) * Add elementwise with dynamic vector dim * Reduce number of instaces * Fixes * Fixes	2024-03-22 10:40:43 +01:00
Rostyslav Geyyer	fd0d093e78	Add instances for conv_scale with bf8 in / fp8 out (#1200 ) * Add bf8 conv fwd instances * Add example * Add profiler mode * Add client example * Fix copyright headers * Format	2024-03-21 13:57:34 -05:00
Rostyslav Geyyer	e626d5202a	Add instances for conv_scale with fp8 in/out (#1193 ) * Add fp8 conv instances and client example * Format * Add example * Update cmakelists * Add profiler mode * Format * Fix copyright headers	2024-03-15 09:50:03 -07:00
Bartłomiej Kocot	285251768e	Add conv fwd/bwd data scale instances, extend bilinear instances (#1178 ) * Add conv fwd/bwd data scale instances * Fix cmake client example file --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2024-03-13 23:09:08 +01:00
zjing14	1837040a9c	Navi3 rel (#1176 ) * wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size * navi3x_multipleD+example * temp save * workable * batchedgemm[OK], groupconv[debug] * groupconv: Sanity check[OK], Performance[Bad] * navi3x_groupconv_need_optimization * create necessary files * save progress * Add Inter-Row thread transfer * save progress * save debugging progress * sanity check pass * fix a host tensor bug and clean up flash-attn code * format * cancel unnecessary change * cancel unnecessary change * cancel unnecessary change * temp save, add asm backend flag to amd_wmma * Mat-A LDS Bypass sanity pass * temp save * gemm sanity fix * Porting new blockwise gemm to flash attention * Example branch provide to compiler team * tempsave * Fix a bug * batched gemm ported * conv A-skip lds ported * Skip B-Lds real gemm * Skip B Lds Gemm + MulD * batched gemm, conv, skip b lds * format * Attn, skip b lds * Change GridwiseOp nam * fix a typo caused bug * Skip A_Lds sanity pass, Skip B_Lds scratch occured * Bug found, intra-row permute off caused * bug found * a fix * disable buffer load due to incorrect 3rd dword * update fmha config, no scratch generated * update 3rd dword * fmha config update * FMHA, add support to gfx1101/gfx1102 * Merge origin dev (#2) * [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <zhangjing14@gmail.com> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit `bb5530af91`. * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * fixed quant example (#672) Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * add vector load check * solve conflicts --------- Co-authored-by: Sam Wu <sjwu@ualberta.ca> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: rocking5566 <ChunYu.Lai@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> * Disable SkipLDS & Align AIT api (#3) * fix layernorm, reduction Ops (#4) * [Navi3x] Fix Gridwise_multiple_d operation (#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (#655) * Separate bibtex requirement from rocm-docs-core (#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <zhangjing14@gmail.com> * Add a denorm test fix (#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> * simplify karg in device/grid of split-k op (#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (#659) * add fp64 instances (#658) Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * Issue #666: Revert "simplify karg in device/grid of split-k op (#644)" (#665) This reverts commit `bb5530af91`. * Groupnorm + swish external api (#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * fixed quant example (#672) Co-authored-by: root <root@ctr-ubbsmc15.amd.com> * Add dependabot config and pin rocm-docs-core (#663) * [gtest] suppress unsafe buffer warn (#670) ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912 * Add memory index guard in wmma device ops (#667) * Add more macros to turn on/off denorm fix (#678) Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> * Fix a typo (#676) * Add (#677) * Allow using ROCm release candidate compilers. (#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * Disable SkipLDS & Align AIT api * Update dependabot config (#682) Co-authored-by: samjwu <samjwu@users.noreply.github.com> * update attn api * solve type_convert bug + enable --------- Co-authored-by: Sam Wu <sjwu@ualberta.ca> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: rocking5566 <ChunYu.Lai@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: samjwu <samjwu@users.noreply.github.com> Co-authored-by: haocwang <Haocong.WANG@amd.com> * fix typo * Fix attention with causal mask * multiple fix, try ait compile * Add A/B not use LDS pipeline * Clang format, Add gfx1101, gfx1102 support of FMHA example * cancel change of format script * 1. Enable 2-stage global Prefetch ( May cause VGPR spilling) 2. Enable FP16 accumulator blockwise_gemm * clang-format * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement) 2. change kernel timing mode to 50 warmup + 50 timed repeat * Update low level abstration of blockwise gemm wmma * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds * (4/5) grouped conv pass * (5/5) attention pass, todo: debug lds perf bug * AIT Attention API refactor (#8) * sanity pass * sanity pass 2 * confirm significant performance regression. * turn on all instances * turn off instance format * Fix bug & tunning & format * DML meta, self_attn+cross_attn * sanity pass * remove useless flag * update tile and problem size used in AIT attention * bug fix in grouped conv supporting check * deprecate inline asm wmma * Bug fix: double lds skip * clang-format * Fix errors in 1. example, fmha 2. gridwise pipeline 3. deviceop, fmha, change some containers from vector to array * part2 of previous commit * clang format * API fix of gridwisegemmpipeline * separate array base and vector base attention tensor transformation * fix gemm * clang format * add gemm fp16 instances * Temp save * fpAintB kernel compile pass * Sanity pass. * Temp save * debug code enabled * Fp16AInt8B_GEMM sanity * MQA implementation * GQA-4 example * tempsave * Compile pass * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm * format * Todo: fix gemm_bilinear_wmma instances compilation bug * Solve a bug when K1=16 * remove unnecessary changes * Remove tensor layout limitation to LDS usage in tesnor contraction * update self-attention and cross-attention * fix a typo of name * Add arch limiter for fp8 gemm * enable fp8 gemm_xdl for all gfx9 targets * temporarily disable gemm_xdl_fp16_fp8 on MI100/200 * fix the cmake logic for gemm_xdl_fp16_fp8 * re-enable the gemm_xdl_fp16_fp8 on MI100/200 --------- Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: Sam Wu <sjwu@ualberta.ca> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: rocking5566 <ChunYu.Lai@amd.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: samjwu <samjwu@users.noreply.github.com> Co-authored-by: haocwang <Haocong.WANG@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-03-08 17:11:51 -08:00
jakpiase	32d4be3d09	Add support for mixed precision bf16&int8 grouped gemm (#1166 ) * add support for mixed precision bf16&int8 grouped gemm * fix gfx versions and add bf16 kbatch condition * added reviewers comments	2024-02-21 10:35:35 +01:00
Bartłomiej Kocot	66736edb95	Extend permute scale support up to 6D (#1168 ) * Extend permute scale support up to 6D * Fixes * Fixes * Update profiler/README.md Co-authored-by: Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by: Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by: Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by: Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by: Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by: Lisa <lisajdelaney@gmail.com> * Update profiler/README.md Co-authored-by: Lisa <lisajdelaney@gmail.com> --------- Co-authored-by: Lisa <lisajdelaney@gmail.com>	2024-02-20 09:56:54 -08:00
Bartłomiej Kocot	bf98b47697	Add bilinear conv fwd and bwd data instances (#1164 )	2024-02-13 11:49:05 +01:00
zjing14	602c4cc0d9	Optimizing fp8_fp16 mixedprec gemm (#1150 ) * add delayed cvt * extend fp16 gemm_splitk instances for fp8_fp16 gemm * add f8 example * add 128 kperblk instances for fp8 * add kpb128 instance * added more instances into kpb128 * clean code * clean code * fix * fix * fixed * Update example/35_splitK_gemm/splitK_gemm_xdl_fp16_fp8.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update library/src/tensor_operation_instance/gpu/gemm_splitk/device_gemm_xdl_splitk_f16_fp8_f16_mk_nk_mn_kpb128_instance.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2024-02-12 09:45:42 -08:00
zjing14	94fbaac002	add generic instances for DeviceGemm_Xdl_CShuffle (#1161 ) * add generic instances * clean code	2024-02-09 10:20:53 -06:00
Illia Silin	1b0fbaebbb	Split-up instances to improve build times. (#1159 ) * split up splitk-gemm instances * clean up some unused variables * split the mk_kn_mn interwave splitk-gemm instances * split up f16_f16_f16 mk_nk_mn splitk gemm instances * fix clang format * fix function names * fix typo * split up the 2 largest fp16fp8 splitk gemm instances get rid of unused variables * split up the largest splitk-gemm fp8fp16 instance file split up the instances for xdl fp8 gemms * split the headers for f16 and i8 for wmmma convolution instances	2024-02-07 12:47:12 -08:00
jakpiase	ba86eadce5	Add support for mixed-precision f16bf16_int8 gemm (#1127 )	2024-02-07 15:54:13 +01:00
Bartlomiej Wroblewski	6951858221	Implement direct loads split-K GEMM kernel (#1137 ) * WIP: Implement direct loads split-K GEMM kernel * Clean the review --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2024-02-07 01:08:34 +01:00
rocking	28f68a5a99	layernorm & groupnorm bwd gamma beta (#1133 ) * Add layernorm bwd gamma beta external api * Add groupnorm external api * Add layernorm bwd gamma beta profiler * Add groupnorm bwd gamma beta ckProfiler * Add layernorm & groupnorm bwd gamma beta test * Fix groupnorm bwd gamma beta profiler bug * Layernorm bwd weight client example * Groupnorm bwd weight client example * clang format * Remove useless header * Let inv_std be positive * Rename to num_bytes and move this calculation outside the loop	2024-01-25 19:53:15 +08:00
Haocong WANG	bb63b9732c	[GEMM] Optimization for MI200/300. (#1135 ) * Optimize GEMM on MI200/300: 1. Add new blockwise gemm pipeline 2. Add irregular splitk intances * clang format + typo fix * Fix a bug	2024-01-19 07:02:22 -06:00
Bartłomiej Kocot	a35e466c86	Revert "[SWDEV-435347] disable instances failed with mainlien compiler (#1077 )" (#1101 ) This reverts commit `ff24b537cb`.	2024-01-02 11:36:45 +01:00
Artur Wojcik	fb5bd51b42	enable compilation of INSTANCES_ONLY for Windows (#1082 ) * enable compilation of INSTANCES_ONLY for Windows * suppress ROCMChecks warnings on GoogleTests * suppress -Wfloat-equal warning on GoogleTests --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2023-12-20 14:34:53 -08:00
arai713	12a8883c48	Hip tensor permute unit test (#1068 ) * adding files for F32 example * adding functioning implementation with scalar multiplication and unary operator support * added fp 16 type check in unary square * updating scalar multiplication as an operator * functioning version with scalar operator * changing strides for col major * updated column major implementation * working column major implementation * cleaned up comments, rearranged/renamed files * small edits to 3d transpose profiler * adding test/profiler/instance files for hipTensor permute unit test * added more test instances * cleaned up errors, randomized input tensor, added more instances * turned off time printouts * removed conflicting transpose profiler * rearranged some files	2023-12-18 21:35:00 -06:00
rocking	a69aa2a11a	layernorm and groupnorm backward data (#1083 ) * rename folder * Add type string * Remove typo * Add deviceOp to backward x * Add comment to describe the behavior of backward normalization * Add kernel function, prepare to implement * implement generic kernel * Check vector size * Add sweep once pipeline for small reduce size * Fix bug of KRaw_ error * Fix bug of dx stride * sanity check for mean and rstd * backward x for groupnorm * Add bwd x instance * add layernorm 2d bwd gamma beta instances * Change save mean var type from f32 to f16 in f16 mode * Change the example to f16 * Add groupnorm bwd gamma beta instance * Add groupnorm bwd x instance * Fix naming * Add layernorm bwd x ckprofiler * Add groupnorm bwd x profiler * clang format * Rename bwd x to bwd data * Fix bug of verification in profiler * Add test of layernorm and groupnorm bwd data * Add missing cmake * Add layernorm2d bwd data * rename fwd example * Add groupnorm client example * Fix typo. replace Invarient with Invariant * Add checking before running the best instance	2023-12-19 04:23:11 +08:00
Bartlomiej Wroblewski	ad0a8e4cd2	Optimize fp16 direct load GEMM instances (#1086 ) This PR optimizes fp16 instances of direct load GEMM kernel introduced in #999 and #1052. Measured the performance of new instances on CDNA2 GPU and compared it against the performance of the best non-direct-load GEMM instances. Used 76 different GEMM problems. On average, this change improves the performance of the tested problems by 47%. For cases known as latency-bound, the speedup is around 126%.	2023-12-18 11:09:10 +01:00
Illia Silin	c004e0d990	disabling some fp8 gemm instances to reduce build time (#1084 ) * disabling some fp8 gemm instances to reduce build time * disable fp8 gemm instances to reduce build time * remove the unused variable * build fp8 gemm default and padded instances separately * fix include pathsc	2023-12-11 17:49:27 -08:00

1 2 3 4 5 ...

256 Commits