composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-19 20:40:07 +00:00

Author	SHA1	Message	Date
Illia Silin	beb7927f52	Select appropriate GPU targets for instances, tests, and examples. (#1304 ) * set individual gpu targets for instances, examples, tests * fix path to hip compiler * fix path to hip compiler once more * aggregate device macros in ck_tile config header * fix the cmake logic for instances * fix clang format * add gfx900 and gfx906 to default set of targets [ROCm/composable_kernel commit: `7b027d5643`]	2024-05-22 11:45:27 -07:00
carlushuang	9be788a9aa	[CK_TILE] support alibi (#1269 ) * add alibi support * fix code * update code based on comment * Support more hdim * fix fp8 bias * support seqlen_k=0 case * remove unused printf * fix format --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `851c3ed157`]	2024-05-07 22:32:54 +08:00
Haocong WANG	ceaecc86ca	[GEMM] Gemm universal device operation (#1154 ) * Optimize GEMM on MI200/300: 1. Add new blockwise gemm pipeline 2. Add irregular splitk intances * clang format + typo fix * Fix a bug * initial commit * Add more instances to irregular splitk * blkgemm pipeline v1~4 prototype * Sanity Checked. Known issue: 1. Poor performance of splitk 2. Register spill on blkgemmpipeline v3 * Sanity and Performance fix: 1. fix a bug related to sanity in grouped b2c mapping 2. fix a bug related to sanity and performance in splitk offset * Sanity and API update: 1. Remove prefetch stage 2. Fix valid check bug 3, Add first gemm_universal instance into ckProfiler * Add NN instances for gemm universal * 1. Add NT instances for gemm_universal 2. Fix a bug about Kpadding in gemm_universal * Fix a bug regarding padding Odd K number * remove kernel print * Fix KPadding bug... * Update safety check * another try to fix kpadding.. * Sanity checked * new instances.. * clang format+typo fix * remove clang format script's change * Add non-hotloop compile option * 1. Add fp16xfp8 example 2. pull packed convert f8 from pr1150 * Some miscs.. opt and fix * Add pipeline description docs * Split universal gemm instance library to cut profiler compiling time * uncomment cmakefile * Fix a bug caused by blockwise_gemm_pipe_v2 * reduce default splitk to 1 * Add 224x256x64 tile size * update, including: 1. Experiment pipeline 5~7 2. Optimization for pipeline 4 3. Organized instance library * temp save * temp save * Permuted lds layout, sanity and function checked * clang format * Move OOB check from RunRead to RunWrite, for better software pipeline. TODO: agpr spill when NN layout * clangformat * A/B splitpipe scheduler for v3 * Fix two bugs * bug fix * fix a bug in oob check * Example for mixed fp16_fp8 gemm * Clean experimental code blocks * Add mixed precision gemm into profiler * tempsave * optimize m/n major lds layout * Add RRR GEMM mixed precision instances * Optimize f8 matrix transpose * Add test_gemm_universal * A/B spilt schedule for blkpip v5 * Take ds_read2 into iglp scheduling scheme * format * fixed cmake * Add llvm-option into CI cmake flag --------- Co-authored-by: Jing Zhang <jizhan@amd.com> [ROCm/composable_kernel commit: `f83e9701e9`]	2024-04-13 21:03:18 -05:00
Illia Silin	2977f7d7de	fix the latest errors with staging compiler (#1229 ) [ROCm/composable_kernel commit: `7e5c81fed2`]	2024-04-04 11:33:29 -07:00
Illia Silin	1f4d13b2b5	Split the instances by architecture. (#1223 ) * parse examples inside the add_example_executable function * fix the example 64 cmake file * add xdl flag to the gemm_bias_softmax_gemm_permute example * add filtering of tests based on architecture type * enable test_grouped_gemm for gfx9 only * enable test_transpose only for gfx9 * only linnk test_transpose if it gets built * split the gemm instances by architectures * split gemm_bilinear,grouped_conv_bwd_weight instances by targets * split instances by architecture * split grouped_conv instances by architecture * fix clang format * fix the if-else logic in group_conv headers * small fix for grouped convolution instances * fix the grouped conv bwd weight dl instances * fix client examples * only enable client examples 3 and 4 on gfx9 * set the gfx9 macro * make sure the architecture macros are set by cmake * use separate set of xdl/wmma flags for host code * sinmplify the main cmake file * add conv_fwd_bf8 instance declaration [ROCm/composable_kernel commit: `ae57e5938e`]	2024-04-02 09:42:17 -07:00
jakpiase	b41cf51951	Add support for mixed-precision f16bf16_int8 gemm (#1127 ) [ROCm/composable_kernel commit: `ba86eadce5`]	2024-02-07 15:54:13 +01:00
rocking	133d8e762f	layernorm & groupnorm bwd gamma beta (#1133 ) * Add layernorm bwd gamma beta external api * Add groupnorm external api * Add layernorm bwd gamma beta profiler * Add groupnorm bwd gamma beta ckProfiler * Add layernorm & groupnorm bwd gamma beta test * Fix groupnorm bwd gamma beta profiler bug * Layernorm bwd weight client example * Groupnorm bwd weight client example * clang format * Remove useless header * Let inv_std be positive * Rename to num_bytes and move this calculation outside the loop [ROCm/composable_kernel commit: `28f68a5a99`]	2024-01-25 19:53:15 +08:00
Artur Wojcik	e9ec2910a0	enable compilation of INSTANCES_ONLY for Windows (#1082 ) * enable compilation of INSTANCES_ONLY for Windows * suppress ROCMChecks warnings on GoogleTests * suppress -Wfloat-equal warning on GoogleTests --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `fb5bd51b42`]	2023-12-20 14:34:53 -08:00
arai713	b5aae16948	Hip tensor permute unit test (#1068 ) * adding files for F32 example * adding functioning implementation with scalar multiplication and unary operator support * added fp 16 type check in unary square * updating scalar multiplication as an operator * functioning version with scalar operator * changing strides for col major * updated column major implementation * working column major implementation * cleaned up comments, rearranged/renamed files * small edits to 3d transpose profiler * adding test/profiler/instance files for hipTensor permute unit test * added more test instances * cleaned up errors, randomized input tensor, added more instances * turned off time printouts * removed conflicting transpose profiler * rearranged some files [ROCm/composable_kernel commit: `12a8883c48`]	2023-12-18 21:35:00 -06:00
rocking	53eab49062	layernorm and groupnorm backward data (#1083 ) * rename folder * Add type string * Remove typo * Add deviceOp to backward x * Add comment to describe the behavior of backward normalization * Add kernel function, prepare to implement * implement generic kernel * Check vector size * Add sweep once pipeline for small reduce size * Fix bug of KRaw_ error * Fix bug of dx stride * sanity check for mean and rstd * backward x for groupnorm * Add bwd x instance * add layernorm 2d bwd gamma beta instances * Change save mean var type from f32 to f16 in f16 mode * Change the example to f16 * Add groupnorm bwd gamma beta instance * Add groupnorm bwd x instance * Fix naming * Add layernorm bwd x ckprofiler * Add groupnorm bwd x profiler * clang format * Rename bwd x to bwd data * Fix bug of verification in profiler * Add test of layernorm and groupnorm bwd data * Add missing cmake * Add layernorm2d bwd data * rename fwd example * Add groupnorm client example * Fix typo. replace Invarient with Invariant * Add checking before running the best instance [ROCm/composable_kernel commit: `a69aa2a11a`]	2023-12-19 04:23:11 +08:00
Bartłomiej Kocot	f1a25ca168	Introduce wrapper library (#1071 ) * Introduce wrapper library * Update cmake files * Revert "Update cmake files" This reverts commit `c27f88b565`. * Fix comments [ROCm/composable_kernel commit: `836b7e557d`]	2023-12-06 11:58:59 +01:00
arai713	38b596215d	Transpose 3d (#984 ) * added working example for 5D input using 1D kernel * example with 5D input tensor and 2d kernel - not working: issues with arguments * added updated version of 3d device op - changed descriptors/dims * added example file to check kernel * fixed descriptor and isSupportedArgument stride problem * added and modified kernel for 3d - updated tids/loop * adding some more 5d example files * fixed some issues * changes made for testing * working version: fixed error in stride for A, still a bit inefficient * cleaned up formatting/comments * updating formatting * more formatting fixes * fixing cmake, adding back gpu targets in cmake script * adding client example * added instances for client example * fixed errors in client example * implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp * removed extra files * minor formatting and naming fixes * adding test files and profiler * fixing minor error * minor fix * removed unneccesary comments, renamed files * updated instance list for client example, added different layout example * removing instances * fixed error in instance generation * remove comments * update profiler and client example tensor layouts * fixed errors in test/profiler * updated vector dim access to enable vector load * updated test/profiler files * updated example with 1d kernel * updating profiler * renamed files --------- Co-authored-by: Jing Zhang <jizha@amd.com> [ROCm/composable_kernel commit: `3af8c81a72`]	2023-11-08 19:45:07 -06:00
rocking	b64f30e733	Layernorm4d (#1022 ) * Rename folder * Add layernorm 4d fwd example * Rename original layernorm example * Add layernorm 4d f16 test * Add layernorm4d_fwd client example * Support layernorm4D in ckProfiler * Rename groupnorm to groupnorm fwd in example * Rename layernorm and group fwd in test * Rename normalization to normalization_fwd (instances) * Add fwd to DeviceNormalization * Rename external api header * Rename folder, because we can also add bwd in this folder * Add fwd in layernorm and groupnorm (profiler * Fix compile error --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `a3d9a2cd42`]	2023-11-09 08:34:51 +08:00
Bartłomiej Kocot	3a3b3a822c	Fix cmake dtype check (#989 ) * Fix instances dtype check * Fix source dtypes seletor for examples and tests * Sync with new cmakefile changes * Remove not needed ifdefs * Remove not needed ifdefs [ROCm/composable_kernel commit: `ac0e006766`]	2023-10-21 22:19:43 +02:00
zjing14	c1372eb34f	Clean DTYPES conditions in CMake (#974 ) * Add a condition to build fp8 instances * simplified buffer_load/store * add bfp8/fp8 * fixed * remove all f8/bf8 condition include folder * fixed cmake conditions * fixed DTYPES=fp16/bfp16 * fix * fixed buffer_load * fixed buffer_store * fix * clean example cmake files * fixed ci * fixed cit --------- Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com> Co-authored-by: Jing Zhang <jizha@amd.com> [ROCm/composable_kernel commit: `bf435140dc`]	2023-10-18 11:14:14 -05:00
Lauren Wrubleski	144a3ef093	Replace CMake `return` from later CMake (#970 ) [ROCm/composable_kernel commit: `5913609168`]	2023-10-05 14:58:58 -07:00
Bartłomiej Kocot	9bc92adde3	Add column to image kernel (#930 ) * Add column to image kernel * Minor fixes for dtypes and client examples * Disable tests for disabled dtypes * Disable add instances functions for disabled data types * Minor stylistic fixes * Revert "Disable add instances functions for disabled data types" This reverts commit `728b869563`. * Instances reduction * Add comments in device_column_to_image_impl * Update changelog and Copyrights * Improve changelog [ROCm/composable_kernel commit: `e2243a4d1e`]	2023-09-27 17:19:06 +02:00
Illia Silin	43098d2a23	Refactoring cmake files to build data types separately. (#932 ) * refactor cmake files for the tests * refactor cmake files for examples * fix cmake for gemm example * fix the cmake file for all examples * add splitting by data types in gemm_splitk instance header * rename test to reflect only dl instances are used * clean up CI workspace, update cmake for instances * change the jenkinsfile syntax * build all instances except DL on gfx11 * move workspace cleanup after stages * clean up workspace after every stage * isolate data types in grouped_conv_fwd header * isolate dl instances for grouped_conv2d_fwd * fix syntax * fix cmake and batchnorm instances * fix typo * fix reduction instances * fix grouped_conv headers * fix syntax * replace parsing logic for instances, replace bfp16 with bf16 * fix the client examples build * clean up DTYPES from instances cmake files * update the parsing logic in cmake files * make an exception for reduction kernels * update few remaining cmake files to handle DTYPES * fix syntax * fix cmake conflicts * replace f8 with fp8 test name * resolve conflicts for dpp instances [ROCm/composable_kernel commit: `bba085d2b5`]	2023-09-20 22:15:56 -07:00
Bartłomiej Kocot	2ec7a9084a	Add image to column kernel (#867 ) * Add image to column kernel * Add instances, tests, profiler, example * Add client example * Several fixes of image to column * Fix variable name in device_image_to_column_impl * Several fixes of image to column profiler * Fix num_btype calculation * Make new mesaurements for correct bytes calculation [ROCm/composable_kernel commit: `0077eeb3be`]	2023-09-05 10:11:40 -05:00
rocking	e422d088a3	MaxPool & AvgPool bwd instances, test, ckProfiler, client example (#861 ) * Add maxpool instances * Rename index pool to max pool. * Add maxpool bwd bf16 instances * Add avg pool bwd instances * Rename avgpool and maxpool to avg_pool3d and max_pool * Add bf16 pool fwd instances * Add max pool bwd to ckProfiler * Add avg pool3d bwd to ckProfiler * Add avg pool bwd test * Fix bug of reference pool fwd (dilation) * Fix bug of max pool bwd (dilation and initZero) * Support bf16 compute data type * Force compute type be f32. Because atomicAdd only support f32 * Add max pool bwd test * Rename folder * Rename pool * Add max pool bwd client example * Add avg pool bwd client example * Add missing workspace * clang format * Rename macro * remove useless header * remove useless layout [ROCm/composable_kernel commit: `866377de18`]	2023-08-31 21:01:50 +08:00
Illia Silin	4bea06a519	Allow building CK for specific data types and split off last remaining DL instances. (#830 ) * properly split conv_nd_bwd_data instances * split conv2d_fwd instance data types * split the gemm, conv2d_fwd and batched_gemm_softamx_gemm * split the tests by data types where possible * filter examples by DTYPES * split few remaining examples by DTYPES * filter most instances by DTYPES * add new lines at end of headers, fix grouped_gemm profiler * fix syntax * split the ckprofiler instances by DTYPES * split the conv2d and quantization DL and XDL instances * fix the splitting of conv2d DL instances * split softmax and pool_fwd tests for fp16 and fp32 types * fix syntax * fix the dl_int8 quantization instances isolation [ROCm/composable_kernel commit: `08eb176929`]	2023-08-07 14:56:10 -07:00
Bartłomiej Kocot	1dde9f03de	Support bf16/f32/f16 and NHWGC conv2d_bwd_data (#757 ) * Support bf16/f32/f16 and NHWGC conv2d_bwd_data * Add interface test * clang format * Comment fixes * Add more friendly error message [ROCm/composable_kernel commit: `63388e84ab`]	2023-06-21 08:20:31 -05:00
Bartłomiej Kocot	a404cc8faf	Add DeviceBatchedGemmMultipleD_Dl (#732 ) * Add DeviceBatchedGemmMultipleD_Dl * Fix batched_gemm tests * Fix comments * test_batched_gemm_multi_d fixes * Fix args for isSupported batchedGemmMultipleDDl * Disable tests for gfx90a [ROCm/composable_kernel commit: `fc9f97568f`]	2023-06-12 08:37:15 -05:00
rocking	84cbb3af35	Pool3d fwd (#697 ) * Expand the base class of pool2d, prepare to share base class with pool3d * Add pool3d device op * Add pool3d f16 example * Refactor the base class. implement generic pooling in the future * clang format * get original index in max pooling * Add outputindex to base class * Fix dimension * Add pooling instance * Use indexType instead * Remove useless header * Extract IndexDataType to template * Extract pooling reference code * clang format * clang format * Fix typo * Add tensor stride * Add missing header * Add index stride and output stride * Refine naming * Add type to base class * Rename file * Use proper size * Fix typo * Refine naming * Modify the argument into vector. * Add max pool profiler * Refine naming * Support f32 pool * Fix typo * Add avg pool2d fwd in profiler * clang format * Rename AccDatatype to ComputeDatatype * Fix init * test pool * Extract variable * Add client example * Check the pooling dim * clang format * Connect argv and arg_parser * Add found check * Remove useless header * Refine naming * Adjust the order of device_pool_fwd [ROCm/composable_kernel commit: `76ec0089fb`]	2023-05-24 09:05:04 -05:00
Bartłomiej Kocot	993c671395	Add contraction profiler and tests (#701 ) * Add contraction profiler and tests * Build and style fixes * Allow to use any elementwise operator for ref_contraction * Introduce profile_contraction_scale and profile_contraction_bilinear * Make ref_contraction generic and extend interface tests * Stylistic minor fixes * Extend test_contraction_interface [ROCm/composable_kernel commit: `642d5e9155`]	2023-05-15 09:46:52 -05:00
rocking5566	16c383aa2a	Gemm+layernorm instance, ckProfiler, client example (#568 ) * Add gemm + layernorm instance * Add ckProfiler * Add test * Add client example * Detect if user forger to set the workrspace * Use literal in the example * [What] use builtin function for sqrt [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt() * check gemm vaildity in IsSupportedArgument * Add more testcases * Merge duplicated folder in client example * Print more infomation * Use better kernel parameter for MS problem size * clang format * Add constexpr for if condition and remove redundant include * Remove cstdlib and add constexpr [ROCm/composable_kernel commit: `f7d28f3e4b`]	2023-02-09 15:02:55 -06:00
Haocong WANG	3baad464d0	[Navi3x-LWPCK-449] wmma_op + unit test (#484 ) * wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * Remote int4 related * delete deprecated test Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `abf9cc6c5c`]	2022-12-02 11:41:13 -06:00
Po Yen Chen	02db748e74	Modularize ckProfiler operations (#514 ) * Re-structure ckProfiler source files * Rename profiler.cpp to main.cpp * Modularize ckProfiler operations * Add description for profiler operations * Use longer name to avoid name collision * Use macro to delay expansion * Use std::move() to avoid object copying * Prohibit users from calling dtor * Use macro to eliminate redundant code * Make friend function hidden * Add missing include directive <iostream> * Fix wrong include directives * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com> [ROCm/composable_kernel commit: `8784a72e23`]	2022-12-01 15:15:02 -06:00
Qianfeng	c3bb3db252	BatchNorm backward instance/external API/profiler/tests (#519 ) * Refine the device batchnorm-backward base API templates and data type assignments * Remove duplicated kernel file * Add batchnorm backward instances and external API * Add batchnorm-backward profiler and tests * Add client example which uses batchnorm backward external API * Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory * Loose the threshold for batchnorm-backward check_err() [ROCm/composable_kernel commit: `63af525c06`]	2022-11-30 13:32:20 -06:00
Qianfeng	144efbf9b6	BatchNorm forward instance/external api/profiler/tests/client example (#511 ) * Update to device_batchnorm_forward base class to include all template parameters for problem description * Add batchnorm forward instances and external api * Add batchnorm forward profiler module which uses the external api * Add some comments in batchnorm_forward example to explain the dimensions in lengths[] * Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward * Improvement to the batchnorm infer base API * Add batchnorm forward client example which shows using the batchnorm forward external API * Add test for batchnorm forward * Tuning the batchnorm profiler initialized values and error threshold * Add support for bhalf_t in instances/external api/tests * Add support for int8_t in instances/external api/tests * Add support for double in instances/external api/tests * Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances * Checking before running best instance in batchnorm_fwd_nhwc client example * Add checking for YElementwiseOp in batchnorm_forward external API * Add more types in batchnorm forward profiler * Add more test lengths Co-authored-by: rocking5566 <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `4e6a5575be`]	2022-11-24 18:02:27 -06:00
Po Yen Chen	44e669e0dd	Add client example of grouped conv2d backward weight (data type: fp16) (#498 ) * Remove redundant CMake setting * Extract common code from files * Rename folder 'convnd' to 'conv' * Use std::array<> to accept compile-time kwnown # of arguments * Fix compilation error of tuning parameter * In example, use same setting as unit-test * Remove no-longer used include directive * Add interface for grouped conv bwd weight * Add group support for conv bwd weight * Add grouped conv bwd weight example * Use group parameter in example * Rename example folder * Remove non-grouped version example source files * Rename device op template * Add group support to convolution backward weight * Remove debug messages * Use smaller group size in example * Use named variable as loop terminate condition * Prettify example output message * Enlarge used grid size * Allow real grid size exceeds expected grid size * Rename interface file * Add client example for grouped conv2d bwd weight * Fix wrong include directive * Rename client example folder [ROCm/composable_kernel commit: `38470e0497`]	2022-11-09 18:50:03 -06:00
guangzlu	b569103030	Fused elementwise normalization (#492 ) * add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp [ROCm/composable_kernel commit: `8a4253baaf`]	2022-11-03 12:01:58 -06:00
Anthony Chang	bc16c34989	Disable gtest discovery to run tests per-program not per-case (#432 ) * disable gtest discovery to run tests per-program not per-case * register cmake target to ctest [ROCm/composable_kernel commit: `79aa3fb179`]	2022-11-02 16:58:08 -06:00
Anthony Chang	a109f73b95	Input/output permutation for fused attention (#460 ) * reopen masking att instance due to CI is upgraded * re-enable instances previously failed on 9110 * enable ksize-kpadding pair validity test * add non-masked attention+permute test; expose masking boolean to attention kernel handles * disable bench * fix test * move files * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute * format * amend rename * disable bench in test * add mask/no-mask test for non-permute attention kernels * disable broken kernel instance * example working add non-permuted problem statement evaluating whether overhead comes from permutation or the extra kernel arg * interface for bias addition without implementing it * test and profiler running * tidy * mask type determined by enum class * unify example code * move masking specialization to its own header * align formats * extract helper functions * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute * add tensor specialization to template args since tensor spec packed shows perf parity when permutation isn't needed remove redundant template args comment on 'packed' tensor specialization * grouped attention with input/output permute example * format * clean up * refactor acc0 tile visitor Co-authored-by: shaojiewang <wsjmessi@163.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `de37550f72`]	2022-10-27 14:58:20 -06:00
guangzlu	537f7af7b7	Revert "Fused elementwise layernorm (#468 )" (#491 ) This reverts commit `ba4d3575b2`. [ROCm/composable_kernel commit: `6ea9257e9d`]	2022-10-25 18:37:12 +08:00
guangzlu	ba4d3575b2	Fused elementwise layernorm (#468 ) * add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp [ROCm/composable_kernel commit: `efbcc6eddc`]	2022-10-25 10:23:20 +08:00
rocking5566	1dcaa3991f	Fix bug of layernorm ckProfiler and refine code (#448 ) * Fix bug of profiler for layernorm * 1. Rename layernorm into normalization 2. Decouple softmax from normalization * clang-format [ROCm/composable_kernel commit: `1b62bfaa2a`]	2022-10-12 21:06:39 -05:00
Shaojie WANG	a121a10a26	MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425 ) * add lower triangle bmm * init code for tile skipping * functionality right with lower triangle mask * add decoder lower triangular mask calculation * use 713 group fix n2 compute error * attention with lower triangle mask with tile skipping * add template to distinguish masking kernel * rename template and remove default template value * remove lower triangle gemm reference struct * add some comments on example * add 10 instance for masking bmm + scale + softmax + bmm + permute kernels * add test * add test file * add gtest for bmm masking scale softmax bmm permute * clang-format * fix compile error * check lef bottom corner for tile skipping * fix error: check left bottom corner for tile skipping * add k padding * add test and instance for MNK padding * passing a mask struct * fix instances * delete used comments * format Co-authored-by: danyao12 <yaodan@dc-smc-13.amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `ebab84b6f9`]	2022-09-20 12:43:53 -05:00
Adam Osewski	038fb75356	int4 data type (#364 ) * Introduce int4 data type. * Add unit-tests for int4 * Compile int4 UT only when int4 enabled. * clang-format Co-authored-by: Adam Osewski <aosewski@amd.com> [ROCm/composable_kernel commit: `e00149ac67`]	2022-08-18 14:53:47 -05:00
Anthony Chang	bd30eaf33b	Fused GEMM+GEMM (#351 ) * initial stub for gemm_gemm_xdl_cshuffle * set up example code * compiles * prevent integer overflow * harmonize interface between ref_gemm and ref_batched_gemm * batched_gemm_gemm * fix example * host tensor gen: diagonal pattern in lowest two-dimensions only * make c descriptors containing only integral constants * clean up * add BlockwiseGemmXdlops_v2 while exploring an unified approach * implement proper interface * tidy up example * fix compilation warnings * coarsely controlled 2nd gemm padding * remove rocm-cmake's hard requirement for certain revision * clang-format * resolve merge conflict * fix compilation error on gfx10 * adds acc0 elementwise op to interface * add gemm_gemm instances and tests * avoid LDS data hazard * fix build Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `c20a75b07d`]	2022-08-13 09:18:58 -05:00
Anthony Chang	72b7716744	Fused attention (#345 ) * initial stub for gemm_gemm_xdl_cshuffle * set up example code * compiles * prevent integer overflow * harmonize interface between ref_gemm and ref_batched_gemm * batched_gemm_gemm * fix example * host tensor gen: diagonal pattern in lowest two-dimensions only * make c descriptors containing only integral constants * clean up * add BlockwiseGemmXdlops_v2 while exploring an unified approach * implement proper interface * tidy up example * fix compilation warnings * coarsely controlled 2nd gemm padding * remove rocm-cmake's hard requirement for certain revision * clang-format * resolve merge conflict * fix compilation error on gfx10 * adds acc0 elementwise op to interface * attention host validation * add blockwsie softmax v1 * iteratively update softmax+gemm * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum * add init method for easier debugging * do away with manual thread cluster calculation * generalize blockwise softmax interface * row-wise softmax sum & max * format * rename to DeviceBatchedGemmSoftmaxGemm * add gemm_softmax_gemm instances and tests * comment Co-authored-by: ltqin <letao.qin@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `cac014f173`]	2022-08-13 00:16:14 -05:00
Chao Liu	236f946292	Clean up conv example, Instances, profiler and test (#324 ) * convnd_fwd fp16 example * update example * update example * update instance * updating refernce conv * update reference conv * update conv fwd profiler * update conv 1d and 3d instance * update include path * clean * update profiler for conv bwd data and weight * update conv bwd weight * clean * update conv example * update profiler for conv bwd weight * update ckprofiler for conv bwd data * fix reference conv bwd data bug; update conv bwd data test * update examples * fix initialization issue * update test for conv fwd * clean * clean * remove test case too sensitive to error threshhold * fix test * clean * fix build * adding conv multiple d * adding conv multiple D * add matrix padder * add gemm padding to convnd * adding group conv * update gemm multi-d * refactor * refactor * refactor * clean * clean * refactor * refactor * reorg * add ds * add bias * clean * add G * adding group * adding group * adding group * update Tensor * clean * update example * update DeviceGemmMultipleD_Xdl_CShuffle * update conv bwd-data and bwd-weight * upate contraction example * update gemm and batch gemm with e permute * fix example build * instance for grouped conv1d * update example * adding group conv instance * update gemm bilinear instance * update gemm+add+add+fastgelu instance * update profiler * update profiler * update test * update test and client example * clean * add grouped conv into profiler * update profiler * clean * add test grouped conv, update all conv test to gtest * update test [ROCm/composable_kernel commit: `500fa99512`]	2022-07-29 18:19:25 -05:00
rocking5566	546f5e0ee6	Standalone layernorm (#315 ) * Implement layernorm kernel and deviceOp * verify gpu kernel with host code * 1. Separate gamma aand beta from affine 2. Check if argument is valid * clean * Sync the naming * Support sweep once mode if we can put k dimension data inside one block * [What] Get length from upper length. [Why] if we get length directly, we may get length after padding. * We only use one block in K dimension. Hence, we can simplify the indexing of global R/W. * Use 1d descriptor for gamma and beta * Add accElementwiseOp * Extract layernorm host code * Support different YVectorDim in GridwiseLayernorm * Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp * Gamma and beta can share the VGPR. * Add test for fp32 and fp16 * Fix bug of concurrency and add test case which may fail orignally * Propagate NaN for layernorm Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `7f21662089`]	2022-07-13 11:16:14 -05:00
Shaojie WANG	4513fb2e77	add conv1d/3d bwd weight instances (#318 ) * add conv1d/3d bwd weight instances * add profiler code [ROCm/composable_kernel commit: `763ca61581`]	2022-07-08 15:42:20 -05:00
Liam Wrubleski	73af96b913	Switch to standard ROCm packaging (#301 ) * Switch to standard ROCm packaging * Revert .gitignore changes * install new rocm-cmake version * update readme Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `b653c5eb2e`]	2022-06-25 09:35:16 -05:00
Chao Liu	4a27f120ea	Absolute include path (#281 ) * ad gelu and fast_gelu * added GeLU and fast GeLU * clean up * add gemm+fastgelu example * add gemm+gelu instances * update profiler * clean up * clean up * adding gemm+bias+activation * clean * adding bias * clean * adding gemm multiple d * debugging * add gemm bias add fastgelu * rename, clean * refactoring; add readme * refactor * refactor * refactor * refactor * refactor * refactor * fix * fix * update example * update example * rename * update example * add ckProfiler * clean * clean * clean * clean * add client app example * update readme * delete obselete files * remove old client app * delete old file * cleaning * clean * remove half * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path for all examples * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * revert client app example * clean build * fix build * temporary disable client test on Jenkins * clean * clean * clean [ROCm/composable_kernel commit: `d1db6a0c3e`]	2022-06-24 20:51:04 -05:00
Anthony Chang	08b531ae32	Standalone softmax kernel (#284 ) * initial stub for standalone softmax * start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m * host softmax validates * compiles; to implement beta scaling * use NaN trick to efficiently ignore OOB values during sum of exponentials * freeload device_reduce's utility functions * clean up interface * adding prior value (beta scaling) * remove restriction related to perf considerations * apply clang-format * clean; disable diagnostics * resolve conflicts * add exp wrapper * honor HostTensorDesc interface; allow implicit cast from different vector<T> type * test softmax for fp16/fp32 * update readme * amend commit NaN trick * remove redundant param added during development * format * replace ScalarDataType with AccDataType * separate out test programs by precision type * move softmax sample code to its own folder * format * keep up with recent changes in reduction API * remove extra header [ROCm/composable_kernel commit: `15c89e81f0`]	2022-06-21 14:59:19 -05:00
Jianfeng Yan	81f473ff41	Navi21 gemm (#197 ) * start adding navi21 GEMM * navi_gemm_km_kn_mn_fp32 compiles and passes one test. * rename variables and functions in gridwise_gemm_dlops_v1r3 * add other 3 layouts; format instance * adding more tuning parameters add tuning parameters for other 3 layouts * add gemm_dlops_f16 * tmp * add dependence of DeviceGemm::IsSupportedArg() on arch * minor changes * minor changes * minor changes * minor changes * minor changes * minor changes * minor changes * push gemm_dlops into profiler * minor changes * if using xdl or dlops is moved into profiler_gemm_impl * minor changes * minor changes * remove is_xdl from profile_gemm_impl * make IsSupportedArg dependent on arch for other device_gemm * minor changes * minor changes * fix a bug in f_generate_tensor_value * add 64x64x64 for gemm_dlops_int8 * add 64x64x64 for gemm_dlops_int8 * comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1 * fix * start fixing tuning parameters * monir * minor changes * minor changes * minor changes * fixing * adding example * adding example * adding example * add gemm fp32 example * clean up * use 128x128x16 as MNK tile in navi21 gemm example * bug fix * fix test * use new block c tile * clean * fix build Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: shaojiewang <wsjmessi@163.com> [ROCm/composable_kernel commit: `40b59a63cc`]	2022-05-24 12:19:27 -05:00
Anthony Chang	1ec2ac091a	Refactor block to C tile map (#235 ) * refactor block-to-ctile-map * gridwise gemm block2ctile generic validity check * format * amend split-k gemm block2ctile map refactor * add test * format * amend * revert to calculating batch index in kernel instead of passing as block_id_z * move file * add valid ctile index check to gridwise v2r4 [ROCm/composable_kernel commit: `a054f7d604`]	2022-05-20 12:40:51 -05:00
Anthony Chang	4085385498	Validate examples in CI (#233 ) * validate examples in ctest runs * format * fix usage of check_err * amend * add example codes to custom target 'check' Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `9f71ff48e2`]	2022-05-13 16:54:44 -05:00

1 2

72 Commits