composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-19 02:01:01 +00:00

Author	SHA1	Message	Date
Illia Silin	ffac25fffa	Fix grouped_gemm_splitk kernels on MI300. (#694 ) * replace amd_buffer_atomic_add with hip_atomic_add * fix grouped_gemm_splitk kernels on mi300 * fix syntax * revert experimental atomic_add changes --------- Co-authored-by: Jing Zhang <jizhan@amd.com> [ROCm/composable_kernel commit: `4a51d2da9d`]	2023-05-03 08:25:25 -07:00
Illia Silin	dda83a196e	Syncing up from internal repo to enable MI300. (#690 ) * enable gfx940 * switch between intrinsic mfma routines on mi100/200 and mi300 * fix mfma_int8 on MI300 * disable 2 int8 examples on MI300 * Update cmake-ck-dev.sh * restore gitignore file * modify Jenkinsfile to the internal repo --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `4feebedd41`]	2023-04-28 18:22:59 -05:00
Haocong WANG	f6c5a1727a	add vector load check (#680 ) Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `54c90aae13`]	2023-04-26 15:58:57 -05:00
Adam Osewski	db4216c421	Grouped Gemm + SplitK + simplified Kernel Args (#669 ) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * B2C with 3D grid for KSplit * Remove unused code. * Use default B2C (3D grid) in grid gemm v2r4r2. * Device gemm splitk use B2C map. * Device GroupedGemmXdlSplitKCShuffle * Example for GroupedGemm Xdl SplitK * Introduce Device GroupedGemmSplitK * Fix updating kbatch size. * Add instance mk-nk-mn * Enable set kbatch in profiler. * Add GGemmSplitK mk-kn-mn instances * Add more instances & split into multiple files. * minor fix * tuning * clean * disabled failed instances * use pipe v2 * Ignore arg on not supported arch. * fix warning --------- Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Jing Zhang <jizhan@amd.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com> [ROCm/composable_kernel commit: `8bb2bb4a05`]	2023-04-24 15:43:36 -05:00
Illia Silin	64dc32a54b	Put back the split-k gemm code. (#684 ) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout --------- Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `903cd19ce3`]	2023-04-21 19:37:00 -05:00
Jun Liu	d32add6de2	Issue #666 : Revert "simplify karg in device/grid of split-k op (#644 )" (#665 ) This reverts commit 469cce884ed93ab0e59e793df5b3c00d7657bf7a. [ROCm/composable_kernel commit: `3248387bbb`]	2023-04-06 17:14:11 -07:00
carlushuang	0755fc355d	simplify karg in device/grid of split-k op (#644 ) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout [ROCm/composable_kernel commit: `bb5530af91`]	2023-03-29 19:03:07 -05:00
Rostyslav Geyyer	35a515d0f3	Add a denorm test fix (#603 ) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `dbd8f94bef`]	2023-03-29 15:05:32 -05:00
Illia Silin	f84ebeac6b	Get rid of XDL parameters in WMMA kernel string. (#646 ) * remove XDL parameters from WMMA kernel string * get rid f two more parameters [ROCm/composable_kernel commit: `36750a5763`]	2023-03-22 08:05:48 -07:00
rocking5566	a235ffef27	gemm/Conv xdlops + dlops quantization (#625 ) * Add conv perlayer quantization * Add gemm_dlops quantization * Support int8 for innerproduct * Refine gemm dlops int8 kernel parameter * Support gfx908(MI100) and gfx90a(MI200) * clang-format * Rename example number * Support different layout for d tensor * Add conv dlops perchannel quantization example * Move to example 40 * Extract the common code for different platform (dlops and xdlops) * Move ot subfolder. Prepare to add other op of quantization * Refine the quantization instance library * Add conv dl instances and client example * Remove unnecessary type * Add gemm quantization instance * Add external api and client example * Refine num_bytes * Separete different layout to different cpp * Add more xdl instances * Revert "Remove unnecessary type" This reverts commit `820869182f`. * Remove CShuffleDataType in dlops Let acc and CShuffleDataType be the same in xdlops --------- Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `16dc18e0f9`]	2023-03-15 15:29:40 -05:00
Adam Osewski	96e6154eb6	Device Op GroupedGemmMultipleD + example fp16 (#633 ) * Pass shared mem pointer as pointer to void. * Device Op GroupedGEMM Multiple D * Example for grouped gemm multiple d. * Add MI200 to supported archs. --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `a2d5ca8e95`]	2023-03-15 11:22:59 -05:00
Rostyslav Geyyer	3e6e867dae	Add layout check to IsSupportedArgument (#627 ) * Add layout check to IsSupportedArgument * Format --------- Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `c10a6e8293`]	2023-03-15 11:12:12 -05:00
Illia Silin	a21c09c1f1	Update GetTypeString function to generate unique kernel IDs. (#638 ) * make conv_fwd_bias_activation kernel id unique * add more parameters to conv and gemm kernel names * update GetTypeString for conv and gemm kernels * fix two more kernel strings [ROCm/composable_kernel commit: `14b3504d95`]	2023-03-15 10:44:42 -05:00
Rostyslav Geyyer	f52a062900	Remove debug asserts (#629 ) Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> [ROCm/composable_kernel commit: `5b57ab96a8`]	2023-03-10 17:34:44 -06:00
Haocong WANG	271236466d	[Navi3x] Multiple issue fix (#612 ) * Change gridwise gemm mD blockwise gemm to naive * RRR Gemm fix * Fix RCR gemm bug * Isolate wmma instructions * Update amd_inline_asm.hpp * Update amd_wmma.hpp * Update amd_wmma.hpp * fix syntax and update Jenkinsfile --------- Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `087e310589`]	2023-03-10 17:04:28 -06:00
Illia Silin	0202ecce60	[gfx110x] support Navi3x architectures. (#628 ) * enable building on Nav31 * fix syntax * replace GPU_TARGETS with offload-arch * add gfx1102 rachitecture * fix typo * update changelog [ROCm/composable_kernel commit: `0ccecc7c31`]	2023-03-09 07:56:40 -06:00
Adam Osewski	50707cbb13	GroupedGEMM + Gelu client example/instances/profiler (#614 ) * Grouped gemm + Gelu instances. * Device Instance Factory for GroupedGemm+Gelu * Client example * Rangify fill helper functions. * Fix name clash. * Profiler for grouped_gemm+gelu * No need to use full namespace name. * Add check for MRaw divisible by vector load. * Ugly fix for big errors. * Add grouped_gemm+gelu to profiler CMakelists. * Store in argument additional info. * Information about Mraw, Nraw, Kraw values. * Use FastGelu instead of Gelu. * Change client ex to use FastGelu * Remove relaxed error precision. * Remove duplicate output elementwise-op --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `9096b1c7b2`]	2023-03-07 22:06:56 -06:00
Haocong WANG	45ae1a62bd	[Navi3x Bug Fix] fix typo to accept MNKPadding flag correctly. (#597 ) * fix a bug blocking wmma_gemm_multipleD * Utilize matrix padder in device_wmma_op * cosmetic change for gemmpadding format * clang format * Change gridwise gemm from FIFO to KMN loop fashion [ROCm/composable_kernel commit: `68dbf40a79`]	2023-03-01 12:07:42 -06:00
zjing14	84a4731c15	disable tensor contraction f64 on MI100 (#602 ) [ROCm/composable_kernel commit: `209baee299`]	2023-02-23 16:59:37 -08:00
Rostyslav Geyyer	76c23de0a8	Add Grouped Conv Backward Weight on Navi21 for ResNet50. (#505 ) * Add DeviceOp and examples * Format DeviceOp template arguments * Remove bf16 example * Format * Format * Update MakeABCGridDescriptor_A_K0_M_K1_B_K0_N_K1_C_M_N * Refactor argument preparation * Update conv_bwd_weight_dl to grouped_conv_bwd_weight_dl * Rename device op file * Update include directive in the example file * Update descriptor preparation for grouped op * Update the argument * Update batch handling * Add gridwise gemm supporting batched input * Update blockwise indexing, working version * Update copyright year * Update check if argument is supported * Refactor and make consistent with xdl examples * Update check if argument is supported * Add changelog entry * Added comments on Dl op split_k>1 support --------- Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `246ceee49e`]	2023-02-22 11:59:53 -06:00
Illia Silin	c3840f179d	fix a bug when building for gfx1030 target. (#591 ) * fix a bug while building for gfx1030 and add gfx1030 to targets * fix syntax [ROCm/composable_kernel commit: `bef0cb20db`]	2023-02-16 13:54:08 -06:00
rocking5566	d5062679f1	Improve normalization (#580 ) * Sync the order of type string with template parameter * Add more instances * Check the vector size and remove redundant var * Extract var to static, prepare to separate sweep once kernel * Separate sweeponce flow and optimize the flow * 1. Rename AccDatatype in normalization to computeData 2. Rename AccElementwiseOperation to YElementwiseOperation in normalization * Remove useless code * Update naive variance kernel * Refine string * Fix typo * Support naive variance for device_normalization * Check the blocksize * Share the VGPR of x and y * Share the VGPR of gamma and beta * Add more instances * Support fp16 sqrt for experiment * Add CHANGELOG * Fix typo * clang-format [ROCm/composable_kernel commit: `6a6163a3d1`]	2023-02-15 11:59:35 -06:00
Haocong WANG	d6de9bdcbe	[Navi3x] Add Device Operations (#567 ) * wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size * navi3x_multipleD+example * temp save * workable * batchedgemm[OK], groupconv[debug] * groupconv: Sanity check[OK], Performance[Bad] * navi3x_groupconv_need_optimization * format * Add arch limitation to all wmma examples * fix bug: example30 input conv args [ROCm/composable_kernel commit: `0cfda84d05`]	2023-02-15 11:50:51 -06:00
rocking5566	329678b636	Gemm+layernorm instance, ckProfiler, client example (#568 ) * Add gemm + layernorm instance * Add ckProfiler * Add test * Add client example * Detect if user forger to set the workrspace * Use literal in the example * [What] use builtin function for sqrt [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt() * check gemm vaildity in IsSupportedArgument * Add more testcases * Merge duplicated folder in client example * Print more infomation * Use better kernel parameter for MS problem size * clang format * Add constexpr for if condition and remove redundant include * Remove cstdlib and add constexpr [ROCm/composable_kernel commit: `f7d28f3e4b`]	2023-02-09 15:02:55 -06:00
guangzlu	6caec3d429	Add instance for elementwise normlization (#573 ) * added instances for large N * add instance for elementwise normlization * added supported restrict in device_elementwise_normalization_impl.hpp [ROCm/composable_kernel commit: `76d144fa7c`]	2023-02-09 09:37:29 -08:00
ltqin	32525bff35	Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576 ) * add instance for gemm bias softmax gemm * add client example * change CGridDesc_G_M_N to CGridDesc_G_M_O * add gridwise * change c grid name * device add d0s data * fix 08 client_example * add example 47_fused_attention * example output correct * add d0 to example * add d0 element op * rechange instance code * change Acc0ElementwiseOperation to C0DEElementwiseOperation * change example name * update instance for cdeelementwiseop * add bhalf_t ScaleAdd * add test * not surport geem1 bias * remove some ignore * fix test bug [ROCm/composable_kernel commit: `332ccc3367`]	2023-02-08 14:34:45 -06:00
Qianfeng	b148acfaaa	Batchnorm inference instances, external API, client examples and gtests (#531 ) * File renaming and class renaming for device element-wise operation * Add batchnorm-infer instances, external API and client example * Add batchnorm-infer profiler module and gtests * Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp * Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer * Rename class and file due to conflict from device_elementwise_2d.hpp * Fix namespace in batcnnorm_infer_nhwc client example [ROCm/composable_kernel commit: `a1b2441f8d`]	2023-01-25 17:09:04 -06:00
Qianfeng	4180697c64	Use double for all scaling values and float-point constant values at the Device Op API (#557 ) * Use double as alpha/beta values type in reduce device op api * Use double as alpha/beta values type in softmax device op api * Use double as alpha/beta values type in multiple-reduce device op api * Use double as epsilon value type in normalization/elementwise-normalization device op api [ROCm/composable_kernel commit: `52abc2f371`]	2023-01-18 12:02:50 -06:00
Raman R jana	de63c222fb	Wavelet (inter-wave consumer-producer) GEMM (#310 ) * wavelet gemm programming model support for CK * GEMM pipeline update for wavelet progrmmaing model * Updated wavelet programming pipeline * fixes for global-write for math-wave * fixed bug in global writes * Updated comments for better readability * fixed clang format errors * added block_lds without barrier sync * clean * clean * clean * clean * refactor * prototype 4 layouts fix default stride all problem sizes tidy move file update build script restore old file fix build * refactor standalone test to use gemm test harness * simplify gemm test * update build script * remove redundant * early return when cmd arg doesn't match * tidy * report failure when result not validated * tidy * Add comment depicting B2C mapping pattern. * Formatting & comments. * Comparison with custom B2C mapping pattern. * Example for wavelet gemm. * Add wavelet to Gemm standalone test. * Remove debug code. * Remove dangling #endif directive. Co-authored-by: root <Raman Jana> Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: Anthony Chang <ac.chang@outlook.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `1cfa87608a`]	2023-01-18 12:00:02 -06:00
ltqin	ebdb392f09	Add multiD Gemm client APIs (#534 ) * start add example * fix config * fix showinfo bug * add an elementop * change to padding * add xdl example * change elementwiseop * add instance * add instance to profiler * change file name * fix deive not support issue * add client example * fix client gemm_add_multiply name * change AddMultiply elementwiseop * fix elementwiseop * fix client example * fix addmultiply op * fix comments and fun name Co-authored-by: letaoqin <letaoqin@amd.com> [ROCm/composable_kernel commit: `d66421fe34`]	2023-01-18 11:53:56 -06:00
who who who	75c29a7c90	add multi embeddings support (#542 ) * add multi embeddings support * fix format * optimize sqrt * add reduce operation * change to elementwise op * fix name * rename * run ci cd * format example * format code * format code [ROCm/composable_kernel commit: `147b7db561`]	2023-01-18 11:32:12 -06:00
Qianfeng	2ca8512f48	Reduction external API and client examples (#493 ) * Change to the DeviceReduce base class template to include all problem description information * Add external api for reduction * Add client example to test the reduction external api * Spelling correction * Re-implement the host_reduction to follow the DeviceReduce base API format * Change the reduce profiler to call the external API for collecting device instances * Rename reduce client example directory from 08_reduce to 12_reduce * Remove (void) before the functional call * Tiny update in reduce client example * Tiny update in profile_reduce_impl.hpp * Rename the reduce client example directory Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `80e0526741`]	2023-01-16 22:18:06 -06:00
rocking5566	540e76a1bd	Gemm layernorm welford (#413 ) * Add device op of gemm layernorm * [What] Rename F to H [Why] F and G prepare for welford tensor * Add gridwise gemm + welford * Extract template parameter * Rename kernel. Prepare to add second half kernel * Extract var * Add second kernel for gemm+layernorm * Move to the gemm_layernorm folder * Rename F and G to mean and var * Do not use snakeCurved, it makes determination of padding for welford difficult * Rewrite the device interface and rename some var * Add welford count * Update interface * Sync code, prepare to test on MI200 * Clean the code * Implement layernorm * Add comment to mension hipFree * Wrtie out the e for debug. This could be remove and use h for instead * 1. Allocate mean, var and count into by SetWorkSpacePointer. 2. Add GetWorkSpaceSize to calculate the space size * Add gemm layernorm host code * use reference layernorm * Fix bug of blockwise welford for first kernel * Fix bug of mean var padding for layernorm * Use sgpr for shuffleM_index * padding for GemmMeanVarCountGridDescriptor_M_NBlock * Add layout parameter * Check argument for gemm * calculate max count for tail block * Share E and H memory in device op * Hard code the vector dim * Refine the MakeDescriptor * 1. Remove E parameter, because E is inside of device op 2. Check vector size * [What] Rename MakeMeanVarDescriptor_M_N [Why] Prepare to add count version of make descriptor * Use 1D global memory for count * Prevent redundant IO * Update parameter * Add pipeline v1/v2 selector * Rename the example name * Add base class for gemm layernorm * Refine naming to distinguish naive and welford * Add comment to explan in detail * We don't need to pad in N dimension in gemm for mean/var/count. Set NPerTile 1 * Rewrite the 2st kernel, use multiple block along N dimension in layernorm kernel * Share the vector size * Refine var name * [What] Force LayernormThreadSliceSize_N = vector size. [Why] Memory coalesce * Add comment * Extract divisor out of the loop in reference layernorm * Pad different size for E and H in layernorm kernel according to different block tile * Refine naming * Refine naming * Prevent implicit cast * [What] use ck::math::sqrt instead of __builtin_amdgcn_sqrtf [Why] __builtin_amdgcn_sqrtf is only support float, double will cause casting * Cast only constant * Change of post shuffle thread descriptor * Add EMeanVarDataType parameter. * Merge the mean and var threadwise copy * Add missing index * Fix Typo * Sync the variable with previous if * 1. Declare e inside the host_gemm_layernorm() 2. Prevent implicit cast in reference code Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `7829d729fb`]	2023-01-16 20:08:25 -06:00
Haocong WANG	e328adfbe4	[Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541 ) * wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size [ROCm/composable_kernel commit: `919aeb1f52`]	2023-01-16 20:06:01 -06:00
Illia Silin	345d2ec3db	Add a flag to enable/disable debug output in many kernels. (#549 ) * add DEBUG_LOG macro to enable/disable debug output * fix syntax * fix syntax again * fix syntax one more time * remove balnk spaces * use ifdefs * add the Print argument * move the definition of DEBUG_LOG to ck.hpp * add the missign argument to Print() [ROCm/composable_kernel commit: `715e8dd241`]	2023-01-11 19:55:56 -06:00
zjing14	ac9c43d666	Add MNK padding, M = 0 support into grouped_gemm (#539 ) * add mnk padding, support m=0 * clean code * clean code Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> [ROCm/composable_kernel commit: `0345963eef`]	2022-12-15 15:07:24 -06:00
Illia Silin	ae5803e284	Make sure that GEMM sizes in K dimension are supported. (#527 ) * apply new K-dimension check in gemm_xdl_cshuffle * add K-dim check to gemm_xdl and batched_gemm_xdl * fix syntax * fix syntax * clean-up the debug output [ROCm/composable_kernel commit: `d58b7f5155`]	2022-12-08 11:48:43 -06:00
Qianfeng	0b8096b485	BatchNorm backward instance/external API/profiler/tests (#519 ) * Refine the device batchnorm-backward base API templates and data type assignments * Remove duplicated kernel file * Add batchnorm backward instances and external API * Add batchnorm-backward profiler and tests * Add client example which uses batchnorm backward external API * Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory * Loose the threshold for batchnorm-backward check_err() [ROCm/composable_kernel commit: `63af525c06`]	2022-11-30 13:32:20 -06:00
fsx950223	cb87e544a3	fix GetTypeString [ROCm/composable_kernel commit: `0e9c88cecf`]	2022-11-29 14:18:10 +08:00
Qianfeng	0ef7876035	BatchNorm backward implementation (#461 ) * Implemented batchnorm-backward Blockwise and Multiblock kernels * Add batchnorm-backward device op * Add batchnorm-backward host-reference op * Add batchnorm-backward example * Parameters renaming in batchnorm backward kernels and device op * Change in the example to loose the threshold for ScaleDiff checking * Add comments to explain the implementation of batchnorm-backward * Parameters renaming again in batchnorm backward kernels * Improve the expression calculation for performance * Add batchnorm backward to README * Add comments to explain inv-variance in batchnorm forward and backward * Renaming the batchnorm forward training and inferring examples * Add/update the comments for batchnorm-backward kernels * Renaming again * Add block_sync_lds between two consecutive blockwise reductions * Move common expression 1/N out of the static_for loops * Add dy_elementwise_op * Renaming in backward example again * Add checking for reduceDims in reference_batchnorm_backward * Update to comments and codes format * Rename in the comments * Remove common expression out of the loop in reference_batchnorm_backward_nhwc_c * Add block_sync_lds() between blockwise reduction again * Fix comments again * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test [ROCm/composable_kernel commit: `44789d992a`]	2022-11-28 20:51:10 -06:00
Qianfeng	52d082bade	BatchNorm forward instance/external api/profiler/tests/client example (#511 ) * Update to device_batchnorm_forward base class to include all template parameters for problem description * Add batchnorm forward instances and external api * Add batchnorm forward profiler module which uses the external api * Add some comments in batchnorm_forward example to explain the dimensions in lengths[] * Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward * Improvement to the batchnorm infer base API * Add batchnorm forward client example which shows using the batchnorm forward external API * Add test for batchnorm forward * Tuning the batchnorm profiler initialized values and error threshold * Add support for bhalf_t in instances/external api/tests * Add support for int8_t in instances/external api/tests * Add support for double in instances/external api/tests * Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances * Checking before running best instance in batchnorm_fwd_nhwc client example * Add checking for YElementwiseOp in batchnorm_forward external API * Add more types in batchnorm forward profiler * Add more test lengths Co-authored-by: rocking5566 <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `4e6a5575be`]	2022-11-24 18:02:27 -06:00
ltqin	32b187963d	Add Conv Backward Data on Navi21 for ResNet50 (#499 ) * start add example * add device dl * change launch kernel * change init data method * change example config * add config valid check * add instance for dl bwd * add instance to ckProfiler * reserver to profiler and cmakelist * add instance to ckProfiler2 * change instance f32 config * fix example return value Co-authored-by: letaoqin <letaoqin@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `db0eb1ea9c`]	2022-11-15 16:22:20 -06:00
Po Yen Chen	a28f26cd0a	Avoid reporting unused member function error (#507 ) [ROCm/composable_kernel commit: `7038723a46`]	2022-11-14 19:54:37 -06:00
Po Yen Chen	ff9f244625	Introduce ck::accumulate_n() (#439 ) We can use this template to eliminate duplicated iterator computing logics. By providing return type to ck::accumulate_n(), we can avoid type conversion operations. [ROCm/composable_kernel commit: `730204eed0`]	2022-11-14 19:53:39 -06:00
Po Yen Chen	ad513a0fee	Fix build errors on CI server (#506 ) * Add missing ignore expression * Add missing include directive [ROCm/composable_kernel commit: `4382b41469`]	2022-11-11 11:36:55 -06:00
Po Yen Chen	93f036f2c3	Add client example of grouped conv2d backward weight (data type: fp16) (#498 ) * Remove redundant CMake setting * Extract common code from files * Rename folder 'convnd' to 'conv' * Use std::array<> to accept compile-time kwnown # of arguments * Fix compilation error of tuning parameter * In example, use same setting as unit-test * Remove no-longer used include directive * Add interface for grouped conv bwd weight * Add group support for conv bwd weight * Add grouped conv bwd weight example * Use group parameter in example * Rename example folder * Remove non-grouped version example source files * Rename device op template * Add group support to convolution backward weight * Remove debug messages * Use smaller group size in example * Use named variable as loop terminate condition * Prettify example output message * Enlarge used grid size * Allow real grid size exceeds expected grid size * Rename interface file * Add client example for grouped conv2d bwd weight * Fix wrong include directive * Rename client example folder [ROCm/composable_kernel commit: `38470e0497`]	2022-11-09 18:50:03 -06:00
guangzlu	d1c5fef7d1	Fused elementwise normalization (#492 ) * add fused addition lyernorm * add fused addition lyernorm * changed CMakelist * removed annotates * modified descriptor of C * fixed bug in gridwise add layernorm * format the files * modified name from add&layernorm into elementwise&layernorm * created fused elementwise layernorm branch * change input into tuple type * add sweep once to reduce load & read of C from global memory * modified Argument api * modified way to malloc c in global memory * changed gamma and beta to m_k_desc * fixed bug when sweep once and move CDataType when define device level struct * add src dim for gamma and beta * implement optimization for coalesced * delete a annotation line * fixed some bug to meet the requirements of ck * add bandwidth computing in example, and fixed the time unit * move device_elementwise_layernorm_impl.hpp into device/impl * fixed bug in device_elementwise_layernorm_impl.hpp * changed name from layernorm into normalization * clang-format the changed files * changed the names * moved immidiate results into lds, it become faster in non-sweeponce cases * changed naming of C into X to make the defination more clear * changed naming in example * add tests for elementwise normalization * move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization * move test_elementwise_layernorm_fp16 into new folder * move elementwise_normalization_instances into a new folder * add more tests in test_elementwise_layernorm_fp16.cpp * added some corner cases in test * fixed method to compute lds size for matrix X * changed name of 44_elementwise_normalization into 45_elementwise_normalization * modified some comments * modified some other confused comments * reduce redundant tests in test_elementwise_layernorm_fp16.cpp [ROCm/composable_kernel commit: `8a4253baaf`]	2022-11-03 12:01:58 -06:00
rocking5566	fe367bc917	Refine layernorm naming and test code (#497 ) * Sync the naming * Sync the test of layernorm with groupnorm * Sync the naming * Minor change for comment and log * [What] Add saveMean and SaveInvVariance in the interface. [Why] These can optimize the backward [ROCm/composable_kernel commit: `d4d1147f0a`]	2022-11-02 16:57:28 -06:00
Rostyslav Geyyer	deb5b07204	Add pipeline v1/v2 selector, add more instances (#381 ) * Add gridwise gemm pipeline v1/v2 selector * Pipeline selector working, test-wise add pipeline options to one instance * Add gemm instances * Add debug info to DeviceGemmXdl * Add debug info to DeviceGemmXdl_CShuffle * Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu * Minor fix * Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm * set up inter-wave configuration * use defualt loop scheduling for supported gemm ops for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI * Add enum PipelineVersion * Update instances * Format * Fix the merge conflict * Add flags to disable added instances * Test disable flag check * Disable flag check * Enable the instances Co-authored-by: Anthony Chang <ac.chang@outlook.com> [ROCm/composable_kernel commit: `1a0b0e7bec`]	2022-11-02 16:50:48 -06:00
Adam Osewski	561c1d76c5	Softmax unit-test reduction across all and non innermost dims cases. (#406 ) * Add reduction across all dims cases. * host softmax: handle all reduce * Test cases when reduced dim is not innermost axis. * Fix syntax. * Test non innermost dim for fp32 and int8 * Group test suites wrt NumReduceDim. * Additionally test failing cases. * Throw error when Rank or NumReduceDims doesn't match arguments. * Check reducedDims has correct values * Move don't reuse DeviceReduceMultiblock IsSupportedArgument method. Instead implement own. (in fact just get rid of one check to enable reduction across inner dimensions). * Reorganize unit tests to better cover use scenarios. * Test input validation * Test reduction of inner dimensions with custom op instances. * Refactor fp32 and int8 unit tests. * Fix FP32 instance template parameters. * Add more instances. * Instances with InSrcVectorDim=0. * Do not initialize and copy data when arg not supported. * ckProfiler Softmax use instance factory. * Refactor device softmax IsSupported. * Additionally add non-polymorphic api functions * Split softmax instances into multiple files. * Fix profiler. * Reorganize tests to reuse profiler and cover edge cases. * Clang-format * I8 Softmax instances along with UT. * Reuse type alias definitions from instance factory header. * Clean included headers * Fix variable names. * Add missing checks in Argument constructor. Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: Anthony Chang <ac.chang@outlook.com> [ROCm/composable_kernel commit: `6d8614ee50`]	2022-11-02 16:46:08 -06:00

1 2

61 Commits