composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Author	SHA1	Message	Date
Chao Liu	8f455615a8	Fast GeLU using built-in function (#587 ) * clean up * fast gelu using builtin function * clean * clean * clean * clean: * clean * fix compilation * clean * clean --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-02-26 23:19:11 -06:00
zjing14	209baee299	disable tensor contraction f64 on MI100 (#602 )	2023-02-23 16:59:37 -08:00
Rostyslav Geyyer	246ceee49e	Add Grouped Conv Backward Weight on Navi21 for ResNet50. (#505 ) * Add DeviceOp and examples * Format DeviceOp template arguments * Remove bf16 example * Format * Format * Update MakeABCGridDescriptor_A_K0_M_K1_B_K0_N_K1_C_M_N * Refactor argument preparation * Update conv_bwd_weight_dl to grouped_conv_bwd_weight_dl * Rename device op file * Update include directive in the example file * Update descriptor preparation for grouped op * Update the argument * Update batch handling * Add gridwise gemm supporting batched input * Update blockwise indexing, working version * Update copyright year * Update check if argument is supported * Refactor and make consistent with xdl examples * Update check if argument is supported * Add changelog entry * Added comments on Dl op split_k>1 support --------- Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-02-22 11:59:53 -06:00
ltqin	830d37a7d5	Grouped conv1d client example (#589 ) * add conv1d fwd client example * change 07_grouped_conv2d_fwd to 07_grouped_convnd_fwd * add conv1d bwd weight --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-02-22 11:55:21 -06:00
Illia Silin	bef0cb20db	fix a bug when building for gfx1030 target. (#591 ) * fix a bug while building for gfx1030 and add gfx1030 to targets * fix syntax	2023-02-16 13:54:08 -06:00
Illia Silin	584d233cfe	Build and archive deb packages. (#590 ) * build and archive deb packages * fix syntax * run QA to test building packages * apply cron to develop branch again	2023-02-16 13:11:23 -06:00
pmaybank	cb3fac4d2a	Sphinx doc (#581 ) * New docs directory with minimal config * Based on docs directory of rocBLAS * Config for running Doxygen then Sphinx to generate HTML * Add minimal content - intro to doc * Add some boilerplate sections to doc * content still needs to be done, * e.g., need to generate API documentation using Doxygen * need to write contributor guide * Start Softmax section of Support Primitives doc * Written as a test bed for typesetting math content * Need to decide how much detail to go into * add doc directories to git ignore file. * Minor edits - new line at EOF, change year in copyright notices * Port Markdown files to ReStructuredText * Copy Markdown files from pre-existing doc directory to docs directory * Convert to reStructured Text (rst) - section headings, links, tables have a different syntax in rst * New rst files added to index - can generate HTML with same style as HTML generated from rst files in previous commits * Intention is to make all the content in doc redundant and use rst throughout rather than mix of md and rst * Extend Softmax section of Primitives Guide * rename l to z * add material on applying softmax row-wise to matrix * define macro for diag operator (represents diagonal matrix) --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-02-15 17:17:46 -06:00
Illia Silin	19490ac4f7	Clean up kernel launch output (#569 ) * clean up output from kernel_launch * set RUN_WARMUP to 0 by default * split the warm-up into a separate issue --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-02-15 12:07:21 -06:00
zjing14	24c9ee1d22	Add contraction_fp64 example (#570 ) * add contraction_bilinear * add contraction_scale_xdl_fp64 * reduce tile size to avoid register spill --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com>	2023-02-15 12:00:58 -06:00
rocking5566	6a6163a3d1	Improve normalization (#580 ) * Sync the order of type string with template parameter * Add more instances * Check the vector size and remove redundant var * Extract var to static, prepare to separate sweep once kernel * Separate sweeponce flow and optimize the flow * 1. Rename AccDatatype in normalization to computeData 2. Rename AccElementwiseOperation to YElementwiseOperation in normalization * Remove useless code * Update naive variance kernel * Refine string * Fix typo * Support naive variance for device_normalization * Check the blocksize * Share the VGPR of x and y * Share the VGPR of gamma and beta * Add more instances * Support fp16 sqrt for experiment * Add CHANGELOG * Fix typo * clang-format	2023-02-15 11:59:35 -06:00
Haocong WANG	0cfda84d05	[Navi3x] Add Device Operations (#567 ) * wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size * navi3x_multipleD+example * temp save * workable * batchedgemm[OK], groupconv[debug] * groupconv: Sanity check[OK], Performance[Bad] * navi3x_groupconv_need_optimization * format * Add arch limitation to all wmma examples * fix bug: example30 input conv args	2023-02-15 11:50:51 -06:00
Adam Osewski	e9fd122889	Conv3D FWD BWD WRW fp16 fp32 client examples (#559 ) * Conv3d bwd weight client example. * Update year in license * Convolution bwd data 3D fp16/fp32 client example. * Client example for convnd fwd fp16 fp32 * clang-format * Review remarks. * Fix compiler err. * Update data layout to standard one. * Add conv 3d fwd NDHWGC instances * clang-format * Conv3d fwd NDHWGC instances. --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-02-15 11:16:47 -06:00
Illia Silin	06f1fc864c	Remove the workaround for bf16 attention tests. (#586 ) * remove workanround in bf16 attention test * clean up another workaround	2023-02-14 18:06:24 -06:00
Adam Osewski	8f42780fd6	GroupedGEMM more bigger tiles. (#577 ) * Adding more bigger tiles. * Remove failing instance. * Remove instances which that don't improve perf. --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-02-13 10:06:24 -06:00
Illia Silin	0ac0f51ad6	enable batched_gemm_softmax_bf16 tests (#582 )	2023-02-10 13:00:37 -06:00
rocking5566	f7d28f3e4b	Gemm+layernorm instance, ckProfiler, client example (#568 ) * Add gemm + layernorm instance * Add ckProfiler * Add test * Add client example * Detect if user forger to set the workrspace * Use literal in the example * [What] use builtin function for sqrt [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt() * check gemm vaildity in IsSupportedArgument * Add more testcases * Merge duplicated folder in client example * Print more infomation * Use better kernel parameter for MS problem size * clang format * Add constexpr for if condition and remove redundant include * Remove cstdlib and add constexpr	2023-02-09 15:02:55 -06:00
guangzlu	76d144fa7c	Add instance for elementwise normlization (#573 ) * added instances for large N * add instance for elementwise normlization * added supported restrict in device_elementwise_normalization_impl.hpp	2023-02-09 09:37:29 -08:00
Illia Silin	b63accee2b	adding the first draft of changelog (#571 ) * adding the first draft of changelog * second draft of changelog	2023-02-08 17:25:53 -06:00
ltqin	332ccc3367	Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576 ) * add instance for gemm bias softmax gemm * add client example * change CGridDesc_G_M_N to CGridDesc_G_M_O * add gridwise * change c grid name * device add d0s data * fix 08 client_example * add example 47_fused_attention * example output correct * add d0 to example * add d0 element op * rechange instance code * change Acc0ElementwiseOperation to C0DEElementwiseOperation * change example name * update instance for cdeelementwiseop * add bhalf_t ScaleAdd * add test * not surport geem1 bias * remove some ignore * fix test bug	2023-02-08 14:34:45 -06:00
Illia Silin	bb3d9546f1	Fix a couple more CI issues. (#578 ) * test the QA cron parameter for compiler commit * create separate dockers for latest and fixed amd-stg-open compiler versions * change groovy syntax * apply cron timers back to develop branch	2023-02-08 11:50:09 -06:00
Illia Silin	f73574ffdd	Fix CI issues. (#572 ) * switch to recent staging compiler as default for CI * fix the baseline query * roll back sqlalchemy to version 1.4.46	2023-02-06 13:15:45 -06:00
Rostyslav Geyyer	afdfef74f7	Add the markdown tutorial hello world (#563 ) * Add the markdown tutorial * Clean up --------- Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>	2023-02-01 15:56:59 -06:00
who who who	ba40c2ce9d	remove unused variable (#564 ) * remove unused variable * format code	2023-01-31 10:34:35 +08:00
Adam Osewski	274108d6e6	Use defined seed for deterministic test runs. (#562 ) Co-authored-by: Adam Osewski <aosewski@amd.com>	2023-01-30 13:03:59 -06:00
Adam Osewski	7494c1c611	Add more instances for irregular GEMM sizes. (#560 ) Co-authored-by: Adam Osewski <aosewski@amd.com>	2023-01-26 13:42:20 -06:00
Qianfeng	a1b2441f8d	Batchnorm inference instances, external API, client examples and gtests (#531 ) * File renaming and class renaming for device element-wise operation * Add batchnorm-infer instances, external API and client example * Add batchnorm-infer profiler module and gtests * Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp * Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer * Rename class and file due to conflict from device_elementwise_2d.hpp * Fix namespace in batcnnorm_infer_nhwc client example	2023-01-25 17:09:04 -06:00
Qianfeng	52abc2f371	Use double for all scaling values and float-point constant values at the Device Op API (#557 ) * Use double as alpha/beta values type in reduce device op api * Use double as alpha/beta values type in softmax device op api * Use double as alpha/beta values type in multiple-reduce device op api * Use double as epsilon value type in normalization/elementwise-normalization device op api	2023-01-18 12:02:50 -06:00
Raman R jana	1cfa87608a	Wavelet (inter-wave consumer-producer) GEMM (#310 ) * wavelet gemm programming model support for CK * GEMM pipeline update for wavelet progrmmaing model * Updated wavelet programming pipeline * fixes for global-write for math-wave * fixed bug in global writes * Updated comments for better readability * fixed clang format errors * added block_lds without barrier sync * clean * clean * clean * clean * refactor * prototype 4 layouts fix default stride all problem sizes tidy move file update build script restore old file fix build * refactor standalone test to use gemm test harness * simplify gemm test * update build script * remove redundant * early return when cmd arg doesn't match * tidy * report failure when result not validated * tidy * Add comment depicting B2C mapping pattern. * Formatting & comments. * Comparison with custom B2C mapping pattern. * Example for wavelet gemm. * Add wavelet to Gemm standalone test. * Remove debug code. * Remove dangling #endif directive. Co-authored-by: root <Raman Jana> Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: Anthony Chang <ac.chang@outlook.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2023-01-18 12:00:02 -06:00
ltqin	d66421fe34	Add multiD Gemm client APIs (#534 ) * start add example * fix config * fix showinfo bug * add an elementop * change to padding * add xdl example * change elementwiseop * add instance * add instance to profiler * change file name * fix deive not support issue * add client example * fix client gemm_add_multiply name * change AddMultiply elementwiseop * fix elementwiseop * fix client example * fix addmultiply op * fix comments and fun name Co-authored-by: letaoqin <letaoqin@amd.com>	2023-01-18 11:53:56 -06:00
Illia Silin	00ff30af8c	fix a bug for 6-dim kernels (#555 )	2023-01-18 11:44:11 -06:00
who who who	147b7db561	add multi embeddings support (#542 ) * add multi embeddings support * fix format * optimize sqrt * add reduce operation * change to elementwise op * fix name * rename * run ci cd * format example * format code * format code	2023-01-18 11:32:12 -06:00
ltqin	55236709e2	Add client API/examples for 3xGemm+Bias+Add+Permute{0, 2, 3, 1} (#550 ) * add example * fix example * add instance for gemm permute * add to client example * change configs * change instance file name * formate * change client example file name and remove example	2023-01-18 10:52:52 -06:00
Qianfeng	80e0526741	Reduction external API and client examples (#493 ) * Change to the DeviceReduce base class template to include all problem description information * Add external api for reduction * Add client example to test the reduction external api * Spelling correction * Re-implement the host_reduction to follow the DeviceReduce base API format * Change the reduce profiler to call the external API for collecting device instances * Rename reduce client example directory from 08_reduce to 12_reduce * Remove (void) before the functional call * Tiny update in reduce client example * Tiny update in profile_reduce_impl.hpp * Rename the reduce client example directory Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2023-01-16 22:18:06 -06:00
rocking5566	7829d729fb	Gemm layernorm welford (#413 ) * Add device op of gemm layernorm * [What] Rename F to H [Why] F and G prepare for welford tensor * Add gridwise gemm + welford * Extract template parameter * Rename kernel. Prepare to add second half kernel * Extract var * Add second kernel for gemm+layernorm * Move to the gemm_layernorm folder * Rename F and G to mean and var * Do not use snakeCurved, it makes determination of padding for welford difficult * Rewrite the device interface and rename some var * Add welford count * Update interface * Sync code, prepare to test on MI200 * Clean the code * Implement layernorm * Add comment to mension hipFree * Wrtie out the e for debug. This could be remove and use h for instead * 1. Allocate mean, var and count into by SetWorkSpacePointer. 2. Add GetWorkSpaceSize to calculate the space size * Add gemm layernorm host code * use reference layernorm * Fix bug of blockwise welford for first kernel * Fix bug of mean var padding for layernorm * Use sgpr for shuffleM_index * padding for GemmMeanVarCountGridDescriptor_M_NBlock * Add layout parameter * Check argument for gemm * calculate max count for tail block * Share E and H memory in device op * Hard code the vector dim * Refine the MakeDescriptor * 1. Remove E parameter, because E is inside of device op 2. Check vector size * [What] Rename MakeMeanVarDescriptor_M_N [Why] Prepare to add count version of make descriptor * Use 1D global memory for count * Prevent redundant IO * Update parameter * Add pipeline v1/v2 selector * Rename the example name * Add base class for gemm layernorm * Refine naming to distinguish naive and welford * Add comment to explan in detail * We don't need to pad in N dimension in gemm for mean/var/count. Set NPerTile 1 * Rewrite the 2st kernel, use multiple block along N dimension in layernorm kernel * Share the vector size * Refine var name * [What] Force LayernormThreadSliceSize_N = vector size. [Why] Memory coalesce * Add comment * Extract divisor out of the loop in reference layernorm * Pad different size for E and H in layernorm kernel according to different block tile * Refine naming * Refine naming * Prevent implicit cast * [What] use ck::math::sqrt instead of __builtin_amdgcn_sqrtf [Why] __builtin_amdgcn_sqrtf is only support float, double will cause casting * Cast only constant * Change of post shuffle thread descriptor * Add EMeanVarDataType parameter. * Merge the mean and var threadwise copy * Add missing index * Fix Typo * Sync the variable with previous if * 1. Declare e inside the host_gemm_layernorm() 2. Prevent implicit cast in reference code Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2023-01-16 20:08:25 -06:00
Haocong WANG	919aeb1f52	[Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541 ) * wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size	2023-01-16 20:06:01 -06:00
Illia Silin	715e8dd241	Add a flag to enable/disable debug output in many kernels. (#549 ) * add DEBUG_LOG macro to enable/disable debug output * fix syntax * fix syntax again * fix syntax one more time * remove balnk spaces * use ifdefs * add the Print argument * move the definition of DEBUG_LOG to ck.hpp * add the missign argument to Print()	2023-01-11 19:55:56 -06:00
Qianfeng	a17b041486	Remove including of cmath (#551 ) * Let cmath included when compiling host codes in math_v2.hpp * Remove including of cmath in device_base.hpp and device_permute.hpp	2023-01-11 19:52:47 -06:00
zjing14	0345963eef	Add MNK padding, M = 0 support into grouped_gemm (#539 ) * add mnk padding, support m=0 * clean code * clean code Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> tutorial_hello_world	2022-12-15 15:07:24 -06:00
Illia Silin	1115117503	disable the attention test that fails on MI100 (#540 )	2022-12-15 10:20:21 -06:00
Qianfeng	10c72aced8	Add interface GetTypeIdName() and GetTypeIdHashCode() for Device Op (#533 )	2022-12-14 18:34:02 -06:00
Rostyslav Geyyer	9a1f2475e3	Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances to enable arbitrary problem size (#535 ) * Add padding device_gemm_add_add_fastgelu_xdl_c_shuffle instances * Add padding device_gemm_add_fastgelu_xdl_c_shuffle instances * Add gemm_add_fastgelu profiler impl * Add padding device_gemm_fastgelu_xdl_c_shuffle instances * Add gemm_fastgelu profiler impl	2022-12-14 18:12:09 -06:00
Rostyslav Geyyer	74744cab3e	Add a docker hub doc file (#538 )	2022-12-14 12:17:28 -08:00
arai713	0e5c264c3e	Gridwise elementwise 2d (#466 ) * added 2d gridwise elementwise * added 2d version of device elementwise * added example file with updated device elementwise call * added Cmake file * changed NumDim into 2D * fixed compiler issues * fixed indexing for loop step * fixed NumDim dimension error * changed blockID to 2D * updated Grid Desc * updated kernel call * fixed 2d thread indexing * added dimensions for example file * commented out unused code * changed vector load * removed extra code * temporarily removing vector load on 2nd dim * changed vector load back, still causing errors * altered indexing * changed isSupportedArgument for 2D * changed indexing + do/while * fixed isSupportedArgument * changed dimension for debugging * fixed * added testing printouts * testing change * added variables to distribute threads through both dimensions * testing changes * integrated variable for thread distribution into device elementwise and added as parameter for gridwise elementwise * removed most of the extraneous code, testing with different dimensions * testing * removed debugging print statements * moved 2d elementwise permute into elementwise permute directory * fixed formatting * removed debugging comments from threadwise transfer Co-authored-by: Jing Zhang <jizhan@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2022-12-12 09:18:10 -06:00
Illia Silin	d58b7f5155	Make sure that GEMM sizes in K dimension are supported. (#527 ) * apply new K-dimension check in gemm_xdl_cshuffle * add K-dim check to gemm_xdl and batched_gemm_xdl * fix syntax * fix syntax * clean-up the debug output	2022-12-08 11:48:43 -06:00
Po Yen Chen	614a7b1bb0	Fix Grouped ConvBwdWeight test case failure (#524 ) * Use smaller tensor size in test * Use even more smaller tensor size * Touch only failing test case inputs	2022-12-07 17:46:28 -06:00
Rostyslav Geyyer	c7a4d36147	Add padding device_gemm_xdl instances (#529 ) Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-12-07 17:46:03 -06:00
guangzlu	ce87b4f765	modified half function in math_v2.hpp (#528 ) Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-12-07 17:43:02 -06:00
Illia Silin	d072790fe2	Fix CI error. (#530 ) * ignore .git folder when doing clang-format * fix syntax * add backslashes before quotes * add path filter for several extensions	2022-12-06 15:09:51 -06:00
Anthony Chang	d156709432	Fix bug where scaling may not be applied in some code path (#526 ) * fix bug where scaling may not be applied in some code path * more test * revert accidental example code changes	2022-12-02 11:43:34 -06:00
ltqin	23ecf0fa9e	Add multiple d gridwise gemm on Navi21 for ResNet50 (#517 ) * start add example * add multiple d fp16 example * device transfer elementwiseop to gridwise * gridwise add multiple d * change example for multiple d * fix spill registers * fix for passthrough element op * fix int8 overflow * change example file name * add instance for dl multiple d * example add DsDataType * remove grouped_convolution_forward_dl.hpp * add head file(was deleted before) * fix not support device issue * format * remove passthrough check Co-authored-by: letaoqin <letaoqin@amd.com>	2022-12-02 11:42:31 -06:00

1 2 3 4 5 ...

828 Commits