composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-06 05:55:39 +00:00

Author	SHA1	Message	Date
Po-Yen, Chen	f7288bc2b1	Reuse same implementation code for most of GEMM examples	2022-08-19 14:47:09 -04:00
Po-Yen, Chen	ed51c0638b	Re-format template argument in example code	2022-08-19 14:31:46 -04:00
Po-Yen, Chen	5931c7ebe6	Move common codes together	2022-08-19 13:49:22 -04:00
Po-Yen, Chen	68a57e71e6	Move #include directives into new header	2022-08-19 13:24:00 -04:00
Po-Yen, Chen	42d75f356c	Sort include directives	2022-08-19 12:59:46 -04:00
Po-Yen, Chen	dd5b139401	Extract int4 example common codes	2022-08-19 12:57:36 -04:00
Po-Yen, Chen	3e2f37a148	Re-format GEMM instance template arguments	2022-08-19 12:02:57 -04:00
Po-Yen, Chen	c1fbabea04	Avoid too much generalizing check_err()	2022-08-19 11:59:21 -04:00
Po-Yen, Chen	4d4a659cd6	Use ""_uz to simplify example code	2022-08-19 11:54:51 -04:00
Po-Yen, Chen	3e2371c554	Align design with other PR	2022-08-19 11:44:08 -04:00
Po-Yen, Chen	503f07c1e0	Add constraint to check_err() input reference type	2022-08-19 11:34:19 -04:00
Po-Yen, Chen	2fb766e852	Simplify tensor usages in examples	2022-08-19 11:33:25 -04:00
Po-Yen, Chen	0d5025befe	Add #error directive to prevent compile sources with wrong setting	2022-08-19 10:51:30 -04:00
Po-Yen, Chen	625f95ade4	Remove debug messages	2022-08-19 10:05:44 -04:00
Po-Yen, Chen	84843aa36f	Avoid compilation error while disabling ck::int4_t support	2022-08-19 09:54:03 -04:00
Po-Yen, Chen	51d0c6794c	Remove constraint of Tensor<>::CopyAsType()	2022-08-19 05:31:04 -04:00
Po-Yen, Chen	c34f8411c4	Check converted Tensor<int4_t> with golden Tensor<int8_t>	2022-08-19 04:40:13 -04:00
Po-Yen, Chen	a83c006098	Allow comparing different-sized integral types in check_err()	2022-08-19 04:39:20 -04:00
Po-Yen, Chen	726c115393	Add type constraints for integer version check_err<>()	2022-08-19 03:48:20 -04:00
Po-Yen, Chen	f2c148efae	Add type traits 'is_signed_integral<>'	2022-08-19 03:47:22 -04:00
Po-Yen, Chen	463d15f9b5	Add constraint to Tensor<> templated methods	2022-08-19 03:27:41 -04:00
Po-Yen, Chen	f3f61f836b	Complete the int4 examples	2022-08-19 02:19:50 -04:00
Po-Yen, Chen	2dc3357a20	Fix typo in alias names	2022-08-19 01:41:20 -04:00
Po-Yen, Chen	79480f0aee	Re-use element-wise operation type alias	2022-08-19 01:39:46 -04:00
Po-Yen, Chen	dd849a8736	Re-use CopyAsType<>() to implement copy ctor	2022-08-19 01:02:36 -04:00
Po-Yen, Chen	e03cece9c4	Use different type for host tensors	2022-08-19 00:32:57 -04:00
Po-Yen, Chen	89a827cab9	Re-format source files	2022-08-19 00:32:24 -04:00
Po-Yen, Chen	cbbe2485b2	Allow conversion between Tensor<> specializations	2022-08-19 00:30:53 -04:00
Po-Yen, Chen	30ed3e218c	Add int4_t support for check_err()	2022-08-19 00:30:28 -04:00
Po-Yen, Chen	194faf7837	Distinguish user-side type from kernel-side type	2022-08-18 23:43:19 -04:00
Po-Yen, Chen	70c87970ec	Re-use pre-defined alias in int4 exmples	2022-08-18 23:29:38 -04:00
Po-Yen, Chen	4b153bd974	Add GEMM examples for int4 Currently the source files are just copied from int8 examples	2022-08-18 23:03:36 -04:00
Illia Silin	9efd033bee	restart the stages on MI200 in case of failures (#366 ) * restart the stages on MI200 * fix the docker image storage issue	2022-08-18 14:54:47 -05:00
Adam Osewski	e00149ac67	int4 data type (#364 ) * Introduce int4 data type. * Add unit-tests for int4 * Compile int4 UT only when int4 enabled. * clang-format Co-authored-by: Adam Osewski <aosewski@amd.com>	2022-08-18 14:53:47 -05:00
Chao Liu	bac7df8faf	use scale (#363 )	2022-08-17 10:38:00 -05:00
Anthony Chang	c961ce9226	Hotfix LDS data hazard in fused attention (#360 ) * avoid LDS data hazard in gemm_softmax_gemm pipeline * trivial refactors * comments * shrink blockwise gemm v2 thread buffer size * reclaim A block lds space when during 2nd gemm * amend * amend	2022-08-15 12:04:20 -05:00
Qianfeng	53ea4713af	Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320 ) * Implement multiple-reduction in one kernel (kernels, device ops, examples) * Add generic elementwise kernel and device interface * Add generator for normal-distributed data initialization * Add host refer implementation of batchnorm-forward and batchnorm-infer * Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels * Remove un-needed including in batchnorm example * Renaming generic_elementwise to elementiwise in kernel and device classes/functions * Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise * Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise * Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise * Add DeviceElementwiseBase and use it in device_normalize_instance.cpp * Removing and renaming files * Update to synchronize gemm_layernorm client example to the generic element-wise device op API * Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming * Merge two static member functions in device_elementwise.hpp * Remove unary_elementwise_1d kernel and device	2022-08-15 10:11:02 -05:00
Chao Liu	5ee304595c	fix build issue (#357 ) * fix build * excludeexample_gemm_max_xdl_fp16 from testing due to random failure on gfx908	2022-08-13 15:58:31 -05:00
cloudhan	fb1cbf025b	Change all device operations to use add_instance_library (#338 ) * Change all device operations to use add_instance_library to avoid duplicated cmake configuration. * update DeviceMem Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-08-13 12:17:58 -05:00
rocking5566	0bd6b842b9	Layernorm welford (#346 ) * Add threadwise and blockwise welford * Rename gridwise op, prepare to add welford version * implement welford and integrate welford into layernorm * Take care of tail loop * Fix buf when ThreadSliceK > 1 * Fix bug of merging of two empty set * Rename clip to clamp * 1. Fix type of count 2. Remove useless static_assert * Do not inherit Reduction::Argument * [What] replace __syncthreads() with block_sync_lds() [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0) * Add y stride * Rename. DeviceLayernorm -> DeviceLayernormImpl DeviceNormalization2 -> DeviceLayernorm * Move literal ""_uz & ""_zu into namespace 'literals' * Move namespace 'literals' as 'ck::literals' Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-08-13 09:43:18 -05:00
Anthony Chang	c20a75b07d	Fused GEMM+GEMM (#351 ) * initial stub for gemm_gemm_xdl_cshuffle * set up example code * compiles * prevent integer overflow * harmonize interface between ref_gemm and ref_batched_gemm * batched_gemm_gemm * fix example * host tensor gen: diagonal pattern in lowest two-dimensions only * make c descriptors containing only integral constants * clean up * add BlockwiseGemmXdlops_v2 while exploring an unified approach * implement proper interface * tidy up example * fix compilation warnings * coarsely controlled 2nd gemm padding * remove rocm-cmake's hard requirement for certain revision * clang-format * resolve merge conflict * fix compilation error on gfx10 * adds acc0 elementwise op to interface * add gemm_gemm instances and tests * avoid LDS data hazard * fix build Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-08-13 09:18:58 -05:00
ltqin	10b3278b05	Skip lds of b matrix (#326 ) * start * read for gridwise gemm * add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1 * add thread copy desc and register buffer * add K0PerBlock dim * add read global data * finish gridwise gemm * finish blockwise gemm * add print data * add smallest config * add compare code for gridwis gemm * fix NXdlPerWave * fix k0perthread and gridewis gemm main loop * remove b matrix lds alloc * fix name * add test code * create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter * add double register * modify b_thread_desc_ * add float * fp16 tag * add tail for pipeline * finish main loop * optimize main loop * start clear gridwise gemm * clear code * clear redundant code * change file name * change file name * fix bug after merge develop * fix input parameters * using MultiK0 control b load data loop * fix some config * 4 buffer * fix bug * one can use * change read order * change buffer array to tuple * change to 8 buffer * interleave buffer load * change to 16 * read 8 buffer * add data buffer to template * fix after merge develop(head file) * format * change to 4 buffer * remove unnecessary lambda fun	2022-08-13 01:35:49 -05:00
Qianfeng	14932e8de3	Add examples for reduction fp16/fp32/bp16/int8/fp64 for 3d/4d/5d (#342 ) * Update the reduce_blockwise example to support user specified data type and input+reducing dimensions * Add examples for using reduce_multiblock_atomic_add * Add more running examples to the default command-line * Remove un-necessary header including * Update to the example README.md	2022-08-13 01:10:01 -05:00
rocking5566	6c3c06bf1f	Gemm multiple d multiple r (#335 ) * Imitate XXX_gemm_multiple_d, add XXX_gemm_multiple_d_multiple_r for gemm + reduction * Implement run of kernel * Add example * Fix parameter of typo * Rewrite the reduceMax example * Rewrite the reduceMean + reduceMeanSquare example * Refine naming * Refine folder name * refine naming * Rewrite the gemm + bias + relu + add + layernorm example * Rewrite the gemm + layernorm example * clang-format * Fix bug if sync lds * Fix compile error	2022-08-13 01:07:12 -05:00
Anthony Chang	cac014f173	Fused attention (#345 ) * initial stub for gemm_gemm_xdl_cshuffle * set up example code * compiles * prevent integer overflow * harmonize interface between ref_gemm and ref_batched_gemm * batched_gemm_gemm * fix example * host tensor gen: diagonal pattern in lowest two-dimensions only * make c descriptors containing only integral constants * clean up * add BlockwiseGemmXdlops_v2 while exploring an unified approach * implement proper interface * tidy up example * fix compilation warnings * coarsely controlled 2nd gemm padding * remove rocm-cmake's hard requirement for certain revision * clang-format * resolve merge conflict * fix compilation error on gfx10 * adds acc0 elementwise op to interface * attention host validation * add blockwsie softmax v1 * iteratively update softmax+gemm * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum * add init method for easier debugging * do away with manual thread cluster calculation * generalize blockwise softmax interface * row-wise softmax sum & max * format * rename to DeviceBatchedGemmSoftmaxGemm * add gemm_softmax_gemm instances and tests * comment Co-authored-by: ltqin <letao.qin@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-08-13 00:16:14 -05:00
Po Yen Chen	a670a5a092	Move literal ""_uz & ""_zu into namespace 'ck::literals' (#354 ) * Move literal ""_uz & ""_zu into namespace 'literals' * Move namespace 'literals' as 'ck::literals'	2022-08-12 17:48:35 -05:00
Rostyslav Geyyer	0c6ef7c14e	Add example of conv_fwd_bias_relu_add for int4, int8, bfp16, fp16, and fp32 (#343 ) * [LWPCK-359] Initial commit * Working version for fp16, add results to readme * Update according to PR #341 * Update results in readme * Add fp32 example * Add bf16 example * Update fp16 and fp32 examples * Add int8 example * Add separate lengths and strides tensors for D tensors Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>	2022-08-12 15:30:27 -05:00
zjing14	35e49f2de6	add g; fixed strides (#355 )	2022-08-12 15:22:39 -05:00
Illia Silin	de60d290b6	Build docker only once in CI, fix conv_bwd logfile names. (#353 ) * build docker in separate stage * build docker with only one prefix * add parallel statement * add docker repo url * fix the name of perf_conv_bwd_data log file	2022-08-12 12:30:37 -05:00
Po Yen Chen	68b61504a3	Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340 ) * Add always_false<> util to delay symbol resolution * Use always_false<> to prevent trying instantiate unwanted method * Add new specializations of AddAddFastGelu::operator() method * Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32 * Use floating point literal to simplify code * Remove unnecessary capture in lambda expressions * Extract fast GeLU calculation as standalone method * Mark methods as 'constexpr' * Add constraint for HostTensorDescriptor templated ctors * Simplify HostTensorDescriptor ctor calls * Add C++23 std::size_t literal suffix * Use _uz suffix to shorten example code * Remove unnecessary conversion to std::array<> * Re-order include directives * Remove C-style casting by literal suffix * Remove unnecessary statements in main() * Remove unused type parameter of always_false<> * Remove unused include directive * Exit main() by returning meaningful value * Use 'if constexpr' to switch example flow * Use std::is_same_v<> to shorten example code * Add 'inline' specifier to literal functions * Unify output methods in example * Move common codes into .inc file * Add type check in type_convert<>() * Add type_convert<float>() before computation * Merge AddAddFastGelu method specializations * Remove always_false<> * Add constraint to AddAddFastGelu::operator() parameter types	2022-08-11 17:31:28 -05:00

1 2 3 4 5 ...

702 Commits