composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 05:01:25 +00:00

Author	SHA1	Message	Date
Illia Silin	bba085d2b5	Refactoring cmake files to build data types separately. (#932 ) * refactor cmake files for the tests * refactor cmake files for examples * fix cmake for gemm example * fix the cmake file for all examples * add splitting by data types in gemm_splitk instance header * rename test to reflect only dl instances are used * clean up CI workspace, update cmake for instances * change the jenkinsfile syntax * build all instances except DL on gfx11 * move workspace cleanup after stages * clean up workspace after every stage * isolate data types in grouped_conv_fwd header * isolate dl instances for grouped_conv2d_fwd * fix syntax * fix cmake and batchnorm instances * fix typo * fix reduction instances * fix grouped_conv headers * fix syntax * replace parsing logic for instances, replace bfp16 with bf16 * fix the client examples build * clean up DTYPES from instances cmake files * update the parsing logic in cmake files * make an exception for reduction kernels * update few remaining cmake files to handle DTYPES * fix syntax * fix cmake conflicts * replace f8 with fp8 test name * resolve conflicts for dpp instances	2023-09-20 22:15:56 -07:00
Illia Silin	08eb176929	Allow building CK for specific data types and split off last remaining DL instances. (#830 ) * properly split conv_nd_bwd_data instances * split conv2d_fwd instance data types * split the gemm, conv2d_fwd and batched_gemm_softamx_gemm * split the tests by data types where possible * filter examples by DTYPES * split few remaining examples by DTYPES * filter most instances by DTYPES * add new lines at end of headers, fix grouped_gemm profiler * fix syntax * split the ckprofiler instances by DTYPES * split the conv2d and quantization DL and XDL instances * fix the splitting of conv2d DL instances * split softmax and pool_fwd tests for fp16 and fp32 types * fix syntax * fix the dl_int8 quantization instances isolation	2023-08-07 14:56:10 -07:00
Adam Osewski	f4dfc060b7	Move Device Ops implementations into impl directory. (#777 ) Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-07-06 16:15:51 +02:00
Illia Silin	b94fd0b227	update copyright headers (#726 )	2023-05-31 18:46:57 -05:00
guangzlu	4c4c7328a6	Add BF16 tests for batched_gemm_softmax_gemm_permute (#504 ) * fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm * added bf16 tests for batched_gemm_softmax_gemm_permute * changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp * aligned annotations * modified CMakeLists for examples * add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl * use macro to control the instances * added macro control into instances * clang-format some files * changed error tolerance for bf16 * changed index for 10_elementwise_normalization * fixed xdlops code bug in amd_xdlops.hpp Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2022-11-15 16:30:23 -06:00
Po Yen Chen	4a2a56c22f	Rangify constructor of HostTensorDescriptor & Tensor<> (#445 ) * Rangify STL algorithms This commit adapts rangified std::copy(), std::fill() & std::transform() * Rangify check_err() By rangifying check_err(), we can not only compare values between std::vector<>s, but also compare any ranges which have same value type. * Allow constructing Tensor<> like a HostTensorDescriptor * Simplify Tensor<> object construction logics * Remove more unnecessary 'HostTensorDescriptor' objects * Re-format example code * Re-write more HostTensorDescriptor ctor call	2022-11-11 11:36:01 -06:00
Anthony Chang	de37550f72	Input/output permutation for fused attention (#460 ) * reopen masking att instance due to CI is upgraded * re-enable instances previously failed on 9110 * enable ksize-kpadding pair validity test * add non-masked attention+permute test; expose masking boolean to attention kernel handles * disable bench * fix test * move files * bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute * format * amend rename * disable bench in test * add mask/no-mask test for non-permute attention kernels * disable broken kernel instance * example working add non-permuted problem statement evaluating whether overhead comes from permutation or the extra kernel arg * interface for bias addition without implementing it * test and profiler running * tidy * mask type determined by enum class * unify example code * move masking specialization to its own header * align formats * extract helper functions * experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute * add tensor specialization to template args since tensor spec packed shows perf parity when permutation isn't needed remove redundant template args comment on 'packed' tensor specialization * grouped attention with input/output permute example * format * clean up * refactor acc0 tile visitor Co-authored-by: shaojiewang <wsjmessi@163.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-10-27 14:58:20 -06:00
Adam Osewski	3048028897	Refactor device op implementations into `impl` subdirectory. (#420 ) * Move kernel implementation files under impl directory. * Update examples paths. * Update device kernel impl include paths. * Update tensor operation instances include paths. * Update profiler and tests include paths. * Clang-format * Update include paths for batched gemm reduce * Refactor UnitTest ConvNDBwdWeight. * Refactor fwd and bwd data convND UT. * Fix used test macro. * Fix include path. * Fix include paths. * Fix include paths in profiler and tests. * Fix include paths. Co-authored-by: Adam Osewski <aosewski@amd.com>	2022-10-13 09:05:08 -05:00
Shaojie WANG	ebab84b6f9	MNKO padding support on bmm+masking+scale+softmax+bmm+premute (#425 ) * add lower triangle bmm * init code for tile skipping * functionality right with lower triangle mask * add decoder lower triangular mask calculation * use 713 group fix n2 compute error * attention with lower triangle mask with tile skipping * add template to distinguish masking kernel * rename template and remove default template value * remove lower triangle gemm reference struct * add some comments on example * add 10 instance for masking bmm + scale + softmax + bmm + permute kernels * add test * add test file * add gtest for bmm masking scale softmax bmm permute * clang-format * fix compile error * check lef bottom corner for tile skipping * fix error: check left bottom corner for tile skipping * add k padding * add test and instance for MNK padding * passing a mask struct * fix instances * delete used comments * format Co-authored-by: danyao12 <yaodan@dc-smc-13.amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-09-20 12:43:53 -05:00
Anthony Chang	9287b7c6b3	Grouped batched attention + permute (#412 ) * grouped attn without batch validates; now move toward grouped batched attn * grouped batched attention * working * remove debug logging clean up clean up * reintroduce g_ prefix back to host tensor variables * format * rename file * restore old file * rename * consolidate padded/non-padded attention example * harmonize padding specialization in attn examples	2022-09-19 16:09:44 -05:00
Anthony Chang	d6709dc373	Fix gemm-softmax-gemm-permute padding cases (#409 ) * fix example; make padding on by default in example; fix argument checks * fix Gemm1KPacK which has since regressed from PR #399	2022-09-08 09:27:50 -05:00
Anthony Chang	868e5c555b	Fused attention instances & padding tests (#395 ) * modify comment * trim unnecessary check * add gemm spec in kernel name * add TNTT gemm_gemm + atten kernel instances * refactor attention padding to better fit in unit tests This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op. Also added compile-time conditionals that load OOB value as NaN only after padding is enabled * add adhoc padding test for atten * shrink input value range for attention kernel validation to avoid occasional error by 1e-3 Still unsure whether this kind of deterministic floating point accurary issue is expected or not. May want to try exact same approach as the GPU kernel in the host reference GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then, shrink the input value range as it is less likely to produce errors of around ~1e-3. * attention kernel proper granular padding for all 4 dims * IsSupportedArgument checks * test more padded cases * block PadK specialization in attention kernels * workaround clang crash for gfx908 (gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class VGPR_32: Cannot scavenge register without an emergency spill slot!" this fall back to less ideal way of handle NPadding in fused attention kernel * comment out kernels giving wrong results on MI100; MI200 doesn't seem affected	2022-09-06 14:38:56 -05:00
Shaojie WANG	45adb736e7	Padding for attention: bmm+scale+softmax+bmm kernel (#385 ) * add padding algo for bmm+scale+softmax+bmm. Version for verification * remove verification code * remove comments * add padded bmm scale softmax bmm example * format * refactor * add comments for usages of padding bmm+scale+softmax+bmm Co-authored-by: Chao Liu <lc.roy86@gmail.com>	2022-08-30 11:01:37 -05:00
Anthony Chang	e0d8806ca1	Attention with output permutation (#370 ) * comment on specialization for TensorSpecialization::Packed * gemm_softmax_gemm with output permutation * scaling * refactor MatrixPadder; rename to GemmPadder * remove old sanity check * restore original gemm_softmax_gemm * revise comment in gemm_softmax_gemm example * use GetElementSpaceSize() * remove extra header * typo * remove archaic DeviceOpPtr	2022-08-23 14:52:56 -05:00
Chao Liu	bac7df8faf	use scale (#363 )	2022-08-17 10:38:00 -05:00

15 Commits