composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 18:17:44 +00:00

Author	SHA1	Message	Date
Shaojie WANG	e26256cd7d	Padding for attention: bmm+scale+softmax+bmm kernel (#385 ) * add padding algo for bmm+scale+softmax+bmm. Version for verification * remove verification code * remove comments * add padded bmm scale softmax bmm example * format * refactor * add comments for usages of padding bmm+scale+softmax+bmm Co-authored-by: Chao Liu <lc.roy86@gmail.com> [ROCm/composable_kernel commit: `45adb736e7`]	2022-08-30 11:01:37 -05:00
Anthony Chang	65e451c3ca	Try to workaround flaky GemmSoftmaxGemm tests (#386 ) * avoid potential hazard; flaky test issue persists * pin down the random seed to avoid flakiness [ROCm/composable_kernel commit: `138faf3961`]	2022-08-29 08:40:25 -05:00
zjing14	80c26d544c	Fixed splitk gemm fp32 (#384 ) * add scripts * fixed splitK_gemm_fp32 * clean * clean [ROCm/composable_kernel commit: `9881625b2d`]	2022-08-26 09:59:50 -05:00
Adam Osewski	af86bff9d4	More int4 tests. (#374 ) * More int4 UT. * Disable BitwiseRepresentation UT. * Add UT with static_cast * Surround cout statements with #if Co-authored-by: Adam Osewski <aosewski@amd.com> [ROCm/composable_kernel commit: `57fadf6fb9`]	2022-08-25 17:20:23 -05:00
Rostyslav Geyyer	3190273630	Add int4 example for convnd_fwd_bias_relu_add (#375 ) * Add int4 example for convnd_fwd_bias_relu_add * Fix AddReluAdd for building without int4 support * Update CMakeLists.txt * Format * Convert int4 tensors for int8 kernel * Fix device memory allocation * Format * Format [ROCm/composable_kernel commit: `b73ae24234`]	2022-08-25 17:08:43 -05:00
Po Yen Chen	7a6b0d3996	Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle (#378 ) [ROCm/composable_kernel commit: `88e43744d8`]	2022-08-24 10:12:54 -05:00
Anthony Chang	183312116d	Attention with output permutation (#370 ) * comment on specialization for TensorSpecialization::Packed * gemm_softmax_gemm with output permutation * scaling * refactor MatrixPadder; rename to GemmPadder * remove old sanity check * restore original gemm_softmax_gemm * revise comment in gemm_softmax_gemm example * use GetElementSpaceSize() * remove extra header * typo * remove archaic DeviceOpPtr [ROCm/composable_kernel commit: `e0d8806ca1`]	2022-08-23 14:52:56 -05:00
zjing14	7566771bdd	Add examples of batched/grouped/SplitK Gemm for int8/bfp16/fp16/fp32 (#361 ) * add examples into grouped/batched_gemm * adding splitK examples * fixed splitK * add bfp16 int8 example into splitK * formatting * use static_cast * added common for batched_gemm * add commons for examples of splitK/batched/grouped_gemm * return true * adjust splitK check tol * update example Co-authored-by: Chao Liu <lc.roy86@gmail.com> [ROCm/composable_kernel commit: `6091458300`]	2022-08-23 14:41:56 -05:00
Po Yen Chen	277be4385b	Add example of Gemm + AddAddFastGelu (data type: int4) (#369 ) * Add custom target to bundle examples together * Add int4 example conditionally (just copy from int8 example) * Extract common code into common.hpp * Move ref gemm type alias into data-type-specific sources * Add #error directive to prevent compile with wrong setting * Let AddAddFastGelu support int4 parameter type * Let check_err() support int4 parameter type * Add wrapper function to hide value conversion while copying memory * Finish int4 example for GEMM + AddAddFastGelu * Add new DeviceMem API to copy memory * Use new DeviceMem API to implement examples * Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4' * Revert "Add new DeviceMem API to copy memory" This reverts commit `e26e7af71e`. * Add conversion ctor for Tensor<> * Add 'const' specifier to Tensor<>::CopyAsType() * Convert Tensor<> values before/after transfer between host & device [ROCm/composable_kernel commit: `2327f1a640`]	2022-08-23 10:38:41 -05:00
Anthony Chang	7171bd452d	Implement padding and sanity checks for fused GEMM+GEMM (#376 ) * GemmPadder and GemmGemmPadder * proper padding using GemmGemmPadder * test gemm_gemm padding * properly check size K in IsSupportedArgument() * properly check size requirement given SrcScalarPerVector in IsSupportedArgument() * comment * format [ROCm/composable_kernel commit: `f4047c9418`]	2022-08-23 10:01:02 -05:00
rocking5566	18fd5307d7	[What] Fix bug of verification fail on E Matrix (#371 ) [Why] We need to sync lds even in first loop because Gemm also use the same LDS. [ROCm/composable_kernel commit: `c366de553e`]	2022-08-22 07:50:28 -05:00
Adam Osewski	2332d3657d	int4 data type (#364 ) * Introduce int4 data type. * Add unit-tests for int4 * Compile int4 UT only when int4 enabled. * clang-format Co-authored-by: Adam Osewski <aosewski@amd.com> [ROCm/composable_kernel commit: `e00149ac67`]	2022-08-18 14:53:47 -05:00
Chao Liu	f04f06158e	use scale (#363 ) [ROCm/composable_kernel commit: `bac7df8faf`]	2022-08-17 10:38:00 -05:00
Anthony Chang	46ce027d9e	Hotfix LDS data hazard in fused attention (#360 ) * avoid LDS data hazard in gemm_softmax_gemm pipeline * trivial refactors * comments * shrink blockwise gemm v2 thread buffer size * reclaim A block lds space when during 2nd gemm * amend * amend [ROCm/composable_kernel commit: `c961ce9226`]	2022-08-15 12:04:20 -05:00
Qianfeng	d6dd154286	Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320 ) * Implement multiple-reduction in one kernel (kernels, device ops, examples) * Add generic elementwise kernel and device interface * Add generator for normal-distributed data initialization * Add host refer implementation of batchnorm-forward and batchnorm-infer * Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels * Remove un-needed including in batchnorm example * Renaming generic_elementwise to elementiwise in kernel and device classes/functions * Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise * Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise * Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise * Add DeviceElementwiseBase and use it in device_normalize_instance.cpp * Removing and renaming files * Update to synchronize gemm_layernorm client example to the generic element-wise device op API * Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming * Merge two static member functions in device_elementwise.hpp * Remove unary_elementwise_1d kernel and device [ROCm/composable_kernel commit: `53ea4713af`]	2022-08-15 10:11:02 -05:00
rocking5566	8ff56b19d5	Layernorm welford (#346 ) * Add threadwise and blockwise welford * Rename gridwise op, prepare to add welford version * implement welford and integrate welford into layernorm * Take care of tail loop * Fix buf when ThreadSliceK > 1 * Fix bug of merging of two empty set * Rename clip to clamp * 1. Fix type of count 2. Remove useless static_assert * Do not inherit Reduction::Argument * [What] replace __syncthreads() with block_sync_lds() [Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0) * Add y stride * Rename. DeviceLayernorm -> DeviceLayernormImpl DeviceNormalization2 -> DeviceLayernorm * Move literal ""_uz & ""_zu into namespace 'literals' * Move namespace 'literals' as 'ck::literals' Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `0bd6b842b9`]	2022-08-13 09:43:18 -05:00
Anthony Chang	37f61cbba5	Fused GEMM+GEMM (#351 ) * initial stub for gemm_gemm_xdl_cshuffle * set up example code * compiles * prevent integer overflow * harmonize interface between ref_gemm and ref_batched_gemm * batched_gemm_gemm * fix example * host tensor gen: diagonal pattern in lowest two-dimensions only * make c descriptors containing only integral constants * clean up * add BlockwiseGemmXdlops_v2 while exploring an unified approach * implement proper interface * tidy up example * fix compilation warnings * coarsely controlled 2nd gemm padding * remove rocm-cmake's hard requirement for certain revision * clang-format * resolve merge conflict * fix compilation error on gfx10 * adds acc0 elementwise op to interface * add gemm_gemm instances and tests * avoid LDS data hazard * fix build Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `c20a75b07d`]	2022-08-13 09:18:58 -05:00
ltqin	6c7e2d3a7e	Skip lds of b matrix (#326 ) * start * read for gridwise gemm * add MakeBGridDescriptor_K0_N0_N1_N2_N3_K1 * add thread copy desc and register buffer * add K0PerBlock dim * add read global data * finish gridwise gemm * finish blockwise gemm * add print data * add smallest config * add compare code for gridwis gemm * fix NXdlPerWave * fix k0perthread and gridewis gemm main loop * remove b matrix lds alloc * fix name * add test code * create b_grid_desc_k0_k1_k2_n0_n1_n2_n3_k3 from parameter * add double register * modify b_thread_desc_ * add float * fp16 tag * add tail for pipeline * finish main loop * optimize main loop * start clear gridwise gemm * clear code * clear redundant code * change file name * change file name * fix bug after merge develop * fix input parameters * using MultiK0 control b load data loop * fix some config * 4 buffer * fix bug * one can use * change read order * change buffer array to tuple * change to 8 buffer * interleave buffer load * change to 16 * read 8 buffer * add data buffer to template * fix after merge develop(head file) * format * change to 4 buffer * remove unnecessary lambda fun [ROCm/composable_kernel commit: `10b3278b05`]	2022-08-13 01:35:49 -05:00
rocking5566	f4589d00d9	Gemm multiple d multiple r (#335 ) * Imitate XXX_gemm_multiple_d, add XXX_gemm_multiple_d_multiple_r for gemm + reduction * Implement run of kernel * Add example * Fix parameter of typo * Rewrite the reduceMax example * Rewrite the reduceMean + reduceMeanSquare example * Refine naming * Refine folder name * refine naming * Rewrite the gemm + bias + relu + add + layernorm example * Rewrite the gemm + layernorm example * clang-format * Fix bug if sync lds * Fix compile error [ROCm/composable_kernel commit: `6c3c06bf1f`]	2022-08-13 01:07:12 -05:00
Anthony Chang	cc5140ac96	Fused attention (#345 ) * initial stub for gemm_gemm_xdl_cshuffle * set up example code * compiles * prevent integer overflow * harmonize interface between ref_gemm and ref_batched_gemm * batched_gemm_gemm * fix example * host tensor gen: diagonal pattern in lowest two-dimensions only * make c descriptors containing only integral constants * clean up * add BlockwiseGemmXdlops_v2 while exploring an unified approach * implement proper interface * tidy up example * fix compilation warnings * coarsely controlled 2nd gemm padding * remove rocm-cmake's hard requirement for certain revision * clang-format * resolve merge conflict * fix compilation error on gfx10 * adds acc0 elementwise op to interface * attention host validation * add blockwsie softmax v1 * iteratively update softmax+gemm * transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum * add init method for easier debugging * do away with manual thread cluster calculation * generalize blockwise softmax interface * row-wise softmax sum & max * format * rename to DeviceBatchedGemmSoftmaxGemm * add gemm_softmax_gemm instances and tests * comment Co-authored-by: ltqin <letao.qin@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `cac014f173`]	2022-08-13 00:16:14 -05:00
Rostyslav Geyyer	3b77a801c6	Add example of conv_fwd_bias_relu_add for int4, int8, bfp16, fp16, and fp32 (#343 ) * [LWPCK-359] Initial commit * Working version for fp16, add results to readme * Update according to PR #341 * Update results in readme * Add fp32 example * Add bf16 example * Update fp16 and fp32 examples * Add int8 example * Add separate lengths and strides tensors for D tensors Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> [ROCm/composable_kernel commit: `0c6ef7c14e`]	2022-08-12 15:30:27 -05:00
zjing14	d83a702186	add g; fixed strides (#355 ) [ROCm/composable_kernel commit: `35e49f2de6`]	2022-08-12 15:22:39 -05:00
Po Yen Chen	d832dc4354	Add examples for GEMM + AddAddFastGelu (data type: int8, bf16, fp32) (#340 ) * Add always_false<> util to delay symbol resolution * Use always_false<> to prevent trying instantiate unwanted method * Add new specializations of AddAddFastGelu::operator() method * Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32 * Use floating point literal to simplify code * Remove unnecessary capture in lambda expressions * Extract fast GeLU calculation as standalone method * Mark methods as 'constexpr' * Add constraint for HostTensorDescriptor templated ctors * Simplify HostTensorDescriptor ctor calls * Add C++23 std::size_t literal suffix * Use _uz suffix to shorten example code * Remove unnecessary conversion to std::array<> * Re-order include directives * Remove C-style casting by literal suffix * Remove unnecessary statements in main() * Remove unused type parameter of always_false<> * Remove unused include directive * Exit main() by returning meaningful value * Use 'if constexpr' to switch example flow * Use std::is_same_v<> to shorten example code * Add 'inline' specifier to literal functions * Unify output methods in example * Move common codes into .inc file * Add type check in type_convert<>() * Add type_convert<float>() before computation * Merge AddAddFastGelu method specializations * Remove always_false<> * Add constraint to AddAddFastGelu::operator() parameter types [ROCm/composable_kernel commit: `68b61504a3`]	2022-08-11 17:31:28 -05:00
rocking5566	01a7af65bb	ckProfiler for layernorm (#330 ) * Refine parameter * Add base class for layernorm * Add layernorm instance * Add layernorm to ckProfiler * Remove redundant * Add verification * Fix compile error due to merge [ROCm/composable_kernel commit: `fdfd7eb597`]	2022-08-11 17:03:54 -05:00
zjing14	e4cc81b8ba	Add batched/grouped_gemm contraction deviceOps (#349 ) * convnd_fwd fp16 example * update example * update example * update instance * updating refernce conv * update reference conv * update conv fwd profiler * update conv 1d and 3d instance * update include path * clean * update profiler for conv bwd data and weight * update conv bwd weight * clean * update conv example * update profiler for conv bwd weight * update ckprofiler for conv bwd data * fix reference conv bwd data bug; update conv bwd data test * update examples * fix initialization issue * update test for conv fwd * clean * clean * remove test case too sensitive to error threshhold * fix test * clean * fix build * adding conv multiple d * adding conv multiple D * add matrix padder * add gemm padding to convnd * adding group conv * update gemm multi-d * refactor * refactor * refactor * clean * clean * refactor * refactor * reorg * add ds * add bias * clean * add G * adding group * adding group * adding group * update Tensor * clean * update example * update DeviceGemmMultipleD_Xdl_CShuffle * update conv bwd-data and bwd-weight * upate contraction example * update gemm and batch gemm with e permute * fix example build * instance for grouped conv1d * update example * adding group conv instance * update gemm bilinear instance * update gemm+add+add+fastgelu instance * update profiler * update profiler * update test * update test and client example * clean * add grouped conv into profiler * update profiler * clean * add test grouped conv, update all conv test to gtest * update test * change gemm_c_permute with contraction * add grouped_contraction * add contraction in group_gemm * add example of grouped_gemm with contraction * add example of grouped_contraction_bias_e_permute * clean * fixed ds * add m3n2 m2n3 examples into gemm_bias_e_permute Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `e08d68d25d`]	2022-08-10 12:20:29 -05:00
Chao Liu	159f0bc1b4	Update Group convolution (#341 ) * add conv oddC * update example * update example * fix bug in example * fix bug in group conv example [ROCm/composable_kernel commit: `75ab874e02`]	2022-08-03 12:28:33 -05:00
Adam Osewski	dee2696501	CGEMM examples bf16, fp32, int8 (#332 ) * Add int8 specialization for elementwise Add and Subtract. * CGEMM examples bf16, fp32, int8 * Add convert reference output to CDataType. * Skip BF16 data type during testing. * Lower K value to get rid of accumulation error. * Fix merge artifact. * Fix changed function name: GetElementSpaceSize() * Fix merge artifact. Co-authored-by: Adam Osewski <aosewski@amd.com> [ROCm/composable_kernel commit: `fb0dc35861`]	2022-08-02 14:52:27 -05:00
Chao Liu	5a1aa3f702	Clean up conv example, Instances, profiler and test (#324 ) * convnd_fwd fp16 example * update example * update example * update instance * updating refernce conv * update reference conv * update conv fwd profiler * update conv 1d and 3d instance * update include path * clean * update profiler for conv bwd data and weight * update conv bwd weight * clean * update conv example * update profiler for conv bwd weight * update ckprofiler for conv bwd data * fix reference conv bwd data bug; update conv bwd data test * update examples * fix initialization issue * update test for conv fwd * clean * clean * remove test case too sensitive to error threshhold * fix test * clean * fix build * adding conv multiple d * adding conv multiple D * add matrix padder * add gemm padding to convnd * adding group conv * update gemm multi-d * refactor * refactor * refactor * clean * clean * refactor * refactor * reorg * add ds * add bias * clean * add G * adding group * adding group * adding group * update Tensor * clean * update example * update DeviceGemmMultipleD_Xdl_CShuffle * update conv bwd-data and bwd-weight * upate contraction example * update gemm and batch gemm with e permute * fix example build * instance for grouped conv1d * update example * adding group conv instance * update gemm bilinear instance * update gemm+add+add+fastgelu instance * update profiler * update profiler * update test * update test and client example * clean * add grouped conv into profiler * update profiler * clean * add test grouped conv, update all conv test to gtest * update test [ROCm/composable_kernel commit: `500fa99512`]	2022-07-29 18:19:25 -05:00
zjing14	c67c8e3968	Batched Gemm with multiD (#329 ) * add batched_gemm_multiD * add ds * rename file * add batched_gemm_bias example * add batch_strides into bmm_c_permute * clean * rename example_28 to example_29 Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `d7d7829096`]	2022-07-22 09:33:50 -05:00
zjing14	cdb627bf1b	Grouped Gemm device with multiD grid (#319 ) * replace gridwise_v2r3 with multiD * adjust parameters * add instances * fixed test_grouped_gemm * fix standalone softmax race condition around blockwise reduction * fixed ci * fixed comment: remove redundant workspace * use instanceFactory * add test layout * add empty Ds * add bias example * use array * sperate examples Co-authored-by: Anthony Chang <ac.chang@outlook.com> [ROCm/composable_kernel commit: `7959dad566`]	2022-07-21 10:07:01 -05:00
Anthony Chang	6c157aa076	fix standalone softmax race condition around blockwise reduction (#323 ) [ROCm/composable_kernel commit: `a11680cce6`]	2022-07-14 22:52:45 -05:00
rocking5566	88e9bfd4da	Standalone layernorm (#315 ) * Implement layernorm kernel and deviceOp * verify gpu kernel with host code * 1. Separate gamma aand beta from affine 2. Check if argument is valid * clean * Sync the naming * Support sweep once mode if we can put k dimension data inside one block * [What] Get length from upper length. [Why] if we get length directly, we may get length after padding. * We only use one block in K dimension. Hence, we can simplify the indexing of global R/W. * Use 1d descriptor for gamma and beta * Add accElementwiseOp * Extract layernorm host code * Support different YVectorDim in GridwiseLayernorm * Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp * Gamma and beta can share the VGPR. * Add test for fp32 and fp16 * Fix bug of concurrency and add test case which may fail orignally * Propagate NaN for layernorm Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `7f21662089`]	2022-07-13 11:16:14 -05:00
Po Yen Chen	41aed137e0	GEMM pipeline v2 (#317 ) * format * improving pipeline * fix typo * format * adding thread group * adding thread group * adding thread group * adding gemm pipeline * tweak * refactor * refactor * add missing type convert * refactor * refactor * refactor * clean * fix build * refactor * format * clean up * use remove_cvref_t * clean * use pipeline_v2 for gemm kernel * Remove inconsistent indent * Fix compilation errors due to incomplete merge process * Add missing include directives * Fix compilation errors in currently unused files * Add license in newly added files * Re-format touched files by clang-format-10 * Fix wrong template argument count of DeviceGemm<> * Use language construct to choose between types * Use language construct to choose GEMM example instance * Fix compilation error due to interface change * Re-use type alias to avoid duplication * Unify type alias usage in source file * Only use v2 pipeline in one gridwise GEMM type * Remove no-longer used include directives * Add static_assert() to check pipeline type requirements * Revert "Add static_assert() to check pipeline type requirements" This reverts commit `f0985f0a13`. * clean * clean * clean * clean Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: shaojiewang <wsjmessi@163.com> [ROCm/composable_kernel commit: `639147432b`]	2022-07-08 15:55:14 -05:00
Shaojie WANG	2ec81dc784	add conv1d/3d bwd weight instances (#318 ) * add conv1d/3d bwd weight instances * add profiler code [ROCm/composable_kernel commit: `763ca61581`]	2022-07-08 15:42:20 -05:00
Chao Liu	82745bffde	N-D Tensor Contraction example, instance, and client example (#270 ) * adding contraction * add contraction example * update examle * update example * format * update readme * clean header * clean header * contraction with multiple D * rename * fix naming issue; add instances for contraction+bilinear * change assumed virtual layout of contraction; add client example * update example * update * contraction+scale * use type_convert * rename [ROCm/composable_kernel commit: `4fe9c393b8`]	2022-07-07 14:31:11 -05:00
zjing14	288e654664	Batched Gemm with C Permute (#305 ) * init commit * add c_permute * add mnk padding * fixed comments * Fixed comments Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `334361cbde`]	2022-07-06 10:38:29 -05:00
Chao Liu	4be57e5afa	Gemm+Bilinear (#316 ) * refactor * update example * update example * gemm bilinear * clean * update [ROCm/composable_kernel commit: `9e4429f9c3`]	2022-07-02 09:15:38 -05:00
guangzlu	32da6c1591	modified grouped gemm addressing method (#307 ) * modified grouped gemm addressing method * modified addressing method in device_grouped_gemm_xdl.hpp Co-authored-by: root <root@dc-smc-13.amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `8e374781d5`]	2022-07-01 01:38:21 -05:00
Anthony Chang	94c9902903	Single-kernel GEMM + layernorm (#263 ) * dump lds content in appropriate precision type * add squared add reduction op; allows sq sum * initial stub from regular gemm impl * layernorm example code & host verification * initial layernorm implementation * tidy up * make C0 precision type consistent with C * clang-tidy and additional comments * tighten up example code * account for extra flops/bytes from normalization * clang-format * c0 bias/beta/gamma now have its own precision type * AccElemOp for gemm outputs prior to feeding to layernorm * update workgroup mapping * rename kernel template param to reflect its dual use * use LDS mem pool for reduction workspace * change cshuffle precision type to f16; clean up * clang-format * correct naming * explicit cast * fully implemented gemm + bias + activation + add + norm * activation in correct order * reflect reduction API's recent change * amend * clean up; add comment * keep up with recent changes in reduction API * format * resolve merge conflicts Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `63fd5da637`]	2022-07-01 01:38:00 -05:00
zjing14	668891d234	add batch_stride into batched gemm (#314 ) * add batch_stride * fixed test Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `1c8126a4c2`]	2022-07-01 01:35:37 -05:00
Chao Liu	74b6e85eaf	Improve external interface for GEMM and GEMM+add+add+fastgelu (#311 ) * interface for GEMM and GEMM+add+add+fastgelu * rename namespace * instance factory * fix build * fix build; add GEMM client example * clean [ROCm/composable_kernel commit: `0dcb3496cf`]	2022-06-30 22:11:00 -05:00
zjing14	7094d7c910	Gemm + bias + c_permute (#312 ) * init commit * add desc * finished c permute * fixed vector lens [ROCm/composable_kernel commit: `fa9a0a5cfb`]	2022-06-30 19:55:09 -05:00
Anthony Chang	bd01797525	Standalone sweep once softmax kernel w/ ckProfiler (#295 ) * use 'sweep once' softmax kernel where applicable * threadwise copy's dst buffer can specify invalid element value * add int8 in/out float compute softmax support give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error * format * softmax inherits DeviceNormalization * softmax profiler stub * tighten up reference softmax interface * example prints tensor dimension * add fp32 to softmax profiler * rename header * hook with ckProfiler * format * resolve merge conflict * resolve merge conflicts * update normalization profiler help string * resolve conflict * typo * remove residual * softmax profiler: address feedback * test for mixed precision input/output * fully qualify ck::math::isnan * add comment for device normalization interface * revise wording * constness for alpha/beta scaler pointer [ROCm/composable_kernel commit: `93c99f3d87`]	2022-06-30 12:08:50 -05:00
rocking5566	f9163ae575	external api for gemm + layernorm (#285 ) * Extract base class for elementwise * Refactor interface of DeviceGemmReduce. Do not use tuple in interface * [What] Rename d into reduce in gemm + reduction related code [Why] Prepare to add d term for add * Unify base class of gemm + reduce and gemm + bias + add + reduce * 1. Rename gemm_bias_add_reduce for external api 2. Refine cmake * Add normalize device operation * [What] Reorder the argument [Why] Because d0 is also the input of c. * Add type string * Add example of gemm_bias_add_layernorm via external api * Refactor example code * clang-format * Fix compile error * clang-format * Add external api for gemm_add_add_layernorm and normalize * Add client example * clang-format [ROCm/composable_kernel commit: `12235112a1`]	2022-06-27 14:25:10 -05:00
Chao Liu	675e7b7956	External Interface (#304 ) * add client example * clean * clean * reorg * clean up profiler * reorg * clea * fix profiler * function for getinstances * update client example * update client example * update client example * update * update example * update Jenkins file * update cmake * update Jenkins [ROCm/composable_kernel commit: `aebd211c36`]	2022-06-26 19:39:02 -05:00
Chao Liu	2ef299e0ad	add license in file (#303 ) [ROCm/composable_kernel commit: `d3051d7517`]	2022-06-24 23:32:43 -05:00
Chao Liu	9df0a11a51	Absolute include path (#281 ) * ad gelu and fast_gelu * added GeLU and fast GeLU * clean up * add gemm+fastgelu example * add gemm+gelu instances * update profiler * clean up * clean up * adding gemm+bias+activation * clean * adding bias * clean * adding gemm multiple d * debugging * add gemm bias add fastgelu * rename, clean * refactoring; add readme * refactor * refactor * refactor * refactor * refactor * refactor * fix * fix * update example * update example * rename * update example * add ckProfiler * clean * clean * clean * clean * add client app example * update readme * delete obselete files * remove old client app * delete old file * cleaning * clean * remove half * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path for all examples * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * fix header path * revert client app example * clean build * fix build * temporary disable client test on Jenkins * clean * clean * clean [ROCm/composable_kernel commit: `d1db6a0c3e`]	2022-06-24 20:51:04 -05:00
Chao Liu	170b534aae	update license (#297 ) * update license * update license * update license * update license [ROCm/composable_kernel commit: `a49115b95e`]	2022-06-23 01:27:30 -05:00
Adam Osewski	86c2b55b2a	Testing all fwd convolution specializations. (#259 ) * UniforFill with integer values. * Log tested instance type string. * Add UT for all convolution specializations. * debugging conv * Fix dangling reference bug. * Small refinements. * Fix call to error checking function. * Small refinements to tests. * Configure error tolerance * Change problem size. * Remove OddC case from types that do not support it. * Add helper traits for AccumulatorDataType. * Print first 5 errs in check_err for integral types. * Rename FillUniform to FillUniformDistribution * Refactor * Do not use typed tests. * Instead use plain fixture class with templatized member functions. * Initialize tensors with integer values. * Refine test instances. * Properly set accumulator data type. * Add another "big" instance. * Refactor convolution tests. * Revert "debugging conv" This reverts commit `b109516455`. * Add pragma once + format + small refinement. * Fix some unwanted changes. * Clang-format * Fix profile_convnd to use renamed tensor initializer. * Add instances for ConvFWDND kernel case 2D * Helpers to get ConvNDFwd 2D instances. * Refactoring. * Remove "small block" instance as it was generating compiler errors. * Remove default template parameters values. * Refine and fix test. * Fix problem with default template parameter types. * Adjust error thresholds for floating point values test. * Use integer values initialization for instances test. * Add tests for ConvNDFwd 2D case. * Remove AccumulatorDataType type trait. * Update unit-tests. * Remove operator<< overload. * Unlock conv1d/3d nd fwd instances. * Enable skipping calculating reference using flag. * Fix number of channels for first ResNet50 layer. * Clang-format. Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> [ROCm/composable_kernel commit: `a2edd7d802`]	2022-06-22 22:05:04 -05:00
Shaojie WANG	1b65fd9584	fix Issue 291 (#294 ) * rename for typeconvert functor * refine code [ROCm/composable_kernel commit: `4634b12043`]	2022-06-21 17:10:56 -05:00

1 2 3

118 Commits