composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Author	SHA1	Message	Date
Chao Liu	a4f24233e5	manually apply bug fix changes in pr #63 (#64 ) * Bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2() * Bug in ThreadwiseTensorSliceTransfer_v1r3 logic for calculating "forward_sweep"	2021-12-12 18:05:51 -06:00
Chao Liu	fd3d907a80	fix ReLU formula (#61 ) * fix relu * clean up * clean up	2021-12-04 16:05:29 -06:00
Chao Liu	41cdd3801a	GEMM/Conv+BiasAdd+ReLU+Add (#55 ) * gemm+activation * move C pointwise operation into threadwise copy * add pointwise operation to A/B matrix * update ckProfiler * adding bias add * adding bias add * adding bias add * added bias add; worked around compiler issues * clean up * clean up * Update README.md * Update README.md * Update README.md * clean up * add conv_xdl example * adding conv_xdl_bias_relu_add example * add conv+bias+relu+add, but has register spill issue * tweak * tweak * refactor * Update README.md update readme for example/2_gemm_xdl_bias_relu_add * clean up * Update README.md update readme for example/3_conv_xdl * Update README.md	2021-12-02 20:07:37 -06:00
Jing Zhang	d7a0a3f94c	renaming/comments	2021-12-02 23:37:57 +00:00
Jing Zhang	2cbb897638	add static_buffer_v2 zero out	2021-12-02 05:54:19 +00:00
Jing Zhang	d798c9b8c6	fixed c_buffer alloc	2021-12-02 05:03:03 +00:00
Chao Liu	4041850f11	fix layout naming convention (#56 )	2021-11-30 09:10:55 -06:00
Chao Liu	237d4ca03f	added test for magic number division (#58 )	2021-11-30 09:09:28 -06:00
zjing14	567f5e9c5f	add args for packed gemm (#54 )	2021-11-24 12:33:55 -06:00
Chao Liu	64350affc5	Use __builtin_memcpy to implement bit_cast and for accessing vector from pointer of scalars (#53 ) * reworking vector_type * use __builtin_memcpy for bit_cast and vector access of scalar pointer * clean up	2021-11-18 09:11:15 -06:00
zjing14	970fa3e92e	v5r1 fusion kernels for inference (#49 ) * init * refactor for 1x1 * rename e0_e1 * add e1 with bugs * debug * fixed * fixed e1 * add timer * imprve threadwise gemm with dot2 * add e2 * tuning * seperate c2 * add nhwc * restore nchwc * clean * opt * fixed; tuning * add BGlobalMoveSliceWindowStepHacks{} * tuning * repeat running * adjust * merge v5r1 nchwc * add adaptors * split k0 k1 in c_thread_grid * split h and w * remove v5r1 nhwc * clean for pr * remove host_conv_add * clean code * clean * add dynamic support * static mode * test static * add conv+add fusion * fixed validation * naming fix * use activ_enum * make static * refactor conv_add for InMem::add * add bias * add conv_out * add configurable makeddesc * add maxpool fusion * add maxpool host for validation * enable static desc * conv-only use v5r1_add * test * test * for binary dumps * fixed incorrect results due to typo * clean * debugging maxpool * workaround with offset trick * clean code * modularize ops of fusion * add gridwise_gemm_v3 * create seperate fusion fun * enable dynamic mode of conv and conv+resize_add * add dynamic mode of maxpool * add pass by point * add activ_type as arguments * merge develop * clean * reset config to old default Co-authored-by: Chao Liu <chao.liu2@amd.com>	2021-11-18 08:34:07 -06:00
zjing14	a651ea4f7a	Fixed bfp16 host_conv_fwd (#52 ) * fixed bfloat16 issues * refactor type_convert * fixed host_convolution_forward for ushort Co-authored-by: Chao Liu <chao.liu2@amd.com>	2021-11-18 08:10:56 -06:00
zjing14	0a66c54e95	fixed multiple definition issue of bfp16/fp32 conversion function when building ckProfiler (#51 ) * fixed bfloat16 issues * refactor type_convert Co-authored-by: Chao Liu <chao.liu2@amd.com>	2021-11-16 15:44:17 -06:00
Jing Zhang	89e1ebd4d5	updated bfloat16_to_float	2021-11-16 18:01:25 +00:00
zjing14	3737bb039a	Add bfp16/int8 support into XDL GEMM operator (#50 ) * init StaticBufferV2 * clean * adopt old output stage for staticBufferV2 * clean * remove hack * clean * clean * add parameters * clean code * move c_buffer alloc into blockwise gemm * add adaptors for m/n_thread_data_on_grid * tweak gemm * adjust blockwise_gemm_xdlops * tweak * update conv * update script * adding bwd 1x1 * update script * adding 1x1 bwd * debugging bwd 1x1 failure * update script * update script * test * test v100 * add bf16_1k * clang-format * clean * add bfp16 for gfx908 * add verification * clean up * clean code * restore bfl16 * clean * add bfp16 support into gemm_driver * apply new generator to other drivers * add int8 support * cleanb * clean * clean * clean Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Chao Liu <lc.roy86@gmail.com> Co-authored-by: root <root@hayabusa6111.amd.com>	2021-11-15 10:24:39 -06:00
Chao Liu	b491ebf384	FP16 data in-register transpose (#41 ) * start fixing 16bit data packing * adding StaticTensor * adding StaticTensor * adding StaticTensor * add missing constexpr * adding static tensor * adding static tensor * adding transpose * add inline asm for transpose 2x2 of half_t * add general transpose_vectors(), but have unnecessary register initialization using v_mov * fix unnecessary register initialization in transpose_vector by using more pass-by-reference * add hardcoded logic for NHWC wrw * improve asm for v_pack * make ThreadwiseTensorSliceTransfer_v3r2 support any tensor * tweak * reorganize file	2021-11-15 10:05:58 -06:00
Chao Liu	e823d518cb	ckProfiler and device-level XDL GEMM operator (#48 ) * add DeviceGemmXdl * update script * fix naming issue * fix comment * output HostTensorDescriptor * rename * padded GEMM for fwd v4r4r4 nhwc * refactor * refactor * refactor * adding ckProfiler * adding ckProfiler * refactor * fix tuning parameter bug * add more gemm instances * add more fp16 GEMM instances * fix profiler driver * fix bug in tuning parameter * add fp32 gemm instances * small fix * refactor * rename * refactor gemm profiler; adding DeviceConv and conv profiler * refactor * fix * add conv profiler * refactor * adding more GEMM and Conv instance * Create README.md Add build instruction for ckProfiler * Create README.md Add Readme for gemm_xdl example * Update README.md Remove build instruction from top most folder * Update README.md * clean up	2021-11-14 11:28:32 -06:00
ltqin	6014185ac6	[Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44 ) * change method computering kpad * remove unusing variable: batchlen * change KPerBlock to K0PerBlock * fix bug for k0 == k0perblock * fix bug for get k0 index * use math::integer_divide_ceil Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2021-10-27 09:39:18 -05:00
Chao Liu	3e9113707f	Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all update ck from miopen ck_upstream	2021-10-27 09:07:38 -05:00
ltqin	211dae8229	Merge branch 'develop' into miopen_downstream_all	2021-10-27 13:34:19 +08:00
Jun Liu	5890e30076	[Composable Kernel] update develop branch code to ck_upstream Merge pull request #1236 from ROCmSoftwarePlatform/develop	2021-10-25 19:49:17 -07:00
Chao Liu	d5297abae9	fix bug in gridwise gemm xdlops v2r3 (#45 )	2021-10-21 16:42:24 -05:00
Chao Liu	c3018794b4	bug fix (#39 )	2021-10-19 18:43:10 -05:00
ltqin	fd49ff8080	add nchw atomic , nhwc and nhwc atomic method for backward weight (#30 ) * add add new algorithm from v4r4r2 * program once issue * add split k functiion * redefine code * add a matrix unmerge * add b matrix unmerge k0 * trans a and b to gridegemm * nhwc init * no hacks and vector load * add hacks * modify some parameter * fix tuning prometer for fp32 * fix tuning prometer for fp16 * start change gridwise k split * init ok * revome a b matrix k0mk1 desc in grid * carewrite lculate gridsize * add kbatch to CalculateBottomIndex * remove some unused funtion * add clear data function before call kernel * out hacks * in hacks * rename device convolution file and function name * modify kBatch value * fix some tuning code * start from v4r4 nhwc * nhwc atomic is able to run * just for fp32 * enable nchw atomic * tweak * tweak * re-arrange gridwise gemm hot loop for wrw * add wrw v4r5 * v4r4r5 fp16 * v4r4r4 fp16 * v4r4r2 fp16 * V4R4R4XDLNHWC fp16 * V4R4R2XDLATOMICNCHW fp16 * adjust for fp16 * input gridsize * change kbatch to gridsize * testing wrw * clean up * k_batch to gridsize * fix bug * wrw v4r4r4 kbatch change to gride size * wrw v4r4r2 kbatch change to gride size * after merge , change gridwise gemm v2r4 * change MakeCBlockClusterAdaptor * other method use new gridwise gemm * clean up * chapad method nge to make_right_pad_transform * kbatch out from transform function * clean up and fix bug * fix bug * using function type reduce template parameters * using auto replace define fuction type * clean up Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Jing Zhang <jizhan@amd.com>	2021-10-19 18:42:34 -05:00
Qianfeng	b2dc55f82c	[MIOpen Downstream] Fix Reduction Kernel (#34 ) * Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel * Fix with regard to implementing GetZeroVal() in both kernel and host * Avoid convert to compType from dstDataType before writting the output value * Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator * Add CONSTANT decorator for descriptor read buffer * Use get_thread_local_1d_id() for thread local Id * Rename GetZeroVal() to GetReductionZeroVal() in the kernels * Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp * Occasional tiny simplification and update in the kernel files * Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers * Update to remove OpenCL tidy checking failures * Update for better readability * Remove unused codes and not-needed template parameters in the kernel wrappers Co-authored-by: Chao Liu <chao.liu2@amd.com>	2021-10-06 14:43:17 -05:00
Chao Liu	b3e8d57d51	Tweak GEMM kernel (#38 ) * add parameters * tweak gemm * tweak * update conv * update script * adding bwd 1x1 * update script * adding 1x1 bwd * debugging bwd 1x1 failure * update script * update script * test * test v100 * clean up	2021-10-06 11:12:36 -05:00
zjing14	846f462bd4	Add VectorType support into StaticBuffer (#27 ) * init StaticBufferV2 * clean * adopt old output stage for staticBufferV2 * clean * remove hack * clean * clean * clean code * move c_buffer alloc into blockwise gemm * add adaptors for m/n_thread_data_on_grid * adjust blockwise_gemm_xdlops * reorder ops in GEMM hot loop Co-authored-by: Chao Liu <chao.liu2@amd.com>	2021-10-06 10:13:52 -05:00
Qianfeng	dfb80c4e39	[Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156 ) * Squashed 'src/composable_kernel/' content from commit `f6edda611` git-subtree-dir: src/composable_kernel git-subtree-split: `f6edda6119` * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files * Squashed 'src/composable_kernel/' changes from f6edda611..5781adf5c `5781adf5c` Update develop (#5) (#6) `97e6d514f` Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile `7b1ec41e5` refactor `49c33aaea` refactor `54b3e73d1` rename git-subtree-dir: src/composable_kernel git-subtree-split: `5781adf5cf` * fix * refactor * remove online compilation from CK * refactor * fix * add ctest * tidy * add tidy * tidy * tidy * tidy * tidy * tidy * tidy * tidy * tidy * tidy * add c-style pointer cast * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast * fix clang warning suppression * tidy * suppress cppcheck * fix enum issue * revert chagnes to hip build * fix kernel filename * update CK build script * rename * rename * make innner product compatiable on gfx900 * Update src/include/miopen/solver/ck_utility_common.hpp Co-authored-by: JD <Jehandad.Khan@amd.com> * compiler parameter use stream * use int instead of index_t in kernel wrapper * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element * refactor * refactor * change cmakelist * change ck common utility * fix * Squashed 'src/composable_kernel/' changes from 5781adf5c..31b403526 `31b403526` Merge pull request #16 from ROCmSoftwarePlatform/develop `b62bf8c3f` Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration `ccc4a1d36` Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration `67ad47e7c` refactor `16effa767` refactor `a91b68dfc` DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element `2cbabbba5` use int instead of index_t in kernel wrapper `0834bc763` compiler parameter use stream `f2ac7832c` make innner product compatiable on gfx900 `4e57b30a6` rename `c03045ce2` rename `b2589957f` update CK build script `2c48039d0` fix kernel filename `d626dccc9` fix enum issue `643ebd4f3` tidy `ddd49ec9e` fix clang warning suppression `4f566c622` vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast `172036d72` add c-style pointer cast `76f313193` tidy `d18428901` tidy `f885c131d` tidy `80120f0a0` tidy `c3efeb5e2` tidy `56fc0842b` tidy `54fba515b` tidy `e62bae7a4` tidy `24c872894` add tidy `61487e0a0` fix `ae98b52ad` remove online compilation from CK `cb9542131` refactor `73ca97015` Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3 `3b8664611` Merge pull request #7 from ROCmSoftwarePlatform/master `d09ea4f4e` Update develop (#5) `3d32ae940` add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files git-subtree-dir: src/composable_kernel git-subtree-split: `31b403526e` * Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel * Fix with regard to implementing GetZeroVal() in both kernel and host * Avoid convert to compType from dstDataType before writting the output value * Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator * Add CONSTANT decorator for descriptor read buffer * Use get_thread_local_1d_id() for thread local Id * Rename GetZeroVal() to GetReductionZeroVal() in the kernels * Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp * Occasional tiny simplification and update in the kernel files * Update in src/reducetensor.cpp for consistent IDs passing to the kernel * Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers * Update to remove OpenCL tidy checking failures * Small updates in src/reducetensor.cpp * Update for better readability * Remove unused codes and not-needed template parameters in the kernel wrappers Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: JD <Jehandad.Khan@amd.com>	2021-09-29 08:12:11 -07:00
Jun Liu	8557901d02	Merge pull request #1165 from ROCmSoftwarePlatform/develop Merge develop into CK_upstream (Please don't squash when merging)	2021-09-21 15:52:12 -07:00
Chao Liu	f305bebdc3	Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr [MIOpen Downstream] Dynamic Reduction PR	2021-09-21 11:59:23 -05:00
Chao Liu	b725e3fc84	Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr	2021-09-21 11:55:26 -05:00
Chao Liu	df0d68106e	:Merge remote-tracking branch 'origin/develop' into CK_upstream	2021-09-20 20:44:01 -05:00
Chao Liu	f3acd2510b	Add a version of Merge transform that use integerdivision and mod (#25 ) * add Merg_v3_division_mod * refactor	2021-09-05 12:57:57 -05:00
Chao Liu	19613902b5	GEMM driver and kernel (#29 ) * add gemm driver * tweak * add gemm kernel: mk_kn_mn and km_kn_mn * tweak * add GEMM km_nk_mn * fix comment	2021-09-05 12:41:28 -05:00
ltqin	627d8ef35a	Backward weight v4r4r2 with xdlops (#18 ) * start * modify transformat * modify device convolutiion * modify host * added host conv bwd and wrw * remove bwd, seperate wrw * clean * hacall k to zero * out log * fixed * fixed * change to (out in wei) * input hack * hack to out * format * fix by comments * change wei hacks(wei transform has not merge) * fix program once issue * fix review comment * fix vector load issue * tweak Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Jing Zhang <jizhan@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2021-08-30 22:49:17 -05:00
Chao Liu	10bb811060	Misc fixes (#24 ) * use cast_pointer_to_generic_address_space() in v6r1 kernel wrapper, DynamcBuffer and buffer_load take customized invalid-element-value, add buffer_load/store for fp64 * use remove_cvref_t	2021-08-26 20:05:19 -05:00
Qianfeng	9e80cdceb7	[SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108 ) * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files * make inner product compatible on gfx900 * Update src/include/miopen/solver/ck_utility_common.hpp * compiler parameter use stream * use int instead of index_t in kernel wrapper * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element * Add dynamic generic reduction kernel layer (kernel wrappers, kernel implementations and utilities) * Some updates to dynamic composable kernel facility for the need of dynamic generic reduction * Update to generic reduction C++ host interface layer to support dynamic generic reduction * Update to remove tidy complaints in host interface layer * Change the unary operator form from void op(T &x) to T op(T x) * Update to pass single workspace pointer for all kernels (fix for OpenCL backend) * Use cppcheck-suppress to prevent some strange warnings * Re-use operator [] and () for DynamicBuffer and update to depending codes * Remove useless codes in first call threadwise/warpwise/blockwise kernel wrappers * [performance] Remove un-needed local buffer initialization Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: JD <Jehandad.Khan@amd.com>	2021-08-26 18:04:55 -07:00
zjing14	a7a758d8ce	GlobalAtomicAdd for fp32/int32 (#23 ) * add f32/i32 atomicAdd support into dynamicBuffer, and enable it in v1r3 * fixed * fixed * update comment Co-authored-by: Chao Liu <chao.liu2@amd.com>	2021-08-25 10:55:55 -05:00
zjing14	9d3f634a3c	Xdlops refactor fix (#22 ) * added constexpr ahead of adptor; clean unused driver; rename M/NPerWave to M/NPerXDL * fixed bwd * fixed comment	2021-08-23 11:22:10 -05:00
Chao Liu	c6f26bb480	magic division use __umulhi() (#19 )	2021-08-23 10:40:27 -05:00
Chao Liu	6fe3627a9e	Composable kernel init integration v3 (#1097 ) * Squashed 'src/composable_kernel/' content from commit `f6edda611` git-subtree-dir: src/composable_kernel git-subtree-split: `f6edda6119` * add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files * Squashed 'src/composable_kernel/' changes from f6edda611..5781adf5c `5781adf5c` Update develop (#5) (#6) `97e6d514f` Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile `7b1ec41e5` refactor `49c33aaea` refactor `54b3e73d1` rename git-subtree-dir: src/composable_kernel git-subtree-split: `5781adf5cf` * fix * refactor * remove online compilation from CK * refactor * fix * add ctest * add c-style pointer cast * vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast * fix clang warning suppression * tidy * suppress cppcheck * fix enum issue * revert chagnes to hip build * fix kernel filename * update CK build script * rename * rename * make innner product compatiable on gfx900 * Update src/include/miopen/solver/ck_utility_common.hpp Co-authored-by: JD <Jehandad.Khan@amd.com> * compiler parameter use stream * use int instead of index_t in kernel wrapper * DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element * refactor * refactor * change cmakelist * change ck common utility * fix Co-authored-by: JD <Jehandad.Khan@amd.com>	2021-08-19 10:55:03 -05:00
zjing14	a2ad6d3531	refactor dynamic xdlops iGemm (#13 ) * xdlops refactor * fixed commnt * clean xdlops_gemm * add make c into xldops-gemm * change mfma_info * refactor xdlops, hide c desc * clean * clean * clean * apply hacks changes to v4r4r4_nhwc * rename hacks and use single stage adapter * enable fp16 mfma	2021-08-19 09:54:10 -05:00
zjing14	ba6f79a75e	Added host_conv_wrw for verification (#15 ) * added host conv wrw	2021-08-19 01:00:41 -05:00
Chao Liu	31b403526e	Merge pull request #16 from ROCmSoftwarePlatform/develop Merge develop into master	2021-08-18 11:22:34 -05:00
Chao Liu	b62bf8c3f8	Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration MIOpen Downstream: Initial integration 2nd PR	2021-08-16 16:39:40 -05:00
Chao Liu	ccc4a1d365	Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration	2021-08-16 16:28:53 -05:00
Chao Liu	67ad47e7c1	refactor	2021-08-16 21:01:33 +00:00
Chao Liu	16effa767c	refactor	2021-08-16 20:36:47 +00:00
Chao Liu	a91b68dfcb	DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element	2021-08-13 23:40:19 +00:00
Chao Liu	2cbabbba54	use int instead of index_t in kernel wrapper	2021-08-13 20:55:39 +00:00

1 2 3 4 5 ...

486 Commits