composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Author	SHA1	Message	Date
Chao Liu	b2589957f3	update CK build script	2021-08-10 22:19:13 +00:00
Chao Liu	f63a23acb1	[MIOpen Downstream] Initial MIOpen integration (#52 ) * update online kernel wrapper bundle all descriptors in a tuple * change __CONSTANT__ to CONSTANT * rename * adding tuning * added IsValidCompileParameter * reorginze * adding tunable for fp16 and int8 * fix kernel compile warning and bug fixes * suppress warning about cast CONSTANT (address space 4) pointer * fix building issue	2021-07-27 00:02:27 -05:00
Chao Liu	1264925422	reorganize files to prepare for MIOpen integration (#51 ) * change olc cmake * adding online compile to fwd-v4r5r2 * update scripts * remane fwd-v4r5r2 to fwd-v6r1 * clean up	2021-07-18 00:43:05 -05:00
Chao Liu	81c942cd7e	Deprecate static kernel (#42 ) * deprecate static kernels	2021-07-08 10:40:00 -05:00
zjing14	3835318cc3	xdlops_v4r4_fwd fp32/fp16 (#34 ) * create files for xdlops * working on blockwise_gemm_xdlops * add KReduction * add m/n repeats * add 2x2 pipeline * added 128x128 wavegemm * use StaticBuffer of vector_type * break vector type to blk_size * add kpack into xldops_gemm and blockwise_gemm * abroadcast only * add fp32 mfma instructions * adding fp16 mfma * pack half4_t * rename kperwave to kpack * add 32x32x8fp16 * add fp16 mfma * clean code * clean code * V4r4 xdlops kpack (#35) * add kpack with incorrect results * bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2 * add 1x1 kernel * add gridwise_gemm_v2 - single_buffer * enabled dwordx4 for fp16 Co-authored-by: Chao Liu <chao.liu2@amd.com> * refactor fwd-v4r4-xdlops * add v4r4-nhwc-xdlop * improve some perf of nhwc and nchw by tuning parameters, and change scheuduling in gridwise-gemm loop * tweak scheduling in gridwise gemm * add v4r3 with a single output copy * init commit: output with slice win * adding sliceWin * add multiple repeats pattern * starting adding bwd-v4r1-xdlops * use tuple as SrcBuffer * adding bwd-data v4r1 nhwc xdlops * fix bug in make_dynamic_naive_tensor_descriptor_aligned_v2() * fix bug in host bwd-data conv * initial implementation of bwd-data v4r1 nhwc xdlops * add launch bound flags * enable launch bound * add m/nrepeat=4 * tweak bwd-data v4r1 nhwc xdlops * added bwd-data v4r1 nhwc xlops with output A and weight B * add fwd-v4r4 nhwc xdlops, A input, B weight, C output Co-authored-by: Chao Liu <chao.liu2@amd.com>	2021-07-01 14:33:00 -05:00
Qianfeng	1685048a67	Add online compilation for dynamic kernels (#37 ) * Add online-compiling facility * Synchronize from fwd-v4r5 and implement host interfaces to call conv-fwd v4r4/v4r5 using on-line compiling method * Tiny adjustment to time reporting * Use object assignment to replace explicit bytes copying in the first kernel of v4r4/v4r5 * Use single thread to assign descriptor object to device memory * Adjust to the workload assignment of the two kernels of v4r4 (experimental) * Revert "Adjust to the workload assignment of the two kernels of v4r4 (experimental)" This reverts commit eb38461456bb0c82b6c0d32cdd616e181907e20c. * Update to make constexpr for generating descriptor types in kernel 2 of dynamic conv-fwd v4r4 * Update to dynamic conv-fwd v4r4 online-compiling * Update to dynamic conv-fwd v4r5 online-compiling (result not accurate) * Tiny update to driver/CMakeLists.txt * clang-format * Tiny comments change * Add env OLC_DUMP_SAVE_TMP_DIR to support saving of temperary dir * Fwd v4r5 olc perf (#39) * added hip-clang flags that fix perf issue of online compilation * fix bug for olc fwd-v4r5-nchw * Move constexpr and type reference statements out of the function body in conv-fwd v4r4/v4r5 kernel wrapper * Remove printing in hip_build_utils.cpp * Update to root CMakeLists.txt * Revert "Move constexpr and type reference statements out of the function body in conv-fwd v4r4/v4r5 kernel wrapper" This reverts commit 3d2c5d8ecdd8298b72d127110500ed5b38d9835c. Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Chao Liu <lc.roy86@gmail.com> Co-authored-by: root <root@dc-smc-18.amd.com>	2021-06-24 08:34:19 -05:00
Chao Liu	30072aec37	Restructure gridwise and blockwise GEMM, add tensor contraction and FWD-v4r5 (#36 ) * experimenting magic number division * overhauling fwd-v4r4 to clearly reflect transformation graph * added fwd-v4r5 * bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2 * bug fix and added sanity-check in transform_dynamic_tensor_descriptor * added conv_driver_v2	2021-06-09 23:53:08 -05:00
Chao Liu	78b987fbd6	Use DynamicBuffer instead of raw pointer (#32 ) * Use DynamicBuffer to hold raw pointer (to global and LDS memory) * add workaround for compiler issue (inefficient ISA) of ds_write for int8x4, int8x8, int8x16	2021-05-12 13:10:42 -05:00
Chao Liu	01055d95d9	No raw index calculation (#31 ) * Replace most raw index calculation to coordinate transformation * Overhaul blockwise and threadwise GEMM * Overhaul driver for gridwies GEMM kernel Co-authored-by: Jing Zhang <jizhan@amd.com>	2021-05-11 00:09:25 -05:00
Chao Liu	fcbb978828	Dynamic tensor descriptor (#24 ) * support dynamic tensor descriptor * use buffer load OOB feature for padding case * add navi support * add int8x4 inference kernel Co-authored-by: Chao Liu <chao@ixt-rack-81.local.lan> Co-authored-by: Jing Zhang <jizhan@amd.com>	2021-03-25 13:51:11 -05:00
Chao Liu	5c7cec1115	Code clean up (#20 ) * tuning para, * testing on v100 * add fp16 * remove deprecated tensor descriptor * sync with miopen * update build script Co-authored-by: Jing Zhang <jizhan@amd.com>	2020-06-23 20:31:27 -05:00
Chao Liu	c5da0377fb	Added bwd data v3r1 v4r1, tweaking v1 (#10 ) * Added bwd data v3r1: breaking down compute into a series of load balanced GEMM, and launch in a single kernel * Added bwd data v4r1: like v3r1, but launch GEMMs in multiple kernels * Tweaked v1r1 and v1r2 (atomic) on AMD GPU	2020-01-20 10:20:03 -06:00
Chao Liu	8f5f64960e	backward data (#7 ) * enabled atomic add in tensor copy * added gridwise GEMM * added backward data conv using GEMM + atomic * added backward data conv using GEMM, no atomic	2019-12-03 01:16:12 -06:00
Chao Liu	52c3fe05be	Refactor for MIOpen integration (#4 ) Refactor, so can bring multi-index transformation and padding support into MIOpen	2019-10-11 11:37:31 -05:00
Chao Liu	f58bf38445	enable hip compiler flag: -amdgpu-enable-global-sgpr-addr	2019-09-17 17:34:39 -05:00
Chao Liu	0c83df668f	add script for doing Jack's ISA injection hack	2019-08-21 14:29:13 -05:00
Chao Liu	284e7bb317	refactored implicit gemm v1r3	2019-07-29 15:25:38 -05:00
Chao Liu	efd419ecbe	refactored implicit gemm v1r3	2019-07-29 15:01:01 -05:00
Chao Liu	c15ff3c825	update compile script	2019-07-03 16:03:12 -05:00
Chao Liu	dab2938937	tested on P100	2019-06-27 15:46:09 -05:00
Chao Liu	c82b833d8e	change build	2019-06-12 10:47:25 -05:00
Jing Zhang	49d5af1002	ds_read_offset	2019-04-26 15:55:26 -05:00
Chao Liu	e624df922d	enabled ds_read_b128 and ds_write_b128 on hip c++	2019-04-09 19:05:44 -05:00
Chao Liu	605afd0fb6	Merge branch 'master' into inline_asm_v2	2019-04-04 18:40:23 -05:00
Chao Liu	6166233e05	add script to extrac asm on hip	2019-04-03 10:36:18 -05:00
Chao Liu	e6c86f81b5	add cuda extract_asm script	2019-04-02 20:26:58 -05:00
Chao Liu	bdbc0eaad1	cleaning up dead code	2019-04-02 17:58:44 -05:00

27 Commits