composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Author	SHA1	Message	Date
Chao Liu	4a99f54c31	remove dead code	2019-05-02 11:09:42 -05:00
Chao Liu	d2436a58c4	v1r3 nchw*cyxk=nkhw lds double buffer	2019-04-27 15:06:15 -05:00
Chao Liu	63cdc6d2a4	fix v1r3 output reorder bug	2019-04-27 00:58:49 -05:00
Chao Liu	c138e2126d	nchwcyxknkhw on AMD	2019-04-26 18:03:55 -05:00
Jing Zhang	49d5af1002	ds_read_offset	2019-04-26 15:55:26 -05:00
Chao Liu	3ce77700b6	debugging ds_read asm	2019-04-26 15:34:55 -05:00
Chao Liu	b93d2e1b57	fix batch gemm asm bug	2019-04-26 14:40:19 -05:00
Chao Liu	46a0aec185	trying gemm asm	2019-04-26 10:20:02 -05:00
Chao Liu	2603bb0fe3	tuning on vega 20	2019-04-25 17:28:59 -05:00
Chao Liu	a903146427	implicit gemm v1r3 nchw_cyxk_nkhw	2019-04-25 15:14:39 -05:00
Chao Liu	569ad66e2a	added implicit gemm v1r3 lds_double_buffer NCHW * CYXK = KNHW, reworked static functionals	2019-04-23 17:51:14 -05:00
Chao Liu	87d8740bf5	added lds double buffer (on C dimension) for implicit gemm v1r3, as a result, it should achieve 90% of peak for all filter sizes, on CHWN format	2019-04-19 17:49:25 -05:00
Chao Liu	6d066ede00	added implicit gemm v1r3, refactored decomposition of wei tensor (loop over y, x first, and C second) to allow easy lds double buffer on C	2019-04-19 16:46:29 -05:00
Chao Liu	5ce19234a4	added GridwiseConvolutionImplicitGemm_v1r2_nchw_cyxk_khwn	2019-04-19 14:22:02 -05:00
Chao Liu	19f17df47a	implicit gemm v1r2: adding support for nchw	2019-04-18 11:49:09 -05:00
Chao Liu	17f3d2d4bc	refactor ConstantTensorDescriptor and functional	2019-04-16 17:36:18 -05:00
Chao Liu	a2cf803c7e	refactor	2019-04-13 16:21:55 -05:00
Chao Liu	3ffb824ed3	refactor	2019-04-13 16:21:44 -05:00
Chao Liu	7d8daba741	tuning	2019-04-13 15:53:03 -05:00
Chao Liu	00899f191b	implicit gemm v1r2: only load 1d filter	2019-04-13 11:19:17 -05:00
Chao Liu	96ee9571e2	tuned implicit gemm v1 for 3x3 on AMD to 82%. Fixed a bug in 4d tensor blockwise copy.	2019-04-10 18:10:18 -05:00
Chao Liu	edc89778c3	update flops calculation	2019-04-10 15:35:00 -05:00
Chao Liu	5696c81ffd	simplify blockwise batched GEMM	2019-04-10 15:29:35 -05:00
Chao Liu	71434918bf	add 3x3 28x28 case	2019-04-10 11:33:57 -05:00
Chao Liu	d86a5e4b47	clean up	2019-04-09 19:12:16 -05:00
Chao Liu	e624df922d	enabled ds_read_b128 and ds_write_b128 on hip c++	2019-04-09 19:05:44 -05:00
Chao Liu	471830a052	tidy yp	2019-04-09 18:07:36 -05:00
Chao Liu	1bd880a674	refactor	2019-04-09 16:13:14 -05:00
Chao Liu	796f72e26e	load smaller weight tensor	2019-04-08 23:57:13 -05:00
Chao Liu	5b36aeadeb	refactor	2019-04-08 16:13:17 -05:00
Chao Liu	cc0fa73acd	added implicit_gemm_v1 lds double_buffer	2019-04-08 14:11:55 -05:00
Chao Liu	c075d3f7d9	add more assertion	2019-04-08 12:02:56 -05:00
Chao Liu	268d1c717c	tidy up	2019-04-08 10:48:29 -05:00
Chao Liu	c9fa46af0b	debugging implicit gemm v1: use 10d tensor output	2019-04-08 10:27:32 -05:00
Chao Liu	90abf42799	refactor	2019-04-06 19:39:58 -05:00
Chao Liu	b57d60c0b7	refactor	2019-04-06 19:01:59 -05:00
Chao Liu	bd0098afb3	use dedicated threadwise_copy for 1x1, perf at 80%	2019-04-06 18:40:54 -05:00
Chao Liu	5245a0162b	clean up	2019-04-06 16:27:07 -05:00
Chao Liu	7a251a0922	debugged: CUDA should use its own float4 definition	2019-04-06 15:44:53 -05:00
Chao Liu	f6cb5b846d	debugging	2019-04-06 15:10:40 -05:00
Chao Liu	0983d205ad	debugging	2019-04-05 18:54:26 -05:00
Chao Liu	bae2333791	preload don't wait	2019-04-05 02:42:54 -05:00
Chao Liu	dabfa77fc6	clipboard float4 copy and paste C++ code	2019-04-05 02:13:29 -05:00
Chao Liu	605afd0fb6	Merge branch 'master' into inline_asm_v2	2019-04-04 18:40:23 -05:00
Chao Liu	66edb2590d	Merge branch 'inline_asm_v2' of github.com:asroy/modular_convolution into inline_asm_v2	2019-04-04 17:37:13 -05:00
Chao Liu	19b4179798	unroll even-odd loop	2019-04-04 17:36:02 -05:00
Jing Zhang	62c4d5dff3	clean code	2019-04-04 14:52:42 -05:00
Jing Zhang	313f3c07d2	unroll k	2019-04-04 11:43:37 -05:00
Jing Zhang	0f620a9018	add debug	2019-04-04 11:20:43 -05:00
Chao Liu	fbc7817bbb	add asm into lds_double_buffer version	2019-04-04 10:38:49 -05:00

1 2 3 4 5

203 Commits