composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Author	SHA1	Message	Date
danyao12	70514fd899	bwd rtn	2024-08-14 12:32:12 +00:00
danyao12	17c97f5814	solve lds read&write conflicts	2024-08-12 13:13:55 +00:00
danyao12	4cc514f803	fix unpadded lse issue in fwd splitkv	2024-08-07 11:00:33 +00:00
rocking	e148767a9b	Support unpad lse layout for splitkv	2024-08-06 14:57:58 +00:00
rocking	e6c489df49	Support unpad layout for group lse	2024-08-06 14:25:22 +00:00
danyao12	3efdb593b8	unpadded lse&d for group mode	2024-08-06 14:47:55 +00:00
danyao12	3d5b0755ef	non-iglp pipeline for headdim padding cases	2024-08-02 10:59:52 +00:00
danyao12	5d2a5a1131	more strides for fa integration	2024-07-30 10:57:22 +00:00
danyao12	ad3e94bbaa	fwd dropout revert	2024-07-28 17:51:48 +08:00
danyao12	7e9d2390cc	dq_acc stride stuff	2024-07-27 16:23:16 +08:00
danyao12	99ed2c1ae3	code sync up	2024-07-26 18:41:18 +08:00
danyao12	3552041a70	Merge branch 'develop' into ck_tile/fa_bwd_opt	2024-07-26 18:05:24 +08:00
Dan Yao	e892711040	Merge pull request #1419 from ROCm/ck_tile/fa_bwd_opt_clean Remove duplicated codes for creating WarpGemm	2024-07-26 12:20:48 +08:00
danyao12	ed8ef7e58f	dropout patch for mrepeat 16*16	2024-07-26 12:10:43 +08:00
Qianfeng Zhang	5a561b5e1d	Remove duplicated WarpGemm definitions in the policy file	2024-07-25 16:36:37 +00:00
zjing14	105bd708c7	Add rotating buff for gemm_multi_d (#1411 ) * add rotating_buff for gemm_multi_d * format * Update flush_cache.hpp * Update gtest.cmake --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: Haocong WANG <haocwang@amd.com>	2024-07-25 23:21:21 +08:00
danyao12	94c957b3db	revert	2024-07-25 20:51:02 +08:00
danyao12	dcc3593fe4	fix hd32 error and boost performance	2024-07-25 16:16:30 +08:00
Andriy Roshchenko	4a8a1befd5	Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412 ) * Add CMakePresets configurations. * Add binary elementwise ConvScaleAdd and an example. * Numerical verification of results. Observed significant irregularities in F8 to F32 type conversions: ```log ConvScaleAdd: float=145.000000 f8_t=160.000000 e=144.000000 ConvScaleAdd: float=97.000000 f8_t=96.000000 e=104.000000 ConvScaleAdd: float=65.000000 f8_t=64.000000 e=72.000000 ``` * Implemented ConvScaleAdd + Example. * Add ConvScale+Bias Instances * Add Client Example for ConvScale+Bias * Fix number of bytes in an example.. * Cleanup.	2024-07-24 15:49:55 -05:00
Bartłomiej Kocot	ffabd70a15	Add support for half_t and bfloat to reduction operations (#1395 ) * Add support for half_t and bfloat to reduction operations * Fix bhalf convert * Next fix bf16	2024-07-24 12:12:37 -05:00
danyao12	b2510c0541	fix dq alignment	2024-07-24 12:28:41 +08:00
danyao12	da2dce1866	group convert_dq opt	2024-07-23 14:16:21 +08:00
danyao12	b69499b933	fix fwd dropout	2024-07-23 10:20:12 +08:00
Bartłomiej Kocot	5d8c3d8190	Revert Support access per groups and filter2x3 in grouped conv fwd (#1382 ) (#1406 )	2024-07-22 14:21:24 +02:00
danyao12	260ace4b78	code cleanup	2024-07-22 11:35:34 +08:00
danyao12	9b4b4622c1	code cleanup	2024-07-22 11:06:04 +08:00
danyao12	06f575a3b7	refactor dropout	2024-07-21 17:34:40 +08:00
danyao12	99436cd4c6	save clear_tile	2024-07-20 17:21:12 +08:00
danyao12	b3100b6f43	remove FmhaBwdTilePartitioner	2024-07-20 16:09:14 +08:00
danyao12	9d78a6c506	comments	2024-07-20 15:00:02 +08:00
danyao12	42a7240a19	do{}while() -> while(){}	2024-07-20 14:52:35 +08:00
Haocong WANG	8c90f25be3	[GEMM] F8 GEMM, performance optimized. (#1384 ) * add ab_scale init support * enabled interwave * add scale type; update isSupport * adjust example * clean * enable f8 pure gemm rcr ckprofiler * Add gemm_multiply_multiply instances * clang format * Optimize for ScaleBlockMNK=128 * enable abscale f8 gemm ck profiler * Add pure f8 gemm test suite * Reverting to the state of project at `f60fd77` * update copyright * clang format * update copyright --------- Co-authored-by: root <jizhan@amd.com>	2024-07-19 22:06:52 +08:00
ltqin	c544eb4da0	Universal gemm splitk using reduce (with multi-d) (#1341 ) * init for reduce_threadwise multi_d * add reduce_threadwise_multi_d * add reduce_multi_d * clean * start add an other splitk device op * add reduce template parameter to SplitKBatchOffset * add reduce c matrix * clean up code * change example data type to bf16 * add bf16Ai8B example * remove reduce template parameter * add splitk atomic status to v4 * example add multi d parameters * device op add multi-d parameters * add multi-d to reduce * fix kbach=1 bug * change B layout to col in bf16Ai8B example * remove float adding struct * change multi-d interface * change file and class name * remove multi-d of bf16Ai8B example * change IsReduce function to IsReduceAdd * change example layout to RRR from RCR * according layout to set ds stride * reset parameter layout * add gemm universal reduce instance * add reduce factory * add profile_gemm_universal_reduce * add reduce to profiler * fix reduce instance * fix profiler reduce compiling bug * format * format library instance code * add mem instance for reduce library * fix call instance names * add workspace for reduce in ckProfiler * format * add mnpading to reduce library instance * add fp16 instance to reduce of profiler * change copyright time * restore profiler cmake file * add reduce text to instances * add DsLayout and DsDataType to instances template parameter * fixed gemm_reduce_multi_d * add an example without multi_d * Update common.hpp * Update gtest.cmake * Update gemm_xdl_splitk_reduce_bf16.cpp * clean * Update gtest.cmake * format * fixe api * format * default parameter change to RRR * add vector_len for multi_d * format * Update gtest.cmake * fix bf16A iBB elementwiseop * add ReduceDataType * move ReduceDataType to end position * format * remove googletest git method address * fix copyright time * update init data --------- Co-authored-by: root <jizhan@amd.com> Co-authored-by: letaoqin <letaoqin@amd.com> Co-authored-by: Jing Zhang <jizhan@meta.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2024-07-19 22:01:22 +08:00
Bartłomiej Kocot	70a814f163	Refactor transform conv to gemm fwd (#1391 ) * Refactor transform conv to gemm fwd * fixes codegen * wmma fixes * fix wmma * Fix copyright	2024-07-19 09:29:25 +02:00
danyao12	1d7099b6c4	fix hd256 dropout scratch	2024-07-19 15:08:45 +08:00
danyao12	a67bdd6349	simplify convert dq	2024-07-18 09:55:52 +08:00
Qianfeng	ee768148f0	Replace the using of __expf by __ocml_exp_f32 to work-around the test_softmax_rank4 failure (#1394 )	2024-07-17 09:15:05 -07:00
Andriy Roshchenko	802a8a1df1	Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386 ) * Add CMakePresets configurations. * Add ConvScale+ReLU Functor and an Example * Account for ReLU FLOPs. * Add instances of 3D convolutions with ConvscaleRelu operation. * Implement Client Example * Cleanup	2024-07-16 08:51:49 -07:00
danyao12	fb26ec5d01	hd256 bias support	2024-07-16 13:48:56 +08:00
danyao12	237c93c85b	bias support	2024-07-15 12:23:27 +08:00
Bartłomiej Kocot	82e8a78a3f	Support access per groups and filter3x3 in grouped conv fwd (#1382 ) * Support access per groups and filter3x3 in grouped conv fwd * Fixes for large cases * Fixes for large tensors	2024-07-12 11:08:42 -07:00
danyao12	ca4a9f008b	reorder files	2024-07-12 18:10:20 +08:00
danyao12	39fc3d4b2e	fix group deterministic bugs	2024-07-11 17:34:59 +08:00
danyao12	8c967d76d1	fix batch deterministic bugs	2024-07-11 16:33:34 +08:00
danyao12	74f1516c5b	tmp save	2024-07-10 18:05:30 +08:00
carlushuang	8182976c37	[CK_TILE] wa prec, remove sgpr offset for inline asm (#1356 ) * wa prec, remove sgpr offset for inline asm * macro for set tile * ignore unused param if no kernel instances in host API * fix more prec issue * cache buffer resource * fix * support pre-nop * clear tile by vector type members * add workaround to reduce scratch memory * conditionally enable workaround code * enable workaround start from certain build version * fallback set_tile() implementation from certain build version * undo template argument changes * put dummy asm in load_raw() * fix comments, refactor s_nop inside buffer_load --------- Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>	2024-07-08 11:09:55 -07:00
Harisankar Sadasivan	75e622f02f	Universal streamk with atomics (#1360 ) * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). * Update README.md * fixing clang-format issues * removed conflicts in struct members between streamk and universal streamk * corrected arg parsing for streamk and universal streamk * added stream-k policies for 3 tile and 4 tile * fixed argument type issue with parsing cmd args * changes suggested in PR review are made- removing comments and correcting copyright * file permissions updated * added default value support for grid_size and streamk-policy selection set to -1 * print messages for arguments * print messages for arguments * print messages for arguments1	2024-07-05 21:40:30 -07:00
jakpiase	eaa870a1ab	Add structural sparsity xdlops (#1363 ) * Implemented smfmac xdlops * add reviewer comments	2024-07-04 12:00:14 +02:00
Jun Liu	959073842c	Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372 )	2024-07-03 23:34:38 -07:00
jakpiase	ed21948bcd	Add structural sparsity gemm instruction tests (#1309 ) * first version of smfmac test * add reviewer comments * add reviewer suggestions	2024-06-27 11:30:32 +02:00

1 2 3 4 5 ...

509 Commits