composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-07 00:04:37 +00:00

Author	SHA1	Message	Date
PoYen, Chen	8ac6bacf26	Unify CMakeLists.txt coding style	2024-07-08 10:19:31 +00:00
PoYen, Chen	5d21b4d736	Merge branch 'feature/cond-add-splitkv' into feature/fmha-fwd-appendkv	2024-07-08 10:18:28 +00:00
PoYen, Chen	6ca3910199	Show message if we are ignoring option	2024-07-08 10:17:55 +00:00
PoYen, Chen	fe4ae5dcd9	Early return if 0 < s_k_new is not supported	2024-07-08 10:09:36 +00:00
PoYen, Chen	be076db91c	Merge branch 'feature/cond-add-splitkv' into feature/fmha-fwd-appendkv	2024-07-08 10:03:58 +00:00
PoYen, Chen	aba46cd655	Regsiter API handlers automatically	2024-07-08 09:39:15 +00:00
PoYen, Chen	3aefb560e0	Remove "EXAMPLE_" prefix of cmake variables	2024-07-08 07:17:24 +00:00
PoYen, Chen	1c070380fa	Merge branch 'feature/cond-add-splitkv' into feature/fmha-fwd-appendkv	2024-07-08 07:13:34 +00:00
PoYen, Chen	82f3b3d0a0	Conditionally add call to fmha_fwd_splitkv()	2024-07-08 06:40:18 +00:00
PoYen, Chen	efd18fa887	Conditionally add fwd_splitkv API in fmha_fwd example	2024-07-08 06:27:44 +00:00
Harisankar Sadasivan	75e622f02f	Universal streamk with atomics (#1360 ) * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). * Update README.md * fixing clang-format issues * removed conflicts in struct members between streamk and universal streamk * corrected arg parsing for streamk and universal streamk * added stream-k policies for 3 tile and 4 tile * fixed argument type issue with parsing cmd args * changes suggested in PR review are made- removing comments and correcting copyright * file permissions updated * added default value support for grid_size and streamk-policy selection set to -1 * print messages for arguments * print messages for arguments * print messages for arguments1	2024-07-05 21:40:30 -07:00
Jun Liu	959073842c	Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372 )	2024-07-03 23:34:38 -07:00
Ruturaj Vaidya	2525864fda	Update CMakeLists.txt (#1364 ) It is a good practice to check if the file CMakeLists.txt is in fact in the directory.	2024-06-27 12:34:25 -07:00
Illia Silin	941d1f7ce0	Merging the gfx12 code into public repo. (#1362 )	2024-06-27 00:33:34 -07:00
PoYen, Chen	8fb567c286	Fix vnew append errro	2024-06-26 17:00:07 +00:00
Po Yen Chen	0cb2e06ddc	[CK_TILE] fmha forward split-kv + combine kernels (#1338 ) * FA fwd dropout * FA bwd * epilogue reuse * CMakeLists update * [CK_TILE] support alibi (#1269) * add alibi support * fix code * update code based on comment * Support more hdim * fix fp8 bias * support seqlen_k=0 case * remove unused printf * fix format --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> * now fwd/bwd can build * bwd alibi * add bwd validation stream_config * update generated filenames * update bwd kernel launch * CK_TILE_HOST_DEVICE in philox * Transpose -> transpose * format * format * format * Generate the instance for FA required * format * fix error in WarpGemm * Add num_splits option and dummy split-kv api method * Generate fmha_fwd_splitkv() * Add SplitKV kernel codegen logics * Add SplitKV combine kernel codegen logics * Fix mismatched return type * Clean-up code * Replace sentinel value before storing * Fix wrong layout of LSE/LSEacc/Oacc * Format codes * Fix o_acc memory error * Fix wrong kBlockSize used in policy * Reduce # of combine kernels * Fix split-kv combine kernel name * Fix wrong LDS indexing logics * Fix wrong loop counter step logic * Undo vector size changes * Remove no-longer used field * Remove in-consistent comment * Remove debug statements in example * Remove more debug statements * Add constness to local variables * Clearn up generate.py * Fix unstable clang-format comment * Remove unused include directive * Use shorter template parameter name * Enable non-split-kv blobs * Update license date * Print num_splits conditionally * Undo disabling data types * Remove unnessary tile size for fp8 * Fix wrong pipeline args for fp8 * Fix example output format * Remove more debug code in combine pipeline * Add stride kernel arguments for LSE/O acc workspace * Re-order split-kv pipeline call operator arguments * Pass LSE/O strides in kernel argument * Re-order pipeline call operator arguments * Use tensor_descriptor to locate LSEacc elements * Support providing invalid element for tensor view * Set invalid element value for LSEacc tensor view * Remove hand-written store_tile() code * Remove necessary value-overwrite logic * Add transposed lds descriptor * Support load_tile() for tile_window_with_static_lengths<> * Undo removing necessary value-overwrite logic * Use read descriptor to locate lds elements * Simplify pipeline source code * Add constraint to kMaxSplits * Default use kMaxSplits=64 in generate.py * Revert "Add constraint to kMaxSplits" This reverts commit `0a2132d758`. * Revert "Default use kMaxSplits=64 in generate.py" This reverts commit `c7d9c80b77`. * Decide alignment by the padding parameter * Remove no-longer used utility functions * Remove not-working code * Add comment & remove no-longer used code * Fix computation errors * Add heuristic to override num_splits option * Add constraint to kMaxSplits * Fix compilation error * Clean up pipeline code * Wrap pointer access as lambda function * Rename confusing methods * Use kLogMasSplits as template parameter * Finish splitkv combine kernel codegen * Update kMaxSplits limit * Use smaller kM0 for splitkv combine kernel * Ignore droupout flag in splitkv pipeline * Unify flag usage * Add back flag kStoreLSE * Merge lambda calls in pipeline * Fix compilation errors * Avoid all empty splits * Always check for empty loop in splitkv pipelines * Re-order parameters * Remove redundant p_drop option check * Add traits/problem for fwd splitkv kernel * Conditionally enable uneven split boundary checks * Add comment for the splitkv traits field * Change even split criteria * Re-order statements * Refine occupancy value for hdim=128&256 * Refine occupancy value for hdim=32&64 * Remove redundant kernel argument * Separate fmha bwd codegen logics * Separate fmha fwd codegen logics * Remove redundant direction parameter in fwd&bwd codegen logics * Support generate multiple APIs for an example * Let 'api' an alias of 'direction' option * Remove choices for the 'direction' option * Use dictionary to config all the functions * Move fmha splitkv codegen logics to other file * Add fwd_splitkv api for tile_example_fmha_fwd --------- Co-authored-by: danyao12 <danyao12> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: Jing Zhang <jizhan@amd.com>	2024-06-26 17:41:15 +08:00
PoYen, Chen	4e6c28522c	Fix wrong K values after appending	2024-06-25 10:12:13 +00:00
PoYen, Chen	1ac17dae50	Add knew/vnew tensors to the kernel argument	2024-06-25 07:56:36 +00:00
PoYen, Chen	344902732a	Sync kernel name with the codegen	2024-06-24 14:50:25 +00:00
PoYen, Chen	eee035ade5	Setup meaningfull arguments	2024-06-24 14:34:31 +00:00
PoYen, Chen	bace0e5df0	Add init codegen logic for fmha fwd appendkv	2024-06-24 12:33:51 +00:00
rocking	cb13839425	layernorm2d forward (#1339 ) * Add layernorm2d forward * Refind file path * clang format * Exclude ck_tile op from all * use add_executable instead * refactor layernorm2d_fwd example --------- Co-authored-by: carlushuang <carlus.huang@amd.com>	2024-06-24 08:45:52 +08:00
PoYen, Chen	4060416c97	Use dictionary to config all the functions	2024-06-23 19:22:25 +00:00
PoYen, Chen	d0b9fd0c5c	Merge branch 'develop' into feature/refactor-fmha-codegen	2024-06-23 18:54:08 +00:00
Andriy Roshchenko	05b10e0e5a	Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326 ) We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation. This commit handles bf8@bf8->fp8 data types combination. * Included an example. * Added instances. * Added a client example. --------- Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2024-06-21 19:02:57 -06:00
PoYen, Chen	9fe165fe1d	Remove choices for the 'direction' option	2024-06-21 12:45:28 +00:00
PoYen, Chen	2d57c34348	Let 'api' an alias of 'direction' option	2024-06-21 10:38:25 +00:00
PoYen, Chen	51487f238a	Support generate multiple APIs for an example	2024-06-21 10:04:17 +00:00
PoYen, Chen	925d25ff47	Remove redundant direction parameter in fwd&bwd codegen logics	2024-06-21 09:41:06 +00:00
PoYen, Chen	0ef4fb2a16	Separate fmha fwd codegen logics	2024-06-21 09:22:15 +00:00
PoYen, Chen	070e89d7ad	Separate fmha bwd codegen logics	2024-06-21 08:35:15 +00:00
ThruptiRajLakshmanaGowda	0162a5f6ba	Adding Missed Activation Functions for Grouped 2D/3D Convolutions (#1348 ) * Initial Push * First Push * Fixed Clang format * Resolve merge conflict * Addressed review comments * Addressed review comments * Addressed review comments	2024-06-20 09:24:54 -05:00
jakpiase	e2d139201b	Switch to universal gemm in grouped gemm tile loop (#1335 ) * switch to universal gemm in grouped gemm tile loop * minor fixes * add reviewers comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2024-06-18 09:01:49 -05:00
Rostyslav Geyyer	acda4c5a3c	Add instances for grouped conv fwd 3d with ConvScale for fp8@bf8->fp8 (#1325 ) * Add fp8 bf8 conv example * Add instances * Add client example * Add random scale values * Format	2024-06-12 14:41:56 -05:00
Rostyslav Geyyer	ce66277a76	Add a convinvscale op, related instances and examples (#1307 ) * Update the element op * Add an example * Add instances * Add a client example * make sure new instances only build on gfx9 * Update element op and its handling * Format * Update instances to take element op as an argument * Update examples to use random scale values * Format * Update client example with random scales * Format --------- Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-06-10 14:48:49 -05:00
Rostyslav Geyyer	cb0645bedc	Add a scale op, related instances and examples (#1242 ) * Add a scale op * Update the element op * Add instances * Add an example * Add a client example * Add a flag check * Revert flag check addition * Fix flag check * Update d strides in example * Update d strides in client example * Apply suggestions from code review Update copyright header Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Move the example * Move the client example * Update element op * Update example with the new element op * Add scalar layout * Update example * Update kernel for scalar Ds * Revert kernel changes * Update element op * Update example to use scales' pointers * Format * Update instances * Update client example * Move element op to unary elements * Update element op to work with values instead of pointers * Update instances to take element op as an argument * Update examples to use random scale values --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2024-06-04 19:28:15 -05:00
Dan Yao	2cab8d39e3	CK Tile FA Training kernels (#1286 ) * FA fwd dropout * FA bwd * epilogue reuse * CMakeLists update * [CK_TILE] support alibi (#1269) * add alibi support * fix code * update code based on comment * Support more hdim * fix fp8 bias * support seqlen_k=0 case * remove unused printf * fix format --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> * now fwd/bwd can build * bwd alibi * add bwd validation stream_config * update generated filenames * update bwd kernel launch * CK_TILE_HOST_DEVICE in philox * Transpose -> transpose * format * format * format * Generate the instance for FA required * format * fix error in WarpGemm --------- Co-authored-by: danyao12 <danyao12> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Jing Zhang <jizhan@amd.com>	2024-06-04 13:12:45 -05:00
zjing14	6fb1f4e03f	Post-merge fix of PR 1300 (#1313 ) * add f8 gemm with multiD for both row/col wise * change compute_type to fp8 * changed tuning parameters in the example * add rcr example * post-merge fix * fix * reduce init range	2024-05-31 22:46:41 -07:00
zjing14	80db62f08d	add f8 gemm multiD with both row/col wise scale (#1300 ) * add f8 gemm with multiD for both row/col wise * change compute_type to fp8 * changed tuning parameters in the example * add rcr example	2024-05-28 12:04:22 -05:00
carlushuang	5055b3bdcb	[CK_TILE] support group from cmdline (#1295 ) * support cmdline seqlen decode * silent print * update readme * update kernel launch 3d * update tile partitioner * fix spill for bf16 * modify based on comment * modify payload_t * fix bug for alibi mode * fix alibi test err * refactor kernel launch, support select timer * add missing file * remove useless code * add some comments	2024-05-28 11:13:21 +08:00
Illia Silin	7b027d5643	Select appropriate GPU targets for instances, tests, and examples. (#1304 ) * set individual gpu targets for instances, examples, tests * fix path to hip compiler * fix path to hip compiler once more * aggregate device macros in ck_tile config header * fix the cmake logic for instances * fix clang format * add gfx900 and gfx906 to default set of targets	2024-05-22 11:45:27 -07:00
Illia Silin	7843a8a7fb	re-enable convnd_fwd_xdl_fp64 testing (#1289 )	2024-05-10 22:48:28 -07:00
Illia Silin	566b6480a2	Code clean-up (#1285 ) * code clean-up * remove the profiling output samples	2024-05-10 09:41:39 -07:00
carlushuang	fcba889ef4	[CK_TILE] fix some rand number init (#1287 ) * add random norm * normalized default to 0/3 * change squant->auto	2024-05-10 09:03:39 -07:00
Adam Osewski	a0ae1c6133	Fix MakeArgument (#1284 )	2024-05-09 09:42:41 -07:00
carlushuang	851c3ed157	[CK_TILE] support alibi (#1269 ) * add alibi support * fix code * update code based on comment * Support more hdim * fix fp8 bias * support seqlen_k=0 case * remove unused printf * fix format --------- Co-authored-by: rocking <ChunYu.Lai@amd.com>	2024-05-07 22:32:54 +08:00
Adam Osewski	0f7e8ec485	Fix example CMakeLists.txt (#1267 ) Add proper dependency target.	2024-04-30 08:28:19 -07:00
Haocong WANG	764164b488	[GEMM] UniversalGemm update (#1262 ) * Add bf16 instances * Add bf16 gemm universal example * tempsave * Add guard to navi compilation * workground on a specific mixed gemm instance ( bring back it when compiler fix upload) * fix formatting condition statement issue * solve conflict --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com>	2024-04-26 12:56:07 -05:00
zjing14	0d0150db20	bf16A_Int8B with fastgelu/bias (#1264 ) * changed the copy function to v7r2 * adding multi_abd * in-progress * add post-load oob check * debugging * adjust instances * add run_lds * add elemntwise_op * replace multi_abd_device with v3 * clean up * clean * clean * Added LDSType * profiling * adjust oobcheck * add missing file * refactor * clean * add examples	2024-04-26 07:26:30 -05:00
Adam Osewski	b4032629e5	Grouped GEMM Multiple D tile loop. (#1247 ) * Overload output stream operator for LoopScheduler and PiplineVersion * Add Run overload accepting grid descriptors MK. * Add __device__ keyword for CalculateGridSize * Create device op GroupedGemmMultipleD * Add GroupedGemm MultipleD Tile Loop implementation. * Add an example for GroupedGemm MultipleD tile loop. * Device Op GroupedGEMMTileLoop. * Bunch of small changes in exmaple. * CkProfiler * Remove unused tparam. * Fix include statement. * Fix output stream overloads. * Do not make descriptors and check validity untill we find group. * Fix gemm desc initialization. * Revert device op * Fix compilation for DTYPES=FP16 * Validate tensor transfers paramters. * Validate on host only NK dims if M is not known. * Fix bug. * A convenient debug func for selecting threads. * Fix has main k block loop bug. * Make sure that b2c has up to date tile offset. * Output stream operator for Sequence type. * Cmake file formatting.	2024-04-25 15:12:53 -05:00

1 2 3 4 5 ...

349 Commits