composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-07 00:04:37 +00:00

Author	SHA1	Message	Date
apoorva	9b64da2298	Added wrapper and renamed the wmma_v3 instances	2025-07-08 11:26:01 +00:00
apoorva	86ca6b827d	Removed the old wmma instances.	2025-07-08 11:23:33 +00:00
apoorva	f5843dd22b	Added v3 instances for gemm_add_relu	2025-07-01 12:37:46 +00:00
apoorva	6a116fa958	Modified the template parameters to make the instances work.	2025-07-01 11:23:44 +00:00
apoorva	cdaff7f210	Added instances to Cmake	2025-07-01 11:23:43 +00:00
apoorva	6f89183179	Added f16 wmma instance and corrected bf16 instance errors.	2025-07-01 11:23:17 +00:00
apoorva	35aab35d96	Added bf16 wmma instance for add_relu	2025-07-01 11:23:17 +00:00
Zoltán Lakatos	686df332e2	Resolve "Implement `device_gemm_bilinear` for RDNA4"	2025-06-26 06:48:38 +00:00
Zoltan Lakatos	7424b4a0f8	fixed ONLY_XDL_AND_WMMA_KERNELS tag	2025-06-17 19:50:07 +00:00
Zoltan Lakatos	b4d3e4112c	added tests for all splitk fastgelus	2025-06-12 11:43:26 +00:00
Zoltan Lakatos	264e1b238a	created fastegelu instances	2025-06-11 14:14:43 +00:00
Zoltan Lakatos	4c8ea9517d	created gemm_add_add_fastgelu instances	2025-06-11 12:46:42 +00:00
Zoltan Lakatos	aeca8efdea	revert unintended change in xdl add_fastgelu	2025-06-11 07:50:53 +00:00
Zoltan Lakatos	ea9805b580	added add_fastgelu instances	2025-06-10 17:23:44 +00:00
Anton Gorenko	381c02d06a	Add gemm_add_fastgelu instances and test	2025-06-04 12:34:26 +05:00
Anton Gorenko	52b4860a30	WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 ) * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Updates to support mixed precision * Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip * Added support for F8xF16xF16 to gemm_wmma_universal * Added support for F16xF8xF16 to gemm_wmma_universal * Added support for BF16xI4xBF16 to gemm_wmma_universal * Added support for F16xI4xF16 to gemm_wmma_universal * Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType * Added missing test class for FP16_KM_NK * Pre-commit hooks fixes * Added padding instances for f16xf16xf16 * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Ammending changes for adding support for padding instances for f16xf16xf16 * Fixes for padding instances for f16xf16xf16 * Added padding instances for bf16xbf16, f8xf8 * Added packed instances for bf16xi4xbf16 * Added padding instances for f8xf16xf16 * Added padding instances for f16xf8xf16, f16xi4xf16 * Fixed typos for bf16xbf16xbf16 padding instances * Fixed typos for padded instances * Added tests for fp16, KM_KN and KM_NK * Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Fix int8 DTYPES check for gemm_bilinear --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>	2025-06-04 12:22:33 +06:00
Illia Silin	b76fdbe47f	Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274 ) * upgrade to rocm6.4.1 and use gfx1x-generic targets * add rocm version parsing * fix the gfx10-3-generic syntax in cmake	2025-06-03 07:17:35 -07:00
Bartłomiej Kocot	e7906dd644	Change relu to clamp for grouped conv fwd instances (#2249 )	2025-05-29 00:51:25 +02:00
Adam Dickin	6df1c56ad6	Changes to allow MIOpen to build CK as part of its build. (#2247 ) * tweaks to the miopen specific build. add way to skip clang-tidy checks and a way to skip some custom build targets MIOpen also has. * move the tidy if statment --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-05-28 13:51:15 -07:00
BrianHarrisonAMD	e91be7d96a	Add option to disable offload compress for CK builds (#2250 ) * Add option to disable offload compress for CK builds * Remove gemm exe offload compress flag conditional	2025-05-28 13:47:56 -07:00
Bartłomiej Kocot	b1ed92b131	Revert "Remove not needed bwd wei merged groups instances (#2218 )" (#2235 ) This reverts commit `4583aeffad`.	2025-05-26 23:26:04 +02:00
Bartłomiej Kocot	4583aeffad	Remove not needed bwd wei merged groups instances (#2218 ) * Grouped conv bwd wei add two stage instances for larger filter and Merge Groups * Fix * fix * Revert "Restore oddc instances (#2201)" This reverts commit `6342f6b5e8`. * fix --------- Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>	2025-05-26 22:46:18 +02:00
Illia Silin	bc2551ac3b	disable building device_mha_operations by default (#2225 )	2025-05-22 14:03:04 -07:00
Adam Dickin	417a6b65b6	Add MIOPEN_REQ_LIBS_ONLY option for cmake to build only the libs MIOpen requires (#2224 ) * cut out anything we dont need for MIOpen to test * refactor exclusion code to be more streamlined.	2025-05-22 11:14:33 -07:00
Bartłomiej Kocot	ebc5a6ef87	Grouped conv bwd wei add for larger filter and Merge Groupes optimization (#2197 ) * Grouped conv bwd wei add two stage instances for larger filter and Merge Groups * Fix * fix * Restore removed instances --------- Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>	2025-05-21 22:47:34 +02:00
Thomas Ning	1386924749	Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212 ) * Add small instance, add the bug fix, & improve the example CMake * clang format	2025-05-20 15:05:08 -07:00
Bartłomiej Kocot	6342f6b5e8	Restore oddc instances (#2201 )	2025-05-16 18:42:02 -07:00
Illia Silin	40668c9a99	Build and store CK library deb package for all targets daily. (#2196 ) * generate and store library package for all targets * use ninja to build packages for all targets * make sure to use ftime-trace when using ninja * make sure build trace only runs on gfx9 * archive lib package and stash only library package	2025-05-16 07:40:53 -07:00
Thomas Ning	b49f7de81f	Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166 ) * make the work compiled * Solved the example code, but still have the profiler error * Finished the feature * Clang format and update the CHANGELOG * solve the preshuffle v1 & v2 problem * Comment Addressed * Comment Addressed	2025-05-12 09:52:58 -07:00
Bartłomiej Kocot	6fddb5708c	Add grouped conv fwd bias relu instances (#2179 ) * Add grouped conv fwd bias relu instances * fixes * fix	2025-05-09 22:52:34 +02:00
Andriy Roshchenko	cb27e7c77f	Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171 ) * Re-enable MX GEMM instances * Fix compilation error when building MX GEMM for multiple architectures	2025-05-08 13:26:03 -06:00
Bartłomiej Kocot	397b9080a2	Move 16x16 grouped conv fwd instances from comp header (#2165 ) * Move 16x16 grouped conv fwd instances from comp header * Improvements	2025-05-07 08:04:31 -07:00
Andriy Roshchenko	79beaacdd1	Restrict MX GEMM instantiation to GFX950 arch (#2157 )	2025-05-05 08:18:22 -07:00
Andriy Roshchenko	79b0bfeb41	MX GEMM - Add FP8 GEMM Tests for Different Layouts (#2152 ) * Add gemm_mx_fp8_bf8 example with row-major B * Add more overloads of MX MFMA instructions * Add MK_KN (RRR) tests * Add KM_NK (CCR) tests * Add more problem sizes to Large tests * Add test_gemm_mx to the list of regression tests	2025-05-01 11:55:48 -06:00
Bartłomiej Kocot	23de234dbe	Add grouped conv fwd 16x16 mfma instruction instances (#2140 ) * Add grouped conv fwd 16x16 mfma instruction instances * fix * remove oddc * fix * fix	2025-04-30 09:49:37 +02:00
Bartłomiej Kocot	4094ad158a	Integrate universal gemm with conv bwd data and add SplitK (#1315 ) * Integrate universal gemm with conv bwd data * Fix multi d kernel * Add splitK support * instances refactor * instances refactor * refactor * fixeS * fixes * 16x16 instnaces * Fixes * Fix * Fix * Fix * Fix * Fix * Fixes * fix * fix	2025-04-28 23:54:49 +02:00
Anton Gorenko	edd92fc546	DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096 ) * Prepare files for DeviceGemm_Wmma_CShuffleV3 * Implement main part of CShuffleV3 with block pipeline v3 for WMMA * Remove unused functions and template params for A/B descriptors * Support both gfx11 and gfx12 * Enable SplitK for gfx12 and disable for gfx11 * Added RowColRow layout for DeviceGemmV2 fp16 * Added more instances for Row, Col, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout * Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout * Fix formatting * Add documentation Based on `e5ad48a784` * Enable gemm_universal profiling for gfx11/12 * Add WMMA intrinsics for F8/BF8 * Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances * Add BF16 instances and tests * Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8 --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com>	2025-04-28 10:14:21 +05:00
Andriy Roshchenko	213b203a3c	MX GEMM - Parameterized Test Template (#2088 ) * Tests for MX FP8 GEMM * Improve documentation	2025-04-16 19:56:00 -06:00
aledudek	7c32652e03	Add grouped conv fwd 3d GKCYX instances for f32, f16, bf16 (#2069 ) * Part1 * Add grouped conv fwd 3d GKCYX instances for f32, f16, bf16 * Add missing coma * Add missing cpp instance files * Fix 3d layout * Add missing closing bracket * Add missing comp x2 and part2 instances * Fix typo in instance name * fix * Fix --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-04-16 11:00:55 +02:00
Illia Silin	3e6d21adeb	enable gfx115x support (#2065 )	2025-04-09 10:06:42 -07:00
Khushbu Agarwal	263ff689e0	New instances for gemm_multiply_multiply_weightpreshuffle operator (#2061 ) * Add new instances for weight_preshuffle for f8->bf16 * Add new instances for weight_preshuffle for f8->f16 * clang formatted --------- Co-authored-by: Khushbu Agarwal <khuagar@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-04-08 15:14:53 -07:00
Bartłomiej Kocot	2ccf914888	Add support for GKCYX grouped conv weight (#2023 ) * Grouped conv bwd weight GKCYX support * fix and changelog * fix * fix * fixes * comments * fix	2025-04-02 23:59:49 +02:00
Bartłomiej Kocot	8c0ab61ece	Grouped conv backward data GKCYX support (#2029 ) * Grouped conv backward data GKCYX support * profiler * Converter * split instances	2025-04-01 13:24:38 -07:00
Bartłomiej Kocot	6355ee7ca5	Improve compilation time for grouped conv fwd (#2039 ) * Improve compilation time for grouped conv fwd * Fix	2025-04-01 07:11:42 -07:00
Muhammed Emin Ozturk	dd4c12b155	f8/bf16 GEMM Stream-K (#1879 )	2025-03-31 20:30:17 -06:00
jefyang1	16b15e336a	Fix gemm universal and grouped_conv_fwd test failures on gfx950 (#2031 )	2025-03-31 09:20:52 -07:00
Bartłomiej Kocot	54c81a1fcf	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes	2025-03-26 21:13:38 +01:00
Bartłomiej Kocot	c2e4898b4b	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example	2025-03-17 13:32:00 +01:00
Haocong WANG	cbd74c2d12	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-11 10:11:21 -07:00
Thomas Ning	c954bd0cfa	Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955 ) * tempsave, not selected * finish the feature and merge with develop --------- Co-authored-by: aska-0096 <haocwang@amd.com>	2025-03-07 13:44:06 -08:00

1 2 3 4 5 ...

386 Commits