composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-13 17:55:48 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	050cad09b5	Grouped Convolution Backward Weight Explicit GEMM (#2282 ) * Grouped conv bwd weight explicit gemm * 3d * cmake fixes * fix test * fix	2025-06-06 10:30:08 +02:00
Andriy Roshchenko	00247e3c29	Optimized GEMMs for MX FP4/8 (#2294 ) Adds V3 GEMM pipeline for MX FP4 and MX FP8 Adds V3 GEMM pipeline for MX FP4 with preshuffling Adds MXFP4 GEMM tests (#2275) Adds MXFP4 GEMM examples Adds MXFP4 GEMMs to ckProfiler Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>	2025-06-05 13:54:15 -06:00
Anton Gorenko	52b4860a30	WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 ) * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Updates to support mixed precision * Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip * Added support for F8xF16xF16 to gemm_wmma_universal * Added support for F16xF8xF16 to gemm_wmma_universal * Added support for BF16xI4xBF16 to gemm_wmma_universal * Added support for F16xI4xF16 to gemm_wmma_universal * Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType * Added missing test class for FP16_KM_NK * Pre-commit hooks fixes * Added padding instances for f16xf16xf16 * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Ammending changes for adding support for padding instances for f16xf16xf16 * Fixes for padding instances for f16xf16xf16 * Added padding instances for bf16xbf16, f8xf8 * Added packed instances for bf16xi4xbf16 * Added padding instances for f8xf16xf16 * Added padding instances for f16xf8xf16, f16xi4xf16 * Fixed typos for bf16xbf16xbf16 padding instances * Fixed typos for padded instances * Added tests for fp16, KM_KN and KM_NK * Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Fix int8 DTYPES check for gemm_bilinear --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>	2025-06-04 12:22:33 +06:00
Bartłomiej Kocot	6e5acee0f9	Add Clamp/Relu bf16/fp16 cast fixes (#2279 ) * Add Clamp/Relu bf16/fp16 fixes * fix	2025-06-03 18:31:46 +02:00
Illia Silin	b76fdbe47f	Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274 ) * upgrade to rocm6.4.1 and use gfx1x-generic targets * add rocm version parsing * fix the gfx10-3-generic syntax in cmake	2025-06-03 07:17:35 -07:00
Bartłomiej Kocot	e7906dd644	Change relu to clamp for grouped conv fwd instances (#2249 )	2025-05-29 00:51:25 +02:00
Adam Dickin	6df1c56ad6	Changes to allow MIOpen to build CK as part of its build. (#2247 ) * tweaks to the miopen specific build. add way to skip clang-tidy checks and a way to skip some custom build targets MIOpen also has. * move the tidy if statment --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-05-28 13:51:15 -07:00
BrianHarrisonAMD	e91be7d96a	Add option to disable offload compress for CK builds (#2250 ) * Add option to disable offload compress for CK builds * Remove gemm exe offload compress flag conditional	2025-05-28 13:47:56 -07:00
Bartłomiej Kocot	b1ed92b131	Revert "Remove not needed bwd wei merged groups instances (#2218 )" (#2235 ) This reverts commit `4583aeffad`.	2025-05-26 23:26:04 +02:00
Bartłomiej Kocot	4583aeffad	Remove not needed bwd wei merged groups instances (#2218 ) * Grouped conv bwd wei add two stage instances for larger filter and Merge Groups * Fix * fix * Revert "Restore oddc instances (#2201)" This reverts commit `6342f6b5e8`. * fix --------- Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>	2025-05-26 22:46:18 +02:00
Illia Silin	bc2551ac3b	disable building device_mha_operations by default (#2225 )	2025-05-22 14:03:04 -07:00
Adam Dickin	417a6b65b6	Add MIOPEN_REQ_LIBS_ONLY option for cmake to build only the libs MIOpen requires (#2224 ) * cut out anything we dont need for MIOpen to test * refactor exclusion code to be more streamlined.	2025-05-22 11:14:33 -07:00
Bartłomiej Kocot	ebc5a6ef87	Grouped conv bwd wei add for larger filter and Merge Groupes optimization (#2197 ) * Grouped conv bwd wei add two stage instances for larger filter and Merge Groups * Fix * fix * Restore removed instances --------- Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>	2025-05-21 22:47:34 +02:00
Thomas Ning	1386924749	Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212 ) * Add small instance, add the bug fix, & improve the example CMake * clang format	2025-05-20 15:05:08 -07:00
Andriy Roshchenko	57e0f5df29	MX GEMM - Expand MX MFMA Testing to BF8, FP6, and BF6 Data Types (#2199 ) * Unify test interface for different layouts. * WIP: Introducing FP4/FP6/FP8 abstractions * WIP: Introducing packed storage abstraction * WIP: Introducing packed storage abstraction * WIP: Improved support for FP6 data type * Refactor packed storage for f6_t * WIP: FP6 MFMA test * Test if we correctly represent all FP6/FP4 numbers * Additional output for failed FP4 test. * More failing conversion tests * Even more failing conversion tests * Working FP6 MFMA tests * Expand MX MFMA testing to BF8/6 * Update and verify MX MFMA test for packed types * Fix fp4 and fp6 conversions on host * Working MX MFMA tests for FP8/6/4 * Cleanup * Add missing type * Cleanup * Final cleanup * Restrict FP6/4 values output to CK_LOGGING=1 * Use CHAR_BIT instead of number 8 * Fix typo * Remove FP6 and FP4 from the list of native types --------- Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>	2025-05-19 16:52:51 -05:00
Bartłomiej Kocot	6342f6b5e8	Restore oddc instances (#2201 )	2025-05-16 18:42:02 -07:00
Illia Silin	40668c9a99	Build and store CK library deb package for all targets daily. (#2196 ) * generate and store library package for all targets * use ninja to build packages for all targets * make sure to use ftime-trace when using ninja * make sure build trace only runs on gfx9 * archive lib package and stash only library package	2025-05-16 07:40:53 -07:00
Bartłomiej Kocot	7c0e29cc0f	Extend 64x64 with 4 waves instances for grouped conv bwd wei (#2187 ) * Extend 64x64 with 4 waves instnaces for grouped conv bwd wei * Fix * fix * fix	2025-05-15 16:21:34 +02:00
Thomas Ning	b49f7de81f	Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166 ) * make the work compiled * Solved the example code, but still have the profiler error * Finished the feature * Clang format and update the CHANGELOG * solve the preshuffle v1 & v2 problem * Comment Addressed * Comment Addressed	2025-05-12 09:52:58 -07:00
Bartłomiej Kocot	6fddb5708c	Add grouped conv fwd bias relu instances (#2179 ) * Add grouped conv fwd bias relu instances * fixes * fix	2025-05-09 22:52:34 +02:00
Andriy Roshchenko	cb27e7c77f	Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171 ) * Re-enable MX GEMM instances * Fix compilation error when building MX GEMM for multiple architectures	2025-05-08 13:26:03 -06:00
Bartłomiej Kocot	397b9080a2	Move 16x16 grouped conv fwd instances from comp header (#2165 ) * Move 16x16 grouped conv fwd instances from comp header * Improvements	2025-05-07 08:04:31 -07:00
Rostyslav Geyyer	8a0d659f92	Add FP4 MX MFMA tests (#2151 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add ops * Add tests * Add missing utils * Update reference mx gemm * Add f4x2 init mode * Update host tensor utils * Update chunk size for f4x2 * Add non scaled ops * Add a type utility * Update non scaled reference kernel * Add non scaled tests * Debug mfma arguments * Add more debug info * Update chunk size * Update data layout * Add more debugging * Fix B stride * Fix reference gemm * Fix build * One more reference fix * Add more debug info * Disable some tests * Enable tests * Add fp4 dimensions * Update reference kernels * Temp edits * Remove leftovers * Fix conflicts * Clean up * More clean up * Revert "More clean up" This reverts commit `d8d35a0846`. * Add layouts to tests --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>	2025-05-06 09:24:00 -05:00
Andriy Roshchenko	79beaacdd1	Restrict MX GEMM instantiation to GFX950 arch (#2157 )	2025-05-05 08:18:22 -07:00
Andriy Roshchenko	79b0bfeb41	MX GEMM - Add FP8 GEMM Tests for Different Layouts (#2152 ) * Add gemm_mx_fp8_bf8 example with row-major B * Add more overloads of MX MFMA instructions * Add MK_KN (RRR) tests * Add KM_NK (CCR) tests * Add more problem sizes to Large tests * Add test_gemm_mx to the list of regression tests	2025-05-01 11:55:48 -06:00
Bartłomiej Kocot	23de234dbe	Add grouped conv fwd 16x16 mfma instruction instances (#2140 ) * Add grouped conv fwd 16x16 mfma instruction instances * fix * remove oddc * fix * fix	2025-04-30 09:49:37 +02:00
Bartłomiej Kocot	4094ad158a	Integrate universal gemm with conv bwd data and add SplitK (#1315 ) * Integrate universal gemm with conv bwd data * Fix multi d kernel * Add splitK support * instances refactor * instances refactor * refactor * fixeS * fixes * 16x16 instnaces * Fixes * Fix * Fix * Fix * Fix * Fix * Fixes * fix * fix	2025-04-28 23:54:49 +02:00
Anton Gorenko	edd92fc546	DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096 ) * Prepare files for DeviceGemm_Wmma_CShuffleV3 * Implement main part of CShuffleV3 with block pipeline v3 for WMMA * Remove unused functions and template params for A/B descriptors * Support both gfx11 and gfx12 * Enable SplitK for gfx12 and disable for gfx11 * Added RowColRow layout for DeviceGemmV2 fp16 * Added more instances for Row, Col, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout * Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout * Fix formatting * Add documentation Based on `e5ad48a784` * Enable gemm_universal profiling for gfx11/12 * Add WMMA intrinsics for F8/BF8 * Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances * Add BF16 instances and tests * Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8 --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com>	2025-04-28 10:14:21 +05:00
lalala-sh	39ba03f25d	Moe gemm activation (#2026 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * impl int64 but result not correct * int64 index ok now * input output all ok * fix uint32 * revert v1 test * use uint32 * mork to support 13w tokens * moe sorting fix moebuf * fix merge * update moe api fix aiter build * fix buid * fuse silu * silu ok * acale ok * add silu * change code * gemm2 ok * gufusion compatible ok, fix warnings * gu fusion for m32 m64 ok * support bf16 cshuffle * i4 gemm2 ok * i4 gemm2 ok and i4 gemm1 build * 16x16 run ok * change flops; change cshuffle dtype * fuse gelu silu act in moe gemm1 * fp8 with act ready * int4 act ready * remove useless changes * remove useless code change * fix clang format * add the arch limit of int4 moe gemm * fuse moe activation * fix fp8 16x16 * fix no quant case * fix bugs * fix fp8 gufusion bug * remove useless comments * refine activation code & complete moe example * fix int8 bugs * merge tkw1 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: root <root@hjbog-srdc-51.amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-04-23 10:35:34 +08:00
lalala-sh	bcf5bb41be	enable do top k weights in moe stage1 gemm (#2094 ) * add switch for mul topk weights * fix bf16/f16 bugs * complete	2025-04-18 10:45:49 +08:00
Andriy Roshchenko	213b203a3c	MX GEMM - Parameterized Test Template (#2088 ) * Tests for MX FP8 GEMM * Improve documentation	2025-04-16 19:56:00 -06:00
aledudek	7c32652e03	Add grouped conv fwd 3d GKCYX instances for f32, f16, bf16 (#2069 ) * Part1 * Add grouped conv fwd 3d GKCYX instances for f32, f16, bf16 * Add missing coma * Add missing cpp instance files * Fix 3d layout * Add missing closing bracket * Add missing comp x2 and part2 instances * Fix typo in instance name * fix * Fix --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-04-16 11:00:55 +02:00
Illia Silin	3e6d21adeb	enable gfx115x support (#2065 )	2025-04-09 10:06:42 -07:00
Khushbu Agarwal	263ff689e0	New instances for gemm_multiply_multiply_weightpreshuffle operator (#2061 ) * Add new instances for weight_preshuffle for f8->bf16 * Add new instances for weight_preshuffle for f8->f16 * clang formatted --------- Co-authored-by: Khushbu Agarwal <khuagar@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-04-08 15:14:53 -07:00
Bartłomiej Kocot	2ccf914888	Add support for GKCYX grouped conv weight (#2023 ) * Grouped conv bwd weight GKCYX support * fix and changelog * fix * fix * fixes * comments * fix	2025-04-02 23:59:49 +02:00
Bartłomiej Kocot	8c0ab61ece	Grouped conv backward data GKCYX support (#2029 ) * Grouped conv backward data GKCYX support * profiler * Converter * split instances	2025-04-01 13:24:38 -07:00
Bartłomiej Kocot	6355ee7ca5	Improve compilation time for grouped conv fwd (#2039 ) * Improve compilation time for grouped conv fwd * Fix	2025-04-01 07:11:42 -07:00
Muhammed Emin Ozturk	dd4c12b155	f8/bf16 GEMM Stream-K (#1879 )	2025-03-31 20:30:17 -06:00
jefyang1	16b15e336a	Fix gemm universal and grouped_conv_fwd test failures on gfx950 (#2031 )	2025-03-31 09:20:52 -07:00
Bartłomiej Kocot	54c81a1fcf	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes	2025-03-26 21:13:38 +01:00
Bartłomiej Kocot	5b0873c31a	Fix split N for large images in groupd conv fwd (#2004 ) * Fix split N for large images in groupd conv fwd * Fix comments	2025-03-22 23:19:49 +01:00
Bartłomiej Kocot	fdaff5603e	Add grouped conv bwd wei merged grouped instance for larger filter (#1984 ) * Add grouped conv bwd wei merged grouped instance for larger filter * Update readme	2025-03-18 16:16:24 +01:00
Bartłomiej Kocot	c2e4898b4b	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example	2025-03-17 13:32:00 +01:00
Haocong WANG	cbd74c2d12	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-11 10:11:21 -07:00
Thomas Ning	c954bd0cfa	Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955 ) * tempsave, not selected * finish the feature and merge with develop --------- Co-authored-by: aska-0096 <haocwang@amd.com>	2025-03-07 13:44:06 -08:00
Illia Silin	0e8e711ec8	add missing headers (#1959 )	2025-03-07 11:11:30 -08:00
Illia Silin	9b51c08bf7	remove support for gfx940 and gfx941 targets (#1944 ) * remove support for gfx940 and gfx941 targets * update changelog	2025-03-05 11:07:33 -08:00
feli	3786e16375	ck moe gemm implement (#1936 ) * port all moe changes from ck_moe_gemm branch * refine codes in the pr * fix tail odd * fix clang format * fix clang format2 * make hot loop scheduler compatible with 16x16 and 32x32 * clang format * fix per token quant * rename moe example * clang format --------- Co-authored-by: coderfeli <coderfeli@163.com>	2025-03-05 15:56:55 +08:00
jefyang1	c95bda93ba	Remove CK_USE_AMD_MFMA_GFX950 (#1935 ) * Add runtime check in example_gemm_xdl_streamk for gfx950 * Add runtime check in grouped conv fwd examples for gfx950 * Disable CK_USE_AMD_MFMA_GFX950 * Add new instances for gfx950 * Fix test_gemm_universal on gfx950	2025-03-04 10:32:25 -08:00
asleepzzz	ef16010273	Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913 )" (#1933 ) This reverts commit `020148d0f7`.	2025-03-03 07:17:39 -08:00

1 2 3 4 5 ...

456 Commits