composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-18 09:38:17 +00:00

Author	SHA1	Message	Date
Aviral Goel	54ded8c52f	Label CMakeLists message() as DEBUG or STATUS for clean build output (#2301 ) * - elevate important build messages to log level STATUS - comment out the rest (temporarily) * - marked all low importance build messages as log_level=DEBUG [ROCm/composable_kernel commit: `aed0f5880c`]	2025-06-10 10:46:47 -07:00
Max Podkorytov	2b4f8a85b2	Convert CK (GeMM MulMul Weight Preshuffle) instances to use 16x16 xdl tile (#2229 ) * compile profiler only for gemm-mulmul-weight-preshuffle * m/n xdl; m/n xdl per wave; cshuffle block transfer cluster length m per block * process all p1 instances * process all p2 instances * process all p3 instances * convert p4 instance * modify compute p1 instances * modify compute p2 instances * relax p4 instance c block transfer cluster len * fix c block transfer cluster lengths comment * add mfma (without 16x16) instances to the profiler * roll back profiling cmakelists change * clang-format * re-add (now unused) 32x32 xdl-tile instances * clang-format * add more instances * fit c block transfer lengths into block * copy and write over the instance definitions from bf16 to fp16 * add instances to profiler * unify instance tuple alias [ROCm/composable_kernel commit: `e6b5e31c20`]	2025-06-10 09:37:14 -07:00
Eisuke Kawashima	808cc61307	chore: unset executable permission (#2303 ) Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com> [ROCm/composable_kernel commit: `4e586ca958`]	2025-06-10 09:13:59 -07:00
Illia Silin	c1fb1b74a2	fix headers (#2321 ) [ROCm/composable_kernel commit: `1ac5eeaea9`]	2025-06-10 07:26:32 -07:00
Bartłomiej Kocot	dbec4063f1	Grouped conv bwd wei explicit GEMM for odd C/K (#2306 ) [ROCm/composable_kernel commit: `7a83f1d510`]	2025-06-10 11:17:12 +02:00
Bartłomiej Kocot	d9dd0aa254	Grouped Convolution Backward Weight Explicit GEMM (#2282 ) * Grouped conv bwd weight explicit gemm * 3d * cmake fixes * fix test * fix [ROCm/composable_kernel commit: `050cad09b5`]	2025-06-06 10:30:08 +02:00
Andriy Roshchenko	ab0540c5db	Optimized GEMMs for MX FP4/8 (#2294 ) Adds V3 GEMM pipeline for MX FP4 and MX FP8 Adds V3 GEMM pipeline for MX FP4 with preshuffling Adds MXFP4 GEMM tests (#2275) Adds MXFP4 GEMM examples Adds MXFP4 GEMMs to ckProfiler Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> [ROCm/composable_kernel commit: `00247e3c29`]	2025-06-05 13:54:15 -06:00
Anton Gorenko	780cb29a42	WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 ) * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Updates to support mixed precision * Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip * Added support for F8xF16xF16 to gemm_wmma_universal * Added support for F16xF8xF16 to gemm_wmma_universal * Added support for BF16xI4xBF16 to gemm_wmma_universal * Added support for F16xI4xF16 to gemm_wmma_universal * Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType * Added missing test class for FP16_KM_NK * Pre-commit hooks fixes * Added padding instances for f16xf16xf16 * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Ammending changes for adding support for padding instances for f16xf16xf16 * Fixes for padding instances for f16xf16xf16 * Added padding instances for bf16xbf16, f8xf8 * Added packed instances for bf16xi4xbf16 * Added padding instances for f8xf16xf16 * Added padding instances for f16xf8xf16, f16xi4xf16 * Fixed typos for bf16xbf16xbf16 padding instances * Fixed typos for padded instances * Added tests for fp16, KM_KN and KM_NK * Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Fix int8 DTYPES check for gemm_bilinear --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com> [ROCm/composable_kernel commit: `52b4860a30`]	2025-06-04 12:22:33 +06:00
Bartłomiej Kocot	dbdf79d541	Add Clamp/Relu bf16/fp16 cast fixes (#2279 ) * Add Clamp/Relu bf16/fp16 fixes * fix [ROCm/composable_kernel commit: `6e5acee0f9`]	2025-06-03 18:31:46 +02:00
Illia Silin	d5d10f8e88	Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274 ) * upgrade to rocm6.4.1 and use gfx1x-generic targets * add rocm version parsing * fix the gfx10-3-generic syntax in cmake [ROCm/composable_kernel commit: `b76fdbe47f`]	2025-06-03 07:17:35 -07:00
Bartłomiej Kocot	f1ec9e0be5	Change relu to clamp for grouped conv fwd instances (#2249 ) [ROCm/composable_kernel commit: `e7906dd644`]	2025-05-29 00:51:25 +02:00
Adam Dickin	fe30e881d6	Changes to allow MIOpen to build CK as part of its build. (#2247 ) * tweaks to the miopen specific build. add way to skip clang-tidy checks and a way to skip some custom build targets MIOpen also has. * move the tidy if statment --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `6df1c56ad6`]	2025-05-28 13:51:15 -07:00
BrianHarrisonAMD	67c4eb2e99	Add option to disable offload compress for CK builds (#2250 ) * Add option to disable offload compress for CK builds * Remove gemm exe offload compress flag conditional [ROCm/composable_kernel commit: `e91be7d96a`]	2025-05-28 13:47:56 -07:00
Bartłomiej Kocot	d2fe046545	Revert "Remove not needed bwd wei merged groups instances (#2218 )" (#2235 ) This reverts commit 14d9d42a6ec0e07553b3bdadc0d785a9ab6c2375. [ROCm/composable_kernel commit: `b1ed92b131`]	2025-05-26 23:26:04 +02:00
Bartłomiej Kocot	525b200a33	Remove not needed bwd wei merged groups instances (#2218 ) * Grouped conv bwd wei add two stage instances for larger filter and Merge Groups * Fix * fix * Revert "Restore oddc instances (#2201)" This reverts commit 1590272e3f15dd147b9ff60422ad83b6cec6b2ac. * fix --------- Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> [ROCm/composable_kernel commit: `4583aeffad`]	2025-05-26 22:46:18 +02:00
Illia Silin	d0602443ec	disable building device_mha_operations by default (#2225 ) [ROCm/composable_kernel commit: `bc2551ac3b`]	2025-05-22 14:03:04 -07:00
Adam Dickin	0e7c84be7d	Add MIOPEN_REQ_LIBS_ONLY option for cmake to build only the libs MIOpen requires (#2224 ) * cut out anything we dont need for MIOpen to test * refactor exclusion code to be more streamlined. [ROCm/composable_kernel commit: `417a6b65b6`]	2025-05-22 11:14:33 -07:00
Bartłomiej Kocot	c30af6a16a	Grouped conv bwd wei add for larger filter and Merge Groupes optimization (#2197 ) * Grouped conv bwd wei add two stage instances for larger filter and Merge Groups * Fix * fix * Restore removed instances --------- Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> [ROCm/composable_kernel commit: `ebc5a6ef87`]	2025-05-21 22:47:34 +02:00
Thomas Ning	f969f4d798	Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212 ) * Add small instance, add the bug fix, & improve the example CMake * clang format [ROCm/composable_kernel commit: `1386924749`]	2025-05-20 15:05:08 -07:00
Andriy Roshchenko	9128d5e5cc	MX GEMM - Expand MX MFMA Testing to BF8, FP6, and BF6 Data Types (#2199 ) * Unify test interface for different layouts. * WIP: Introducing FP4/FP6/FP8 abstractions * WIP: Introducing packed storage abstraction * WIP: Introducing packed storage abstraction * WIP: Improved support for FP6 data type * Refactor packed storage for f6_t * WIP: FP6 MFMA test * Test if we correctly represent all FP6/FP4 numbers * Additional output for failed FP4 test. * More failing conversion tests * Even more failing conversion tests * Working FP6 MFMA tests * Expand MX MFMA testing to BF8/6 * Update and verify MX MFMA test for packed types * Fix fp4 and fp6 conversions on host * Working MX MFMA tests for FP8/6/4 * Cleanup * Add missing type * Cleanup * Final cleanup * Restrict FP6/4 values output to CK_LOGGING=1 * Use CHAR_BIT instead of number 8 * Fix typo * Remove FP6 and FP4 from the list of native types --------- Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com> [ROCm/composable_kernel commit: `57e0f5df29`]	2025-05-19 16:52:51 -05:00
Bartłomiej Kocot	531652604d	Restore oddc instances (#2201 ) [ROCm/composable_kernel commit: `6342f6b5e8`]	2025-05-16 18:42:02 -07:00
Illia Silin	949ba112e6	Build and store CK library deb package for all targets daily. (#2196 ) * generate and store library package for all targets * use ninja to build packages for all targets * make sure to use ftime-trace when using ninja * make sure build trace only runs on gfx9 * archive lib package and stash only library package [ROCm/composable_kernel commit: `40668c9a99`]	2025-05-16 07:40:53 -07:00
Bartłomiej Kocot	f566cea050	Extend 64x64 with 4 waves instances for grouped conv bwd wei (#2187 ) * Extend 64x64 with 4 waves instnaces for grouped conv bwd wei * Fix * fix * fix [ROCm/composable_kernel commit: `7c0e29cc0f`]	2025-05-15 16:21:34 +02:00
Thomas Ning	ab918009ed	Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166 ) * make the work compiled * Solved the example code, but still have the profiler error * Finished the feature * Clang format and update the CHANGELOG * solve the preshuffle v1 & v2 problem * Comment Addressed * Comment Addressed [ROCm/composable_kernel commit: `b49f7de81f`]	2025-05-12 09:52:58 -07:00
Bartłomiej Kocot	fa024cca43	Add grouped conv fwd bias relu instances (#2179 ) * Add grouped conv fwd bias relu instances * fixes * fix [ROCm/composable_kernel commit: `6fddb5708c`]	2025-05-09 22:52:34 +02:00
Andriy Roshchenko	abd554cfe6	Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171 ) * Re-enable MX GEMM instances * Fix compilation error when building MX GEMM for multiple architectures [ROCm/composable_kernel commit: `cb27e7c77f`]	2025-05-08 13:26:03 -06:00
Bartłomiej Kocot	311de597ac	Move 16x16 grouped conv fwd instances from comp header (#2165 ) * Move 16x16 grouped conv fwd instances from comp header * Improvements [ROCm/composable_kernel commit: `397b9080a2`]	2025-05-07 08:04:31 -07:00
Rostyslav Geyyer	789ed57662	Add FP4 MX MFMA tests (#2151 ) * Add conversion tests * Fix ctor * Fix nan logic * Fix conversion logic * Permute packed f4_t values * Fix conversion to float, repack vector elements * Fix device tests * Permute elements in a vector * Add a repro test * Add a conversion for a repro test * Update test vectors * Update conversion * Fix the test * Update test vector generator * Fix vector sr conversion * Permute conversion args * Update conversion * Test * Fix packing * Simplify conversion function * Pack conversion in a loop * Pack conversion in a loop * Pack another conversion in a loop * Pack one more conversion in a loop * Pack the last conversion in a loop * Clean up * Add ops * Add tests * Add missing utils * Update reference mx gemm * Add f4x2 init mode * Update host tensor utils * Update chunk size for f4x2 * Add non scaled ops * Add a type utility * Update non scaled reference kernel * Add non scaled tests * Debug mfma arguments * Add more debug info * Update chunk size * Update data layout * Add more debugging * Fix B stride * Fix reference gemm * Fix build * One more reference fix * Add more debug info * Disable some tests * Enable tests * Add fp4 dimensions * Update reference kernels * Temp edits * Remove leftovers * Fix conflicts * Clean up * More clean up * Revert "More clean up" This reverts commit `d8d35a0846`. * Add layouts to tests --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> [ROCm/composable_kernel commit: `8a0d659f92`]	2025-05-06 09:24:00 -05:00
Andriy Roshchenko	bde3b6c5ad	Restrict MX GEMM instantiation to GFX950 arch (#2157 ) [ROCm/composable_kernel commit: `79beaacdd1`]	2025-05-05 08:18:22 -07:00
Andriy Roshchenko	e70035cf5a	MX GEMM - Add FP8 GEMM Tests for Different Layouts (#2152 ) * Add gemm_mx_fp8_bf8 example with row-major B * Add more overloads of MX MFMA instructions * Add MK_KN (RRR) tests * Add KM_NK (CCR) tests * Add more problem sizes to Large tests * Add test_gemm_mx to the list of regression tests [ROCm/composable_kernel commit: `79b0bfeb41`]	2025-05-01 11:55:48 -06:00
Bartłomiej Kocot	a78cbfd6fc	Add grouped conv fwd 16x16 mfma instruction instances (#2140 ) * Add grouped conv fwd 16x16 mfma instruction instances * fix * remove oddc * fix * fix [ROCm/composable_kernel commit: `23de234dbe`]	2025-04-30 09:49:37 +02:00
Bartłomiej Kocot	7942bb905b	Integrate universal gemm with conv bwd data and add SplitK (#1315 ) * Integrate universal gemm with conv bwd data * Fix multi d kernel * Add splitK support * instances refactor * instances refactor * refactor * fixeS * fixes * 16x16 instnaces * Fixes * Fix * Fix * Fix * Fix * Fix * Fixes * fix * fix [ROCm/composable_kernel commit: `4094ad158a`]	2025-04-28 23:54:49 +02:00
Anton Gorenko	420de0f22d	DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096 ) * Prepare files for DeviceGemm_Wmma_CShuffleV3 * Implement main part of CShuffleV3 with block pipeline v3 for WMMA * Remove unused functions and template params for A/B descriptors * Support both gfx11 and gfx12 * Enable SplitK for gfx12 and disable for gfx11 * Added RowColRow layout for DeviceGemmV2 fp16 * Added more instances for Row, Col, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout * Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout * Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout * Fix formatting * Add documentation Based on cc666c6a19dabc2cce8141e7ae23bd460ceef331 * Enable gemm_universal profiling for gfx11/12 * Add WMMA intrinsics for F8/BF8 * Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances * Add BF16 instances and tests * Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8 --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> [ROCm/composable_kernel commit: `edd92fc546`]	2025-04-28 10:14:21 +05:00
lalala-sh	8f426b1216	Moe gemm activation (#2026 ) * fix useless code and remove usless oob * clang format * fix coredump in e2e test * fix2 * fix clang format * fix output oob * impl int64 but result not correct * int64 index ok now * input output all ok * fix uint32 * revert v1 test * use uint32 * mork to support 13w tokens * moe sorting fix moebuf * fix merge * update moe api fix aiter build * fix buid * fuse silu * silu ok * acale ok * add silu * change code * gemm2 ok * gufusion compatible ok, fix warnings * gu fusion for m32 m64 ok * support bf16 cshuffle * i4 gemm2 ok * i4 gemm2 ok and i4 gemm1 build * 16x16 run ok * change flops; change cshuffle dtype * fuse gelu silu act in moe gemm1 * fp8 with act ready * int4 act ready * remove useless changes * remove useless code change * fix clang format * add the arch limit of int4 moe gemm * fuse moe activation * fix fp8 16x16 * fix no quant case * fix bugs * fix fp8 gufusion bug * remove useless comments * refine activation code & complete moe example * fix int8 bugs * merge tkw1 --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: feli <felix.li@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: root <root@hjbog-srdc-51.amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `39ba03f25d`]	2025-04-23 10:35:34 +08:00
lalala-sh	dc58110b06	enable do top k weights in moe stage1 gemm (#2094 ) * add switch for mul topk weights * fix bf16/f16 bugs * complete [ROCm/composable_kernel commit: `bcf5bb41be`]	2025-04-18 10:45:49 +08:00
Andriy Roshchenko	7972a39081	MX GEMM - Parameterized Test Template (#2088 ) * Tests for MX FP8 GEMM * Improve documentation [ROCm/composable_kernel commit: `213b203a3c`]	2025-04-16 19:56:00 -06:00
aledudek	d02e727b51	Add grouped conv fwd 3d GKCYX instances for f32, f16, bf16 (#2069 ) * Part1 * Add grouped conv fwd 3d GKCYX instances for f32, f16, bf16 * Add missing coma * Add missing cpp instance files * Fix 3d layout * Add missing closing bracket * Add missing comp x2 and part2 instances * Fix typo in instance name * fix * Fix --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `7c32652e03`]	2025-04-16 11:00:55 +02:00
Illia Silin	9cc561987f	enable gfx115x support (#2065 ) [ROCm/composable_kernel commit: `3e6d21adeb`]	2025-04-09 10:06:42 -07:00
Khushbu Agarwal	8a65c29ffb	New instances for gemm_multiply_multiply_weightpreshuffle operator (#2061 ) * Add new instances for weight_preshuffle for f8->bf16 * Add new instances for weight_preshuffle for f8->f16 * clang formatted --------- Co-authored-by: Khushbu Agarwal <khuagar@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `263ff689e0`]	2025-04-08 15:14:53 -07:00
Bartłomiej Kocot	49565538fe	Add support for GKCYX grouped conv weight (#2023 ) * Grouped conv bwd weight GKCYX support * fix and changelog * fix * fix * fixes * comments * fix [ROCm/composable_kernel commit: `2ccf914888`]	2025-04-02 23:59:49 +02:00
Bartłomiej Kocot	50fb390d6f	Grouped conv backward data GKCYX support (#2029 ) * Grouped conv backward data GKCYX support * profiler * Converter * split instances [ROCm/composable_kernel commit: `8c0ab61ece`]	2025-04-01 13:24:38 -07:00
Bartłomiej Kocot	a9ec282dba	Improve compilation time for grouped conv fwd (#2039 ) * Improve compilation time for grouped conv fwd * Fix [ROCm/composable_kernel commit: `6355ee7ca5`]	2025-04-01 07:11:42 -07:00
Muhammed Emin Ozturk	30e5c8cb49	f8/bf16 GEMM Stream-K (#1879 ) [ROCm/composable_kernel commit: `dd4c12b155`]	2025-03-31 20:30:17 -06:00
jefyang1	92f6e02b96	Fix gemm universal and grouped_conv_fwd test failures on gfx950 (#2031 ) [ROCm/composable_kernel commit: `16b15e336a`]	2025-03-31 09:20:52 -07:00
Bartłomiej Kocot	f967fd7296	Add support for GKCYX grouped conv fwd (#2015 ) * Add support for GKCYX grouped conv fwd * fixes * fix * changelog * Fixes [ROCm/composable_kernel commit: `54c81a1fcf`]	2025-03-26 21:13:38 +01:00
Bartłomiej Kocot	61b5e8318a	Fix split N for large images in groupd conv fwd (#2004 ) * Fix split N for large images in groupd conv fwd * Fix comments [ROCm/composable_kernel commit: `5b0873c31a`]	2025-03-22 23:19:49 +01:00
Bartłomiej Kocot	71c6106947	Add grouped conv bwd wei merged grouped instance for larger filter (#1984 ) * Add grouped conv bwd wei merged grouped instance for larger filter * Update readme [ROCm/composable_kernel commit: `fdaff5603e`]	2025-03-18 16:16:24 +01:00
Bartłomiej Kocot	36f9cc5fb0	Grouped conv bwd data NGCHW (#1967 ) * Grouped conv bwd data NGCHW * fixes * fix * Improvements * Fix * Fix * add client example [ROCm/composable_kernel commit: `c2e4898b4b`]	2025-03-17 13:32:00 +01:00
Haocong WANG	b31ac9d14d	[Block Scale GEMM] Optimized block scale gemm (#1950 ) * Added two kernel for M=32 problem * Comment the first one * Enable multiply_multiply for Scale_Block_M = 1 for deepseek * Modify the a_thread offset since the A data load is different from B. * edit fp8 ab scale for Scale_Block_M=1 * edit GemmSpec to MNKPadding * enable blockwise pipelie v1 and v2. v1 is work for small K. * add instance for gemm_ab_scale * fix cmakelist of ckProfiler * optimize blockscale gemm. todo: reduce vgpr usage * fix a correctness bug * sanity checked * revert ckprofiler cmake changes * clang format * revert unnecessary changes. * remove commented codes. * split weight preshuffle library targets * bring back enable-post-misched=0 * fix build issues for gemm_multiply_multiply_fp8 instances * fix clang format * add verbose build flag when building for all targets * reduce path names for new instances * fix paths in cmake * refactor gemm_multiply_multiply library target * fix a bug in example * fix example 65 cmake * reduce the number of threads when building libs for all targets to 50 * use ninja to build for all targets * reduce teh number of threads when building for all targets * reduce the number of threads to 32 when building libs for all targets to 50 --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `cbd74c2d12`]	2025-03-11 10:11:21 -07:00
Thomas Ning	66663bf2d9	Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955 ) * tempsave, not selected * finish the feature and merge with develop --------- Co-authored-by: aska-0096 <haocwang@amd.com> [ROCm/composable_kernel commit: `c954bd0cfa`]	2025-03-07 13:44:06 -08:00

1 2 3 4 5 ...

461 Commits