composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 08:50:17 +00:00

Author	SHA1	Message	Date
Max Podkorytov	e6b5e31c20	Convert CK (GeMM MulMul Weight Preshuffle) instances to use 16x16 xdl tile (#2229 ) * compile profiler only for gemm-mulmul-weight-preshuffle * m/n xdl; m/n xdl per wave; cshuffle block transfer cluster length m per block * process all p1 instances * process all p2 instances * process all p3 instances * convert p4 instance * modify compute p1 instances * modify compute p2 instances * relax p4 instance c block transfer cluster len * fix c block transfer cluster lengths comment * add mfma (without 16x16) instances to the profiler * roll back profiling cmakelists change * clang-format * re-add (now unused) 32x32 xdl-tile instances * clang-format * add more instances * fit c block transfer lengths into block * copy and write over the instance definitions from bf16 to fp16 * add instances to profiler * unify instance tuple alias	2025-06-10 09:37:14 -07:00
Eisuke Kawashima	4e586ca958	chore: unset executable permission (#2303 ) Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>	2025-06-10 09:13:59 -07:00
John Afaganis	6635d1bb88	Remove usage of 'warpSize' variable as it has been deprecated (#2295 ) * SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround * SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo --------- Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com>	2025-06-10 07:34:54 -07:00
dependabot[bot]	3d9f5eafaf	Bump rocm-docs-core[api_reference] from 1.20.0 to 1.20.1 in /docs/sphinx (#2317 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.20.0 to 1.20.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.20.1/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.20.0...v1.20.1) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.20.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-10 07:27:26 -07:00
Illia Silin	1ac5eeaea9	fix headers (#2321 )	2025-06-10 07:26:32 -07:00
carlushuang	2e0536269e	hot fix (#2315 )	2025-06-10 20:35:28 +08:00
Bartłomiej Kocot	7a83f1d510	Grouped conv bwd wei explicit GEMM for odd C/K (#2306 )	2025-06-10 11:17:12 +02:00
MHYangAMD	9fcf21a4ec	Fix fmha fwd precision issue on MI3XX series (#2285 ) * Fix fmha fwd precision issue on MI3XX series For fmha fwd fp16 cases, we found that using impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero. For examaple, fixing K,V to be all 1, and Q is random, which outputs are expected to be all 1. But we found that it would have some incorrect outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 = 0.0005 < 0.001) Thus, ck do not report this error. * Add option to switch rtn/rtz for fmha fwd	2025-06-10 15:03:23 +08:00
carlushuang	65835c0bbb	MUST USE INLINE FOR ANY NON TEMPLATE FUNCTION IN HEADER!!! (#2305 )	2025-06-10 10:40:54 +08:00
Aviral Goel	5a0bd157db	Code Refactor for check_err.hpp (#2284 ) * refactor & add documentation * removed return datatype from doxygen comments * Update include/ck_tile/host/check_err.hpp Co-authored-by: John Afaganis <john.afaganis@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> --------- Co-authored-by: John Afaganis <john.afaganis@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-06-08 13:41:27 -07:00
Max Podkorytov	aece3c6700	Add a python script for running ckProfiler and processing the results (#2288 ) * add profiler script * add comments * generalize and add some input validation * format * refactor * Rename run_ck_profiler.py to run_ck_profiler_gemm_with_csv_shapes.py rename script file	2025-06-08 12:41:57 -07:00
Sami Remes	1c6f83df6c	[CK_TILE] Tileloop persistent gemm - resubmit (#2299 ) * Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293) This reverts commit `233e274077`. * Add missing header for kentry --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-06 14:18:49 -07:00
valarLip	8482977a37	extend buffer load to support load 32 bf16/fp16 at same time (#2291 )	2025-06-06 17:21:19 +08:00
Bartłomiej Kocot	050cad09b5	Grouped Convolution Backward Weight Explicit GEMM (#2282 ) * Grouped conv bwd weight explicit gemm * 3d * cmake fixes * fix test * fix	2025-06-06 10:30:08 +02:00
Andriy Roshchenko	00247e3c29	Optimized GEMMs for MX FP4/8 (#2294 ) Adds V3 GEMM pipeline for MX FP4 and MX FP8 Adds V3 GEMM pipeline for MX FP4 with preshuffling Adds MXFP4 GEMM tests (#2275) Adds MXFP4 GEMM examples Adds MXFP4 GEMMs to ckProfiler Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>	2025-06-05 13:54:15 -06:00
Illia Silin	233e274077	Revert "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 ) This reverts commit `ffb52783d0`.	2025-06-05 09:24:00 -07:00
Sami Remes	7ea1508b59	[CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222 ) * Add TailHandler for V3, V4 and Mem pipelines * Adapt examples and tests to use TailHandler * move tail-handling logic to pipeline in persistent grouped gemm * Fix Mem pipeline dispatching, add CompV4 dispatching * Use a macro for handling the many tails of Mem pipeline * Fix formatting again * Use const-ref RunFunction, remove unnecessary try_run	2025-06-04 11:50:21 +03:00
Sami Remes	ffb52783d0	[CK_TILE] Tile loop persistent gemm kernel (#2191 ) * Implement tile loop persistent gemm kernel * Enable timing * Add tests for persistent gemm * Fix formatting * Fix gemm_basic * Rename True/False to Persistent/NonPersistent * Use only one set of layouts for persistent tests * Fix gemm example persistent template parameter * Fix formatting	2025-06-04 11:46:28 +03:00
Anton Gorenko	52b4860a30	WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 ) * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Updates to support mixed precision * Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip * Added support for F8xF16xF16 to gemm_wmma_universal * Added support for F16xF8xF16 to gemm_wmma_universal * Added support for BF16xI4xBF16 to gemm_wmma_universal * Added support for F16xI4xF16 to gemm_wmma_universal * Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType * Added missing test class for FP16_KM_NK * Pre-commit hooks fixes * Added padding instances for f16xf16xf16 * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Ammending changes for adding support for padding instances for f16xf16xf16 * Fixes for padding instances for f16xf16xf16 * Added padding instances for bf16xbf16, f8xf8 * Added packed instances for bf16xi4xbf16 * Added padding instances for f8xf16xf16 * Added padding instances for f16xf8xf16, f16xi4xf16 * Fixed typos for bf16xbf16xbf16 padding instances * Fixed typos for padded instances * Added tests for fp16, KM_KN and KM_NK * Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Fix int8 DTYPES check for gemm_bilinear --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>	2025-06-04 12:22:33 +06:00
Khushbu Agarwal	59a85cb4bc	[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262 ) * debugging issue * debugging issue * debugging * debugging * reverting debugging code * clang formatted * updating default_config.json * fix ci failure * clang formatted	2025-06-03 20:16:10 -07:00
Khushbu Agarwal	1037b21cfe	Update changelog for Rotating buffer (#2283 ) * Updating changelog for Rotating buffer * address review comment.	2025-06-03 20:14:22 -07:00
Bartłomiej Kocot	6e5acee0f9	Add Clamp/Relu bf16/fp16 cast fixes (#2279 ) * Add Clamp/Relu bf16/fp16 fixes * fix	2025-06-03 18:31:46 +02:00
Xiaodong Wang	7f9eef40b0	Move pragma ahead (#2231 )	2025-06-03 07:27:51 -07:00
Aviral Goel	11f6c14e03	Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268 ) * add 0 as valid default arguement for strides * add 0 as valid default arguement for strides # Conflicts: # example/01_gemm/common.hpp	2025-06-03 07:26:58 -07:00
Illia Silin	b76fdbe47f	Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274 ) * upgrade to rocm6.4.1 and use gfx1x-generic targets * add rocm version parsing * fix the gfx10-3-generic syntax in cmake	2025-06-03 07:17:35 -07:00
Khushbu Agarwal	2e38eb4f1c	Rotating buffer PR CI fix (#2257 ) * Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)" This reverts commit `bbdaf79a52`. * fix regression	2025-06-02 10:25:01 -07:00
dependabot[bot]	cffe8fa2a4	Bump rocm-docs-core[api_reference] from 1.19.1 to 1.20.0 in /docs/sphinx (#2272 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.19.1 to 1.20.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.19.1...v1.20.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.20.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-02 06:44:10 -07:00
valarLip	0fdbf6bcd1	extend buffer load for fp16/bf16x16 (#2270 ) * extend buffer load for fp16/bf16x16 * format	2025-06-02 10:29:54 +08:00
Kiefer van Teutem	2215a9edf0	Explicitly set the LINKER_LANGUAGE for the gemm_template_instances target to avoid Ninja build config failure. (#2265 ) Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2025-05-30 13:32:28 -07:00
Illia Silin	654956bb02	Add a daily CI build on GFX950. (#2261 ) * add CI build for gfx950 * make sure gfx950 CI always uses special docker and compiler * enable codegen tests by default	2025-05-30 12:50:08 -07:00
Mirza Halilčević	fbce6c7bb6	Define CHAR_BIT during hipRTC (#2264 ) * Fix failing codegen tests. * fix clang format --------- Co-authored-by: illsilin <Illia.Silin@amd.com>	2025-05-30 08:23:44 -07:00
dependabot[bot]	61e6c382c6	Bump rocm-docs-core[api_reference] from 1.19.0 to 1.19.1 in /docs/sphinx (#2263 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.19.0 to 1.19.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.19.0...v1.19.1) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.19.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-30 05:56:59 -07:00
slippedJim	57f497452a	remove restriction of group mode hd192 no lse (#2252 ) Co-authored-by: Jim <jimguo12@amd.com>	2025-05-30 10:14:21 +08:00
Illia Silin	4e561af18c	Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185 )" (#2260 ) This reverts commit `fd6a859b44`.	2025-05-29 16:22:16 -07:00
Paul Fultz II	306f4c537e	Export codegen targets (#2259 )	2025-05-29 11:03:51 -07:00
joyeamd	fd6a859b44	add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185 ) * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * update cshuffle logic * update cshuffle_logics * add some change within review * update some codes following the code review * update epilogue logic * remove from problem * update codes following review. * fix some issues	2025-05-29 14:31:14 +02:00
Po Yen Chen	28cd0dffc9	[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251 ) * Add constraint on traits/tile/pipeline * Use kM0=128 if max_seqlen_q == 8192 * Re-format codegen script * Remove redundant attr name postix * Fix import error: default field in dataclass * Use kK0=64 & kK1=64 to hide latency * Use CU utilization to decide tile size	2025-05-29 18:36:33 +09:00
Bartłomiej Kocot	e7906dd644	Change relu to clamp for grouped conv fwd instances (#2249 )	2025-05-29 00:51:25 +02:00
Adam Dickin	6df1c56ad6	Changes to allow MIOpen to build CK as part of its build. (#2247 ) * tweaks to the miopen specific build. add way to skip clang-tidy checks and a way to skip some custom build targets MIOpen also has. * move the tidy if statment --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-05-28 13:51:15 -07:00
BrianHarrisonAMD	e91be7d96a	Add option to disable offload compress for CK builds (#2250 ) * Add option to disable offload compress for CK builds * Remove gemm exe offload compress flag conditional	2025-05-28 13:47:56 -07:00
Casey-Shi	29574f05f7	change from ninja to make (#2253 )	2025-05-28 09:25:05 -07:00
Illia Silin	bbdaf79a52	Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 ) This reverts commit `99857e10e6`.	2025-05-28 09:46:52 -06:00
Casey-Shi	4286eae09a	fix type hint (#2254 )	2025-05-28 08:43:58 -07:00
Sami Remes	9bd01b624e	Remove extra if from CMakeLists.txt of gemm tests (#2213 )	2025-05-28 15:25:09 +02:00
Khushbu Agarwal	99857e10e6	[CK_tile] Add rotating buffer feature for universal gemm (#2200 ) * Add rotating buffer feature for universal gemm * adding changes in tile_engine * Updated code to merge kernel_launch * removing comments * Enable rotating buffer changes to flatmm * Created diff launch_kernel function for rotating buffer * Simplfied calculation using macros * merge code with new changes in tile_engine * clang formatted * Redefine macros	2025-05-27 23:00:58 -07:00
Aviral Goel	c52649ad57	Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) (#2234 ) * added catch statements to examples * clang format	2025-05-27 22:32:42 -07:00
dependabot[bot]	132bd5b874	Bump rocm-docs-core[api_reference] from 1.18.4 to 1.19.0 in /docs/sphinx (#2237 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.18.4 to 1.19.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.4...v1.19.0) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.19.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-05-27 06:53:12 -07:00
Casey-Shi	128f5a1eab	[Tile Engine] Add benchmark for tile engine gemm. (#2193 ) * initial commit -m benchmark * only support profile * fix * fix doc * add default config * add ci * fix cmake * tmp save for gen blobs * fix bug * merge * range config * test success * fix * fix * move struct * remove config property * fix config * remove comment * add cmake option & modify * add changelog * fix * format * add pydantic module to the docker image * fix * add benchmark for cold and warmp up * python format * add asm cache control * fix README * remove pydantic module * modify changelog * fix config * recover benchmark_gemm and fix * format python * refactor profiler * fix csv bug * fix codegen bug * add kernel instance object * add benchmark gemm executable * fix jenkins & delete extra header * disable warning output & enable default config * Disable sparsity for invalid warp tile combinations * fix gemm host template func * refactor gemm profiler * filter out some inmstances * default config test & fix codegen bug * add sparse flag to gen more instances --------- Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: khuagarw <khuagarw@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-05-26 22:32:36 -07:00
Po Yen Chen	c42b957d65	[CK_TILE] For FMHA forward kernels, assign block indices reversely if using mask (#2209 ) * Assign block indices reversely if kHasMask=true * Assign block indices reversely for splitkv kernel	2025-05-27 10:58:58 +08:00
Yi DING	5727af98d1	Add operator/instance filters to ckProfiler (#2233 )	2025-05-27 09:51:20 +08:00

1 2 3 4 5 ...

1973 Commits