composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-16 16:51:26 +00:00

Author	SHA1	Message	Date
Muhammed Emin Ozturk	70d6dcb230	Stream-K CkProfiler Update ( Replace CPU Validation with GPU Validation and Add Dynamic Grid Size Calculation for Stream-K GEMM Profiler) (#2333 ) * Stream-K Ckprofiler Update * new grid list based on sm number * clang * update for review * Update profile_gemm_universal_streamk.cpp --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com> [ROCm/composable_kernel commit: `bfb33bc1e9`]	2025-06-18 07:49:22 -07:00
joyeamd	fdfcee3b98	transpose load api development (#2177 ) * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * change directory name * delete the duplicated directory * update cmakelists file * delete the unused codes * update function names * update transpose policy * update code after remod.py * update codes * add some comment * Polish the instr infrastructure * build up the fixed instr * redesign the transpose api, currently it has numerical error * add the bf16 transpose * fix some issues * add some comments * update document * Finished the refactor of API and pass through the verification * fix the merging issue --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `a2f01141aa`]	2025-06-18 01:28:34 -07:00
Thomas Ning	ffafdec4d8	Revert "Fix default epilogue (#2358 )" (#2364 ) This reverts commit f1e842455a66c955a3df38a9fc383141c4bcc62f. [ROCm/composable_kernel commit: `64a2fda713`]	2025-06-17 22:43:05 -07:00
linqunAMD	0f3ee13842	[CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd (#2243 ) * [CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd * fix error with the latest develop code. [ROCm/composable_kernel commit: `7aeec9a901`]	2025-06-17 21:37:59 -07:00
carlushuang	8660f6ef22	[CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335 ) * support local_token for hipgraph * update README * fix comment * fix fmoe example [ROCm/composable_kernel commit: `a4e1248dba`]	2025-06-18 10:49:43 +08:00
Kiefer van Teutem	5ff790eb05	Fix argument order for calls to profile_batched_gemm_impl() (#2277 ) * Fix argument order for calls to profile_batched_gemm_impl() * Revert previous and swap the order of the profile_batched_gemm_impl() function arguments instead. * Revert copyright years for unchanged files. * Remove test_batched_gemm from REGRESSION_TESTS since it no longer takes more than 30 seconds to run. --------- Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `c7c6a0ccb3`]	2025-06-17 19:29:09 -07:00
Max Podkorytov	9a342df600	Fix default epilogue (#2358 ) * [ck-tile] fix default epilogue in gemm universal * argument validation needs vector size D * operator() needs to specify dram windows * copy/paste from cshuffle epilogue * clang-format * mark unused argument --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `cd606f72c1`]	2025-06-17 17:30:21 -07:00
linqunAMD	ca0c6fb660	[CK_TILE] Support multi-config in tile_example_gemm_universal (#2240 ) * [CK_TILE] Support multi-config in tile_example_gemm_universal Add GemmConfig in run_gemm_example to support multiple tile config. - It is useful when use you need compare gemm perf with different tile/pipeline config - we also can use it simplify the code for wmma support in the furture. * [CK_TILE] Support multi-config in tile_example_gemm_universal Address review comments * rebase code and fix clang format. * fix clang format * support pipeline v5. * fix merge conflict * address review comment * add missing file * address review comment v2 * fix build error [ROCm/composable_kernel commit: `0eb8974502`]	2025-06-17 17:27:46 -07:00
John Afaganis	b127e8aef5	Add missing copyright headers (#2359 ) * Add missing copyright headers * empty commit [ROCm/composable_kernel commit: `df54667102`]	2025-06-17 14:29:45 -07:00
Illia Silin	fcb07b7311	Revert "Shard several of the most costly targets. (#2266 )" (#2361 ) This reverts commit 33c4b3be9d77ee5932c88a27d4364c4aab774de0. [ROCm/composable_kernel commit: `cdfd7722bf`]	2025-06-17 13:56:30 -07:00
Bartłomiej Kocot	f0d44c77d7	Fix Add in dynamic buffer for fp32/i8 (#2351 ) * Fix Add in dynamic buffer for fp32/i8 * fixes * Fix [ROCm/composable_kernel commit: `cc98a41f46`]	2025-06-17 22:25:56 +02:00
Satyanvesh Dittakavi	a4517b0a9d	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `4c57157d50`]	2025-06-17 11:54:30 -07:00
Aviral Goel	af88547b60	add script to pre commit hooks for checking file permissions (#2322 ) [ROCm/composable_kernel commit: `3af66e99ab`]	2025-06-17 07:07:08 -07:00
Thomas Ning	cb631cd5b1	Fix the CK Tile related operators (#2356 ) * fix the flatmm * Fix the pipeline * address the comment [ROCm/composable_kernel commit: `3c4cdfac4f`]	2025-06-16 17:38:52 -07:00
rahjain-amd	00dd0e56cc	Add cmake flag to enable Assembly dump (#2347 ) This flag makes it easy to dump assembly for the example kernels. [ROCm/composable_kernel commit: `6589f50bc9`]	2025-06-16 09:29:35 -07:00
Illia Silin	fba2ac98ae	Revert "fix the flatmm (#2349 )" (#2352 ) This reverts commit 37f25427f2ec6c907bf12030338f69945691f54c. [ROCm/composable_kernel commit: `5523df4b2d`]	2025-06-16 07:54:55 -07:00
Bartłomiej Kocot	b5b0797513	Grouped convolution forward with clamp (#2334 ) * Grouped convolution forward with clamp * Optimize clamp * unary fixes * test gk bias * Revert "test gk bias" This reverts commit `8e42e29d7b`. * Revert "Revert "test gk bias"" This reverts commit `e73c0550ce`. * workaround comment [ROCm/composable_kernel commit: `f6c2ff9dce`]	2025-06-16 15:36:53 +02:00
Thomas Ning	6f158242b6	fix the flatmm (#2349 ) [ROCm/composable_kernel commit: `d996bc78be`]	2025-06-16 02:17:53 -07:00
ruanjm	1f77d58ae9	Add support for specifying valid flag when fetching elements for tile_scatter_gather (#2332 ) * Add support for specifying valid flag when fetching elements for tile_scatter_gather Add constexpr for operator[] of TrueGenerator * Use different path when valid is enabled [ROCm/composable_kernel commit: `b34c234f51`]	2025-06-16 17:17:03 +08:00
carlushuang	d68fdea428	hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem (#2348 ) [ROCm/composable_kernel commit: `fb97f75099`]	2025-06-15 22:49:04 -07:00
Illia Silin	d335828f23	Fix direct lds load for gfx950 and clang20 (#2346 ) * fix direct lds load for gfx950 and clang20 * Update include/ck/utility/amd_buffer_addressing_builtins.hpp * Fix format --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> [ROCm/composable_kernel commit: `2d8a804152`]	2025-06-15 15:22:34 -07:00
Illia Silin	23d7007455	Limit the threads to builf ck_tile engine, use ninja. (#2342 ) * limit the threads to builf ck_tile engine, use ninja * disable ck_tile engine until it can be built safely [ROCm/composable_kernel commit: `56f654a826`]	2025-06-13 14:13:07 -07:00
Illia Silin	201ba3edb6	check for if misched-bottomup flag is valid (#2341 ) [ROCm/composable_kernel commit: `a0f4db8d9c`]	2025-06-13 13:34:22 -07:00
Mateusz Ozga	044a8560f7	[CK_TILE] Multiple-D GEMM example (#2219 ) * Multiple d, initial commit * Check Ds Layout * Readme and clang format * Update branch & conflicts * Multiple D - fix clang-formatter * Rename elemetwise_op * Fix CI * Code review part1 * Remove printf * Remove unnecessary comment * Add new tests with Col layout * Review part 2 * Added support for Multiple D GEMM * Update comment * Remove maybe_unused * Clang-format * Review part 3 * Add comment to function * Add comment to function: another * Take number of params for a refrence function * Remove additional d param for 0 tensor * Change name of function * Fix CI fails [ROCm/composable_kernel commit: `bd96ac9742`]	2025-06-13 19:39:11 +02:00
John Shumway	ea36ae016e	Shard several of the most costly targets. (#2266 ) * Shard several of the most costly targets. Introduces a filter_tuple_by_modulo to break up tuples. Drops build time of target from 21 minutes to under 14 minutes with 64 build processes, or 11 minutes with 128 build processes. time ninja -j 64 device_grouped_conv3d_fwd_instance * fix clang format * Fix build errors in instantiation code. I wasn't sure how to test the header-only instantiation code on my initial commit. From Jenkins CI test results, I see that there is a test target that depends on these headers: ninja -j 128 test_grouped_convnd_fwd This allowed me to test the build locally. I found three mistakes I made, mostly related to early experiments on I tried on the code. This was hard to find earlier because this PR is really too large. I also discovered that there are five 2D convolution targets that now dominate the compilation time. I will likely address those in a later PR, rather than adding even more changes to this PR. * Fix link errors from mismatched declarations. Our pattern for instantiating MIOpen templates uses duplicate declarations (instead of headers). This is fragile, and I didn't notice that my last commit had a bunch of link errors. I fixed these mistakes, and the bin/test_grouped_conv_fwd test target binary now links correctly. * Migrate the design to a code-generation approach. Use a CMake function with template files to generate the source files for the intantiating the kerenels and to generate the calling function. * Shard the longest 2D convolution builds Now that we have automated the shard instantiation, we can shard the 2D convolution targets that take the longest to build. The target test_grouped_conv2d_fwd now compiles in 15 minutes. * Use PROJECT_SOURCE_DIR for submodule compatibility I used CMAKE_SOURCE_DIR to refer to the top-level source directory in the ShardInstantiation.cmake file, but this can cause issues with git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure compatibility when this project is used as a submodule in another project. --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `3a0cb27966`]	2025-06-13 03:58:50 -07:00
kylasa	10498656ef	Code drop for 2 warp ping pong scheduler along K dimension. (#2276 ) * Code drop for 2 warp ping pong scheduler along K dimension. * Addressing code review comments. * Addressing Clang formatting issues. * Addressing build issues. * Addressing build issues of other GEMM pipelines with ping pong scheduler code drop. * Fix for LDS memory size for GEMM pipelines. * Addressing code review feedback comments. * Change log update. * Addressing code review comments and build issues. * Added new policy for pipeline specific logic about LDS needs. * Clang Fix during build. [ROCm/composable_kernel commit: `5f1ad09b61`]	2025-06-12 18:24:02 -07:00
Aviral Goel	2475910145	fix(gemm_universal): Update gemm_utils.hpp so it builds successfully for memory pipeline (#2336 ) [ROCm/composable_kernel commit: `e5ece14467`]	2025-06-12 15:27:14 -07:00
Thomas Ning	592ef9daaf	OCP FP8 Macro restructure (#2331 ) * solved the problem [ROCm/composable_kernel commit: `f59b8c7d3d`]	2025-06-12 09:46:33 -07:00
Bartłomiej Kocot	b7fc080539	Grouped conv bwd weight with grouped gemm (#2304 ) * Grouped conv bwd weight with grouped gemm * fixes * fix * Fixes * test comments * restore atol * fix [ROCm/composable_kernel commit: `bb4f471b09`]	2025-06-12 10:15:07 +02:00
carlushuang	5d7302c240	[CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel (#2327 ) * refactor subtoken logic to let more kernel pickup mp kernel * typo [ROCm/composable_kernel commit: `8aff45a8af`]	2025-06-12 11:44:22 +08:00
Yi DING	66c79a4045	Add MoE & FP8 Blockscale WP Kernels for GFX950 (#2297 ) * [fix] align v3 gufusion pipeline * fix device kernel selection. * Add .co direct asm support by CK_USE_ASM_MOE_STAGE2_BLOCKSCALE * experimental optimization for scale load in blkscale gemm * Add asm for no-loop v3_128x128x128 * fix bugs * tune fp8 example * Update v1_128x128x128 to 2x2 instead of 4x1 * wip * add warmup to asm launch * wip2 * 16x16 function merged to moe * temp save, a performant version. * wip3 * Update .co binary to 16x16 * 16x16x128 correct; 64x64x128 failed * update * use mem_op::set when topk=1 * add mx fp8 b_preshuffle support, function not yet tested. * Spilt the fp4 target. Fix the known bugs. 128x128x128 sanity checked; remove prints * some fixes * fix update * remove some unnecessary hacky; enable 256x256x256 tilesize * update for function debug * Add pipeline v3. Have some runtime issue and register spill * Fix pipe v3 correctness issue * remove unnecessary hacky * clang format * fix a bug * fix the bug, functional test passed * tempsave; buggy at passed 4 e8m0 to scaled mfma * added fp4_bpreshuffle example, build failures * fixed some bugs * implement shuffled scale mxfp4gemm, blocker: opsel not effect * hotfix * fix bugs, build passed * (M, N, K)=(128, 128, 128) function failed. * temp save for gemm1. Function not ready * fix compile error. Gemm2 pass. Gemm1 WIP * fix bug for a lds read * update moe * Compile pass. Gemm1 function WIP * update moe * fix fp8; fix even/odd * tempsave * update moe * Revert "update" This reverts commit c7d79dcb672616d9bc0fd9958f714fc80e7c84fd. * Revert "use mem_op::set when topk=1" This reverts commit 8c7772860735001a51421e7b6d0a28f6676d6c40. * Add v3 128x128x128_4x4_16x16.co for gfx950 * temp cmake flag suppression for aiter test * add code for mxfp4 gemm, blockscale not supported yet * gemm1 up-only pass. GU WIP * function pass with inline asm hacky * revert unexpected file change * updated and build passed * update CE elementOP * added code for debug * Gemm1 GUFusion function pass. Perf WIP * Fix fp8/bf8; remove duplicated code * disable the scheduler in v3; bring it back when compiler feature ready. * update moe v1 pipeline * Add gemm1 v1 32x128x128 * remove schedule barrier * updated * Fix fp8/bf8 B-row * mfma using asm, device result correct, host result need to check * gemm1 v3 64x128x128 debug * fix cpu ref * a/b thread_desc stride fix * Use random scale for init1 * 16x16x128 input size blockscale function passed * fix blockscale gemm bug * tempsave. Almost all instances passed. * v1 fix for mi350. * temp save * debug save * update debug * fix the bug, 128x128x256 tile function passed * v3 * rename moe block selector and pipeline * Add gemm1 v1 * Add gemm1 v1 to selector * added mx moe block v3 support, function passed * compile error fix * Improve the pipeline * Pack e8m0 as int32_t * v1 compile pass. Function not ready * debug synchronize issue over different GPU/ROCm * minor fix * Add profiler filter * Add f4 ckProfiler * Fix example compile error * Add f4 profiler examples * tempsave * v1 function pass. * v3 function pass * align file and function name * mx_moe_fp4 ready for aiter with clang-format. * modify the way we represent fp4 * generalize the pipeline scheduling. * init moe mx f4 scale shuffle * Cmakelist diable compiler-bound flags * mx_fp4 default parameter change * Moe blockscale gemm1&gemm2 asm support for aiter. Suppression cmkae flag til new compler. * update code * tempsave; modify the way we represent fp4 * generalize the pipeline scheduling. * Add gemm1 gfx942 .co support * updated code, build passed. * Update gemm2 asm with latest compiler flag * Fix mx f4 ckProfiler * Fix blockwise gemm mx v1 * lds conflict free + buffer load lds * Add gemm2 v3 64x128x128 * fix a, b scale loading bugs, a, b scale loading now correctly * Add gemm2 v3 64x128x128 * commit with debug info * fix fp4 profiler * Add mx fp4 pileline v1 instances * Fix v2 topk_weight cal. Add silu asm. * v2 tok_weight WIP * init mx fp4 B no preshuffle version * tempsave. compile pass, function wrong * enable fp4 moe no weigth preshuffle, function pass * update the TFlops calculation in the example * Add gemm2 64x128x128 asm. Fix BF16 ref. * fix 2 typos in fp4_preshuffle * Better kernel selection in device classes * correct preShuffleBuffer we should used packed k to do shuffle. * lds conflict free + buffer load lds * optimize offset math in dma * Fix fp4 ckProfiler * Fix MX MFMA tests * fix f4 pipeline issues * gemm1 func pass * update mx moe gemm1_bns tile size to 64x128x256 * update mx moe gemm1 gemm2 TF and BW calculation * fix typo * temp save * Fix example_gemm_mx build * rename the block pipeline * correct a typo in tail * Add rotating to mx examples * fix the correctness issue * Fix v1; use M padding * Add NT flag to B/BScale buffer * Merge gemm_mx_common.hpp * temp save, 4.4~4.5 * Fix 'Merge gemm_mx_common.hpp' * refactor the pipeline * Pad the M for scale buffer unconditionaly * update MX moe GEMM1 hotloopscheduling * change the gemm1 tile from 64x128x128 to 128x64x128 * Unconditional Ascale padding * Pad shuffled a scale only * pad ascale * add vmcnt guard for async copy * Profiler add f4 wp * Merge preshuffle device * Add more fp4 wp instances * Fix do_weight in gemm1. Fix cshuffle_datatype. Clang-format * Clang-format after 2 merges * Remove rocm6.3 workaround flags and macro * Fix fp8 config * Fix bf8 config * flag and barrier fix for copmiler branch MainOpSelV3 * Add fp8 profiler instances * Remove debug infos; Enable flags for blockscale f8 * No asm ver. for merging moe blocksale fp8 into mainline * update the flag name for f8blockscale * recover example * fix performance bug of bpreshuffle f8 gemm * clang format, remove single rate mfma restriction for f8 * remove single rate mfma restriction for f8 blockscale gemm * Fix moe blockscale gemm1 barrier 0x800 for new compiler * add pipeline v1 for MOE Gemm2 * Use v1 pipeline for example_moe_gemm2_xdl_mx_fp4_bns * Fix OOB; add MB96 instances * remove unnecessary files * fix the cmake issue * Enable splitk for mxfp4; clang format; * Generate random tensor values with multiple threads * Use packed_size_v for A/BPackedSize * Fix warning * Fix target_compile_options for disabled target on gfx942 * fix moe pki4 on gfx950 * doc the kGroup definition * Fix ThreadwiseTensorSliceTransfer_v4::Run (Fuse scale) * Refactor thread_copy_lds_direct_load; fix gfx942 direct lds load example; fix f16_pki4 example * Fix unknown compiler flag * fix two failed examples. * fix some failure tile size in gfx950 universal gemm. fix test_gemm_fp16 * workaround fix for test_gemm_f32; * We have very limited support for lds direct load if input matrix is not K major * fix test_gemm_splitk; * Fix compile for mx_mfma_op * add mfma selection logic for multipled_v3 * Clean up * Fix device gemm mx link error * improve the global atomic pattern * Revert unnecessary copyright updates * restore minimum_occupancy logic * Avoid data race in moe gemm2 ref * Build fp8 gemm_multiply_multiply and moe only on gfx94/95 * update the instance in device_mx_gemm * Resolve comments * Copyright 2025 * Remove unused code * fix library linking issue --------- Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `37554c31e8`]	2025-06-12 09:25:59 +08:00
Bartłomiej Kocot	1223f507c0	Move SetZero functions inside the kernels for Grouped Conv (#2255 ) * Disable SetZero before launch kernel for grouped conv fwd * Move set zero to kernel * wmma fix * fix --------- Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com> [ROCm/composable_kernel commit: `8c1ed6f4c1`]	2025-06-11 23:41:03 +02:00
Muhammed Emin Ozturk	a59a60bfde	Stream-K Reduction option as Runtime parameter and Compilation Error Fix (SK- Reduction) (#2145 ) * reduction is passed as runtime parameter * clang * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_streamk_v3.hpp Co-authored-by: John Afaganis <john.afaganis@amd.com> * Update include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp * remove comment --------- [ROCm/composable_kernel commit: `6fad1c4874`]	2025-06-11 10:59:44 -07:00
Thomas Ning	2350191009	Epilogue cshuffle Improvement (#2312 ) * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * update cshuffle logic * update cshuffle_logics * add some change within review * update some codes following the code review * update epilogue logic * remove from problem * update codes following review. * fix some issues * solve the previous PR error, refine the code * Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Comment addressed * handling tile_engine failing case * handling tile_engine failing case --------- Co-authored-by: joyeamd <John.Ye@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: khushbu agarwal <khuagarw@amd.com> [ROCm/composable_kernel commit: `06e0b8436c`]	2025-06-10 22:44:50 -07:00
Thomas Ning	a0af2eca3f	fix on the typo (#2326 ) [ROCm/composable_kernel commit: `14d229d6c8`]	2025-06-10 16:34:33 -07:00
Khushbu Agarwal	bdb185a509	fix flatmm kernel for bigger size for fp16 datatype (#2302 ) [ROCm/composable_kernel commit: `bd270fe4bc`]	2025-06-10 11:13:40 -07:00
Aviral Goel	54ded8c52f	Label CMakeLists message() as DEBUG or STATUS for clean build output (#2301 ) * - elevate important build messages to log level STATUS - comment out the rest (temporarily) * - marked all low importance build messages as log_level=DEBUG [ROCm/composable_kernel commit: `aed0f5880c`]	2025-06-10 10:46:47 -07:00
Max Podkorytov	2b4f8a85b2	Convert CK (GeMM MulMul Weight Preshuffle) instances to use 16x16 xdl tile (#2229 ) * compile profiler only for gemm-mulmul-weight-preshuffle * m/n xdl; m/n xdl per wave; cshuffle block transfer cluster length m per block * process all p1 instances * process all p2 instances * process all p3 instances * convert p4 instance * modify compute p1 instances * modify compute p2 instances * relax p4 instance c block transfer cluster len * fix c block transfer cluster lengths comment * add mfma (without 16x16) instances to the profiler * roll back profiling cmakelists change * clang-format * re-add (now unused) 32x32 xdl-tile instances * clang-format * add more instances * fit c block transfer lengths into block * copy and write over the instance definitions from bf16 to fp16 * add instances to profiler * unify instance tuple alias [ROCm/composable_kernel commit: `e6b5e31c20`]	2025-06-10 09:37:14 -07:00
Eisuke Kawashima	808cc61307	chore: unset executable permission (#2303 ) Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com> [ROCm/composable_kernel commit: `4e586ca958`]	2025-06-10 09:13:59 -07:00
John Afaganis	42ea095d98	Remove usage of 'warpSize' variable as it has been deprecated (#2295 ) * SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround * SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo --------- Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com> [ROCm/composable_kernel commit: `6635d1bb88`]	2025-06-10 07:34:54 -07:00
dependabot[bot]	66767bf11b	Bump rocm-docs-core[api_reference] from 1.20.0 to 1.20.1 in /docs/sphinx (#2317 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.20.0 to 1.20.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.20.1/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.20.0...v1.20.1) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.20.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `3d9f5eafaf`]	2025-06-10 07:27:26 -07:00
Illia Silin	c1fb1b74a2	fix headers (#2321 ) [ROCm/composable_kernel commit: `1ac5eeaea9`]	2025-06-10 07:26:32 -07:00
carlushuang	6e9b89c349	hot fix (#2315 ) [ROCm/composable_kernel commit: `2e0536269e`]	2025-06-10 20:35:28 +08:00
Bartłomiej Kocot	dbec4063f1	Grouped conv bwd wei explicit GEMM for odd C/K (#2306 ) [ROCm/composable_kernel commit: `7a83f1d510`]	2025-06-10 11:17:12 +02:00
MHYangAMD	7cb7aa8e75	Fix fmha fwd precision issue on MI3XX series (#2285 ) * Fix fmha fwd precision issue on MI3XX series For fmha fwd fp16 cases, we found that using impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero. For examaple, fixing K,V to be all 1, and Q is random, which outputs are expected to be all 1. But we found that it would have some incorrect outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 = 0.0005 < 0.001) Thus, ck do not report this error. * Add option to switch rtn/rtz for fmha fwd [ROCm/composable_kernel commit: `9fcf21a4ec`]	2025-06-10 15:03:23 +08:00
carlushuang	0544dfad87	MUST USE INLINE FOR ANY NON TEMPLATE FUNCTION IN HEADER!!! (#2305 ) [ROCm/composable_kernel commit: `65835c0bbb`]	2025-06-10 10:40:54 +08:00
Aviral Goel	8abeabd2ff	Code Refactor for check_err.hpp (#2284 ) * refactor & add documentation * removed return datatype from doxygen comments * Update include/ck_tile/host/check_err.hpp Co-authored-by: John Afaganis <john.afaganis@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> --------- Co-authored-by: John Afaganis <john.afaganis@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `5a0bd157db`]	2025-06-08 13:41:27 -07:00
Max Podkorytov	088b21921d	Add a python script for running ckProfiler and processing the results (#2288 ) * add profiler script * add comments * generalize and add some input validation * format * refactor * Rename run_ck_profiler.py to run_ck_profiler_gemm_with_csv_shapes.py rename script file [ROCm/composable_kernel commit: `aece3c6700`]	2025-06-08 12:41:57 -07:00
Sami Remes	c964eb1186	[CK_TILE] Tileloop persistent gemm - resubmit (#2299 ) * Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293) This reverts commit 1d9fd3b6a8f8e84a407b8e59b63b17c258f4fb78. * Add missing header for kentry --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `1c6f83df6c`]	2025-06-06 14:18:49 -07:00
valarLip	a5e36a2494	extend buffer load to support load 32 bf16/fp16 at same time (#2291 ) [ROCm/composable_kernel commit: `8482977a37`]	2025-06-06 17:21:19 +08:00

1 2 3 4 5 ...

2001 Commits