composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 11:47:48 +00:00

Author	SHA1	Message	Date
Ville Pietilä	5d7a0487f8	Refactor conv profiler to produce statistics for analysing split-K autodeduction performance.	2025-06-30 14:20:10 +00:00
Ville Pietilä	196b65d1d6	Simple score based model to predict best split-K.	2025-06-27 12:52:24 +00:00
Ville Pietilä	fb5afe5d1d	Add new split-K strategy.	2025-06-26 15:06:32 +00:00
Ville Pietilä	4760c17d23	Analysis script improvements.	2025-06-25 08:57:30 +00:00
Ville Pietilä	3b09175716	Fix benchmarked subscription values.	2025-06-23 13:04:17 +00:00
Ville Pietilä	3410628696	Adjust subscription factor tuning values.	2025-06-23 12:45:04 +00:00
Ville Pietilä	5084311980	Refactor conv profiler and implicit gemm solvers to enable tuning over subscription factors.	2025-06-23 11:48:13 +00:00
Ville Pietilä	385defc8cd	WIP: Oversubscription factor.	2025-06-19 15:12:21 +00:00
Ville Pietilä	e18e53ed16	Fixed GemmK size calculation.	2025-06-19 05:38:51 +00:00
Ville Pietilä	9d2f58de3a	Analysis script improvements.	2025-06-17 10:52:59 +00:00
Ville Pietilä	92af6c85c8	Profiler script improvements.	2025-06-17 09:05:03 +00:00
Ville Pietilä	195c22151a	Analysis script improvements.	2025-06-17 09:01:33 +00:00
Ville Pietilä	6cbefab7c7	Convolution profiler improvements.	2025-06-17 09:00:35 +00:00
Ville Pietilä	550cfb939d	Addsplit-k autodeduction to grouped conv bwd weight XDL CShuffle V3.	2025-06-16 11:31:12 +00:00
Ville Pietilä	282bca6e14	Fix disabled op handling in conv profiler.	2025-06-16 11:30:13 +00:00
Ville Pietilä	bc14ad554b	Improve test scripts.	2025-06-16 11:29:10 +00:00
Ville Pietilä	3b8dfae5bc	Add script to generate CK profiler commands from the KTN data.	2025-06-13 13:49:38 +00:00
Ville Pietilä	e314e6faf0	Fix handling convolution group size.	2025-06-13 13:49:02 +00:00
Ville Pietilä	38340e795b	Merge remote-tracking branch 'origin/develop' into vpietila/split-k-param-auto-deduce	2025-06-12 09:45:16 +00:00
Ville Pietilä	8e965d07ab	Fix a bug in occupancy estimation.	2025-06-12 09:41:17 +00:00
Bartłomiej Kocot	bb4f471b09	Grouped conv bwd weight with grouped gemm (#2304 ) * Grouped conv bwd weight with grouped gemm * fixes * fix * Fixes * test comments * restore atol * fix	2025-06-12 10:15:07 +02:00
carlushuang	8aff45a8af	[CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel (#2327 ) * refactor subtoken logic to let more kernel pickup mp kernel * typo	2025-06-12 11:44:22 +08:00
Yi DING	37554c31e8	Add MoE & FP8 Blockscale WP Kernels for GFX950 (#2297 ) * [fix] align v3 gufusion pipeline * fix device kernel selection. * Add .co direct asm support by CK_USE_ASM_MOE_STAGE2_BLOCKSCALE * experimental optimization for scale load in blkscale gemm * Add asm for no-loop v3_128x128x128 * fix bugs * tune fp8 example * Update v1_128x128x128 to 2x2 instead of 4x1 * wip * add warmup to asm launch * wip2 * 16x16 function merged to moe * temp save, a performant version. * wip3 * Update .co binary to 16x16 * 16x16x128 correct; 64x64x128 failed * update * use mem_op::set when topk=1 * add mx fp8 b_preshuffle support, function not yet tested. * Spilt the fp4 target. Fix the known bugs. 128x128x128 sanity checked; remove prints * some fixes * fix update * remove some unnecessary hacky; enable 256x256x256 tilesize * update for function debug * Add pipeline v3. Have some runtime issue and register spill * Fix pipe v3 correctness issue * remove unnecessary hacky * clang format * fix a bug * fix the bug, functional test passed * tempsave; buggy at passed 4 e8m0 to scaled mfma * added fp4_bpreshuffle example, build failures * fixed some bugs * implement shuffled scale mxfp4gemm, blocker: opsel not effect * hotfix * fix bugs, build passed * (M, N, K)=(128, 128, 128) function failed. * temp save for gemm1. Function not ready * fix compile error. Gemm2 pass. Gemm1 WIP * fix bug for a lds read * update moe * Compile pass. Gemm1 function WIP * update moe * fix fp8; fix even/odd * tempsave * update moe * Revert "update" This reverts commit `960b2bce1c`. * Revert "use mem_op::set when topk=1" This reverts commit `def952a178`. * Add v3 128x128x128_4x4_16x16.co for gfx950 * temp cmake flag suppression for aiter test * add code for mxfp4 gemm, blockscale not supported yet * gemm1 up-only pass. GU WIP * function pass with inline asm hacky * revert unexpected file change * updated and build passed * update CE elementOP * added code for debug * Gemm1 GUFusion function pass. Perf WIP * Fix fp8/bf8; remove duplicated code * disable the scheduler in v3; bring it back when compiler feature ready. * update moe v1 pipeline * Add gemm1 v1 32x128x128 * remove schedule barrier * updated * Fix fp8/bf8 B-row * mfma using asm, device result correct, host result need to check * gemm1 v3 64x128x128 debug * fix cpu ref * a/b thread_desc stride fix * Use random scale for init1 * 16x16x128 input size blockscale function passed * fix blockscale gemm bug * tempsave. Almost all instances passed. * v1 fix for mi350. * temp save * debug save * update debug * fix the bug, 128x128x256 tile function passed * v3 * rename moe block selector and pipeline * Add gemm1 v1 * Add gemm1 v1 to selector * added mx moe block v3 support, function passed * compile error fix * Improve the pipeline * Pack e8m0 as int32_t * v1 compile pass. Function not ready * debug synchronize issue over different GPU/ROCm * minor fix * Add profiler filter * Add f4 ckProfiler * Fix example compile error * Add f4 profiler examples * tempsave * v1 function pass. * v3 function pass * align file and function name * mx_moe_fp4 ready for aiter with clang-format. * modify the way we represent fp4 * generalize the pipeline scheduling. * init moe mx f4 scale shuffle * Cmakelist diable compiler-bound flags * mx_fp4 default parameter change * Moe blockscale gemm1&gemm2 asm support for aiter. Suppression cmkae flag til new compler. * update code * tempsave; modify the way we represent fp4 * generalize the pipeline scheduling. * Add gemm1 gfx942 .co support * updated code, build passed. * Update gemm2 asm with latest compiler flag * Fix mx f4 ckProfiler * Fix blockwise gemm mx v1 * lds conflict free + buffer load lds * Add gemm2 v3 64x128x128 * fix a, b scale loading bugs, a, b scale loading now correctly * Add gemm2 v3 64x128x128 * commit with debug info * fix fp4 profiler * Add mx fp4 pileline v1 instances * Fix v2 topk_weight cal. Add silu asm. * v2 tok_weight WIP * init mx fp4 B no preshuffle version * tempsave. compile pass, function wrong * enable fp4 moe no weigth preshuffle, function pass * update the TFlops calculation in the example * Add gemm2 64x128x128 asm. Fix BF16 ref. * fix 2 typos in fp4_preshuffle * Better kernel selection in device classes * correct preShuffleBuffer we should used packed k to do shuffle. * lds conflict free + buffer load lds * optimize offset math in dma * Fix fp4 ckProfiler * Fix MX MFMA tests * fix f4 pipeline issues * gemm1 func pass * update mx moe gemm1_bns tile size to 64x128x256 * update mx moe gemm1 gemm2 TF and BW calculation * fix typo * temp save * Fix example_gemm_mx build * rename the block pipeline * correct a typo in tail * Add rotating to mx examples * fix the correctness issue * Fix v1; use M padding * Add NT flag to B/BScale buffer * Merge gemm_mx_common.hpp * temp save, 4.4~4.5 * Fix 'Merge gemm_mx_common.hpp' * refactor the pipeline * Pad the M for scale buffer unconditionaly * update MX moe GEMM1 hotloopscheduling * change the gemm1 tile from 64x128x128 to 128x64x128 * Unconditional Ascale padding * Pad shuffled a scale only * pad ascale * add vmcnt guard for async copy * Profiler add f4 wp * Merge preshuffle device * Add more fp4 wp instances * Fix do_weight in gemm1. Fix cshuffle_datatype. Clang-format * Clang-format after 2 merges * Remove rocm6.3 workaround flags and macro * Fix fp8 config * Fix bf8 config * flag and barrier fix for copmiler branch MainOpSelV3 * Add fp8 profiler instances * Remove debug infos; Enable flags for blockscale f8 * No asm ver. for merging moe blocksale fp8 into mainline * update the flag name for f8blockscale * recover example * fix performance bug of bpreshuffle f8 gemm * clang format, remove single rate mfma restriction for f8 * remove single rate mfma restriction for f8 blockscale gemm * Fix moe blockscale gemm1 barrier 0x800 for new compiler * add pipeline v1 for MOE Gemm2 * Use v1 pipeline for example_moe_gemm2_xdl_mx_fp4_bns * Fix OOB; add MB96 instances * remove unnecessary files * fix the cmake issue * Enable splitk for mxfp4; clang format; * Generate random tensor values with multiple threads * Use packed_size_v for A/BPackedSize * Fix warning * Fix target_compile_options for disabled target on gfx942 * fix moe pki4 on gfx950 * doc the kGroup definition * Fix ThreadwiseTensorSliceTransfer_v4::Run (Fuse scale) * Refactor thread_copy_lds_direct_load; fix gfx942 direct lds load example; fix f16_pki4 example * Fix unknown compiler flag * fix two failed examples. * fix some failure tile size in gfx950 universal gemm. fix test_gemm_fp16 * workaround fix for test_gemm_f32; * We have very limited support for lds direct load if input matrix is not K major * fix test_gemm_splitk; * Fix compile for mx_mfma_op * add mfma selection logic for multipled_v3 * Clean up * Fix device gemm mx link error * improve the global atomic pattern * Revert unnecessary copyright updates * restore minimum_occupancy logic * Avoid data race in moe gemm2 ref * Build fp8 gemm_multiply_multiply and moe only on gfx94/95 * update the instance in device_mx_gemm * Resolve comments * Copyright 2025 * Remove unused code * fix library linking issue --------- Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-06-12 09:25:59 +08:00
Bartłomiej Kocot	8c1ed6f4c1	Move SetZero functions inside the kernels for Grouped Conv (#2255 ) * Disable SetZero before launch kernel for grouped conv fwd * Move set zero to kernel * wmma fix * fix --------- Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com>	2025-06-11 23:41:03 +02:00
Muhammed Emin Ozturk	6fad1c4874	Stream-K Reduction option as Runtime parameter and Compilation Error Fix (SK- Reduction) (#2145 ) * reduction is passed as runtime parameter * clang * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_streamk_v3.hpp Co-authored-by: John Afaganis <john.afaganis@amd.com> * Update include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp * remove comment ---------	2025-06-11 10:59:44 -07:00
Ville Pietilä	1e1917c5bc	WIP: Find optimial multiplier for split-K value.	2025-06-11 15:28:21 +00:00
Ville Pietilä	e45a8251bc	Improve analysis script.	2025-06-11 15:27:42 +00:00
Ville Pietilä	9a4bcd19bd	Added analysis script.	2025-06-11 09:28:28 +00:00
Thomas Ning	06e0b8436c	Epilogue cshuffle Improvement (#2312 ) * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * update cshuffle logic * update cshuffle_logics * add some change within review * update some codes following the code review * update epilogue logic * remove from problem * update codes following review. * fix some issues * solve the previous PR error, refine the code * Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Comment addressed * handling tile_engine failing case * handling tile_engine failing case --------- Co-authored-by: joyeamd <John.Ye@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: khushbu agarwal <khuagarw@amd.com>	2025-06-10 22:44:50 -07:00
Thomas Ning	14d229d6c8	fix on the typo (#2326 )	2025-06-10 16:34:33 -07:00
Khushbu Agarwal	bd270fe4bc	fix flatmm kernel for bigger size for fp16 datatype (#2302 )	2025-06-10 11:13:40 -07:00
Aviral Goel	aed0f5880c	Label CMakeLists message() as DEBUG or STATUS for clean build output (#2301 ) * - elevate important build messages to log level STATUS - comment out the rest (temporarily) * - marked all low importance build messages as log_level=DEBUG	2025-06-10 10:46:47 -07:00
Max Podkorytov	e6b5e31c20	Convert CK (GeMM MulMul Weight Preshuffle) instances to use 16x16 xdl tile (#2229 ) * compile profiler only for gemm-mulmul-weight-preshuffle * m/n xdl; m/n xdl per wave; cshuffle block transfer cluster length m per block * process all p1 instances * process all p2 instances * process all p3 instances * convert p4 instance * modify compute p1 instances * modify compute p2 instances * relax p4 instance c block transfer cluster len * fix c block transfer cluster lengths comment * add mfma (without 16x16) instances to the profiler * roll back profiling cmakelists change * clang-format * re-add (now unused) 32x32 xdl-tile instances * clang-format * add more instances * fit c block transfer lengths into block * copy and write over the instance definitions from bf16 to fp16 * add instances to profiler * unify instance tuple alias	2025-06-10 09:37:14 -07:00
Eisuke Kawashima	4e586ca958	chore: unset executable permission (#2303 ) Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>	2025-06-10 09:13:59 -07:00
John Afaganis	6635d1bb88	Remove usage of 'warpSize' variable as it has been deprecated (#2295 ) * SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround * SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo --------- Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com>	2025-06-10 07:34:54 -07:00
dependabot[bot]	3d9f5eafaf	Bump rocm-docs-core[api_reference] from 1.20.0 to 1.20.1 in /docs/sphinx (#2317 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.20.0 to 1.20.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.20.1/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.20.0...v1.20.1) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.20.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-06-10 07:27:26 -07:00
Illia Silin	1ac5eeaea9	fix headers (#2321 )	2025-06-10 07:26:32 -07:00
carlushuang	2e0536269e	hot fix (#2315 )	2025-06-10 20:35:28 +08:00
Ville Pietilä	98e506a358	Add option to suppres verbose output in test script.	2025-06-10 12:03:29 +00:00
Ville Pietilä	3e3a73eee3	Fix grid size calculation.	2025-06-10 12:02:41 +00:00
Bartłomiej Kocot	7a83f1d510	Grouped conv bwd wei explicit GEMM for odd C/K (#2306 )	2025-06-10 11:17:12 +02:00
MHYangAMD	9fcf21a4ec	Fix fmha fwd precision issue on MI3XX series (#2285 ) * Fix fmha fwd precision issue on MI3XX series For fmha fwd fp16 cases, we found that using impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero. For examaple, fixing K,V to be all 1, and Q is random, which outputs are expected to be all 1. But we found that it would have some incorrect outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 = 0.0005 < 0.001) Thus, ck do not report this error. * Add option to switch rtn/rtz for fmha fwd	2025-06-10 15:03:23 +08:00
carlushuang	65835c0bbb	MUST USE INLINE FOR ANY NON TEMPLATE FUNCTION IN HEADER!!! (#2305 )	2025-06-10 10:40:54 +08:00
Ville Pietilä	692d71d466	Add runner script for the Fremont shapes.	2025-06-09 16:40:48 +00:00
Ville Pietilä	b9432f6ef3	Bug fix.	2025-06-09 12:29:12 +00:00
Ville Pietilä	e8c1ea5fc3	Improve reporting perf results. Enable filtering out specific ops.	2025-06-09 12:24:10 +00:00
Ville Pietilä	edffee026e	Intriduced argument 'all' to run all possible split-k values in convolution profiler.	2025-06-09 08:11:33 +00:00
Ville Pietilä	7fdfbd79a7	Move ArgumentSplitK definition into a separate header.	2025-06-09 06:59:26 +00:00
Aviral Goel	5a0bd157db	Code Refactor for check_err.hpp (#2284 ) * refactor & add documentation * removed return datatype from doxygen comments * Update include/ck_tile/host/check_err.hpp Co-authored-by: John Afaganis <john.afaganis@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck_tile/host/check_err.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> --------- Co-authored-by: John Afaganis <john.afaganis@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-06-08 13:41:27 -07:00
Max Podkorytov	aece3c6700	Add a python script for running ckProfiler and processing the results (#2288 ) * add profiler script * add comments * generalize and add some input validation * format * refactor * Rename run_ck_profiler.py to run_ck_profiler_gemm_with_csv_shapes.py rename script file	2025-06-08 12:41:57 -07:00

1 2 3 4 5 ...

2015 Commits