composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 10:09:41 +00:00

Author	SHA1	Message	Date
jakpiase	bdb86fee78	[CK_TILE] Grouped Convolution Backward Weight Kernel (#2357 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * resolved conflicts * rebase fixes * fixes * tmp * add working support for splitk * minor fix * fixes * fixes * minor fix * small fix * Split K and preprocessing fixes --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `6681593864`]	2025-07-24 10:41:35 +02:00
Cong Ma	baf244000e	ck_tile kernel for gemm with groupwise quantized A tensor (#2473 ) * ck_tile kernel for gemm with groupwise quantized A or B tensor. This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. 1. fp8, fp8 -> f32 2. bf8, bf8 -> f32 3. i4, fp8 -> f32 4. i4, bf8 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers. * [Block Scale Gemm] Only run gemm quant examples on __gfx94__ - Only run gemm quant examples on __gfx94__ for usage of `v_cvt_pk_fp8_f32` - Format the code * [Block Scale Gemm] Remove Bquant Gemm BlockScale This cleanup is in preparation for future development of bquant. By isolating Aquant-related code, we can streamline the codebase and make it easier to add and maintain bquant functionality in subsequent updates. * [Block Scale Gemm] Format code with clang-format-12 The latest clang-format (v19) in ROCm 7.0 generate different result than clang-format-12 which is used in CK CI. Format code with clang-format-12 for consistency. * [Block Scale Gemm] Split the k direction loop - Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp to make the logic clearer. - Disable C transposition. * [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm * [Block Scale Gemm] Update copyright * test * Add TailHandler * Move TileDistributionEncodingPatternAQ * Refactor * refactor * fix bug * fix bug * help solve the PR comment * Format the code * [Block Scale Gemm] Add unit tests * [Block Scale Gemm] Add support to 16x16x32 MFMA - Add support to 16x16x32 MFMA - Fix a bug when exchange data crossing lanes --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com> Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `e62710e461`]	2025-07-23 00:10:16 -07:00
Yi DING	bb1a05cc87	[CK_TILE] Fix tile_example_moe_sorting broke in #2436 (#2525 ) [ROCm/composable_kernel commit: `f0a8c18017`]	2025-07-17 22:50:58 -07:00
Mateusz Ozga	96e2d5f86a	Fix CI clang-format (#2521 ) [ROCm/composable_kernel commit: `7fc000d7b3`]	2025-07-17 14:41:29 +02:00
Haocong WANG	894750fd59	fix mfma32x32 dispatch (#2490 ) [ROCm/composable_kernel commit: `28072adc3a`]	2025-07-17 15:24:12 +08:00
Yi DING	cee5776046	[CK_TILE] Use read_tr in universal gemm (#2436 ) * Use read_tr in universal gemm * Enable all instances back * Revert example37 changes * Resolve comments * resolve comments 2 * Fix assertion msg * fix the gemm basic * change index_t to bool for preshuffle variable * Solve the comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> [ROCm/composable_kernel commit: `f1d8ad2818`]	2025-07-16 23:56:22 -07:00
Po Yen Chen	e75bbb6a13	Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389 )" (#2514 ) This reverts commit c5014a86765c598215bf2b58f89c3a7c70f92bac. [ROCm/composable_kernel commit: `722c22fb15`]	2025-07-17 10:09:01 +08:00
linqunAMD	348dec0d0c	Fix build errors on windows (#2456 ) * Fix build errors on windows * correct clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> [ROCm/composable_kernel commit: `6e76b82059`]	2025-07-16 07:58:23 -07:00
MHYangAMD	ff8d3d0d13	[CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409 ) * Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass * Update rmsnorm2d_fwd_pipeline_model_sensitive_pass 1. Add BlockReduce2dTreeCrossWarpSync * Add Rmsnorm2dFusedModelSensitiveEnum * Update patch 1. Reverse generate.py 2. Remove comment in generate.py 3. Update tree cross warp reduce * Refactor RMSNorm model enum and introduce T5-like option * Update the n stage for cross warp reduce * Add new cmdline option in RMSNorm for new pipeline testing --------- Co-authored-by: Clement Lin <clement.lin@amd.com> Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com> [ROCm/composable_kernel commit: `3499fe67ff`]	2025-07-16 14:05:26 +08:00
carlushuang	34ca5f6a68	[CK_TILE] moe sorting optimize local_token (#2469 ) * fix bug in loops that need use local tokens to compute * support extra chain local_token * update * update * refine some main * update * support dispatch_policy * fix 15 example [ROCm/composable_kernel commit: `cfe211cc60`]	2025-07-15 09:42:18 +08:00
Khushbu Agarwal	f3120e7526	Merge flatmm Operator with universal gemm (#2434 ) * Initial commit * Adding new tile partitioner to flatmm * intermediate changes * debugging kernels * Updating flatmm example to universal gemm example * updated flatmm kernel to run via gemmKernel * update universal gemm to incorporate flatmm * debug * Fix flatmm call * Fixing other kernels and tests for API changes * clang formatted * fixing gemm tests * added test for flatmm and simplify kernel arguments * adding flatmm test * fix test for flatmm * simplify gemm kernel with flatmm * remove flatmm related files * addressing review comments and code clean up * resolving empty file * resolving empty file * clang formatted * addressing review comments * enable persistent kernel for flatmm * reverted the removed files for flatmm * reverted the removed files for flatmm * changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example * some more renames * clang formatted [ROCm/composable_kernel commit: `d239b91fd5`]	2025-07-11 08:27:55 -07:00
Qianfeng	337126469c	Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487 ) * Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline * i_nhead_ conversion type to prevent overflow --------- Co-authored-by: ltqin <letaoqin@amd.com> [ROCm/composable_kernel commit: `45904b8fd7`]	2025-07-11 18:14:47 +08:00
shay-li77	4f08a02dae	support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338 ) * mask support ratio for y axis * format code * add notes for param y_ratio * fix comments error * support template and mdiv for ratio mask * refactor y-ratio mask constructor * optimize coordinate calculation * add SimplifiedRatioAttentionMask [ROCm/composable_kernel commit: `d814fefe18`]	2025-07-09 23:18:55 +08:00
Haocong WANG	7c04d93083	[CK TILE] Fix FA build filter (#2369 ) * Fix for fwd/bwd kernel build filter * fix bwd code * cmake depends & bwd filter order fix * revert unexpected reformat * Avoid change fmha bwd filter order for downstream compatibility * Revert unexpected changes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> [ROCm/composable_kernel commit: `5557eadce6`]	2025-07-08 10:42:07 +08:00
Po Yen Chen	a71dc1245f	Eliminate warning caused by failed to meet occupancy requirement (#2389 ) Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: `b2dea90116`]	2025-07-08 09:17:25 +08:00
Thomas Ning	1129c9dc4e	Enable Async Copy for MI355 (#2425 ) * add for async load builtin * add async load api * fix some compiling errors * fix a compiling error * fix some compiling errors * add a pipeline which copies from v4 * add a new pipeline for async load * fix some compiling errors * add async load tests * fix some issues in async load * fix * fix async inline assembly * fix async inline assembly * add ignore header file * comment some not gfx950 codes * comment some not gfx950 codes * fix a error * update async load apis * fix lds descriptor * fix a compiling error * fix some compiling errors * fix a descriptor issue * update lds descriptor * change async pipeline's tile distribution pattern from thread to warp * fix clang format * update async policy * fix a CRTP issue * fix a typo error * change lds layout * fix some sync issues * improve codes * delete the async test * fix a commented format issue * avoid compiling device functions when compile host * make gemm run * add the copy kernel support * finish the feature * Address comment * add the support for buffer_builtin * solved the merging problem * Comment Addressed --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: joyeamd <John.Ye@amd.com> [ROCm/composable_kernel commit: `f240ae3248`]	2025-07-07 10:08:49 -07:00
ltqin	ba133fe9b7	ck tile pagedkv prefill (#2405 ) * add prefetching physical block id for pagedkv * start add pagedkv prefill * rename pipeline * add kernel for pagedkv * add an init version pagedkv prefill * fix redefine issue * add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args * generate dispatch code * add body generating code * comipling pass * remove dropout from pagedkv * set lse to false in generating code * start changing qr kernel to pagedkv * init version of kernerl with pagedkv * change names of file that are generated * chang host validation for pagedkv prefill * using iglp to change blockgemm * add kernel files to op head file * show parameters * rewrite print parameter fun * add fwd * remove default parameter of GridSize * format * fix nhead issue and add seqlen_k_ptr to batch mode * format code * remove no-longer used code * format * fix some comments --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `9f4c5d7372`]	2025-07-07 16:16:54 +08:00
Thomas Ning	90add28587	[CK Tile] Int8 Support on CK Tile GEMM (#2267 ) * updates to support int8 in 03_gemm example * added comments, using aliases, helper functions * test(gemm_universal): add test cases for int8 gemm pipeline * fix(test_gemm): fix for failing test unit test for int8 * test(ck_tile): add int8 unit test for gemm universal * refactor(gemm_universal): GPU reference verification for GEMM code improved * style(gemm_universal): removed extra comments and did clang format * merging recent changes to universal gemm to tile_engine * ck tile engine integration work * feat(tile_engine): add int8 support to tile engine ops/gemm * feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8 * style: Format code with clang-format-12 * refactor(tile_engine): address review comments * style: removed unhelpful comments & unused variables. * build: tile engine uses default config * feat: add int8 support for CK_TILE GEMM * style: added trailing commas to codegen_utils.py * refactor: tile engine * refactor: formatting and code review * refactor: code formatting for python files * fix: suppress build warning * add support for gfx950 * refactor:KWarpTile size in gemms util * Fix the branch and wrap up the k warp tile * Add bf8 integration * refactor: clang format and rebase --------- Co-authored-by: zjli2013 <leezhengjiang@gmail.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Khushbu Agarwal <khuagarw@amd.com> [ROCm/composable_kernel commit: `e03293ebce`]	2025-06-25 08:20:35 -07:00
linqunAMD	d2ec53a74e	[CK_TILE] Refine fp8 support in flatmm (#2239 ) * [CK_TILE] Refine fp8 in flatmm 1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr 2. Add an additional const check to avoid build error in HotLoopScheduler 3. Refine shuffleb to support both tile 32x32 and 16x16 4. Support command option -init 5. Move Gemm warp defintion to a separate struct * fix clang format * fix clang format * keep default bhavior unchanged (warp tile = 16x16) * fix tile engine build error * fix a typo in codegen_utils.py * address review comments * address review comments --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `37e1a27537`]	2025-06-25 01:07:45 -07:00
Po Yen Chen	b62e551ccb	[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403 ) * Rename batch_prerfill interface * Add min_seqlen_q parameter in MakeKargs() [ROCm/composable_kernel commit: `50fad03524`]	2025-06-25 15:19:21 +08:00
Yi DING	c4ba466332	Fix unmatched K size of WarpGemmMfmaBf16Bf16F32M16N16K32TransposedCDistribution on gfx950 (#2393 ) [ROCm/composable_kernel commit: `c5d9181e1b`]	2025-06-24 16:35:54 -07:00
Yi DING	cba904aeff	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `b8212864cf`]	2025-06-24 01:33:31 +08:00
Po Yen Chen	4456552543	[CK_TILE] Fix compilation errors introduced in #2320 , #2219 and #2214 (#2388 ) * Fix compilation errors * Fix more ck_tile example compilation errors [ROCm/composable_kernel commit: `7d669440a6`]	2025-06-23 12:29:15 +08:00
Max Podkorytov	eb96164495	Update for xformers (#2372 ) * update api * update kernel api * clang-format [ROCm/composable_kernel commit: `0366fb2abc`]	2025-06-22 00:28:30 -07:00
Bartłomiej Kocot	2567f5e538	[CK TILE] Grouped Convolution Forward Kernel (#2188 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * rebase fixes * fixes * fixes [ROCm/composable_kernel commit: `cebdee4d9e`]	2025-06-20 15:44:36 -07:00
Max Podkorytov	1beaead90c	Reland fix default epilogue (#2367 ) * Revert "Revert "Fix default epilogue (#2358)" (#2364)" This reverts commit c0a58748ade1dc4d5405894c41ca33f923e480c6. * add operator() with old signature [ROCm/composable_kernel commit: `11eb9f1c77`]	2025-06-19 10:39:30 -07:00
Thomas Ning	ffafdec4d8	Revert "Fix default epilogue (#2358 )" (#2364 ) This reverts commit f1e842455a66c955a3df38a9fc383141c4bcc62f. [ROCm/composable_kernel commit: `64a2fda713`]	2025-06-17 22:43:05 -07:00
carlushuang	8660f6ef22	[CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335 ) * support local_token for hipgraph * update README * fix comment * fix fmoe example [ROCm/composable_kernel commit: `a4e1248dba`]	2025-06-18 10:49:43 +08:00
Max Podkorytov	9a342df600	Fix default epilogue (#2358 ) * [ck-tile] fix default epilogue in gemm universal * argument validation needs vector size D * operator() needs to specify dram windows * copy/paste from cshuffle epilogue * clang-format * mark unused argument --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `cd606f72c1`]	2025-06-17 17:30:21 -07:00
linqunAMD	ca0c6fb660	[CK_TILE] Support multi-config in tile_example_gemm_universal (#2240 ) * [CK_TILE] Support multi-config in tile_example_gemm_universal Add GemmConfig in run_gemm_example to support multiple tile config. - It is useful when use you need compare gemm perf with different tile/pipeline config - we also can use it simplify the code for wmma support in the furture. * [CK_TILE] Support multi-config in tile_example_gemm_universal Address review comments * rebase code and fix clang format. * fix clang format * support pipeline v5. * fix merge conflict * address review comment * add missing file * address review comment v2 * fix build error [ROCm/composable_kernel commit: `0eb8974502`]	2025-06-17 17:27:46 -07:00
Satyanvesh Dittakavi	a4517b0a9d	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `4c57157d50`]	2025-06-17 11:54:30 -07:00
Thomas Ning	cb631cd5b1	Fix the CK Tile related operators (#2356 ) * fix the flatmm * Fix the pipeline * address the comment [ROCm/composable_kernel commit: `3c4cdfac4f`]	2025-06-16 17:38:52 -07:00
Illia Silin	fba2ac98ae	Revert "fix the flatmm (#2349 )" (#2352 ) This reverts commit 37f25427f2ec6c907bf12030338f69945691f54c. [ROCm/composable_kernel commit: `5523df4b2d`]	2025-06-16 07:54:55 -07:00
Thomas Ning	6f158242b6	fix the flatmm (#2349 ) [ROCm/composable_kernel commit: `d996bc78be`]	2025-06-16 02:17:53 -07:00
carlushuang	d68fdea428	hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem (#2348 ) [ROCm/composable_kernel commit: `fb97f75099`]	2025-06-15 22:49:04 -07:00
Mateusz Ozga	044a8560f7	[CK_TILE] Multiple-D GEMM example (#2219 ) * Multiple d, initial commit * Check Ds Layout * Readme and clang format * Update branch & conflicts * Multiple D - fix clang-formatter * Rename elemetwise_op * Fix CI * Code review part1 * Remove printf * Remove unnecessary comment * Add new tests with Col layout * Review part 2 * Added support for Multiple D GEMM * Update comment * Remove maybe_unused * Clang-format * Review part 3 * Add comment to function * Add comment to function: another * Take number of params for a refrence function * Remove additional d param for 0 tensor * Change name of function * Fix CI fails [ROCm/composable_kernel commit: `bd96ac9742`]	2025-06-13 19:39:11 +02:00
kylasa	10498656ef	Code drop for 2 warp ping pong scheduler along K dimension. (#2276 ) * Code drop for 2 warp ping pong scheduler along K dimension. * Addressing code review comments. * Addressing Clang formatting issues. * Addressing build issues. * Addressing build issues of other GEMM pipelines with ping pong scheduler code drop. * Fix for LDS memory size for GEMM pipelines. * Addressing code review feedback comments. * Change log update. * Addressing code review comments and build issues. * Added new policy for pipeline specific logic about LDS needs. * Clang Fix during build. [ROCm/composable_kernel commit: `5f1ad09b61`]	2025-06-12 18:24:02 -07:00
carlushuang	5d7302c240	[CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel (#2327 ) * refactor subtoken logic to let more kernel pickup mp kernel * typo [ROCm/composable_kernel commit: `8aff45a8af`]	2025-06-12 11:44:22 +08:00
Thomas Ning	2350191009	Epilogue cshuffle Improvement (#2312 ) * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * update cshuffle logic * update cshuffle_logics * add some change within review * update some codes following the code review * update epilogue logic * remove from problem * update codes following review. * fix some issues * solve the previous PR error, refine the code * Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Comment addressed * handling tile_engine failing case * handling tile_engine failing case --------- Co-authored-by: joyeamd <John.Ye@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: khushbu agarwal <khuagarw@amd.com> [ROCm/composable_kernel commit: `06e0b8436c`]	2025-06-10 22:44:50 -07:00
Thomas Ning	a0af2eca3f	fix on the typo (#2326 ) [ROCm/composable_kernel commit: `14d229d6c8`]	2025-06-10 16:34:33 -07:00
Khushbu Agarwal	bdb185a509	fix flatmm kernel for bigger size for fp16 datatype (#2302 ) [ROCm/composable_kernel commit: `bd270fe4bc`]	2025-06-10 11:13:40 -07:00
Eisuke Kawashima	808cc61307	chore: unset executable permission (#2303 ) Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com> [ROCm/composable_kernel commit: `4e586ca958`]	2025-06-10 09:13:59 -07:00
MHYangAMD	7cb7aa8e75	Fix fmha fwd precision issue on MI3XX series (#2285 ) * Fix fmha fwd precision issue on MI3XX series For fmha fwd fp16 cases, we found that using impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero. For examaple, fixing K,V to be all 1, and Q is random, which outputs are expected to be all 1. But we found that it would have some incorrect outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 = 0.0005 < 0.001) Thus, ck do not report this error. * Add option to switch rtn/rtz for fmha fwd [ROCm/composable_kernel commit: `9fcf21a4ec`]	2025-06-10 15:03:23 +08:00
Sami Remes	c964eb1186	[CK_TILE] Tileloop persistent gemm - resubmit (#2299 ) * Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293) This reverts commit 1d9fd3b6a8f8e84a407b8e59b63b17c258f4fb78. * Add missing header for kentry --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `1c6f83df6c`]	2025-06-06 14:18:49 -07:00
Andriy Roshchenko	ab0540c5db	Optimized GEMMs for MX FP4/8 (#2294 ) Adds V3 GEMM pipeline for MX FP4 and MX FP8 Adds V3 GEMM pipeline for MX FP4 with preshuffling Adds MXFP4 GEMM tests (#2275) Adds MXFP4 GEMM examples Adds MXFP4 GEMMs to ckProfiler Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com> [ROCm/composable_kernel commit: `00247e3c29`]	2025-06-05 13:54:15 -06:00
Illia Silin	4fba4073d3	Revert "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 ) This reverts commit 6b2a12ae04a22188acd1444e69d89b270525b79e. [ROCm/composable_kernel commit: `233e274077`]	2025-06-05 09:24:00 -07:00
Sami Remes	84ba4164c9	[CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222 ) * Add TailHandler for V3, V4 and Mem pipelines * Adapt examples and tests to use TailHandler * move tail-handling logic to pipeline in persistent grouped gemm * Fix Mem pipeline dispatching, add CompV4 dispatching * Use a macro for handling the many tails of Mem pipeline * Fix formatting again * Use const-ref RunFunction, remove unnecessary try_run [ROCm/composable_kernel commit: `7ea1508b59`]	2025-06-04 11:50:21 +03:00
Sami Remes	47d599c8e3	[CK_TILE] Tile loop persistent gemm kernel (#2191 ) * Implement tile loop persistent gemm kernel * Enable timing * Add tests for persistent gemm * Fix formatting * Fix gemm_basic * Rename True/False to Persistent/NonPersistent * Use only one set of layouts for persistent tests * Fix gemm example persistent template parameter * Fix formatting [ROCm/composable_kernel commit: `ffb52783d0`]	2025-06-04 11:46:28 +03:00
Khushbu Agarwal	c395db8926	[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262 ) * debugging issue * debugging issue * debugging * debugging * reverting debugging code * clang formatted * updating default_config.json * fix ci failure * clang formatted [ROCm/composable_kernel commit: `59a85cb4bc`]	2025-06-03 20:16:10 -07:00
Illia Silin	3eaca9f232	Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185 )" (#2260 ) This reverts commit cdec424edde58fe081fb7f63fa8e247f3975b8b7. [ROCm/composable_kernel commit: `4e561af18c`]	2025-05-29 16:22:16 -07:00

1 2 3 4 5

220 Commits