Commit Graph

220 Commits

Author SHA1 Message Date
jakpiase
bdb86fee78 [CK_TILE] Grouped Convolution Backward Weight Kernel (#2357)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* resolved conflicts

* rebase fixes

* fixes

* tmp

* add working support for splitk

* minor fix

* fixes

* fixes

* minor fix

* small fix

* Split K and preprocessing fixes

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 6681593864]
2025-07-24 10:41:35 +02:00
Cong Ma
baf244000e ck_tile kernel for gemm with groupwise quantized A tensor (#2473)
* ck_tile kernel for gemm with groupwise quantized A or B tensor.

This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

1. fp8, fp8 -> f32
2. bf8, bf8 -> f32
3. i4, fp8 -> f32
4. i4, bf8 -> f32

Group size can go down to as low as K length of underlying WarpGemm primitive.

For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.

* [Block Scale Gemm] Only run gemm quant examples on __gfx94__

- Only run gemm quant examples on __gfx94__ for usage of
  `v_cvt_pk_fp8_f32`
- Format the code

* [Block Scale Gemm] Remove Bquant Gemm BlockScale

This cleanup is in preparation for future development of bquant. By
isolating Aquant-related code, we can streamline the codebase and make
it easier to add and maintain bquant functionality in subsequent
updates.

* [Block Scale Gemm] Format code with clang-format-12

The latest clang-format (v19) in ROCm 7.0 generate different result than
clang-format-12 which is used in CK CI.

Format code with clang-format-12 for consistency.

* [Block Scale Gemm] Split the k direction loop

- Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp
   to make the logic clearer.
- Disable C transposition.

* [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm

* [Block Scale Gemm] Update copyright

* test

* Add TailHandler

* Move TileDistributionEncodingPatternAQ

* Refactor

* refactor

* fix bug

* fix bug

* help solve the PR comment

* Format the code

* [Block Scale Gemm] Add unit tests

* [Block Scale Gemm] Add support to 16x16x32 MFMA

- Add support to 16x16x32 MFMA
- Fix a bug when exchange data crossing lanes

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com>
Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: e62710e461]
2025-07-23 00:10:16 -07:00
Yi DING
bb1a05cc87 [CK_TILE] Fix tile_example_moe_sorting broke in #2436 (#2525)
[ROCm/composable_kernel commit: f0a8c18017]
2025-07-17 22:50:58 -07:00
Mateusz Ozga
96e2d5f86a Fix CI clang-format (#2521)
[ROCm/composable_kernel commit: 7fc000d7b3]
2025-07-17 14:41:29 +02:00
Haocong WANG
894750fd59 fix mfma32x32 dispatch (#2490)
[ROCm/composable_kernel commit: 28072adc3a]
2025-07-17 15:24:12 +08:00
Yi DING
cee5776046 [CK_TILE] Use read_tr in universal gemm (#2436)
* Use read_tr in universal gemm

* Enable all instances back

* Revert example37 changes

* Resolve comments

* resolve comments 2

* Fix assertion msg

* fix the gemm basic

* change index_t to bool for preshuffle variable

* Solve the comment

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>

[ROCm/composable_kernel commit: f1d8ad2818]
2025-07-16 23:56:22 -07:00
Po Yen Chen
e75bbb6a13 Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389)" (#2514)
This reverts commit c5014a86765c598215bf2b58f89c3a7c70f92bac.

[ROCm/composable_kernel commit: 722c22fb15]
2025-07-17 10:09:01 +08:00
linqunAMD
348dec0d0c Fix build errors on windows (#2456)
* Fix build errors on windows

* correct clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>

[ROCm/composable_kernel commit: 6e76b82059]
2025-07-16 07:58:23 -07:00
MHYangAMD
ff8d3d0d13 [CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409)
* Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass

* Update rmsnorm2d_fwd_pipeline_model_sensitive_pass

1.  Add BlockReduce2dTreeCrossWarpSync

* Add Rmsnorm2dFusedModelSensitiveEnum

* Update patch

1. Reverse generate.py
2. Remove comment in generate.py
3. Update tree cross warp reduce

* Refactor RMSNorm model enum and introduce T5-like option

* Update the n stage for cross warp reduce

* Add new cmdline option in RMSNorm for new pipeline testing

---------

Co-authored-by: Clement Lin <clement.lin@amd.com>
Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>

[ROCm/composable_kernel commit: 3499fe67ff]
2025-07-16 14:05:26 +08:00
carlushuang
34ca5f6a68 [CK_TILE] moe sorting optimize local_token (#2469)
* fix bug in loops that need use local tokens to compute

* support extra chain local_token

* update

* update

* refine some main

* update

* support dispatch_policy

* fix 15 example

[ROCm/composable_kernel commit: cfe211cc60]
2025-07-15 09:42:18 +08:00
Khushbu Agarwal
f3120e7526 Merge flatmm Operator with universal gemm (#2434)
* Initial commit

* Adding new tile partitioner to flatmm

* intermediate changes

* debugging kernels

* Updating flatmm example to universal gemm example

* updated flatmm kernel to run via gemmKernel

* update universal gemm to incorporate flatmm

* debug

* Fix flatmm call

* Fixing other kernels and tests for API changes

* clang formatted

* fixing gemm tests

* added test for flatmm and simplify kernel arguments

* adding flatmm test

* fix test for flatmm

* simplify gemm kernel with flatmm

* remove flatmm related files

* addressing review comments and code clean up

* resolving empty file

* resolving empty file

* clang formatted

* addressing review comments

* enable persistent kernel for flatmm

* reverted the removed files for flatmm

* reverted the removed files for flatmm

* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example

* some more renames

* clang formatted

[ROCm/composable_kernel commit: d239b91fd5]
2025-07-11 08:27:55 -07:00
Qianfeng
337126469c Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487)
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline

* i_nhead_ conversion type to prevent overflow

---------

Co-authored-by: ltqin <letaoqin@amd.com>

[ROCm/composable_kernel commit: 45904b8fd7]
2025-07-11 18:14:47 +08:00
shay-li77
4f08a02dae support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338)
* mask support ratio for y axis

* format code

* add notes for param y_ratio

* fix comments error

* support template and mdiv for ratio mask

* refactor y-ratio mask constructor

* optimize coordinate calculation

* add SimplifiedRatioAttentionMask

[ROCm/composable_kernel commit: d814fefe18]
2025-07-09 23:18:55 +08:00
Haocong WANG
7c04d93083 [CK TILE] Fix FA build filter (#2369)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* cmake depends & bwd filter order fix

* revert unexpected reformat

* Avoid change fmha bwd filter order for downstream compatibility

* Revert unexpected changes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>

[ROCm/composable_kernel commit: 5557eadce6]
2025-07-08 10:42:07 +08:00
Po Yen Chen
a71dc1245f Eliminate warning caused by failed to meet occupancy requirement (#2389)
Co-authored-by: felix <felix.li@amd.com>

[ROCm/composable_kernel commit: b2dea90116]
2025-07-08 09:17:25 +08:00
Thomas Ning
1129c9dc4e Enable Async Copy for MI355 (#2425)
* add for async load builtin

* add async load api

* fix some compiling errors

* fix a compiling error

* fix some compiling errors

* add a pipeline which copies from v4

* add a new pipeline for async load

* fix some compiling errors

* add async load tests

* fix some issues in async load

* fix

* fix async inline assembly

* fix async inline assembly

* add ignore header file

* comment some not gfx950 codes

* comment some not gfx950 codes

* fix a error

* update async load apis

* fix lds descriptor

* fix a compiling error

* fix some compiling errors

* fix a descriptor issue

* update lds descriptor

* change async pipeline's tile distribution pattern from thread to warp

* fix clang format

* update async policy

* fix a CRTP issue

* fix a typo error

* change lds layout

* fix some sync issues

* improve codes

* delete the async test

* fix a commented format issue

* avoid compiling device functions when compile host

* make gemm run

* add the copy kernel support

* finish the feature

* Address comment

* add the support for buffer_builtin

* solved the merging problem

* Comment Addressed

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: joyeamd <John.Ye@amd.com>

[ROCm/composable_kernel commit: f240ae3248]
2025-07-07 10:08:49 -07:00
ltqin
ba133fe9b7 ck tile pagedkv prefill (#2405)
* add prefetching physical block id for pagedkv

* start add pagedkv prefill

* rename pipeline

* add kernel for pagedkv

* add an init version pagedkv prefill

* fix redefine issue

* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args

* generate dispatch code

* add body generating code

* comipling pass

* remove dropout from pagedkv

* set lse to false in generating code

* start changing qr kernel to pagedkv

* init version of  kernerl with pagedkv

* change names of file that are generated

* chang host validation for pagedkv prefill

* using iglp to change blockgemm

* add kernel files to op head file

* show parameters

* rewrite print parameter fun

* add fwd

* remove default parameter of GridSize

* format

* fix nhead issue and add seqlen_k_ptr to batch mode

* format code

* remove no-longer used code

* format

* fix some comments

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 9f4c5d7372]
2025-07-07 16:16:54 +08:00
Thomas Ning
90add28587 [CK Tile] Int8 Support on CK Tile GEMM (#2267)
* updates to support int8 in 03_gemm example

* added comments, using aliases, helper functions

* test(gemm_universal): add test cases for int8 gemm pipeline

* fix(test_gemm): fix for failing test unit test for int8

* test(ck_tile): add int8 unit test for gemm universal

* refactor(gemm_universal): GPU reference verification for GEMM code improved

* style(gemm_universal): removed extra comments and did clang format

* merging recent changes to universal gemm to tile_engine

* ck tile engine integration work

* feat(tile_engine): add int8 support to tile engine ops/gemm

* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8

* style: Format code with clang-format-12

* refactor(tile_engine): address review comments

* style: removed unhelpful comments & unused variables.

* build: tile engine uses default config

* feat: add int8 support for CK_TILE GEMM

* style: added trailing commas to codegen_utils.py

* refactor: tile engine

* refactor: formatting and code review

* refactor: code formatting for python files

* fix: suppress build warning

* add support for gfx950

* refactor:KWarpTile size in gemms util

* Fix the branch and wrap up the k warp tile

* Add bf8 integration

* refactor: clang format and rebase

---------

Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>

[ROCm/composable_kernel commit: e03293ebce]
2025-06-25 08:20:35 -07:00
linqunAMD
d2ec53a74e [CK_TILE] Refine fp8 support in flatmm (#2239)
* [CK_TILE] Refine fp8 in flatmm

1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr
2. Add an additional const check to avoid build error in HotLoopScheduler
3. Refine shuffleb to support both tile 32x32 and 16x16
4. Support command option -init
5. Move Gemm warp defintion to a separate struct

* fix clang format

* fix clang format

* keep default bhavior unchanged (warp tile = 16x16)

* fix tile engine build error

* fix a typo in codegen_utils.py

* address review comments

* address review comments

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 37e1a27537]
2025-06-25 01:07:45 -07:00
Po Yen Chen
b62e551ccb [CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403)
* Rename batch_prerfill interface

* Add min_seqlen_q parameter in MakeKargs()

[ROCm/composable_kernel commit: 50fad03524]
2025-06-25 15:19:21 +08:00
Yi DING
c4ba466332 Fix unmatched K size of WarpGemmMfmaBf16Bf16F32M16N16K32TransposedCDistribution on gfx950 (#2393)
[ROCm/composable_kernel commit: c5d9181e1b]
2025-06-24 16:35:54 -07:00
Yi DING
cba904aeff [CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114)
* 160+192

* Add splitkv d160

* cleanup

* fix

* Add change log

* Fix CHANGELOG

* Use static_cast

* Update ignored instance

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: b8212864cf]
2025-06-24 01:33:31 +08:00
Po Yen Chen
4456552543 [CK_TILE] Fix compilation errors introduced in #2320, #2219 and #2214 (#2388)
* Fix compilation errors

* Fix more ck_tile example compilation errors

[ROCm/composable_kernel commit: 7d669440a6]
2025-06-23 12:29:15 +08:00
Max Podkorytov
eb96164495 Update for xformers (#2372)
* update api

* update kernel api

* clang-format

[ROCm/composable_kernel commit: 0366fb2abc]
2025-06-22 00:28:30 -07:00
Bartłomiej Kocot
2567f5e538 [CK TILE] Grouped Convolution Forward Kernel (#2188)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* rebase fixes

* fixes

* fixes

[ROCm/composable_kernel commit: cebdee4d9e]
2025-06-20 15:44:36 -07:00
Max Podkorytov
1beaead90c Reland fix default epilogue (#2367)
* Revert "Revert "Fix default epilogue  (#2358)" (#2364)"

This reverts commit c0a58748ade1dc4d5405894c41ca33f923e480c6.

* add operator() with old signature

[ROCm/composable_kernel commit: 11eb9f1c77]
2025-06-19 10:39:30 -07:00
Thomas Ning
ffafdec4d8 Revert "Fix default epilogue (#2358)" (#2364)
This reverts commit f1e842455a66c955a3df38a9fc383141c4bcc62f.

[ROCm/composable_kernel commit: 64a2fda713]
2025-06-17 22:43:05 -07:00
carlushuang
8660f6ef22 [CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335)
* support local_token for hipgraph

* update README

* fix comment

* fix fmoe example

[ROCm/composable_kernel commit: a4e1248dba]
2025-06-18 10:49:43 +08:00
Max Podkorytov
9a342df600 Fix default epilogue (#2358)
* [ck-tile] fix default epilogue in gemm universal

* argument validation needs vector size D

* operator() needs to specify dram windows

* copy/paste from cshuffle epilogue

* clang-format

* mark unused argument

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: cd606f72c1]
2025-06-17 17:30:21 -07:00
linqunAMD
ca0c6fb660 [CK_TILE] Support multi-config in tile_example_gemm_universal (#2240)
* [CK_TILE] Support multi-config in tile_example_gemm_universal

Add GemmConfig in run_gemm_example to support multiple tile config.
- It is useful when use you need compare gemm perf with different tile/pipeline config
- we also can use it simplify the code for wmma support in the furture.

* [CK_TILE] Support multi-config in tile_example_gemm_universal

Address review comments

* rebase code and fix clang format.

* fix clang format

* support pipeline v5.

* fix merge conflict

* address review comment

* add missing file

* address review comment v2

* fix build error

[ROCm/composable_kernel commit: 0eb8974502]
2025-06-17 17:27:46 -07:00
Satyanvesh Dittakavi
a4517b0a9d Do not use warpSize as compile time constant as it is removed (#2320)
* Do not use warpSize as compile time constant as it is removed

* Update tile_image_to_column_shape.hpp

update warpSize usage.

* clean-up all use of warpSize, make sure code builds

* fix

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 4c57157d50]
2025-06-17 11:54:30 -07:00
Thomas Ning
cb631cd5b1 Fix the CK Tile related operators (#2356)
* fix the flatmm

* Fix the pipeline

* address the comment

[ROCm/composable_kernel commit: 3c4cdfac4f]
2025-06-16 17:38:52 -07:00
Illia Silin
fba2ac98ae Revert "fix the flatmm (#2349)" (#2352)
This reverts commit 37f25427f2ec6c907bf12030338f69945691f54c.

[ROCm/composable_kernel commit: 5523df4b2d]
2025-06-16 07:54:55 -07:00
Thomas Ning
6f158242b6 fix the flatmm (#2349)
[ROCm/composable_kernel commit: d996bc78be]
2025-06-16 02:17:53 -07:00
carlushuang
d68fdea428 hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem (#2348)
[ROCm/composable_kernel commit: fb97f75099]
2025-06-15 22:49:04 -07:00
Mateusz Ozga
044a8560f7 [CK_TILE] Multiple-D GEMM example (#2219)
* Multiple d, initial commit

* Check Ds Layout

* Readme and clang format

* Update branch & conflicts

* Multiple D - fix clang-formatter

* Rename elemetwise_op

* Fix CI

* Code review part1

* Remove printf

* Remove unnecessary comment

* Add new tests with Col layout

* Review part 2

* Added support for Multiple D GEMM

* Update comment

* Remove maybe_unused

* Clang-format

* Review part 3

* Add comment to function

* Add comment to function: another

* Take number of params for a refrence function

* Remove additional d param for 0 tensor

* Change name of function

* Fix CI fails

[ROCm/composable_kernel commit: bd96ac9742]
2025-06-13 19:39:11 +02:00
kylasa
10498656ef Code drop for 2 warp ping pong scheduler along K dimension. (#2276)
* Code drop for 2 warp ping pong scheduler along K dimension.

* Addressing code review comments.

* Addressing Clang formatting issues.

* Addressing build issues.

* Addressing build issues of other GEMM pipelines with ping pong scheduler code drop.

* Fix for LDS memory size for GEMM pipelines.

* Addressing code review feedback comments.

* Change log update.

* Addressing code review comments and build issues.

* Added new policy for pipeline specific logic about LDS needs.

* Clang Fix during build.

[ROCm/composable_kernel commit: 5f1ad09b61]
2025-06-12 18:24:02 -07:00
carlushuang
5d7302c240 [CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel (#2327)
* refactor subtoken logic to let more kernel pickup mp kernel

* typo

[ROCm/composable_kernel commit: 8aff45a8af]
2025-06-12 11:44:22 +08:00
Thomas Ning
2350191009 Epilogue cshuffle Improvement (#2312)
* add cshuffle's mxdlperwavepershuffle support, not finished

* add epilogue functions

* add cshuffle's mxdlperwavepershuffle support, not finished

* add epilogue functions

* update cshuffle logic

* update cshuffle_logics

* add some change within review

* update some codes following the code review

* update epilogue logic

* remove from problem

* update codes following review.

* fix some issues

* solve the previous PR error, refine the code

* Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Comment addressed

* handling tile_engine failing case

* handling tile_engine failing case

---------

Co-authored-by: joyeamd <John.Ye@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: khushbu agarwal <khuagarw@amd.com>

[ROCm/composable_kernel commit: 06e0b8436c]
2025-06-10 22:44:50 -07:00
Thomas Ning
a0af2eca3f fix on the typo (#2326)
[ROCm/composable_kernel commit: 14d229d6c8]
2025-06-10 16:34:33 -07:00
Khushbu Agarwal
bdb185a509 fix flatmm kernel for bigger size for fp16 datatype (#2302)
[ROCm/composable_kernel commit: bd270fe4bc]
2025-06-10 11:13:40 -07:00
Eisuke Kawashima
808cc61307 chore: unset executable permission (#2303)
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>

[ROCm/composable_kernel commit: 4e586ca958]
2025-06-10 09:13:59 -07:00
MHYangAMD
7cb7aa8e75 Fix fmha fwd precision issue on MI3XX series (#2285)
* Fix fmha fwd precision issue on MI3XX series

For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.

For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.

* Add option to switch rtn/rtz for fmha fwd

[ROCm/composable_kernel commit: 9fcf21a4ec]
2025-06-10 15:03:23 +08:00
Sami Remes
c964eb1186 [CK_TILE] Tileloop persistent gemm - resubmit (#2299)
* Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293)

This reverts commit 1d9fd3b6a8f8e84a407b8e59b63b17c258f4fb78.

* Add missing header for kentry

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 1c6f83df6c]
2025-06-06 14:18:49 -07:00
Andriy Roshchenko
ab0540c5db Optimized GEMMs for MX FP4/8 (#2294)
Adds V3 GEMM pipeline for MX FP4 and MX FP8 
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275)
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler




Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>

[ROCm/composable_kernel commit: 00247e3c29]
2025-06-05 13:54:15 -06:00
Illia Silin
4fba4073d3 Revert "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293)
This reverts commit 6b2a12ae04a22188acd1444e69d89b270525b79e.

[ROCm/composable_kernel commit: 233e274077]
2025-06-05 09:24:00 -07:00
Sami Remes
84ba4164c9 [CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222)
* Add TailHandler for V3, V4 and Mem pipelines

* Adapt examples and tests to use TailHandler

* move tail-handling logic to pipeline in persistent grouped gemm

* Fix Mem pipeline dispatching, add CompV4 dispatching

* Use a macro for handling the many tails of Mem pipeline

* Fix formatting again

* Use const-ref RunFunction, remove unnecessary try_run

[ROCm/composable_kernel commit: 7ea1508b59]
2025-06-04 11:50:21 +03:00
Sami Remes
47d599c8e3 [CK_TILE] Tile loop persistent gemm kernel (#2191)
* Implement tile loop persistent gemm kernel

* Enable timing

* Add tests for persistent gemm

* Fix formatting

* Fix gemm_basic

* Rename True/False to Persistent/NonPersistent

* Use only one set of layouts for persistent tests

* Fix gemm example persistent template parameter

* Fix formatting

[ROCm/composable_kernel commit: ffb52783d0]
2025-06-04 11:46:28 +03:00
Khushbu Agarwal
c395db8926 [CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262)
* debugging issue

* debugging issue

* debugging

* debugging

* reverting debugging code

* clang formatted

* updating default_config.json

* fix ci failure

* clang formatted

[ROCm/composable_kernel commit: 59a85cb4bc]
2025-06-03 20:16:10 -07:00
Illia Silin
3eaca9f232 Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185)" (#2260)
This reverts commit cdec424edde58fe081fb7f63fa8e247f3975b8b7.

[ROCm/composable_kernel commit: 4e561af18c]
2025-05-29 16:22:16 -07:00