Commit Graph

2034 Commits

Author SHA1 Message Date
linqunAMD
511f170dab [CK_TILE] Refine fp8 support in flatmm (#2239)
* [CK_TILE] Refine fp8 in flatmm

1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr
2. Add an additional const check to avoid build error in HotLoopScheduler
3. Refine shuffleb to support both tile 32x32 and 16x16
4. Support command option -init
5. Move Gemm warp defintion to a separate struct

* fix clang format

* fix clang format

* keep default bhavior unchanged (warp tile = 16x16)

* fix tile engine build error

* fix a typo in codegen_utils.py

* address review comments

* address review comments

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 37e1a27537]
2025-06-25 01:07:45 -07:00
Po Yen Chen
b86c92c84e [CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403)
* Rename batch_prerfill interface

* Add min_seqlen_q parameter in MakeKargs()

[ROCm/composable_kernel commit: 50fad03524]
2025-06-25 15:19:21 +08:00
Xiao Li
b3b4aa8d57 Fix amd_ck_fp8.hpp macro definitions (#2325)
* Fix amd_ck_fp8.hpp macro definitions

1. Define CK_USE_FNUZ_FP8 and CK_USE_OCP_FP8 definitions only if they were not defined before.
2. Prefix __assert_fnuz_support and __assert_ocp_support with namespace
   fp8_impl to avoid redefined error when building with rocm 6.4+
   (rocm/6.4.0/include/hip/amd_detail/amd_hip_fp8.h)


Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>

[ROCm/composable_kernel commit: bac51b6ec0]
2025-06-24 22:46:15 -06:00
Yi DING
820ba182a0 Fix unmatched K size of WarpGemmMfmaBf16Bf16F32M16N16K32TransposedCDistribution on gfx950 (#2393)
[ROCm/composable_kernel commit: c5d9181e1b]
2025-06-24 16:35:54 -07:00
JiaLuo-CAN
d07dd533b3 add a mx_fp8 client example (#2380)
* add a mx_fp8 client example

* remove verify code and fix date

* remove verify code and fix date, type

---------

Co-authored-by: root <root@bg-1w300-e1-2a.mkm.dcgpu>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>

[ROCm/composable_kernel commit: 778ac24376]
2025-06-24 12:13:18 -04:00
Anton Gorenko
e156b5aebb Improve fmha_bwd tests performance (#2376)
* Avoid passing indices (std::vector) by value to host tensor's operator()

Each access requires 2 allocations and copies of the vector.

* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification

* Compute ds_hp_host_ref in parallel

This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.

* Do not use ForEach for simple copy and conversion of large tensors

These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.

[ROCm/composable_kernel commit: 77123600ee]
2025-06-24 07:45:24 -07:00
JonathanLichtnerAMD
1c8b1cee57 Do not build "other" library for MIOpen (#2382)
MIOpen only needs the static CK library for convolutions.

[ROCm/composable_kernel commit: 87fdb368a7]
2025-06-24 07:32:16 -07:00
JonathanLichtnerAMD
79c30fbb3b Fix build error when building with MIOPEN_REQ_LIBS_ONLY=ON (#2383)
Co-authored-by: John Shumway <john.shumwayjr@gmail.com>

[ROCm/composable_kernel commit: 42e246e90f]
2025-06-24 07:30:42 -07:00
Kiefer van Teutem
eb4b7c65ff Implement batched gemm wmma (RDNA batched gemm) based on wmma cshuffle v3 (#2319)
* Some prep work for adding batched_gemm_wmma_universal. Moved batched_gemm in general to gfx11 and gfx12 categories, and split existing batched_gemm test into xdl and wmma versions. Updated profiler and instance factory. For now only adding f16-row-row-row-GemmDefault. For now actual device instance list is empty.

* Add DeviceBatchedGemm_Wmma_CShuffleV3 based on DeviceGemm_Wmma_CShuffleV3 and make sure it's used in the instance factory and tests. Currently the new batched device level struct cannot actually handle batching, but it does pass tests with a trivial batch size of 1, meaning that the overall structure is good.

* Add custom kernel and Argument type to DeviceBatchedGemm_Wmma_CShuffleV3. Batching arguments not passed to kernel yet.

* Implement kernel-level batching logic for DeviceBatchedGemm_Wmma_CShuffleV3.  In principle the whole thing works now, just need to add other data types and perhaps do some cleanup.

* Add other layouts for batched gemm wmma chufflev3 f16 f16 f16. Now matching XDL (for f16).

* Add bf16 bf16 bf16 support for batched gemm wmma cshuffle v3 for all layouts.

* Fixup comments and TODOs

* Expand test cases for batched gemm wmma cshuffle v3 with more unusual shapes. Some of the original test cases for batched gemm do not work based on cshuffle v3 because the dimensions are too small.

* Fix argument order for calls to profile_batched_gemm_impl() ONLY in wmma tests.

* Take batching into account when using rotating memory or clearing the C tensor.

* Implement small refactors / comments etc. from review.

* Port recent gemm wmma updates to batched gemm wmma: V1 pipeline, non-main-k-block-loop, check compute type, packed buffer size calc. Ported new instance lists.

* Add MNKPadding instances to batched gemm wmma cshuffle v3, remove incompatible test problems.

* Put clearing the C matrix in a pre-process lambda for the non-flush case + small fixups.

* Once again switch order of strides and batch strides in calls to profile_batched_gemm_impl() from test_batched_gemm_wmma to match latest definition of that function.

---------

Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>

[ROCm/composable_kernel commit: 9e74ae7c89]
2025-06-24 07:28:13 -07:00
lalala-sh
b6c780fc7f fix moe i4 bug from aiter (#2339)
[ROCm/composable_kernel commit: bb571a0330]
2025-06-24 14:51:29 +08:00
Yi DING
9f0d3497c3 [CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114)
* 160+192

* Add splitkv d160

* cleanup

* fix

* Add change log

* Fix CHANGELOG

* Use static_cast

* Update ignored instance

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: b8212864cf]
2025-06-24 01:33:31 +08:00
Rostyslav Geyyer
de3cfbab9a Add accelerated stochastic rounding on gfx950 (#2355)
* Add native prand generation support for gfx950

* Update seed calculation

[ROCm/composable_kernel commit: dbfe70e72a]
2025-06-23 09:31:46 -05:00
John Shumway
7c57c4f045 Shard several of the most costly targets. (#2373)
* Shard several of the most costly targets.

Introduces a filter_tuple_by_modulo to break up tuples.

Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.

time ninja -j 64 device_grouped_conv3d_fwd_instance

* fix clang format

* Fix build errors in instantiation code.

I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:

ninja -j 128 test_grouped_convnd_fwd

This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.

I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.

* Fix link errors from mismatched declarations.

Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Shard the longest 2D convolution builds

Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.

* Use PROJECT_SOURCE_DIR for submodule compatibility

I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules.  Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Remove accidental copy of a file

* Remove accidental copies of template files.

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 47ae4b0955]
2025-06-23 07:24:36 -07:00
Linjun-AMD
17346f2c91 update the way to compute fmha fwd tflop, include mask type (#2386)
* update the way to compute fwd tflop, include mask type

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* remove unneccessary comment

* add necessary comment

* remove some comment

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Co-authored-by: root <root@GT-SC-DI16-08.dh144.dcgpu>

[ROCm/composable_kernel commit: 61eb622e85]
2025-06-23 15:53:58 +08:00
Po Yen Chen
7001322416 [CK_TILE] Fix compilation errors introduced in #2320, #2219 and #2214 (#2388)
* Fix compilation errors

* Fix more ck_tile example compilation errors

[ROCm/composable_kernel commit: 7d669440a6]
2025-06-23 12:29:15 +08:00
Max Podkorytov
0bb4daa71b Update for xformers (#2372)
* update api

* update kernel api

* clang-format

[ROCm/composable_kernel commit: 0366fb2abc]
2025-06-22 00:28:30 -07:00
Bartłomiej Kocot
29cfe38b42 [CK TILE] Grouped Convolution Forward Kernel (#2188)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* rebase fixes

* fixes

* fixes

[ROCm/composable_kernel commit: cebdee4d9e]
2025-06-20 15:44:36 -07:00
Illia Silin
bc61ff620d update code owners list (#2381)
[ROCm/composable_kernel commit: 7378a51b4c]
2025-06-20 14:03:20 -07:00
Thomas Ning
5c2009c852 fix the mi350 error (#2378)
[ROCm/composable_kernel commit: df6023e305]
2025-06-20 12:50:13 -07:00
Illia Silin
3d10c98abe Introduce dependency-based CI test selection. (#2377)
* Selective test filter initial commit.

* Expanded folder paths for parsing ninja dependencies.

* Fixing default branch name in the test evaluation script.

* Fixing paths for robustness and adding ctest command to the launch script.

* change jenkins file and few tests to upgrade CI

* Setting ninja build path.

* Fixing typo in Jenkinsfile, and wrong paths.

* Fixing typo in launch script.

* add few more tests to check CI logic

* Fixing header for shell script.

* turn off performance test by default, add option to run all unit tests

* revert dummy changes in source code to trigger tests

* make sure develop branch runs all unit tests

---------

Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>

[ROCm/composable_kernel commit: c3c8c6a10f]
2025-06-20 12:48:00 -07:00
Thomas Ning
3414888f92 Transpose builtin macro defense (#2374)
* add the macro defense

* add the static assert check

[ROCm/composable_kernel commit: 107e3623c7]
2025-06-20 11:24:54 -07:00
Bartłomiej Kocot
9e27236fb7 Grouped conv bias clamp fp32/fp16 support (#2366)
[ROCm/composable_kernel commit: 663992e99b]
2025-06-20 11:41:04 +02:00
Max Podkorytov
7c10189a27 Reland fix default epilogue (#2367)
* Revert "Revert "Fix default epilogue  (#2358)" (#2364)"

This reverts commit f85c70b31e.

* add operator() with old signature

[ROCm/composable_kernel commit: 11eb9f1c77]
2025-06-19 10:39:30 -07:00
dependabot[bot]
83fa5f32c6 Bump sphinxcontrib-bibtex from 2.6.3 to 2.6.4 in /docs/sphinx (#2365)
Bumps [sphinxcontrib-bibtex](https://github.com/mcmtroffaes/sphinxcontrib-bibtex) from 2.6.3 to 2.6.4.
- [Changelog](https://github.com/mcmtroffaes/sphinxcontrib-bibtex/blob/develop/CHANGELOG.rst)
- [Commits](https://github.com/mcmtroffaes/sphinxcontrib-bibtex/compare/2.6.3...2.6.4)

---
updated-dependencies:
- dependency-name: sphinxcontrib-bibtex
  dependency-version: 2.6.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: c8b247c55c]
2025-06-18 08:15:59 -07:00
Muhammed Emin Ozturk
9c035fb203 Stream-K CkProfiler Update ( Replace CPU Validation with GPU Validation and Add Dynamic Grid Size Calculation for Stream-K GEMM Profiler) (#2333)
* Stream-K Ckprofiler Update

* new grid list based on sm number

* clang

* update for review

* Update profile_gemm_universal_streamk.cpp

---------

Co-authored-by: root <root@ctr-ubbsmc16.amd.com>

[ROCm/composable_kernel commit: bfb33bc1e9]
2025-06-18 07:49:22 -07:00
joyeamd
3cb0dd8506 transpose load api development (#2177)
* add transpose load; no real logic

* fix some compile errors

* fix some issues

* update transpose load logic

* add some fixes

* fix a distribution issue

* update some codes

* add some fix

* can pass; but no logic

* transpose load enable

* update tile transpose

* miss output tile distribution mapping

* hack for transpose 16x16

* update output tensor distribution

* delete unused variables

* fix transpose related codes

* update transpose load example

* exchange the iteration order

* fix 16x16 related dimension transpose

* fix a transpose index issue

* fix a transpose index issue

* fix clang format check

* update load tile transpose related codes

* fix compile errors and pass 16x16 tests

* fix a typo

* update logic

* check other data types

* add transpose load api

* update transpose load api

* fix clang format check

* change file name

* refactor codes

* update code name

* delete some unused codes

* delete the unused oob flag for transpose load

* update tensor view api for transpose load

* update for testing

* fix a typo error

* move transpose ops to example directory

* update transpose api

* update include file

* fix for pr review

* fix compile errors

* add transpose load; no real logic

* fix some compile errors

* fix some issues

* update transpose load logic

* add some fixes

* fix a distribution issue

* update some codes

* add some fix

* can pass; but no logic

* transpose load enable

* update tile transpose

* miss output tile distribution mapping

* hack for transpose 16x16

* update output tensor distribution

* delete unused variables

* fix transpose related codes

* update transpose load example

* exchange the iteration order

* fix 16x16 related dimension transpose

* fix a transpose index issue

* fix a transpose index issue

* fix clang format check

* update load tile transpose related codes

* fix compile errors and pass 16x16 tests

* fix a typo

* update logic

* check other data types

* add transpose load api

* update transpose load api

* fix clang format check

* change file name

* refactor codes

* update code name

* delete some unused codes

* delete the unused oob flag for transpose load

* update tensor view api for transpose load

* update for testing

* fix a typo error

* move transpose ops to example directory

* update transpose api

* update include file

* fix for pr review

* fix compile errors

* change directory name

* delete the duplicated directory

* update cmakelists file

* delete the unused codes

* update function names

* update transpose policy

* update code after remod.py

* update codes

* add some comment

* Polish the instr infrastructure

* build up the fixed instr

* redesign the transpose api, currently it has numerical error

* add the bf16 transpose

* fix some issues

* add some comments

* update document

* Finished the refactor of API and pass through the verification

* fix the merging issue

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: a2f01141aa]
2025-06-18 01:28:34 -07:00
Thomas Ning
f85c70b31e Revert "Fix default epilogue (#2358)" (#2364)
This reverts commit b29e3830a6.

[ROCm/composable_kernel commit: 64a2fda713]
2025-06-17 22:43:05 -07:00
linqunAMD
cd0bf60645 [CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd (#2243)
* [CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd

* fix error with the latest develop code.

[ROCm/composable_kernel commit: 7aeec9a901]
2025-06-17 21:37:59 -07:00
carlushuang
f540c6ccb4 [CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335)
* support local_token for hipgraph

* update README

* fix comment

* fix fmoe example

[ROCm/composable_kernel commit: a4e1248dba]
2025-06-18 10:49:43 +08:00
Kiefer van Teutem
609cb2c3ad Fix argument order for calls to profile_batched_gemm_impl() (#2277)
* Fix argument order for calls to profile_batched_gemm_impl()

* Revert previous and swap the order of the profile_batched_gemm_impl() function arguments instead.

* Revert copyright years for unchanged files.

* Remove test_batched_gemm from REGRESSION_TESTS since it no longer takes more than 30 seconds to run.

---------

Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com>

[ROCm/composable_kernel commit: c7c6a0ccb3]
2025-06-17 19:29:09 -07:00
Max Podkorytov
b29e3830a6 Fix default epilogue (#2358)
* [ck-tile] fix default epilogue in gemm universal

* argument validation needs vector size D

* operator() needs to specify dram windows

* copy/paste from cshuffle epilogue

* clang-format

* mark unused argument

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: cd606f72c1]
2025-06-17 17:30:21 -07:00
linqunAMD
af00674037 [CK_TILE] Support multi-config in tile_example_gemm_universal (#2240)
* [CK_TILE] Support multi-config in tile_example_gemm_universal

Add GemmConfig in run_gemm_example to support multiple tile config.
- It is useful when use you need compare gemm perf with different tile/pipeline config
- we also can use it simplify the code for wmma support in the furture.

* [CK_TILE] Support multi-config in tile_example_gemm_universal

Address review comments

* rebase code and fix clang format.

* fix clang format

* support pipeline v5.

* fix merge conflict

* address review comment

* add missing file

* address review comment v2

* fix build error

[ROCm/composable_kernel commit: 0eb8974502]
2025-06-17 17:27:46 -07:00
John Afaganis
3ef7712ee3 Add missing copyright headers (#2359)
* Add missing copyright headers

* empty commit

[ROCm/composable_kernel commit: df54667102]
2025-06-17 14:29:45 -07:00
Illia Silin
073bb8d588 Revert "Shard several of the most costly targets. (#2266)" (#2361)
This reverts commit c1285aaada.

[ROCm/composable_kernel commit: cdfd7722bf]
2025-06-17 13:56:30 -07:00
Bartłomiej Kocot
d9316dfbeb Fix Add in dynamic buffer for fp32/i8 (#2351)
* Fix Add in dynamic buffer for fp32/i8

* fixes

* Fix

[ROCm/composable_kernel commit: cc98a41f46]
2025-06-17 22:25:56 +02:00
Satyanvesh Dittakavi
bde406245a Do not use warpSize as compile time constant as it is removed (#2320)
* Do not use warpSize as compile time constant as it is removed

* Update tile_image_to_column_shape.hpp

update warpSize usage.

* clean-up all use of warpSize, make sure code builds

* fix

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 4c57157d50]
2025-06-17 11:54:30 -07:00
Aviral Goel
66afddf431 add script to pre commit hooks for checking file permissions (#2322)
[ROCm/composable_kernel commit: 3af66e99ab]
2025-06-17 07:07:08 -07:00
Thomas Ning
bc6af0fa49 Fix the CK Tile related operators (#2356)
* fix the flatmm

* Fix the pipeline

* address the comment

[ROCm/composable_kernel commit: 3c4cdfac4f]
2025-06-16 17:38:52 -07:00
rahjain-amd
456719c9bc Add cmake flag to enable Assembly dump (#2347)
This flag makes it easy to dump assembly for the example kernels.

[ROCm/composable_kernel commit: 6589f50bc9]
2025-06-16 09:29:35 -07:00
Illia Silin
0f4d68633b Revert "fix the flatmm (#2349)" (#2352)
This reverts commit fc65195605.

[ROCm/composable_kernel commit: 5523df4b2d]
2025-06-16 07:54:55 -07:00
Bartłomiej Kocot
4ae33b454f Grouped convolution forward with clamp (#2334)
* Grouped convolution forward with clamp

* Optimize clamp

* unary fixes

* test gk bias

* Revert "test gk bias"

This reverts commit 8e42e29d7b.

* Revert "Revert "test gk bias""

This reverts commit e73c0550ce.

* workaround comment

[ROCm/composable_kernel commit: f6c2ff9dce]
2025-06-16 15:36:53 +02:00
Thomas Ning
fc65195605 fix the flatmm (#2349)
[ROCm/composable_kernel commit: d996bc78be]
2025-06-16 02:17:53 -07:00
ruanjm
1fdac8b8fe Add support for specifying valid flag when fetching elements for tile_scatter_gather (#2332)
* Add support for specifying valid flag when fetching elements for tile_scatter_gather

Add constexpr for operator[] of TrueGenerator

* Use different path when valid is enabled

[ROCm/composable_kernel commit: b34c234f51]
2025-06-16 17:17:03 +08:00
carlushuang
370dd01230 hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem (#2348)
[ROCm/composable_kernel commit: fb97f75099]
2025-06-15 22:49:04 -07:00
Illia Silin
7eaa398458 Fix direct lds load for gfx950 and clang20 (#2346)
* fix direct lds load for gfx950 and clang20

* Update include/ck/utility/amd_buffer_addressing_builtins.hpp

* Fix format

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>

[ROCm/composable_kernel commit: 2d8a804152]
2025-06-15 15:22:34 -07:00
Illia Silin
d15d80a68b Limit the threads to builf ck_tile engine, use ninja. (#2342)
* limit the threads to builf ck_tile engine, use ninja

* disable ck_tile engine until it can be built safely

[ROCm/composable_kernel commit: 56f654a826]
2025-06-13 14:13:07 -07:00
Illia Silin
6c4973f734 check for if misched-bottomup flag is valid (#2341)
[ROCm/composable_kernel commit: a0f4db8d9c]
2025-06-13 13:34:22 -07:00
Mateusz Ozga
6b3ddd0e23 [CK_TILE] Multiple-D GEMM example (#2219)
* Multiple d, initial commit

* Check Ds Layout

* Readme and clang format

* Update branch & conflicts

* Multiple D - fix clang-formatter

* Rename elemetwise_op

* Fix CI

* Code review part1

* Remove printf

* Remove unnecessary comment

* Add new tests with Col layout

* Review part 2

* Added support for Multiple D GEMM

* Update comment

* Remove maybe_unused

* Clang-format

* Review part 3

* Add comment to function

* Add comment to function: another

* Take number of params for a refrence function

* Remove additional d param for 0 tensor

* Change name of function

* Fix CI fails

[ROCm/composable_kernel commit: bd96ac9742]
2025-06-13 19:39:11 +02:00
John Shumway
c1285aaada Shard several of the most costly targets. (#2266)
* Shard several of the most costly targets.

Introduces a filter_tuple_by_modulo to break up tuples.

Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.

time ninja -j 64 device_grouped_conv3d_fwd_instance

* fix clang format

* Fix build errors in instantiation code.

I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:

ninja -j 128 test_grouped_convnd_fwd

This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.

I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.

* Fix link errors from mismatched declarations.

Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.

* Migrate the design to a code-generation approach.

Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.

* Shard the longest 2D convolution builds

Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.

* Use PROJECT_SOURCE_DIR for submodule compatibility

I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules.  Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 3a0cb27966]
2025-06-13 03:58:50 -07:00
kylasa
afbc0625f4 Code drop for 2 warp ping pong scheduler along K dimension. (#2276)
* Code drop for 2 warp ping pong scheduler along K dimension.

* Addressing code review comments.

* Addressing Clang formatting issues.

* Addressing build issues.

* Addressing build issues of other GEMM pipelines with ping pong scheduler code drop.

* Fix for LDS memory size for GEMM pipelines.

* Addressing code review feedback comments.

* Change log update.

* Addressing code review comments and build issues.

* Added new policy for pipeline specific logic about LDS needs.

* Clang Fix during build.

[ROCm/composable_kernel commit: 5f1ad09b61]
2025-06-12 18:24:02 -07:00