Commit Graph

2493 Commits

Author SHA1 Message Date
Johannes Graner
671f2686c0 Fix race conditions in ck_tile remod (#3061)
[ROCm/composable_kernel commit: 4043401db1]
2025-10-21 09:35:04 +02:00
Max Podkorytov
7d1d0565d9 refine
[ROCm/composable_kernel commit: ff6efa2fb1]
2025-10-20 23:13:58 -04:00
Max Podkorytov
6200ea9dfc update build instructions
[ROCm/composable_kernel commit: b9e966e574]
2025-10-20 23:13:58 -04:00
Yi DING
698810c92f [CK_TILE] Add fmt: skip to FMHA codegen scripts for readability (#3057)
* fmt: skip for fmha_bwd.py

* more fmt: skip

* thank you, copilot

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: e20923f384]
2025-10-21 10:15:04 +08:00
Max Podkorytov
1d7e4157c5 [CK_TILE] Fix transpose_vectors for 2x2 8-bit tiles (#3042)
fix transpose_vectors logic for 2x2 8-bit tiles

    add a test which goes through this code path.

    factor out constexpr'd cases into smaller functions.

    add inline docs about the data movement

    impact: gemms with 8-bit non-rcr inputs on gfx942


[ROCm/composable_kernel commit: 2570462ecf]
2025-10-20 13:40:44 -07:00
Thrupti Raj Lakshmana Gowda
61dbfdb27b [CK TILE ENGINE] Code changes to finding GPU id from TARGET (#3055)
* Reading gpuname from target for gemm in ck tile engine

* Reading gpuname from target for gemm preshuffle in ck tile engine

* Reading gpuname from target for gemm preshuffle in ck tile engine

* Get GPU changes for GEMM Muti D in TILE ENGINE

* Addressing errors for gpu name in cktileengine

[ROCm/composable_kernel commit: 9f77061094]
2025-10-20 09:02:18 -07:00
John Shumway
5891e2ae79 [CK_BUILDER] Add experimental builder directory and configuration for composable_kernel (#3043)
Add experimental builder infrastructure for composable_kernel

- Add experimental/builder directory with README documentation.
- Create initial test infrastructure with CMakeLists.txt and placeholder test.
- Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option.
- Update .gitignore to not treat `experimental/builder` as a CMake build directory.

This establishes the directory structure  for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration.


[ROCm/composable_kernel commit: f18b79f328]
2025-10-20 07:54:09 -07:00
Gino Lu
b7e5da5e83 [CK_TILE] Patch for pk_fp4 ref check and buffer load. (#3044)
* Patch for pk_fp4_raw_t buffer load and ref check

[ROCm/composable_kernel commit: fb1d090f3c]
2025-10-20 14:47:04 +08:00
BrianHarrisonAMD
0fb13588bb Add dvc pull step (#3056)
* Add dvc pull step

* Remove CD

* Add details about LOGNAME and fail if dvc isn't installed

[ROCm/composable_kernel commit: af3786fe08]
2025-10-19 16:09:21 -07:00
Illia Silin
525ea9dd3f disable aiter test gemm_a8w8_blockscale (#3049)
[ROCm/composable_kernel commit: d88ea05c84]
2025-10-17 19:52:22 -07:00
AviralGoelAMD
48b0e60e14 docs: add inline comments about flush_cache and rotating buffer
[ROCm/composable_kernel commit: b03764ca5a]
2025-10-17 12:56:47 -04:00
Yashvardhan Agarwal
c5eda13381 fix identity values in Max and AbsMax (#3048)
- The identity value method returned the minimum positive number while
we need the lowest number for Max and AbsMax operations

[ROCm/composable_kernel commit: 889ffc0b1d]
2025-10-17 09:49:21 -07:00
Emily Martins
6157673c39 Fix CK Tile Stream-K BF16 Validation Errors (#3039)
Prior to this change, the number of accumulations passed into
calculate_rtol_atol was 1. That said, in most cases, this is not correct
when there are multiple workgroups contributing to the same macro tile
in C.

This change ensures uses the function estimate_num_wgs_per_tile, which
was extracted into a common file and generalized, to estimate the number
of workgroups per macro tile. This estimate is passed into
calculate_rtol_atol to ensure we get a better relative and absolute
tolerance.

[ROCm/composable_kernel commit: 352dee5225]
2025-10-17 09:33:38 -07:00
Johannes Graner
f7ffb12123 Pre-commit in CI (#3029)
* Pre-commit in CI

* Specify python version, and install dos2unix for remod

* Refactor remod hook to correctly install dependencies

* Run pre-commit

[ROCm/composable_kernel commit: 8a4cd32d86]
2025-10-17 09:28:38 -07:00
Ville Pietilä
71ecd257a4 Fixed handling of split-K autodeduce argument for grouped convolution (#3024)
* Fix handling of split-K autodeduce argument.

* Fix clang formatting.

* Test fix.

* Fix clang formatting.

[ROCm/composable_kernel commit: 7e44b845b5]
2025-10-17 15:36:39 +03:00
Johannes Graner
8af66c65d0 Update pre-commit to fixed versions, run remod for ck_tile (#2895)
* Fix ruff linter errors

* Fix remod dos2unix command

* Clang format

* Ignore utility in remod

* Run remod

* Specify clang-format version in pre-commit

* Specify ruff version

* Include PoolKernelArgs in reference_pool

* Add calculate_total_elements to reference batched contraction

* Fix calculate_total_elements declaration

* Refactor remod pre-commit hook

* Fix Aquant tests

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: d40b50b9d5]
2025-10-16 15:29:17 -07:00
Enrico Degregori
1d9320c8f3 Wave Tile Transfer supporting global load with transpose (#3027)
* Initial implementation:

 - add new thread group transfer supporting transpose instruction
 - refactor AB transfer to switch between thread and wave tiles methods

* Add some comments and remove explicit wave and lane calculations

* Remove compiler option for performance

* fp16 example: use tuned instance

* Missing cleanup

* Integrate wave transfer in existing gemm and batched gemm instances

* Add fast instances

* extend implementation for 8 bit datatypes

packed types not supported

* Address review comments

* Optimize pipeline v1 and re-introduce compiler option

* Disable wave tile approach for b scale gemm

* Fix for clang20

* Avoid code duplication of amd_global_load_transpose_to_vgpr function

[ROCm/composable_kernel commit: 440358c168]
2025-10-16 11:33:56 -07:00
kabrahamAMD
06d76b160e implement device batched gemm b scale for wmma (#2825)
* rebased on top of develop

* fixed missing shuffeling and wrong indexing

* added tests for batched_b_scale

* added missing files

* fixed wrong stride computation and removed k batching (for now) due to precision issues

* reinstated k-batching with PRNG constrained to -1..1

* added specialization of GeneratorTensor_3 for int4 and fixed internal overflow

* added k-batching to reference and increased tolerances for test

* changed gemm_b_scale and gemm_universal tests to use correct parameters

* adressed review commentsd

* ported fixes back to non-batched version of b_scale

* adressed review comments

* run clang-format on older commits

* add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior

* added newline at end of file

* reflected changes from muitl-abd branch in batched b_scale

* fixed gfx11 issue

* changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed

* run clang format

* set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested.

* reduced range for pk_i4 even further to 0..0

* removed failing xld instances. Failure now uncovered now that tests were fixed

* removed generation of int4 values entierly

* divide B buffer by BPackedSize

---------

Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>

[ROCm/composable_kernel commit: c4b2da9cbd]
2025-10-16 11:00:42 -07:00
Geo Min
62afd9eb14 [TheRock CI] Updating SHA for build image and TheRock SHA (#3033)
* Updating SHA for build image

* Adding test exclusions

[ROCm/composable_kernel commit: d7278cc664]
2025-10-16 08:13:10 -07:00
Emily Martins
95805e674a Style updates and cleanup
The following changes were made
- Renamed iter to iter_start
- Renamed tile_iter to tile_iter_start
- Moved documentation from member variables to getters
- Removed double underscore from extra_iters_before_me variable
- Defined parent header in impl file
- Removed unused inlcudes


[ROCm/composable_kernel commit: cb83d52301]
2025-10-16 08:47:06 -06:00
Astha
47e002fc27 Addition of the derived structs for the new Stream-K TilePartitioner
There are 2 derived structs based on whether Stream-K is persistent or not.
If it's persistent that means that both the data parallel and Stream-K sections
are data parallel. If it's non-persistent that means that only the
Stream-K section is persistent, while the data parallel section will have
separate workgroups allocated for it. Both structs will have a template
argument for Persistent.

The 2 derived classes will inherit common variables and functions from the
Stream-K TilePartitioner base class. There are additional variables for the
differing data parallel sections that will be added to each derived class,
that are in charge of the indexing/bookkeeping for the data parallel sections.
The only additional function that will differ between the 2 structs is GridSize(),
as the non-persistent will allocate extra workgroups for data parallel.

Unit tests for the derived structs are included.


[ROCm/composable_kernel commit: 8f75d7cea6]
2025-10-16 08:47:06 -06:00
Emily Martins
48ebcbe898 Stream-K Tile Partitioner Base Class with Tests
To better align with the original Stream-K paper, this change implements
a new Stream-K tile partitioner base class. This class will handle the
Stream-K setup that is common to both a persistent and non-persistent DP
section. A later change will implement derived classes to handle the
differences between persistent and non-persistent DP.

This change also includes unit tests for the base tile partitioner.


[ROCm/composable_kernel commit: f87f768d16]
2025-10-16 08:47:06 -06:00
Illia Silin
b4edf184a7 Revert "Enable storelse for fmha_fwd_trload kernel (#3023)" (#3037)
This reverts commit 8925d9b1a3.

[ROCm/composable_kernel commit: 2d1c9e28e2]
2025-10-16 07:19:34 -07:00
Max Podkorytov
79f0324752 re-enable batched transpose test on gfx942 (#3035)
* re-enable batched transpose test on gfx942

* try also enabling on gfx11/12

* roll back to original 'gfx9'

[ROCm/composable_kernel commit: e980d4351c]
2025-10-16 03:12:15 -07:00
Vidyasagar Ananthan
b26a7f9b84 [DOCS] Documentation Addition (Readme updates) (#2495)
* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 92c67a824f]
2025-10-16 03:10:57 -07:00
Haocong WANG
8925d9b1a3 Enable storelse for fmha_fwd_trload kernel (#3023)
[ROCm/composable_kernel commit: 013ba3c737]
2025-10-16 13:51:23 +08:00
Emily Martins
8ac10a4aca Fix compiler noreturn error for ck tile permute test (#3036)
[ROCm/composable_kernel commit: 0dbd173500]
2025-10-15 19:42:02 -07:00
Aviral Goel
18dd814ab3 docs: add quant mode comparison to readme (#3032)
* docs: add quant mode comparison to readme

* Update example/ck_tile/38_block_scale_gemm/README.md

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

[ROCm/composable_kernel commit: 232523d9fa]
2025-10-15 18:35:06 -07:00
Illia Silin
2daa8b642a use branch develop to test hipTensor (#3034)
[ROCm/composable_kernel commit: 87d0a3ac17]
2025-10-15 15:40:34 -07:00
Illia Silin
0b8e15dc2e re-enable clang-format by default (#3030)
* re-enable clang-format by default

* fix clang format

[ROCm/composable_kernel commit: 3348f01e6f]
2025-10-15 07:43:11 -07:00
Christopher Millette
d540b5244c Disable streamk extended regression tests for now (#3016)
[ROCm/composable_kernel commit: bde5f26db3]
2025-10-15 09:05:47 -05:00
felix
7b584fd2d2 Felix/opt sorting (#2902)
* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 4c826abfff]
2025-10-15 09:24:03 +08:00
AviralGoelAMD
c1670a80db test(grouped_gemm_multi_d): add unit test for bf16 support
[ROCm/composable_kernel commit: ca1ab083a7]
2025-10-14 18:00:43 -04:00
AviralGoelAMD
e8653f314d feat(grouped_gemm_multi_d): add support for bf16
[ROCm/composable_kernel commit: 8d8b49dec2]
2025-10-14 18:00:43 -04:00
Geo Min
50aee57dea fixing group id (#3002)
[ROCm/composable_kernel commit: 706c2b281c]
2025-10-14 08:51:52 -07:00
joyeamd
2592957760 update s_barrier's logic in gfx12 architecture (#3003)
change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment

[ROCm/composable_kernel commit: b9d74e7746]
2025-10-14 08:49:34 -07:00
Illia Silin
fc0f6be56d Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)
This reverts commit 985b9f98a2.

[ROCm/composable_kernel commit: e4298e55c7]
2025-10-14 08:43:14 -07:00
jakpiase
4643cdd962 [CK_TILE] Switch into universal gemms for conv bwds (#2981)
* switch into universal gemms for conv bwds

* some fixes and support universal gemm in conv fwd

* add reviewer comments

[ROCm/composable_kernel commit: 6deaaa92cc]
2025-10-14 16:09:16 +02:00
msaffari-amd
2ad7ba80c9 Fix: Handle JSON boolean values (pad_m, pad_n, pad_k and persistent) in gemm_instance_builder (#3008)
[ROCm/composable_kernel commit: 589e242eda]
2025-10-14 13:20:25 +02:00
ClementLinCF
6a423df526 [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540)
* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm

* Update rmsnorm host reference

* Update tree reduction of rmsnorm for reference host

* Fix cross warp for m > 1 cases

* Add RMSNorm model selectable option for host reference

* Fix save_unquant cases

* Update reference rmsnorm forward function to use enum for model sensitivity

* Update reference rmsnorm calculation for model sensitivity

* Fix m warp for layernorm

* Adjust parameter of reference for twoPass

* Fix clang format

* Run clang-format-overwrite.sh to fix formating issue

* fix clang format

---------

Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: e1b0bdfbfa]
2025-10-13 11:52:37 -07:00
John Shumway
784f68e831 Enable GMock and improve gtest configuration (#2976)
Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake:

* Enable BUILD_GMOCK by default (was previously disabled)
* Patch gtest-src/googlemock/CMakeLists.txt for broken include path.
* Add configuration to gmock if the target is used.

No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code.

[ROCm/composable_kernel commit: fc2a121c44]
2025-10-13 08:11:51 -07:00
Sami Remes
985b9f98a2 [CK_TILE] Non-K Major from old CK to CK-Tile (#2442)
* Enable the adapted LDS B layout for Row-Major

* fix formatting

* Implement specialized col-major A LDS block descriptor

* Fix formatting

* Use VecLoadSize for AK1/BK1

* Fix some thread access pattern values

* Use GetVectorSizeA for A

* Fix formatting

* Add extra condition to avoid division by zero

* disable layout for wave32

* remove extra else

* fix formatting

* Fix formatting

* Rename one remaining TileDistributionEncodingPattern2D

* Use integer ceil division

* revert remod.py changes

* also revert utility.hpp

* use getA/BTileAccessPattern everywhere

* use integer_divide_ceil for AK0 too

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[ROCm/composable_kernel commit: d2bbca3eca]
2025-10-13 14:27:02 +02:00
aledudek
ab7f67488c [CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955)
* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[ROCm/composable_kernel commit: 634634f5c0]
2025-10-13 13:57:37 +02:00
aledudek
e42f27e42e [CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860)
* Add valid check batched gemm part1

* [CK_TILE] Add batched gemm kernel IsSupported func checks

* revert broken pre-commit hook changes

* revert broken pre-commit hook changes v2

* Clarify error messages

[ROCm/composable_kernel commit: 3021604213]
2025-10-13 13:55:23 +02:00
damien-lejeune
b904c41e44 Update include path to break the remod's cyclic dep issue (#2978)
* Update include path to break the cyclic dep issue

* Use ck_tile::permute_vectors_i4x4_b in tile engine

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 46c10c316d]
2025-10-13 13:24:47 +02:00
msaffari-amd
b9f7381f95 [CK Tile] contraction multi d - kernel & example (#2901)
* Initial commit. create batched_contraction_kernel file

* initial problem definition

* implement initial example to launch kernel

* add universal gemm to contraction. initial phase

* complete implementation for special case all Dims are 1 and no Ds

* clean code

* initial changes to support multi dimensional G

* more progress in implementing multiple G

* tmp commit

* manage dynamic NumDimG in kernel

* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit

* implement the example for general Multi dimension G M N K and test different reference calculation algorithms

* 2 functions for reference using multi dimensional and flat indexing

* clean the code for muti dimentional G, M, N, K contraction and add some logs

* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E

* some cleaning on kernel

* clean the code for  calculating the offsets from flatten batch number

* Start adding MultiD support to kernel and example

* more changes to manage multi D in kernel and example

* manage passing multi d to kernel and testing.

* complete multi D support in kernel. modify example code to support it

* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning

* Minor fix

* Generalize example code for variable NumD tensors and apply cleanup based on review feedback

* Refactored code and addressed review feedback

* refactoring, cleaning, add documents, in kernel side and example codes

* Optimize batch offset calculation in kernel

* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: e9f0cc83a8]
2025-10-13 12:30:28 +02:00
Yi DING
2ff24ef58c [CK_TILE] FMHA BWD Add Instance for D48 on GFX950 (#2866)
Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 95bdc7410c]
2025-10-13 15:03:46 +08:00
Christopher Millette
7df03bedec Streamk functional tests (#2974)
* Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite

* Account for stride when computing K offsets for A and B tensor

This change ensures that the correct stride is used when computing the K
offsets into the A and B tensors in the Stream-K Kernel's operator()
function. This ensures that the kernel executes correct regardless of
whether A and B are row or column major.

* Move helper code to test_gemm_streamk_util.hpp

* Separate tests into smoke/regression/extended. Add bf16 datatype

* Run clang-format

* Refactor combinatorial macro expansion and naming

* Adjust the initialization values to account for better tolerance on bf16

* Correct BF16 datatypes in comments

* Move the extended tests under the REGRESSION_TESTS label

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: f5708882a3]
2025-10-11 07:53:40 -05:00
John Shumway
7a3d9b8a40 Fix GCC 7 CTAD compilation error in test_fmha_bwd.cpp (#3001)
Fixes compilation error on SLES15 with GCC 7 for gfx942 builds:

error: 'vector' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported]

Changes:

- Explicitly specify template argument for `std::vector<mode_enum>` instead of relying on C++17 CTAD
- Maintains compatibility with both older (GCC 7) and newer compilers

[ROCm/composable_kernel commit: 0843815db7]
2025-10-10 19:13:34 -07:00
Khushbu Agarwal
fdb397b2c9 supporting prefill shapes for preshuffle block scale gemm (#2975)
* debugging

* debugging for prefill shapes

* comment unused code

* fix for prefill shapes

* clearing up the code

* add int4 to universal gemm example

* clang formatted

* adding test for prefill shapes in block scale gemm

* lil improv on the block pipeline

* Address Review Comment

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 3c39d279ab]
2025-10-10 15:36:24 -07:00