Commit Graph

2494 Commits

Author SHA1 Message Date
Rostyslav Geyyer
1df6f6af8e Rearrange pointers to fix the reinterpret_cast issue (#3077)
[ROCm/composable_kernel commit: 6df69abeef]
2025-10-23 10:54:13 -07:00
Qianfeng
6ad906b040 [CK_TILE] Fix in set_slice_tile (#2232)
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: fbd101b1ac]
2025-10-23 10:34:02 -07:00
Michal Kulikowski
c37371e3ef [CK][Examples] Fixing stride issues in ck examples by workaround - Bypassing hostTensor validation.
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>


[ROCm/composable_kernel commit: b9789a0742]
2025-10-23 08:46:02 +02:00
Haocong WANG
895983c816 [CKTILE] FMHA fwd trload lse fix (#3046)
* enable storelse for fmha_fwd_trload kernel

* fix lse in trload

* fix the mask related bug

[ROCm/composable_kernel commit: 0d3860dfdb]
2025-10-23 09:33:33 +08:00
spolifroni-amd
7c14d97d0e updated the changelog with 7.1 and beyond info
[ROCm/composable_kernel commit: 1b95803431]
2025-10-22 13:35:45 -06:00
lalala-sh
0329d71fb9 [CK_TILE] Update flatmm related kernels (#3022)
---------

Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: felix <felix.li@amd.com>

[ROCm/composable_kernel commit: 211d64e18a]
2025-10-22 22:36:11 +08:00
Johannes Graner
a6c3252766 [CK_TILE] Conv bwd splitN support (#3047)
* Conv bwd splitN support

* Adjust splitting calculations to lengths format

* Prepare indexing for future splitK support

[ROCm/composable_kernel commit: cbd1279ae6]
2025-10-22 13:34:06 +02:00
MHYangAMD
f23b8cde7b Introduce tree reduction for BlockReduce2dCrossWarpSync (#2588)
* Introduce tree reduction for BlockReduce2dCrossWarpSync

* Rename original impl to BlockReduce2dLinearCrossWarpSync

* Replace warp_size with get_warp_size()

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 5a27a97391]
2025-10-22 14:41:35 +08:00
John Shumway
a488126d3e [CK_BUILDER] Add compile-time reflection for a convolution instance (#3065)
* [CK_BILDER] Add compile-time reflection for a convolution instance

Introduce InstanceTraits template metaprogramming framework to enable runtime introspection of device kernel template parameters without requiring implementation knowledge. This reflection system extracts configuration details (block sizes, data types, layouts, tuning parameters) directly from kernel specializations through template
pattern matching. In particular, the GetInstanceString method returns a string that uniquely idenitfies the kernel, by explicitly serializing all template paramter values.

This provides critical functionality for MIOpen integration, since the existing GetTypeString method is ambiguous, and only captures some of the template paramters.

The implementation uses a two-level design: a primary InstanceTraits template declaration in instance_traits.hpp serves as the interface, while kernel-specific specializations (e.g., for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3) provide the actual extraction logic. This separation allows the reflection system to scale to additional kernel types without modifying the core interface.

Key architectural decisions:

- Forward-declare device kernels in instance_traits.hpp to avoid  circular dependencies, since device implementation headers will  include the reflection headers

- Use compile-time constants and type aliases to expose kernel  parameters, enabling zero-overhead introspection

- Provide a templated instance_string() function that generates human-readable  kernel configuration strings by serializing all template parameters  in order, useful for debugging and kernel identification

- Guard reflection integration with preprocessor definition CK_EXPERIMENTAL_BUILDER to keep  it opt-in until the API stabilizes

- Add GetInstanceString() virtual method to BaseOperator, allowing  runtime polymorphic access to compile-time kernel information

This infrastructure also enables upcoming higher-level semantic reflection abstractions (like ConvTraits) to query kernel configurations programmatically.

Includes unit tests validating both the trait extraction accuracy and the string generation format.

[ROCm/composable_kernel commit: 37dff024c1]
2025-10-21 21:10:19 -07:00
Bartłomiej Kocot
ebd8495721 Gridwise gemm conv v3 force padded layout on gfx950 (#2961)
* Gridwise gemm conv v3 force padded layout on gfx950

* fix bug in other gridwise

* fix

* Update gridwise_gemm_wmma_cshuffle_v3_common.hpp

[ROCm/composable_kernel commit: 3a28632b20]
2025-10-21 15:41:02 +02:00
Yashvardhan Agarwal
12e9bcd7e2 fix identity value of AbsMax (#3058)
* fix identity value of AbsMax

- Identity value of AbsMax should be 0 not numeric<T>::lowest()

* Update include/ck_tile/core/utility/reduce_operator.hpp

resolved comment

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

[ROCm/composable_kernel commit: 35754d2ec8]
2025-10-21 14:42:08 +02:00
Johannes Graner
0be14218d4 Fix race conditions in ck_tile remod (#3061)
[ROCm/composable_kernel commit: 4043401db1]
2025-10-21 09:35:04 +02:00
Max Podkorytov
eecc99e83d refine
[ROCm/composable_kernel commit: ff6efa2fb1]
2025-10-20 23:13:58 -04:00
Max Podkorytov
983a221831 update build instructions
[ROCm/composable_kernel commit: b9e966e574]
2025-10-20 23:13:58 -04:00
Yi DING
0c61d0da8d [CK_TILE] Add fmt: skip to FMHA codegen scripts for readability (#3057)
* fmt: skip for fmha_bwd.py

* more fmt: skip

* thank you, copilot

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: e20923f384]
2025-10-21 10:15:04 +08:00
Max Podkorytov
df3f347a27 [CK_TILE] Fix transpose_vectors for 2x2 8-bit tiles (#3042)
fix transpose_vectors logic for 2x2 8-bit tiles

    add a test which goes through this code path.

    factor out constexpr'd cases into smaller functions.

    add inline docs about the data movement

    impact: gemms with 8-bit non-rcr inputs on gfx942


[ROCm/composable_kernel commit: 2570462ecf]
2025-10-20 13:40:44 -07:00
Thrupti Raj Lakshmana Gowda
09acf06d06 [CK TILE ENGINE] Code changes to finding GPU id from TARGET (#3055)
* Reading gpuname from target for gemm in ck tile engine

* Reading gpuname from target for gemm preshuffle in ck tile engine

* Reading gpuname from target for gemm preshuffle in ck tile engine

* Get GPU changes for GEMM Muti D in TILE ENGINE

* Addressing errors for gpu name in cktileengine

[ROCm/composable_kernel commit: 9f77061094]
2025-10-20 09:02:18 -07:00
John Shumway
f57d4937c6 [CK_BUILDER] Add experimental builder directory and configuration for composable_kernel (#3043)
Add experimental builder infrastructure for composable_kernel

- Add experimental/builder directory with README documentation.
- Create initial test infrastructure with CMakeLists.txt and placeholder test.
- Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option.
- Update .gitignore to not treat `experimental/builder` as a CMake build directory.

This establishes the directory structure  for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration.


[ROCm/composable_kernel commit: f18b79f328]
2025-10-20 07:54:09 -07:00
Gino Lu
182c4404b5 [CK_TILE] Patch for pk_fp4 ref check and buffer load. (#3044)
* Patch for pk_fp4_raw_t buffer load and ref check

[ROCm/composable_kernel commit: fb1d090f3c]
2025-10-20 14:47:04 +08:00
BrianHarrisonAMD
fcd2b00c83 Add dvc pull step (#3056)
* Add dvc pull step

* Remove CD

* Add details about LOGNAME and fail if dvc isn't installed

[ROCm/composable_kernel commit: af3786fe08]
2025-10-19 16:09:21 -07:00
Illia Silin
17261b3fb8 disable aiter test gemm_a8w8_blockscale (#3049)
[ROCm/composable_kernel commit: d88ea05c84]
2025-10-17 19:52:22 -07:00
AviralGoelAMD
21e65bbb22 docs: add inline comments about flush_cache and rotating buffer
[ROCm/composable_kernel commit: b03764ca5a]
2025-10-17 12:56:47 -04:00
Yashvardhan Agarwal
89fb435ce2 fix identity values in Max and AbsMax (#3048)
- The identity value method returned the minimum positive number while
we need the lowest number for Max and AbsMax operations

[ROCm/composable_kernel commit: 889ffc0b1d]
2025-10-17 09:49:21 -07:00
Emily Martins
cdb6bd372b Fix CK Tile Stream-K BF16 Validation Errors (#3039)
Prior to this change, the number of accumulations passed into
calculate_rtol_atol was 1. That said, in most cases, this is not correct
when there are multiple workgroups contributing to the same macro tile
in C.

This change ensures uses the function estimate_num_wgs_per_tile, which
was extracted into a common file and generalized, to estimate the number
of workgroups per macro tile. This estimate is passed into
calculate_rtol_atol to ensure we get a better relative and absolute
tolerance.

[ROCm/composable_kernel commit: 352dee5225]
2025-10-17 09:33:38 -07:00
Johannes Graner
7fec9695d2 Pre-commit in CI (#3029)
* Pre-commit in CI

* Specify python version, and install dos2unix for remod

* Refactor remod hook to correctly install dependencies

* Run pre-commit

[ROCm/composable_kernel commit: 8a4cd32d86]
2025-10-17 09:28:38 -07:00
Ville Pietilä
bc3a91d23f Fixed handling of split-K autodeduce argument for grouped convolution (#3024)
* Fix handling of split-K autodeduce argument.

* Fix clang formatting.

* Test fix.

* Fix clang formatting.

[ROCm/composable_kernel commit: 7e44b845b5]
2025-10-17 15:36:39 +03:00
Johannes Graner
580a54b400 Update pre-commit to fixed versions, run remod for ck_tile (#2895)
* Fix ruff linter errors

* Fix remod dos2unix command

* Clang format

* Ignore utility in remod

* Run remod

* Specify clang-format version in pre-commit

* Specify ruff version

* Include PoolKernelArgs in reference_pool

* Add calculate_total_elements to reference batched contraction

* Fix calculate_total_elements declaration

* Refactor remod pre-commit hook

* Fix Aquant tests

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: d40b50b9d5]
2025-10-16 15:29:17 -07:00
Enrico Degregori
6066662785 Wave Tile Transfer supporting global load with transpose (#3027)
* Initial implementation:

 - add new thread group transfer supporting transpose instruction
 - refactor AB transfer to switch between thread and wave tiles methods

* Add some comments and remove explicit wave and lane calculations

* Remove compiler option for performance

* fp16 example: use tuned instance

* Missing cleanup

* Integrate wave transfer in existing gemm and batched gemm instances

* Add fast instances

* extend implementation for 8 bit datatypes

packed types not supported

* Address review comments

* Optimize pipeline v1 and re-introduce compiler option

* Disable wave tile approach for b scale gemm

* Fix for clang20

* Avoid code duplication of amd_global_load_transpose_to_vgpr function

[ROCm/composable_kernel commit: 440358c168]
2025-10-16 11:33:56 -07:00
kabrahamAMD
b085a51b44 implement device batched gemm b scale for wmma (#2825)
* rebased on top of develop

* fixed missing shuffeling and wrong indexing

* added tests for batched_b_scale

* added missing files

* fixed wrong stride computation and removed k batching (for now) due to precision issues

* reinstated k-batching with PRNG constrained to -1..1

* added specialization of GeneratorTensor_3 for int4 and fixed internal overflow

* added k-batching to reference and increased tolerances for test

* changed gemm_b_scale and gemm_universal tests to use correct parameters

* adressed review commentsd

* ported fixes back to non-batched version of b_scale

* adressed review comments

* run clang-format on older commits

* add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior

* added newline at end of file

* reflected changes from muitl-abd branch in batched b_scale

* fixed gfx11 issue

* changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed

* run clang format

* set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested.

* reduced range for pk_i4 even further to 0..0

* removed failing xld instances. Failure now uncovered now that tests were fixed

* removed generation of int4 values entierly

* divide B buffer by BPackedSize

---------

Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>

[ROCm/composable_kernel commit: c4b2da9cbd]
2025-10-16 11:00:42 -07:00
Geo Min
2b0c25dc28 [TheRock CI] Updating SHA for build image and TheRock SHA (#3033)
* Updating SHA for build image

* Adding test exclusions

[ROCm/composable_kernel commit: d7278cc664]
2025-10-16 08:13:10 -07:00
Emily Martins
36020b389c Style updates and cleanup
The following changes were made
- Renamed iter to iter_start
- Renamed tile_iter to tile_iter_start
- Moved documentation from member variables to getters
- Removed double underscore from extra_iters_before_me variable
- Defined parent header in impl file
- Removed unused inlcudes


[ROCm/composable_kernel commit: cb83d52301]
2025-10-16 08:47:06 -06:00
Astha
1d1f8af58b Addition of the derived structs for the new Stream-K TilePartitioner
There are 2 derived structs based on whether Stream-K is persistent or not.
If it's persistent that means that both the data parallel and Stream-K sections
are data parallel. If it's non-persistent that means that only the
Stream-K section is persistent, while the data parallel section will have
separate workgroups allocated for it. Both structs will have a template
argument for Persistent.

The 2 derived classes will inherit common variables and functions from the
Stream-K TilePartitioner base class. There are additional variables for the
differing data parallel sections that will be added to each derived class,
that are in charge of the indexing/bookkeeping for the data parallel sections.
The only additional function that will differ between the 2 structs is GridSize(),
as the non-persistent will allocate extra workgroups for data parallel.

Unit tests for the derived structs are included.


[ROCm/composable_kernel commit: 8f75d7cea6]
2025-10-16 08:47:06 -06:00
Emily Martins
64e6fef4ba Stream-K Tile Partitioner Base Class with Tests
To better align with the original Stream-K paper, this change implements
a new Stream-K tile partitioner base class. This class will handle the
Stream-K setup that is common to both a persistent and non-persistent DP
section. A later change will implement derived classes to handle the
differences between persistent and non-persistent DP.

This change also includes unit tests for the base tile partitioner.


[ROCm/composable_kernel commit: f87f768d16]
2025-10-16 08:47:06 -06:00
Illia Silin
d4f55309a2 Revert "Enable storelse for fmha_fwd_trload kernel (#3023)" (#3037)
This reverts commit 0b45561c068c839b8478a88cbd822f23c5bb80d0.

[ROCm/composable_kernel commit: 2d1c9e28e2]
2025-10-16 07:19:34 -07:00
Max Podkorytov
c6f40b6147 re-enable batched transpose test on gfx942 (#3035)
* re-enable batched transpose test on gfx942

* try also enabling on gfx11/12

* roll back to original 'gfx9'

[ROCm/composable_kernel commit: e980d4351c]
2025-10-16 03:12:15 -07:00
Vidyasagar Ananthan
8654f7be8d [DOCS] Documentation Addition (Readme updates) (#2495)
* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 92c67a824f]
2025-10-16 03:10:57 -07:00
Haocong WANG
cbae1418f7 Enable storelse for fmha_fwd_trload kernel (#3023)
[ROCm/composable_kernel commit: 013ba3c737]
2025-10-16 13:51:23 +08:00
Emily Martins
80527458b0 Fix compiler noreturn error for ck tile permute test (#3036)
[ROCm/composable_kernel commit: 0dbd173500]
2025-10-15 19:42:02 -07:00
Aviral Goel
43ccad2bcf docs: add quant mode comparison to readme (#3032)
* docs: add quant mode comparison to readme

* Update example/ck_tile/38_block_scale_gemm/README.md

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

[ROCm/composable_kernel commit: 232523d9fa]
2025-10-15 18:35:06 -07:00
Illia Silin
57a340cc4c use branch develop to test hipTensor (#3034)
[ROCm/composable_kernel commit: 87d0a3ac17]
2025-10-15 15:40:34 -07:00
Illia Silin
fd12e33f27 re-enable clang-format by default (#3030)
* re-enable clang-format by default

* fix clang format

[ROCm/composable_kernel commit: 3348f01e6f]
2025-10-15 07:43:11 -07:00
Christopher Millette
fe50f6e177 Disable streamk extended regression tests for now (#3016)
[ROCm/composable_kernel commit: bde5f26db3]
2025-10-15 09:05:47 -05:00
felix
b6f6b7cd2a Felix/opt sorting (#2902)
* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 4c826abfff]
2025-10-15 09:24:03 +08:00
AviralGoelAMD
f0b0b1e838 test(grouped_gemm_multi_d): add unit test for bf16 support
[ROCm/composable_kernel commit: ca1ab083a7]
2025-10-14 18:00:43 -04:00
AviralGoelAMD
49286aab3f feat(grouped_gemm_multi_d): add support for bf16
[ROCm/composable_kernel commit: 8d8b49dec2]
2025-10-14 18:00:43 -04:00
Geo Min
266ab45f7a fixing group id (#3002)
[ROCm/composable_kernel commit: 706c2b281c]
2025-10-14 08:51:52 -07:00
joyeamd
ed83bcb9a2 update s_barrier's logic in gfx12 architecture (#3003)
change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment

[ROCm/composable_kernel commit: b9d74e7746]
2025-10-14 08:49:34 -07:00
Illia Silin
3a9bd7c1ff Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)
This reverts commit 3653a0d01edd715f5c9759eaf547a7644c762b8e.

[ROCm/composable_kernel commit: e4298e55c7]
2025-10-14 08:43:14 -07:00
jakpiase
72a1a1ca59 [CK_TILE] Switch into universal gemms for conv bwds (#2981)
* switch into universal gemms for conv bwds

* some fixes and support universal gemm in conv fwd

* add reviewer comments

[ROCm/composable_kernel commit: 6deaaa92cc]
2025-10-14 16:09:16 +02:00
msaffari-amd
6b4d770179 Fix: Handle JSON boolean values (pad_m, pad_n, pad_k and persistent) in gemm_instance_builder (#3008)
[ROCm/composable_kernel commit: 589e242eda]
2025-10-14 13:20:25 +02:00