Commit Graph

2477 Commits

Author SHA1 Message Date
Enrico Degregori
440358c168 Wave Tile Transfer supporting global load with transpose (#3027)
* Initial implementation:

 - add new thread group transfer supporting transpose instruction
 - refactor AB transfer to switch between thread and wave tiles methods

* Add some comments and remove explicit wave and lane calculations

* Remove compiler option for performance

* fp16 example: use tuned instance

* Missing cleanup

* Integrate wave transfer in existing gemm and batched gemm instances

* Add fast instances

* extend implementation for 8 bit datatypes

packed types not supported

* Address review comments

* Optimize pipeline v1 and re-introduce compiler option

* Disable wave tile approach for b scale gemm

* Fix for clang20

* Avoid code duplication of amd_global_load_transpose_to_vgpr function
2025-10-16 11:33:56 -07:00
kabrahamAMD
c4b2da9cbd implement device batched gemm b scale for wmma (#2825)
* rebased on top of develop

* fixed missing shuffeling and wrong indexing

* added tests for batched_b_scale

* added missing files

* fixed wrong stride computation and removed k batching (for now) due to precision issues

* reinstated k-batching with PRNG constrained to -1..1

* added specialization of GeneratorTensor_3 for int4 and fixed internal overflow

* added k-batching to reference and increased tolerances for test

* changed gemm_b_scale and gemm_universal tests to use correct parameters

* adressed review commentsd

* ported fixes back to non-batched version of b_scale

* adressed review comments

* run clang-format on older commits

* add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior

* added newline at end of file

* reflected changes from muitl-abd branch in batched b_scale

* fixed gfx11 issue

* changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed

* run clang format

* set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested.

* reduced range for pk_i4 even further to 0..0

* removed failing xld instances. Failure now uncovered now that tests were fixed

* removed generation of int4 values entierly

* divide B buffer by BPackedSize

---------

Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
2025-10-16 11:00:42 -07:00
Geo Min
d7278cc664 [TheRock CI] Updating SHA for build image and TheRock SHA (#3033)
* Updating SHA for build image

* Adding test exclusions
2025-10-16 08:13:10 -07:00
Emily Martins
cb83d52301 Style updates and cleanup
The following changes were made
- Renamed iter to iter_start
- Renamed tile_iter to tile_iter_start
- Moved documentation from member variables to getters
- Removed double underscore from extra_iters_before_me variable
- Defined parent header in impl file
- Removed unused inlcudes
2025-10-16 08:47:06 -06:00
Astha
8f75d7cea6 Addition of the derived structs for the new Stream-K TilePartitioner
There are 2 derived structs based on whether Stream-K is persistent or not.
If it's persistent that means that both the data parallel and Stream-K sections
are data parallel. If it's non-persistent that means that only the
Stream-K section is persistent, while the data parallel section will have
separate workgroups allocated for it. Both structs will have a template
argument for Persistent.

The 2 derived classes will inherit common variables and functions from the
Stream-K TilePartitioner base class. There are additional variables for the
differing data parallel sections that will be added to each derived class,
that are in charge of the indexing/bookkeeping for the data parallel sections.
The only additional function that will differ between the 2 structs is GridSize(),
as the non-persistent will allocate extra workgroups for data parallel.

Unit tests for the derived structs are included.
2025-10-16 08:47:06 -06:00
Emily Martins
f87f768d16 Stream-K Tile Partitioner Base Class with Tests
To better align with the original Stream-K paper, this change implements
a new Stream-K tile partitioner base class. This class will handle the
Stream-K setup that is common to both a persistent and non-persistent DP
section. A later change will implement derived classes to handle the
differences between persistent and non-persistent DP.

This change also includes unit tests for the base tile partitioner.
2025-10-16 08:47:06 -06:00
Illia Silin
2d1c9e28e2 Revert "Enable storelse for fmha_fwd_trload kernel (#3023)" (#3037)
This reverts commit 013ba3c737.
2025-10-16 07:19:34 -07:00
Max Podkorytov
e980d4351c re-enable batched transpose test on gfx942 (#3035)
* re-enable batched transpose test on gfx942

* try also enabling on gfx11/12

* roll back to original 'gfx9'
2025-10-16 03:12:15 -07:00
Vidyasagar Ananthan
92c67a824f [DOCS] Documentation Addition (Readme updates) (#2495)
* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-16 03:10:57 -07:00
Haocong WANG
013ba3c737 Enable storelse for fmha_fwd_trload kernel (#3023) 2025-10-16 13:51:23 +08:00
Emily Martins
0dbd173500 Fix compiler noreturn error for ck tile permute test (#3036) 2025-10-15 19:42:02 -07:00
Aviral Goel
232523d9fa docs: add quant mode comparison to readme (#3032)
* docs: add quant mode comparison to readme

* Update example/ck_tile/38_block_scale_gemm/README.md

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
2025-10-15 18:35:06 -07:00
Illia Silin
87d0a3ac17 use branch develop to test hipTensor (#3034) 2025-10-15 15:40:34 -07:00
Illia Silin
3348f01e6f re-enable clang-format by default (#3030)
* re-enable clang-format by default

* fix clang format
2025-10-15 07:43:11 -07:00
Christopher Millette
bde5f26db3 Disable streamk extended regression tests for now (#3016) 2025-10-15 09:05:47 -05:00
felix
4c826abfff Felix/opt sorting (#2902)
* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-10-15 09:24:03 +08:00
AviralGoelAMD
ca1ab083a7 test(grouped_gemm_multi_d): add unit test for bf16 support 2025-10-14 18:00:43 -04:00
AviralGoelAMD
8d8b49dec2 feat(grouped_gemm_multi_d): add support for bf16 2025-10-14 18:00:43 -04:00
Geo Min
706c2b281c fixing group id (#3002) 2025-10-14 08:51:52 -07:00
joyeamd
b9d74e7746 update s_barrier's logic in gfx12 architecture (#3003)
change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment
2025-10-14 08:49:34 -07:00
Illia Silin
e4298e55c7 Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)
This reverts commit d2bbca3eca.
2025-10-14 08:43:14 -07:00
jakpiase
6deaaa92cc [CK_TILE] Switch into universal gemms for conv bwds (#2981)
* switch into universal gemms for conv bwds

* some fixes and support universal gemm in conv fwd

* add reviewer comments
2025-10-14 16:09:16 +02:00
msaffari-amd
589e242eda Fix: Handle JSON boolean values (pad_m, pad_n, pad_k and persistent) in gemm_instance_builder (#3008) 2025-10-14 13:20:25 +02:00
ClementLinCF
e1b0bdfbfa [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540)
* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm

* Update rmsnorm host reference

* Update tree reduction of rmsnorm for reference host

* Fix cross warp for m > 1 cases

* Add RMSNorm model selectable option for host reference

* Fix save_unquant cases

* Update reference rmsnorm forward function to use enum for model sensitivity

* Update reference rmsnorm calculation for model sensitivity

* Fix m warp for layernorm

* Adjust parameter of reference for twoPass

* Fix clang format

* Run clang-format-overwrite.sh to fix formating issue

* fix clang format

---------

Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-13 11:52:37 -07:00
John Shumway
fc2a121c44 Enable GMock and improve gtest configuration (#2976)
Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake:

* Enable BUILD_GMOCK by default (was previously disabled)
* Patch gtest-src/googlemock/CMakeLists.txt for broken include path.
* Add configuration to gmock if the target is used.

No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code.
2025-10-13 08:11:51 -07:00
Sami Remes
d2bbca3eca [CK_TILE] Non-K Major from old CK to CK-Tile (#2442)
* Enable the adapted LDS B layout for Row-Major

* fix formatting

* Implement specialized col-major A LDS block descriptor

* Fix formatting

* Use VecLoadSize for AK1/BK1

* Fix some thread access pattern values

* Use GetVectorSizeA for A

* Fix formatting

* Add extra condition to avoid division by zero

* disable layout for wave32

* remove extra else

* fix formatting

* Fix formatting

* Rename one remaining TileDistributionEncodingPattern2D

* Use integer ceil division

* revert remod.py changes

* also revert utility.hpp

* use getA/BTileAccessPattern everywhere

* use integer_divide_ceil for AK0 too

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2025-10-13 14:27:02 +02:00
aledudek
634634f5c0 [CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955)
* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2025-10-13 13:57:37 +02:00
aledudek
3021604213 [CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860)
* Add valid check batched gemm part1

* [CK_TILE] Add batched gemm kernel IsSupported func checks

* revert broken pre-commit hook changes

* revert broken pre-commit hook changes v2

* Clarify error messages
2025-10-13 13:55:23 +02:00
damien-lejeune
46c10c316d Update include path to break the remod's cyclic dep issue (#2978)
* Update include path to break the cyclic dep issue

* Use ck_tile::permute_vectors_i4x4_b in tile engine

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-13 13:24:47 +02:00
msaffari-amd
e9f0cc83a8 [CK Tile] contraction multi d - kernel & example (#2901)
* Initial commit. create batched_contraction_kernel file

* initial problem definition

* implement initial example to launch kernel

* add universal gemm to contraction. initial phase

* complete implementation for special case all Dims are 1 and no Ds

* clean code

* initial changes to support multi dimensional G

* more progress in implementing multiple G

* tmp commit

* manage dynamic NumDimG in kernel

* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit

* implement the example for general Multi dimension G M N K and test different reference calculation algorithms

* 2 functions for reference using multi dimensional and flat indexing

* clean the code for muti dimentional G, M, N, K contraction and add some logs

* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E

* some cleaning on kernel

* clean the code for  calculating the offsets from flatten batch number

* Start adding MultiD support to kernel and example

* more changes to manage multi D in kernel and example

* manage passing multi d to kernel and testing.

* complete multi D support in kernel. modify example code to support it

* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning

* Minor fix

* Generalize example code for variable NumD tensors and apply cleanup based on review feedback

* Refactored code and addressed review feedback

* refactoring, cleaning, add documents, in kernel side and example codes

* Optimize batch offset calculation in kernel

* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-13 12:30:28 +02:00
Yi DING
95bdc7410c [CK_TILE] FMHA BWD Add Instance for D48 on GFX950 (#2866)
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-10-13 15:03:46 +08:00
Christopher Millette
f5708882a3 Streamk functional tests (#2974)
* Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite

* Account for stride when computing K offsets for A and B tensor

This change ensures that the correct stride is used when computing the K
offsets into the A and B tensors in the Stream-K Kernel's operator()
function. This ensures that the kernel executes correct regardless of
whether A and B are row or column major.

* Move helper code to test_gemm_streamk_util.hpp

* Separate tests into smoke/regression/extended. Add bf16 datatype

* Run clang-format

* Refactor combinatorial macro expansion and naming

* Adjust the initialization values to account for better tolerance on bf16

* Correct BF16 datatypes in comments

* Move the extended tests under the REGRESSION_TESTS label

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-11 07:53:40 -05:00
John Shumway
0843815db7 Fix GCC 7 CTAD compilation error in test_fmha_bwd.cpp (#3001)
Fixes compilation error on SLES15 with GCC 7 for gfx942 builds:

error: 'vector' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported]

Changes:

- Explicitly specify template argument for `std::vector<mode_enum>` instead of relying on C++17 CTAD
- Maintains compatibility with both older (GCC 7) and newer compilers
2025-10-10 19:13:34 -07:00
Khushbu Agarwal
3c39d279ab supporting prefill shapes for preshuffle block scale gemm (#2975)
* debugging

* debugging for prefill shapes

* comment unused code

* fix for prefill shapes

* clearing up the code

* add int4 to universal gemm example

* clang formatted

* adding test for prefill shapes in block scale gemm

* lil improv on the block pipeline

* Address Review Comment

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-10 15:36:24 -07:00
Max Podkorytov
9d060d3e3c [CK-Tile] functional support for transposed inputs in compute-bound double-lds-buffer pipeline with async loads from global memory to LDS (#2984)
* reuse local prefetch logic from compute v4 pipeline

add single-tile test

explicit lambda capture

reuse lds block descriptors from base policy for the transposed case

match the test case kernel configuration with compute v4

* add comments
2025-10-10 12:57:50 -07:00
yinglu
fada1a3cae Conv:TF32: add more instances - 2 (#2879)
* add instances of device_grouped_conv_fwd_xdl_f32_comp_instances
* add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances
* add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances
* tf32:conv:add instances for base class DeviceConvFwd
* tf32:conv:add instances for base class DeviceGroupedConvBwdDataMultipleD
* tf32:conv:add instances for base class DeviceGroupedConvBwdWeight
* add tf32 in profiler
* remove gnhwc/ngchw/ngcdhw instances
* remove non-ndhwgc/nhwgc/nhwc instances
* add check in IsSupportedArgument()
2025-10-10 15:28:17 +08:00
Bartłomiej Kocot
ad7a215aba Fix splitK for grouped conv bwd data (#2991) 2025-10-10 09:24:21 +02:00
Yi DING
b6036bc76a [CK_TILE] FMHA Tests Enhancement (#2945)
* fmha-gtest-wip

* Thanks Copilot!
2025-10-10 11:34:47 +08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
fb66b4f5e4 [CK_TILE] fix pk_fp4 compilation for non-gfx950 GPUs (#2983)
See build error log from
https://github.com/ROCm/composable_kernel/issues/2271#issuecomment-3150218542

This PR make vector element access constexpr-safe by avoiding operator[] on
ext_vector_type(2) and replace those sites in the pk_fp4 conversions so they
can be used in constant expressions, as The operator[] on ext_vector_type(2)
isn't allowed in constant expressions, which caused "constexpr function never
produces a constant expression" with a note at x[0]. Using `bit_cast` to a
trivial array representation keeps it constexpr-compatible.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-10-09 07:43:41 -07:00
Yashvardhan Agarwal
7b6451b68e [CK_TILE] Pooling FWD (Lwpck 3683) (#2956)
* Pooling 2D/3D with refernce

* Tests & cleanup

- added test for ppoling
- cleanup
- removed 2d example

* Comment resolution

- README added
- example target name rectified
- appropriate arg description and comments added

* clang-format

* appropriate blocksize calc

* modifications for future indexing addition

- instead of transforming views we now transform the descriptors, so
that the same descriptor can be re-used for index tensor in the future

* some basic fixes

* comment resolutions

* comment resolutions

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-09 16:13:26 +02:00
Sami Remes
9d4bfe3932 Add KBatch support for gemm_ab_scale (#2740)
* Add KBatch support for gemm_ab_scale

* Revert kernel parameters change

* Remove printing

* fix formatting

* fix check

* Use {} in if

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-09 08:33:16 +02:00
Aviral Goel
e99356dabc Add Memory pipeline for AQuant Block Scale GEMM (#2987)
* WIP: add memory pipeline boiler plate code that compiles and works for one block

* WIP: tail handling works for memory pipeline

* WIP: numerical errors appears to have gone by adding block_sync_lds()

* fix: numerical error with memory pipeline by adding block_sync_lds() and new tail handler

* refactror: remove debug print statements and lints

* fix: remove redundant sync barriars

* chore: remove lint

* fix: remove unused code from tile handler and remove redundant block_sync_lds()

* fix: correct parent struct name for memory pipeline

* fix: remove static assert check from parent struct and add it to child struct because not all child structs needs to static assert

* fix: defer block sync lds to just before prefill
2025-10-08 17:22:30 -07:00
JC
e29151b533 [CI] Enable ccache w/ namespace for external use (#2988)
* Enable ccache w/ namespace for external use

* Add TheRock parent directory to log path

* Fix typo for TheRock
2025-10-08 16:03:22 -07:00
andrew clark
0a4c45b4d3 CI Skip and Status Checks Fix (#2952)
* Update Jenkinsfile

Adding logic to skip CI checks when a commit contains changes to non-relevant files like docs, .md, licenses, and .github workflow files.

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

Testing skip env var

* Update Jenkinsfile

Fixing syntax

* Update Jenkinsfile

Simplifying CI check logic

* Update Jenkinsfile

Testing skipping logic on stages.

* Update Jenkinsfile

Removing post block. The status for skipped stages are already reported.

* Testing Docs

Testing modifications to files in the docs folder do not trigger a the build and test stages.

* Testing Multifile Trigger

Removed Jenkinsfile from the skip patterns. Reversed change to docs file. This test should not skip CI checks.

* Clean code

Renamed setup stage to be more descriptive.
Added pipeline env variable for consistency.
Moved performance test results stage conditional up a level so the parent stage appropriate reports the status if it is skipped.

* Fixing syntax error

* Updated CRON Flags

Added the FORCE_CI flag to the CRON instructions. This will ensure CI does not skip the job.

* Updating logging

Making logs more explicit.

* Comment update

Cleaning comments.

* Update Jenkinsfile

Reverting performance reports when condition.

* Parallel Test

Testing stage status with parallel stages

* Update Jenkinsfile

* Update Jenkinsfile

Removing stages for quick testing

* Update Jenkinsfile

* Testing skipped parallel stages

Testing the addition of a coordination stage to always pass and give an update to skipped parent stages with parallel sub-stages.

* Testing parallel stages

Adding coordination stage to test if parent check status is  correctly updated.

* Simplified performance results stage

Removed parent stage as there are no other parallel stages to execute (yet).

* Testing final clean up stage

* Testing check status update

Testing - forcing status to update after a stage skip.

* Testing results stage skip

* Removing test stage

* Testing pipeline

* Testing post status updates

* Process Test Results Post Event Update

The stage will report success when it skips or is successful.

* Testing non-relevant file change

This should skip build and test in CI

* Reverting test

updating regex file patterns to use strings instead of regex literal syntax.

* Fixing file matching regex

* Testing docs modification

* Fixing default env var value

* Correcting env var assignment

* Pipeline test

Updating docs file. Should skip ci.

* Testing Pipeline

Setting default run ci state.

* Adding debugging

* Removing debugging

* Pipeline test

Should skip pipeline

* Pipeline Test

Mixed files to trigger a CI run

* Adding additional status updates

The parent stage sometimes remains in pending even if the child stage completes when skipped. Added an additional status update for the parent stage.

* Fixing variable name

* Moving stage names

Moved the performance stage names to a single location because they are referenced multiple times. This reduces errors with typos in the future.

* Revert "Moving stage names"

This reverts commit 7cf6743e54.

* Update Jenkinsfile

Handle both truly empty arrays and arrays containing only empty strings.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-08 15:48:08 -06:00
Michal Kulikowski
2444c44895 [CK][Examples] Extending support for rdna3/4 part 3:
-example_gemm_xdl_int8
-example_gemm_xdl_fp8
-example_gemm_xdl_fp8_bf8
-example_gemm_xdl_fp16_fp8
-example_gemm_add_add_fastgelu_xdl_int8
-example_grouped_gemm_xdl_int8
-example_grouped_conv_bwd_weight_xdl_bf16
-example_cgemm_xdl_fp32
-example_cgemm_xdl_int8

fixing cmdlines for:
-example_22_cgemm
-example_24_batched_gemm
-example_batched_gemm_xdl_fp16int4_b_scale_v3

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2025-10-08 18:14:38 +02:00
Michal Kulikowski
7259b9c4db [CK][Examples] Extending support for rdna3/4 part 2:
-example_batched_gemm_xdl_int8
-example_batched_gemm_xdl_fp8_rowwise_v3
-example_batched_gemm_xdl_fp32
-example_batched_gemm_xdl_bf16
-example_batched_gemm_xdl_bf16_v3
-example_batched_gemm_xdl_fp16
-example_splitk_gemm_bias_e_permute_xdl_fp32
*fixing return value to return 0 as success in above examples.

Fixing cmdline parameters in:
-example_sparse_embedding3_forward_layernorm
-example_elementwise_binary_4D_fp16
-elementwise_scale_permute_amax_2D_fp16_fp8

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2025-10-08 18:14:38 +02:00
Cong Ma
1d4db30af9 [CK TILE GEMM] Refactor the code of transposeC and quantpreshuffle of AQuant Gemm (#2965)
Refactor the code of transposeC and quantpreshuffle of AQuant Gemm to make it easier to maintain.

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-10-08 00:05:38 -07:00
Thomas Ning
ae9f29b7d5 add the sync barrier for persistent kernel (#2977) 2025-10-07 11:54:04 -07:00
Aviral Goel
19415d0b6f fix: nil performance results for gemm examples (#2950) 2025-10-06 12:43:23 -07:00
Geo Min
d4761d7807 Fixing hash (#2973) 2025-10-06 08:38:38 -07:00