Commit Graph

2266 Commits

Author SHA1 Message Date
linqunAMD
84d188b137 Fix a typo in intrin_wmma_bf16_16x16x16_bf16_w32 (#2727)
__builtin_amdgcn_wmma_bf16_16x16x16_bf16_w32 is only available in gfx11.

[ROCm/composable_kernel commit: 00fd72b2d4]
2025-09-03 08:07:09 +08:00
Thomas Ning
16a878c572 fix the errors (#2771)
[ROCm/composable_kernel commit: 8d43155bce]
2025-09-02 11:04:21 -07:00
Po Yen Chen
d175d5b28a [CK_TILE] Fix fmha_fwd_v3() Default2DEpilogue usage (#2765)
* Fix Default2DEpilogue usage

* Fix Default2DEpilogue usage for batch_prefill

[ROCm/composable_kernel commit: 9f35cde374]
2025-09-02 09:51:56 -07:00
Vidyasagar Ananthan
e0187eab41 Fixing python backward compatibility issue in benchmarking script.
[ROCm/composable_kernel commit: 0e322200e5]
2025-09-02 12:33:30 -04:00
Yi DING
c453d7661f Fix typo in profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp (#2767)
[ROCm/composable_kernel commit: bab747b017]
2025-09-03 00:12:24 +08:00
Sami Remes
4b690a85e0 Fix formatting problem (#2768)
[ROCm/composable_kernel commit: 4419fc34a2]
2025-09-02 14:14:10 +03:00
Michael Mcminn
53129b5add Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGem… (#2751)
* Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16

* Adding support for offload target gfx9-4-generic

* This duplication here isn't ideal

[ROCm/composable_kernel commit: 022f369deb]
2025-09-02 10:35:07 +02:00
Haocong WANG
1b9b8084bc Fix naming issue (#2762)
[ROCm/composable_kernel commit: 33418b201f]
2025-09-02 11:18:53 +08:00
Po Yen Chen
884ecea071 [CK_TILE] Add FAv3 fwd pipeline (#2731)
* Add FAv3 fwd pipeline

* Unpack v_pk_mul to hide v_mov

* Avoid compiler moving l compute across phase

* Sync sched_group_barrier() setting for masking cases

[ROCm/composable_kernel commit: d876e87fe4]
2025-09-01 09:16:45 +08:00
Thomas Ning
06fad18aaf Restructure the Tile Engine to have faster build time and clear config report (#2747)
* Making edits to identify individual compilation issues.

* Minor fix for blob txt files not being created.

* Fixing compilation issues.

* Fixing ordering bug.

* Adding python profiling functionality.

* Setting individual build as default.

* Setting gpu target filtering for tile engine to gfx90a, gfx942 and gfx950.

* update the default running parameters and settings

* Fixing bug with benchmarking, shifting file generation to build instead of config.

* Updating fixes.

* Fixing json output and parsing.

* Disable ccache for tile engine gemm ops because we dont need it.

* Removing duplicate type definition.

* Improving json printing.

* Add the flexibility of different layout and more warp tile support

* Fix extra flag in name of individual kernels.

* Fixing bug with booleans.

* Solve the first patch of the post merge conflict

* Compilation fixes, and cosmetic improvements.

* Yet again compilation fixes after latest changes from develop.

* Fixing python benchmarking script.

---------

Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
Co-authored-by: Vidyasagar Ananthan <vanantha@amd.com>

[ROCm/composable_kernel commit: 705804d9bf]
2025-08-30 06:54:18 -07:00
Aviral Goel
f71b16afc0 chore(gemm): clang format to pass CI (#2758)
[ROCm/composable_kernel commit: fcff0043ae]
2025-08-29 00:38:46 -07:00
Vijay Krish
bad7262507 ck_tile kernel for gemm with groupwise quantized B tensor. (#2663)
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.

* Solve merge conflict

* [CK TILE] Update CHANGELOG.md

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>

[ROCm/composable_kernel commit: 4208e28988]
2025-08-28 23:43:02 -07:00
Cong Ma
16e29b0fe3 Support transposed C tile in Aquant (#2679)
The performance of Aquant has increased after enabling transposed C.

Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.

[ROCm/composable_kernel commit: 428090f749]
2025-08-28 13:28:09 -07:00
Mateusz Ozga
d9f42f3777 [CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752)
* Init commit

* Quick fix, CI fails

* Remove CDElementWise

* Add CDEELementWise

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 0758883fa4]
2025-08-28 12:45:50 -07:00
asleepzzz
154b02423f Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)" (#2757)
This reverts commit 4eb18aab71ddb0cf7b63fb161424034615e2bdf5.

[ROCm/composable_kernel commit: 038ea82315]
2025-08-28 22:50:42 +08:00
linqunAMD
cd6d731322 [Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728)
* fix copy basic build error

* fix other ck tile test build error

[ROCm/composable_kernel commit: 4a49dac7c6]
2025-08-28 20:30:30 +08:00
Yi DING
8b537fb883 [CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)
* 16x192

* Use buffer_load_lds for lse/d

* Dispatch & cleanup

* Avoid zeroing dq & fix

* fix

[ROCm/composable_kernel commit: ead4447b20]
2025-08-28 18:54:18 +08:00
msaffari-amd
146a41f0d8 Ck tile gemm low prec data types int4 int8 unit tests (#2718)
* add gemm unit tests for int4, int8 datatypes

* minor changes based on reviews

---------

Co-authored-by: msaffari-amd <msaffari@banff-cyxtera-s78-2.ctr.dcgpu>

[ROCm/composable_kernel commit: b951416cdb]
2025-08-28 10:47:16 +02:00
Linjun-AMD
16d37f9d67 use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline (#2711)
* add k_lds padding and iglp to improve dim256 fmha fwd

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update  block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp

* clang format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* use same naming style

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: bf7b458e6e]
2025-08-28 11:39:39 +08:00
Aviral Goel
5256e754cc feat(HostTensor): Extend support for HostTensor class' >> operator to print more data types (#2691)
* feat(check_err): add a variable to adjust number of incorrect values to print

* feat(host_tensor): add printing capability for fp8 bf8 int8 int4

* fix(gemm_utils): update acceptable data type

* fix(host_tensor): print both 4 bit ints in pk_int4_t

* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp

* feat(host_tensor): add print first n elements functions

[ROCm/composable_kernel commit: f5f795c4d6]
2025-08-27 18:17:24 -07:00
John Shumway
ad9378b5a3 Replace auto with function template for c++17. (#2754)
In #2443, a helper function was added test new print functionality, but it used auto for the function parameter types.

We need to support c++17 for downstream libraries, so we cannot use auto there. This PR replaces it witht the equivalent function template implementation.

[ROCm/composable_kernel commit: 9751583f95]
2025-08-27 17:10:39 -07:00
Illia Silin
566b5351a3 Add a daily CI cron job to build pytorch. (#2755)
* add a stage to builf pytorch

* add docker file for pytorch stage

* call build scripts fro mthe default path

* add a daily chron build for pytorcn stage

[ROCm/composable_kernel commit: 0ac908fb57]
2025-08-27 16:57:43 -07:00
Cong Ma
73657622b1 [CK TILE GEMM] Fix a merge conflict (#2753)
* Fixed a merge conflict in 43e7d549
* Foramt the code

[ROCm/composable_kernel commit: cd53e2e57e]
2025-08-27 11:08:09 -07:00
Bartłomiej Kocot
380f0196b4 Fix splitk autodeduce for grouped conv bwd weight (#2742)
[ROCm/composable_kernel commit: cfe5e448db]
2025-08-27 12:35:42 +02:00
Cong Ma
a6d612d20c [CK TILE] Fix bugs in AQuant preshuffle (#2700)
* [CK TILE] Fix bugs in AQuant preshuffle

- Make Aquant works with block Mx64x256. `M` could be 16, 32, 64
- Make Aquant works with warp 16x16x32 and 32x32x16.

* [CK TILE] Rename Preshuffle to PreshuffleQuant

The new name, PreshuffleQuant, explicitly states the function's purpose:
to preshuffle the quantization matrix.

* [CK TILE Block Scale] Use GemmConfig to save tile properties

- Remove specialization of GemmQuantTypeConfig
- Pass GemmConfig around which contains tile properties. Stop using hard
  coded tile properties in `gemm_calc_aquant()`

* [CK TILE Block Scale] Rename GemmConfig used in block scale

    - Remove unused GemmConfig
    - Rename GemmConfig used in block scale

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 245467f359]
2025-08-27 00:05:54 -07:00
linqunAMD
bca698bebf Fix merge mfma_wmma (part 1) regression (#2749)
root cause: a typo in GetGfx11InputBlkIdx, const ia added by mistake.

[ROCm/composable_kernel commit: 95e4a4efcb]
2025-08-26 22:49:34 -07:00
JH-Leon-KIM-AMD
7d6f0107bd Test comprehensive dataset (#2685)
* Add CSV-driven convolution test pipeline

- Add test_grouped_convnd_fwd_dataset_xdl.cpp with CSV reader functionality
- Add complete dataset generation toolchain in test_data/
- Add Jenkins integration with RUN_CONV_COMPREHENSIVE_DATASET parameter
- Ready for comprehensive convolution testing with scalable datasets

* Update convolution test dataset generation pipeline

* add 2d, 3d dataset csv files

* Remove CSV test dataset files from repository

* Update generate_test_dataset.sh

* Fix channel division for MIOpen to CK conversion

* Remove unnecessary test files

* Fix clang-format-18 formatting issues

* TEST: Enable comprehensive dataset tests by default

* Fix test_data path in Jenkins - build runs from build directory

* Add Python dependencies and debug output for CSV generation

* Remove Python package installation - not needed

* Add better debugging for generate_test_dataset.sh execution

* Fix Jenkinsfile syntax error - escape dollar signs

* Add PyTorch to Docker image for convolution test dataset generation

- Install PyTorch CPU version for lightweight model execution
- Fixes Jenkins CI failures where CSV files were empty due to missing PyTorch
- Model generation scripts require PyTorch to extract convolution parameters

* Add debugging to understand Jenkins directory structure and CSV file status

- Print current working directory
- List CSV files in test_data directory
- Show line counts of CSV files
- Will help diagnose why tests fail in Jenkins

* Fix clang-format-18 formatting issues

- Applied clang-format-18 to test file
- Fixed brace placement and whitespace issues

* Add detailed debugging for CSV dataset investigation

- Check generated_datasets directory contents
- List all CSV files with line counts
- Show first 5 lines of main CSV file
- Applied clang-format-18 formatting
- This will help identify why CSV files are empty in Jenkins

* keep testing add pytorch installation in shell script

* Use virtual environment for PyTorch installation

- Jenkins user doesn't have permission to write to /.local
- Create virtual environment in current directory (./pytorch_venv)
- Install PyTorch in virtual environment to avoid permission issues
- Use PYTHON_CMD variable to run all Python scripts with correct interpreter
- Virtual environment will be reused if it already exists

* Remove debug code and reduce verbose logging in Jenkins

- Remove bash -x and debug commands from Jenkinsfile execute_args
- Remove all debug system() calls and getcwd from C++ test file
- Remove unistd.h include that was only needed for getcwd
- Remove debug print in CSV parser
- Add set +x to generate_test_dataset.sh to disable command echo
- Redirect Python script stdout to /dev/null for cleaner output

This makes Jenkins logs much cleaner while still showing progress messages.

* install gpu torch

* Clean up and optimize comprehensive dataset test pipeline

- Reorder Jenkinsfile execution: build -> generate data -> run test
- Remove commented-out debug code from generate_test_dataset.sh
- Ensure all files end with proper newline character (POSIX compliance)
- Keep useful status messages while removing development debug prints
- Set MAX_ITERATIONS=0 for unlimited test generation in production

* Add configuration modes to reduce test execution time

- Add --mode option (half/full) to generate_model_configs.py
  - half mode (default): ~278 configs (224 2D + 54 3D) -> ~1,058 total tests
  - full mode: ~807 configs (672 2D + 135 3D) -> ~3,093 total tests
- Update generate_test_dataset.sh to use CONFIG_MODE environment variable
- Keeps all model types but reduces parameter combinations intelligently
- Fixes Jenkins timeout issue (was running 3,669 tests taking 17+ hours)
- Default half mode should complete in ~4-5 hours instead of 17+ hours

* Add small mode for quick testing of comprehensive dataset

* jenkins pipeline test done

* jenkins test done

* Trigger CI build

* remove test comment and update data generation option as half

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 19d5327c45]
2025-08-26 22:18:05 +02:00
John Afaganis
c8047ebb8b Revert "[CK-TILE] Default epilogue, adding support for D (#2629)" (#2746)
This reverts commit 66774c8a1ccd0631a15dd994fdfeb1f020d3ecc6.

[ROCm/composable_kernel commit: 508e7912f9]
2025-08-26 09:48:49 -07:00
SamiAario-AMD
4f649d35b9 Lwpck 3548 gemm test cleanups (#2717)
* Remove some unnecessary calls to create_args in basic and universal GEMM tests

* Remove unnecessary include statements in universal GEMM tests

* Improve compilation time of basic GEMM tests by only compiling the precision variants that we need

* Universal GEMM PrecType should be the same as CDataType

* Improve compilation time of universal GEMM tests by only compiling the precision variants that we need

* Revert to constexpr when defining some constants

[ROCm/composable_kernel commit: 5e85c38d7d]
2025-08-26 13:25:48 +03:00
Mateusz Ozga
50f7068579 [CK-TILE] Default epilogue, adding support for D (#2629)
* Extend 2d-epilogue, D support

* Added tests & update

* Remove unused attribute

* Extend tests

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: d43228fbca]
2025-08-25 19:29:35 -07:00
John Shumway
a25eb35712 Add a CMake property for c++ standard (17 or 20) (#2736)
Configure C++ standard with a CMake variable.

Defaults to C++20, but can be set to C++17  to test backwards compatibility.

* Add validation for allowed C++ standards.

* build CK in rehl8 docker with std=c++17

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 99d27aca17]
2025-08-25 18:56:58 -07:00
Aviral Goel
2b2da54d9e fix(gemm_universal): define CK_TILE_USE_WMMA with default value to stop compilation error (#2737)
[ROCm/composable_kernel commit: c88e24ebe5]
2025-08-25 18:53:40 -07:00
Tianyuan Wu
802a5e7373 [CK_TILE] Fix the Wrong Output Generated by Gemm Examples on GFX11/12 (#2713)
* Introduce macro CK_TILE_USE_WMMA

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Make CK_TILE_USE_WMMA global for all examples

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Remove CK_TILE_USE_WMMA from config.hpp

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

---------

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

[ROCm/composable_kernel commit: e9605ed36d]
2025-08-25 12:55:35 -07:00
spolifroni-amd
af76fb5d64 removed the blog posts as as these are broken links (#2732)
[ROCm/composable_kernel commit: 1d4a3341f0]
2025-08-25 14:16:57 -04:00
Aviral Goel
ceb2877c25 build!: Update composable kernel version to 1.2.0 for rocm 7.0 release (#2734)
* build!: Update composable kernel version to 1.2.0 for rocm 7.0 release

[ROCm/composable_kernel commit: bb6132116f]
2025-08-25 13:48:51 -04:00
Illia Silin
4f611a6bc4 Resolve issues with performance logs in CI. (#2733)
* update the performance test logic

* fix unstash perf logs logic

* untangle unstashing fmha logs for different archs

* run process stage after running fmha tests

* fix the processing of perf logs

* fix arguments for run_performance scripts

[ROCm/composable_kernel commit: 6180685688]
2025-08-25 09:51:29 -07:00
Yi DING
a29eac5da0 [CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715)
* FMHA avoid unnecessary vmcnt0

Squashed commit of the following:

commit 61f5a8d4ef2cb74c0bd4caac359708d6fdb50de7
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 22 03:15:51 2025 +0000

    merge develop and solve conflicts

commit ed7d18e306e16e6f39170a8ae4202d5df7b4045c
Merge: 2dac61a4f 13a6816fb
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 22 03:15:21 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue

commit 2dac61a4f8d28fde9c466ae3ce56435fb679a140
Author: Ding, Yi <yi.ding@amd.com>
Date:   Tue Aug 19 02:17:43 2025 +0000

    update bwd

commit 281bfa9cc94eb08effdcdb6e8028bccc1d166682
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Mon Aug 18 19:36:38 2025 +0000

    add restrict to applicable functions

commit 45534dee5bcbe532da46fc5cd6601cde10d84387
Author: Ding, Yi <yi.ding@amd.com>
Date:   Mon Aug 18 02:07:03 2025 +0000

    bwd filter

commit 7abd7b372b82cba94a457238b6b4a81d093e7280
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Sat Aug 16 08:15:23 2025 +0000

    remove noinline attr as it causes a lot more s_waitcnt's

commit 89c29746a09255c1d26038171157e91d1b68d14a
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Thu Aug 14 12:11:17 2025 +0000

    remove innerloop, move restrict parameters to mainloop and add noinline attribute.

commit 6f61b3a5c80011411aa3aebf7983602f7c117566
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Thu Aug 14 07:06:51 2025 +0000

    Create inner lambda with restrict parameters, add restrict to some parameters

commit 4e17551191980ea7a7e71e9798946cf1dc9f1a1a
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Aug 14 03:43:54 2025 +0000

    save for debug

commit 5f2c3cfa86c6951208a1cc227fa556704a885a88
Merge: 25f067b4f 165a2723c
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 13 02:15:22 2025 +0000

    Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 25f067b4f09d6909a05e252c7621124046dfda57
Merge: 447c1c5d6 2ad2f97b7
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 13 02:14:26 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 165a2723c557420b48891cc1ce3434e3675aef5d
Merge: 447c1c5d6 4491739ab
Author: asleepzzz <hanwen.chang@amd.com>
Date:   Wed Aug 13 00:34:11 2025 +0800

    Merge branch 'develop' into wip-async-tr-fa

commit 447c1c5d6ef0474f9a54c06eea68d65b0346f9b6
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 14:25:50 2025 +0000

    refactor blockgemm change, isolate to v2;

commit 8f67083511ff77d31c880f4427d3bdf53a179568
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 09:26:13 2025 +0000

    clang format

commit 3f28caa88b9ac9d84029948a7bacf1175cc5a965
Merge: c84662c34 245071bcf
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 09:04:41 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit c84662c345755ec5f3d524fdde4aa951c8f86298
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 08:46:06 2025 +0000

    Fix the bug

commit e0647ffa5646f8132529b152af02750c4010013d
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 04:02:41 2025 +0000

    fix conflict. disable all v-col instance for fmha fwd

commit 781f98236c376f57591a6d481cc2ee04b36a148b
Merge: 241f3d7dc 6e03d9607
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 03:52:34 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 241f3d7dc35b2d1cca4eca8ba714581e84f5725e
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 01:53:31 2025 +0000

    clang format

commit 8ee83f1c492ae9600a947c4cfe5f7cd25156138f
Merge: 1a629c098 3639befe9
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 01:52:52 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 1a629c09876cc05f0750db7eade1d527dc32a1d3
Merge: f65874e5b b34a029cd
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 15:59:40 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit f65874e5b07579d5b734b4c68877679a3ee04dac
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 15:37:37 2025 +0000

    change the warp setting for hdim32 fmha fwd

commit 7c5f5e65e97486c074ef9a138900ed9aafea547e
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 14:21:09 2025 +0000

    tempsave, update the blocksync functions

commit beb0950ad8c6b0366a77f5b82e7d5c5f8663b915
Author: aska-0096 <haocwang@amd.com>
Date:   Sun Aug 10 06:00:51 2025 +0000

    fix bug in pki4

commit 073db2e18af21f1ed1fb3d1f1c15830838df986f
Author: aska-0096 <haocwang@amd.com>
Date:   Sat Aug 9 03:25:12 2025 +0000

    fix bugs in gemm

commit 01f2d7bd763f64f19861b8a2a861b50bd0aed70a
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 18:35:53 2025 +0000

    fix bug on non-gfx950

commit 9a9ca06d59cb1721b4fa70a0d3253fb6b252b37e
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 17:53:19 2025 +0000

    fix bug

commit 30de97f473685e0bd5b82f15eee2493d9a05cffd
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 15:42:15 2025 +0000

    fix bugs

commit f449cb85a3cfb27bf86525e9c11a2ecf4f7a73a7
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:31:01 2025 +0000

    fix clangformat with 18.1.3

commit e4cb185c41586d018771a5413efd909d8d53a8c5
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:07:40 2025 +0000

    remove non-necessary change

commit 498f0d44cfba17287cce8d10855cce5c5de263db
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:04:02 2025 +0000

    bug fix, clang format;

commit 3cb648cbc4883e6889340d85f48d803a21b9c805
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 08:08:03 2025 +0000

    Remove unnecessary changes

commit 9e7ff3b611b7933b65973907a0cae312a15d31c6
Merge: a3c1bfe6d 7f14bd1df
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 07:50:12 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit a3c1bfe6dd64572e4371c7b1b8b5a809aad90c71
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 06:19:31 2025 +0000

    remove unnecessary files; rename some files

commit 6c257fa27729c005d539b5b71deeba3703031089
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 05:46:18 2025 +0000

    merge fa_decode pipeline into fmha_fwd api

commit 26c911b4e5e43aa78fadc5b7c7880421b94d9449
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 6 05:58:43 2025 +0000

    add __restrict__ to tr load

commit bbad2b979b701533b74f43452ffe0f775e019139
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 5 07:23:51 2025 +0000

    Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

commit d7fabd5f765e2a573ddbaf0857ce6f691407e562
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 10:27:42 2025 +0000

    Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

commit 9f2c1c5baddaa3a2aa9cd70c4a62401df3c29fd9
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 10:02:17 2025 +0000

    add vmcnt guard before load ktile

commit f9772f8b6035bc92aa08fb4d092fc21b6b24445c
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 06:49:01 2025 +0000

    Load Q through lds, implement xor;

commit 62bb9f05177dfb8280d6c2be67a88492d6be4838
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 1 10:44:54 2025 +0000

    small refactor

commit 7cb83c2ab6a87d161259eeb8d5ac3e27ce9587af
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 31 10:25:37 2025 +0000

    upgrade prefill pipeline; simple iglp; consistent data produce and consume order

commit 3a85dee389c424490a5101f05c3f4aa3a1ea70be
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 31 05:13:27 2025 +0000

    enable larger tile size; upgrade xor pattern

commit a468e59a01d6dd85c105ca30ac249491256c5915
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 30 12:25:33 2025 +0000

    remove all lds bankconflict with xor layouts

commit 39ff55cdc377311112100fb24bc013adfd8960c0
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 30 03:51:06 2025 +0000

    enable prefill overload operator().

commit a7b152a788e8035c93f8e4cbf317863182665d8f
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 25 07:10:01 2025 +0000

    fix the lds alignment caused performance regression

commit c4e99bc8f502cd019a754cc9e0043e3d8b9d0f3e
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 23 09:05:57 2025 +0000

    remove unnecessary features

commit 9758750801c7fd5a80f654eb982f43b87d674fa3
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 22 08:04:05 2025 +0000

    tempsave. asynccopy+trload sanity checked

commit 1c4c04d725047357224ebf8a2b94d9010a5651a6
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Jul 21 05:55:55 2025 +0000

    tempsave, trload+asyncload done

commit 75e68f91fc5a1f35cd5d96901efe15c346a1bd5c
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 10:04:34 2025 +0000

    compile pass

commit d41b5eace939909084d32281710fb81142ad5fec
Merge: 3f86a81ee 33204e15f
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 05:17:27 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 3f86a81eee75256a78df02032d50814aaa42b038
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 05:16:39 2025 +0000

    tempsave

commit 7d43f7446a9a20773f70e08462393f6c9afb7280
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 10:06:09 2025 +0000

    temp save

commit 727629cd9115f1be9c1800bb65a8ea84ff06c250
Merge: aa5da19c9 94bceebc9
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 07:24:32 2025 +0000

    Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline

commit 94bceebc96ef4885e0ac861b7793e2e2897481bd
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 03:10:46 2025 +0000

    move test_copy into test

commit 8f8bfe7f33884f1588bb7aa1a1d599521f40a30e
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:41:31 2025 +0000

    remove unnecessary output

commit b1dbcacb1832560c6cc967a079dffce558228f0b
Merge: 5b0d311e6 0eaf3325a
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:26:13 2025 +0000

    Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix

commit 5b0d311e649257557a7014c28fcfac0c327b77b5
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:26:10 2025 +0000

    add input validation and bug fix

commit 0eaf3325a8e019402ff12a2402f446f8471f584f
Merge: a66e1d29a 08c5df68a
Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Date:   Wed Jul 16 11:23:57 2025 -0700

    Merge branch 'develop' into test_copy_fix

commit a66e1d29a8cccc17cc8958d970ec7b1281ec8291
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 08:55:50 2025 +0000

    fix vmcnt shift

commit 197bdcb4827dae6d8460ed375e6265c2c9ddaef0
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 08:37:07 2025 +0000

    Improve s_waitcnt_imm calculation

commit 3b59e26cf8e0ba573a99a6caa0f37296b23b8bd2
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 05:39:50 2025 +0000

    fix the s_waitcnt_imm calculation

commit 1c0870089a0e7c78ed71a278bf52d98fc780e482
Merge: d6ee05e36 d9de58c66
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 03:57:57 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix

commit d6ee05e360dc8426ed2a08a8d6877ebf5cabbd32
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 03:54:33 2025 +0000

    Add block_sync_lds_direct_load utility

commit c037a72040217471f52ee76bed9c07bf5b22aef4
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 15 09:39:03 2025 +0000

    fix async copytest bug

commit aa5da19c94022449b027e7a57668f2e219f0f171
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 10 04:29:33 2025 +0000

    temp save, change all instance to 1wave

commit ddd172feb9eb2cb783420a8db6f44d51b350c370
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 8 08:37:20 2025 +0000

    tempsave, fmha_decode

commit fd90531f4eafdfdbf7df0f3731018fc57dcf4a33
Author: aska-0096 <haocwang@amd.com>
Date:   Sat Jun 21 15:02:57 2025 +0000

    temp save, waiting for debug

commit 71dd31f15bca01995c8cb0be9e903103f4657181
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jun 19 05:11:52 2025 +0000

    save an example for __bf16 type

commit cdf33e079fa7d7d5b03b06550df2356b02041d7b
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jun 18 07:27:24 2025 +0000

    fix bwd code

commit d630998dc6751f44097b1e9a239bb5063a793736
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jun 18 06:37:16 2025 +0000

    Fix for fwd/bwd kernel build filter

commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba
Author: Ding, Yi <yi.ding@amd.com>
Date:   Tue Aug 19 08:13:18 2025 +0000

    Add restrict to avoid unnecessary vmcnt

---------

Co-authored-by: aska-0096 <haocwang@amd.com>

* Add comments for c-stype cast

* Better comments

---------

Co-authored-by: aska-0096 <haocwang@amd.com>

[ROCm/composable_kernel commit: de61e55493]
2025-08-25 20:55:12 +08:00
John Shumway
1ec340cc3b Remove unsupported use of c++20 concept. (#2719)
Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.

[ROCm/composable_kernel commit: c71d7ddd74]
2025-08-24 21:29:23 -07:00
linqunAMD
8acd47f5ff Extend XDL kernel to Support RDNA3/4 - Part 1 (#2606)
[ROCm/composable_kernel commit: d6e49c5fde]
2025-08-22 17:46:30 -04:00
Po Yen Chen
3c57e29cdc [CK_TILE] Allow switching between SGPR/VGPR get_warp_id() return values (#2669)
* Allow return VGPR get_warp_id() value

* Avoid using SALU in async_load_raw()

[ROCm/composable_kernel commit: 0db21053e6]
2025-08-22 10:17:05 +08:00
Po Yen Chen
a2b87dd39d [CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549)
* Support async load dwordx4

* Enlarge load size on gfx950

[ROCm/composable_kernel commit: 4a7ecce096]
2025-08-22 10:13:47 +08:00
Yi DING
d732e1e379 [CK_TILE] FMHA BWD Fix Compilation with Bias (#2682)
* [CK_TILE] FMHA BWD Fix Compilation with Bias

* Fix appendkv kApplyRoPE

[ROCm/composable_kernel commit: 4cfa2c7158]
2025-08-22 10:01:10 +08:00
dnovakovic-dxc
9a39d6ac51 Script for generating list of files not referenced in tests (#2696)
* script for generating list of not referenced files in tests, list is in json format

* script comment added

* added empty line at the end of the script

* format changes

[ROCm/composable_kernel commit: 49c6b05c72]
2025-08-20 08:22:51 -07:00
Bartłomiej Kocot
3e8a6dfb9c [CK Tile] Grouped convolution backward data (#2652)
* base working version for single groupped conv bwd data

* Fix 2d descriptor

* fix groups

* Add 3d support

* fixes

* fixes

* fixes

---------

Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>

[ROCm/composable_kernel commit: 4212bbc170]
2025-08-20 05:29:57 -07:00
Haocong WANG
2f6735906e [CK_TILE] Update the fmhafwd dispatch logic (#2698)
* update the fmhafwd dispatch  logic

* Fix fmha test scripts

* Fix bash

---------

Co-authored-by: Ding, Yi <yi.ding@amd.com>

[ROCm/composable_kernel commit: 81b265cf91]
2025-08-20 16:24:43 +08:00
Vidyasagar Ananthan
0e945358c8 Setting gpu target filtering for tile engine to gfx90a, gfx942 and gfx950. (#2709)
[ROCm/composable_kernel commit: bf3e719c16]
2025-08-19 18:12:06 -07:00
Emily Martins
329b680447 [CK Tile] Stream K GEMM Kernel HostArgs and Kernel Classes (#2681)
* CK Tile Stream K Device Ops

Implementation of CK Tile StreamKHostArgs and StreamKKernel classes. The
StreamKKernel class injects Universal Gemm and includes functions to
facilitate kernel preparation for the GPU.

* Stream K Device Ops Fixes

- Update GetWorkSpaceSize to call TilePartitioner's GetWorkSpaceSize to
  ensure we get size needed for accumulation buffers and semaphores.
- Pass in num_sk_blocks into TilePartitioner constructor
- Update documentation

* Add WarpTile dimensions to GetName function in StreamKKernel class

* Fix typos in StreamKHostArgs class description.

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

* Apply clang format on updated comment for StreamKHostArgs

* Explicitly specify type for StreamKReductionStrategy enum

* Remove unecessary scopes

* Unify the commenting style to inline comments

* Add explicit casts for occupancy and num_cu in MakeKernelArgs function

Both the static functions Occupancy and NumCU in the StreamKKernel class
use functions from the HIP API that result in the returned occupancy and
num_cu types being type int. The TilePartitioner interface for stream K will
have occupancy and num_cu being type ck_tile::index_t which is int32_t.
Thus, to be safe, this change ensures that both occupancy and num_cu are
cast to int32_t.

* Fix use of kentry due to interface update
PR #2594 updated the interface for the kentry function in
include/ck_tile/host/kernel_launch.hpp. As a result, the static function
Occupancy was updated to work correctly with the new interface.
PR #2594 also changed UniversalGemmKernel's KernelBlockSize static
variable to kBlockSize, so the StreamKKernel class was updated to
reflect this change.

* Switch type of num_sk_blocks from uint32_t to int32_t

This change switches the type of num_sk_blocks to type ck_tile::index_t
which is int32_t. This was done because parallel work for the CK Tile
StreamK TilePartitioner's constructor will have num_sk_blocks as
ck_tile::index_t. Thus, this change will help unify the interfaces to
avoid any type conversion errors.

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

[ROCm/composable_kernel commit: 071165919f]
2025-08-19 15:08:52 -06:00
jefyang1
29863f7381 Fix pk i4 v3 example test regression on gfx942 (#2706)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 6ba9289b26]
2025-08-19 09:58:28 -07:00
joyeamd
188e5dc70b fix grouped gemm example when wave32 enabled (#2707)
1, delete some unused variables
2, fix BlockSize when wave32 enabled

[ROCm/composable_kernel commit: a1589a9667]
2025-08-19 16:20:43 +08:00
mirchen-amd
b31811538f Mirchen/gemm blockscale wp segfault fix (#2638)
* Add stride validation to prevent segfault in blockscale GEMM

* run clang-format

* Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* added stride length checking to more gemm examples in ckprofiler

* ran clang format

* added validation header and implement in core gemm operations

* remove ck_tile transpose and gemm stages from CI (#2646)

* update CK build instruction step 4 (#2563)

Co-authored-by: Aviral Goel <aviral.goel@amd.com>

* Fixes to  "General 2D Reduction Kernel" (#2535) (#2656)

* fix reduce2d

- revret the combine_partial_results() chnages
- remove auto from function def

* clang-format

* enable aiter test_mha in daily CI (#2659)

* feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582)

* feat(copy_kernel): add basic copy kernel example with documentation

* docs(CHANGELOG): Updated changelog

* chore: performed clang format

* Update example/ck_tile/39_copy/copy_basic.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* fix(terminology): follow amd terms

* extract elementwise copy to a new kernel

* fix(copy_kernel): bug in verification

* add comments about vgpr usage

* lint and nits

* add notes and comments

* print hostTensor via stream

* print hostTensor via stream

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* [CK_TILE] FMHA BWD Optimization For GFX950 (#2628)

* simplify fmha_bwd_kernel MakeKargs & dq_dram_window

* simply duplicate

* trload pipeline

* Try two-stage

* add prefetch

* optimize & iglp

* Fix num_byte calculations to use nhead_k for K & V size (#2653)

Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.

Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```

After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```

* [CK_TILE] FMHA BWD Decode Pipeline (#2643)

* Fix distr

* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr

* decode 16x16 o2

* fix (#2668)

* Optimize fmha fwd decode & prefill for gfx950 (#2641)

* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

* Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)

This reverts commit 327bf408dd05b4e4bfb7b72f63f8710f35efa9a4.

* added batch stride checking to batched gemm ops in profiler

* removed batch stride validation

* removed batched stride validation again

* Update include/ck/library/utility/profiler_validation_common.hpp

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

* refactor function names

* added gemm stride checking to more profiler gemm operations

* run clang format

* add stride checkign to 01 gemm example

* rename from profiler to validation common, used for examples and profiler

* build of ckProfiler success

* update file headers

---------

Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
Co-authored-by: Cameron Shinn <camerontshinn@gmail.com>
Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 60320e90c1]
2025-08-19 01:19:17 -07:00