Commit Graph

2160 Commits

Author SHA1 Message Date
Thomas Ning
f466e04554 delete all slp compilation flag in CK Tile (#2625)
[ROCm/composable_kernel commit: 07469142cb]
2025-08-06 00:34:39 -07:00
Illia Silin
69cfe33716 Revert "Reduce build time tile engine (#2579)" (#2623)
This reverts commit 19caeff665a8d9c499e28ff4e1703d1c87602162.

[ROCm/composable_kernel commit: 833ae1d051]
2025-08-05 09:27:55 -07:00
Enrico Degregori
41f3e48f9c Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610)
* Add padding 1x1Stride1Pad0 conv specialization

* Add gridwise checks for conv cshufflev3

* Merge padding with previous transforms

* Apply transform changes for padding to default specialization as well

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 2203b0ddfe]
2025-08-05 15:23:19 +02:00
Thomas Ning
7c9d0d0435 Persistent grouped gemm CompV4 Enablement & Polish (#2605)
* enable the persistent kernel for CompV4

* polish the example and clang format

* fix the non-persistent kernel error

---------

Co-authored-by: ThomasNing <thomasning@amd.com>

[ROCm/composable_kernel commit: cbfecf8d7a]
2025-08-04 23:43:01 -07:00
Max Podkorytov
beff75a8ff fix build for test_ck_tile_fp8 on rhel8 (#2615)
[ROCm/composable_kernel commit: 2a78da4708]
2025-08-04 17:43:15 -07:00
Illia Silin
7e796b7861 fix test_mx_mfma errors (#2614)
[ROCm/composable_kernel commit: fb96b49666]
2025-08-04 11:43:47 -07:00
rahjain-amd
0bc4627b73 Fix Debug Build for ckProfiler (#2609)
Problem
=======
relocation R_X86_64_32 out of range: 5405348154 is not in [0, 4294967295]

Solution
========
The problem was caused due the limitation comes from the 32 bit offsets
used in original DWARF standard.
We have the option to switch to 64bit offset for your libs which free
us from 4G size boundary.

add -gdwarf64 and -Og to avoid this limit.

[ROCm/composable_kernel commit: 59245df46d]
2025-08-04 11:28:09 -07:00
Jinchao Xu
d2525548c6 Add -gsplit-dwarf flag to reduce debug section size and fix ckProfiler link errors (#2611)
Resolves R_X86_64_32 relocation out of range errors in grouped conv2d instances
by splitting debug information into separate .dwo files.

Add explicit cast to avoid signed/unsigned comparison warning.

[ROCm/composable_kernel commit: 15eb493152]
2025-08-04 11:26:08 -07:00
Bartłomiej Kocot
c903850b65 Mark non-grouped convolutions instances as deprecated (#2595)
* Mark non-grouped convolutions instances as deprecated

* Update CHANGELOG.md

Co-authored-by: John Afaganis <john.afaganis@amd.com>

* Update library/src/tensor_operation_instance/gpu/conv1d_bwd_data/device_conv1d_bwd_data_xdl_nwc_kxc_nwk_bf16_instance.cpp

Co-authored-by: John Afaganis <john.afaganis@amd.com>

---------

Co-authored-by: John Afaganis <john.afaganis@amd.com>

[ROCm/composable_kernel commit: 8655ba989c]
2025-08-04 16:49:55 +02:00
Max Podkorytov
b13f01345d remove std::format (#2604)
[ROCm/composable_kernel commit: 0d9439760f]
2025-08-01 19:22:07 -07:00
Illia Silin
8703165b46 remove std=c++17 compiler flag (#2603)
[ROCm/composable_kernel commit: b786d12e56]
2025-08-01 16:18:16 -07:00
Max Podkorytov
dd4a259904 [CK-tile] remove old ck-tile transpose test (#2591)
* remove old ck-tile transpose test

* rename test exe for consistency

* replace batched transpose regression test

[ROCm/composable_kernel commit: f36cb5b2aa]
2025-08-01 14:50:09 -07:00
Thomas Ning
aa2f9b4c73 Reduce build time tile engine (#2579)
* Modify CMakeLists to allow for splitting.

* Modify CMakeLists for data and layout logic.

* Run tests and get build artifact.

* Test new Cmakelists for speedup.

* Further improvements for speedup.

* turn off the FMHA

* turn off the automatic tile engine gemm

* minor fix

* disable the transpose test first

* Address the comment

* Jenkinsfile

* change the make thread to 64

* change the compile thread to 32

* Try to use with less OS memory space

* Have the Unity build batch size to 2

* reduce the chunk size

---------

Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>

[ROCm/composable_kernel commit: e5b79b26fa]
2025-08-01 14:42:33 -07:00
Illia Silin
6fc0a709dc update the switch condition for buffer built-ins (#2602)
[ROCm/composable_kernel commit: 788e8a878e]
2025-08-01 14:30:07 -07:00
Thomas Ning
9325ae50e8 Fix the GFX 950 Universal GEMM (#2597)
* solve the gfx950 error

* clang format

* fix a typo error

---------

Co-authored-by: ThomasNing <thomasning@amd.com>

[ROCm/composable_kernel commit: 7c44a763fa]
2025-08-01 09:32:24 -07:00
Illia Silin
6c5d1d39b8 Add a daily CI stage to test AITER with latest CK. (#2598)
* add a CI stage for AITER testing

[ROCm/composable_kernel commit: e6104daecc]
2025-08-01 07:55:51 -07:00
lalala-sh
def677fbc7 fix weight index out of range (#2414)
[ROCm/composable_kernel commit: bb5c478295]
2025-08-01 17:50:02 +08:00
Aviral Goel
4f9a78fbf0 Integration of a new pipeline for weight preshuffle into gemm examples (#2516)
* something khushbu can help with

* v1 v2 works with flatmm develop

* v0 v1 v2 numerical error gone

* Fixing numerical error, and interchange preshuffle configs to match with flatmm

* Refactor GEMM pipeline configurations and integrate preshuffle support

- Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3).
- Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations.
- Removed obsolete code and comments

* clang format

* fix vectorloadsize bug

* add the Preshuffle3

* update kwarp calculation in gemm utils

* update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch

* fix: add CK_GFX950_SUPPORT macro for gfx950 detection

* default disable rotating buffer

* docs(CHANGELOG): update changelog for rocm 7.0

* Revert "docs(CHANGELOG): update changelog for rocm 7.0"

This reverts commit 2bc16fff84.

* Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files.

* revert example/ck_tile/flatmm to its original state

* remove comment added by second author

* switch to xor ALDSDescriptor

* modify the MakeALdsDescriptor()

* temporary profiling script

* getting rid of line marker compiler error

* UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy

* add a minor fix for the config

* typo fix

* Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2

* revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp

* revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp

* reenable the GemmSpatiallyLocalTilePartitioner

* make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline

* remove hardcoded true for preshuffle bool template argument

* rename script

* remove gemm_profilie.sh script

* merge conflict resolve

* clang formatted

* typo fix

* Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp

* Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy

* Fix missing newline at end of file in run_gemm_example.inc

* Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1

* addressing review comments

* removing debug code

* addressing review comments

* Revert "addressing review comments"

This reverts commit 29c45192ba.

* updating tile_engine code

* addressing review comments

---------

Co-authored-by: amd-khushbu <khuagarw@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 1441a0a7ee]
2025-08-01 00:04:54 -07:00
Khushbu Agarwal
464c6f459e [CK_Tile] Updating gpu timer when doing flush cache (#2593)
* Missed updating function names in example

* updating timer

* code cleanup

* addressing review comments

* updating tile_engine code

* addressing review comments

[ROCm/composable_kernel commit: 88d72178d6]
2025-07-31 16:43:33 -07:00
Aviral Goel
8bea96ab32 Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt

* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt

* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

---------

Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>

[ROCm/composable_kernel commit: 546ef78d1d]
2025-07-31 12:18:02 -07:00
Ville Pietilä
e91e8e7908 Automatic deduction of split-K value for grouped convolution (#2491)
* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.

* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.

* Use simple best occupancy model to calculate the split-K.

* Handle split-K autodeduction in explicit gemm conv.

* Add unit tests for split-K autodeduction.

* Remove oversubscription.

* Small fixes.

* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.

* Run clang formatting.

* Fix error handling in the conv profiler.

* Add missing documentation for the autodeducted split-K values.

* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.

* Fix clang formatting and split-K profiler documentation.

* Rename max_occupancy value variable.

* Calculate grid size for split-K autodeduction directly from input array shapes and template params.

---------

Co-authored-by: Ville Pietilä <>

[ROCm/composable_kernel commit: e962a41638]
2025-07-31 12:08:45 +02:00
Anton Gorenko
2ef590ab43 [CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571)
* Add tests for host convesion f32/f16 to f8

* Add tests for host convesion from f8 to f32/f16

* Fix UB and corner cases in f32/f16 to/from f8 conversion

* There are UBs when very small values are converted to f8: bitshifts
  can be larger that type width. Using unsigned long long does not help
  because exponent_diff >= 64 in such cases. This causes that values
  like 2.117582368e-22 are converted to non-zero f8 in host validation
  of FMHA tests, test_f8 crashes with segfault in completely irrelevant
  code like GTest internals or produces non-deterministic results etc.
* Fix FNUZ conversion to return NaN for NaN inputs.
* Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16
  conversion.

* Replace some magic numbers with values from numeric_traits

* Build tests only on devices supporting the type

[ROCm/composable_kernel commit: 7b074249f4]
2025-07-31 09:54:17 +05:00
Illia Silin
64f8c28b42 upgrade clang-format version in install_precommit.sh (#2589)
[ROCm/composable_kernel commit: e8709c24f4]
2025-07-30 08:02:25 -07:00
Max Podkorytov
e18a5f6859 [CK-tile] add gtest for ck-tile batched transpose kernels (#2585)
* add a dummy test file

* add kernel launch logic to the test

* transfer all test cases into gtest params

* factor kernel out into test config

* add load transpose pipeline tests

* add padded tests and skip invalid kernels at runtime

* enum class for pipeline type

* add multiwarp test cases

* fix type

* try to solve the problem

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: de0cdb4c31]
2025-07-30 07:31:05 -07:00
Gino Lu
80a9d720ca add constexpr to pk_fp4::pack/unpack() (#2586)
[ROCm/composable_kernel commit: b25d512e8a]
2025-07-30 10:29:04 -04:00
Khushbu Agarwal
7af5e043b1 Update to gpu_timer for rotating_buffer (#2524)
* update gpu_timer for rotating buffer as hipblasLt's implementation

* timing fix

* Updating gpu timer for old ck as well

* Revert "Updating gpu timer for old ck as well"

This reverts commit 958cd1bc99.

* code clean up with runtime argument; function rename

* code cleanup

* general timer fixes

* bug fix

* clang formatted

* addressing reveiew comments

* clang formatted

* Addressing review comments

* CI fix

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 61e21f5567]
2025-07-29 15:21:05 -07:00
Illia Silin
c810dda70e Revert "Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558)" (#2584)
This reverts commit a111f65475de7974fe89070963d21a2741341ef7.

[ROCm/composable_kernel commit: b80099cc5f]
2025-07-29 13:04:51 -07:00
Thomas Ning
69b117ae73 Expand the bandwidth of direct_global_to_lds for gfx950 (#2576)
* Expand the bandwidth of direct_global_to_lds for gfx950

* clang-format

* fix the remod.py and script for clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: 9d4b494f07]
2025-07-28 23:56:53 -07:00
rocking
0fb487ad63 set default optdim (#2580)
[ROCm/composable_kernel commit: 01642ca8b1]
2025-07-29 13:44:10 +08:00
Illia Silin
6704e5303f fix the clang-format (#2578)
[ROCm/composable_kernel commit: 49723e94bb]
2025-07-28 20:49:55 -07:00
Yi DING
a592107cb9 [CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130)
* Fix shuffle_tile

* Add fmha bwd d160

* CHANGELOG

* Use static_cast

* Update

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 1926cd0cb8]
2025-07-29 09:31:14 +08:00
Andres Lugo
53b36035f2 Remove filter for only batch on receipt 4 (#2574)
Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA

[ROCm/composable_kernel commit: 7fe50dc3da]
2025-07-28 14:53:24 -07:00
Bartłomiej Kocot
f25da17c36 Enable multiple D for grouped conv fwd large tensors (#2572)
[ROCm/composable_kernel commit: 5b244105d9]
2025-07-28 22:39:07 +02:00
linqunAMD
d3a1842669 Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564)
* Remove HIP_COMPILE_DEVICE

* add missing files

* fix clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>

[ROCm/composable_kernel commit: 0782ee8eb3]
2025-07-28 13:01:07 -07:00
Illia Silin
3345f5f417 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config

[ROCm/composable_kernel commit: 504b101da3]
2025-07-28 11:34:07 -07:00
Illia Silin
61ff984dcd use ninja to build packages (#2575)
[ROCm/composable_kernel commit: 9786087010]
2025-07-28 11:04:12 -07:00
jefyang1
8d7d85be00 Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558)
[ROCm/composable_kernel commit: c64a0c65b9]
2025-07-28 09:03:54 -07:00
rocking
4c9de4fbb4 [CK_TILE][FMHA] Uncomment all the headdim, use optdim to control (#2539)
* uncomment all the headdim, use optdim to control

* change default back to -1

* uncomment splitkv instance

* Fix typo in receipt 4 for appendkv

* support optdim for bwd, splitkv and appendkv

* Fix 192 key error

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>

[ROCm/composable_kernel commit: b36e0b029f]
2025-07-28 17:16:32 +08:00
shay-li77
d143f1466f fix mha bwd dbias random mismatch (#2570)
* fix mha bwd dbias random mismatch

* formatting code

[ROCm/composable_kernel commit: 8ae528a1b4]
2025-07-28 14:39:31 +08:00
Bartłomiej Kocot
23ee2ddb5c Enable bf16 RNE on gfx950 (#2542)
* Enable bf16 RNE for gfx950

* test bhalf

* fix

* fix

* Comments fixes

* fixes

* clean

* fix

[ROCm/composable_kernel commit: 685771b875]
2025-07-28 00:47:17 +02:00
Gheorghe-Teodor Bercea
457e3c8b08 Refactor async loads to work on all GPUs (#2545)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: cbfa62e4b6]
2025-07-26 22:04:59 -07:00
Max Podkorytov
98bad686a9 [CK-Tile] Merge transpose examples (#2450)
* unify pipeline signature with existing example

* iwyu

* move stuff around in load-tile-transpose

* cleanups in batched transpose pipeline

* comments

* use same inputs size

* cleaner printf

* print host args

* use 64 block sides in the 37_transpose example

* roll back grid dimension size adjustment for 37_transpose example

* transpose grid for 37_transpose to unify with 35_batched_transpose

* unify grid computation logic

* make policy methods device only (since they are used only on device from the pipeline)

* more host/device attribute cleanups

* copy over problem

* move over pipeline and policy

* add switch to batched transpose api

* make the lds problem more similar to original problem

* factor out logic into traits

* factor out conditional compilation into trait parameter

* propagate pipeline to args

* unhardcode pipeline dispatch parameter

* refactor vector size

* put warp tile out of dispatch

* rename template parameter for trait

* rewrite vector size in terms of problem

* mark policy-internal struct variable as device

* factor out input distribution and thread access pattern from policies

* reword vector size

* use datatype across batched transpose pipelines, problems and kernel

* remove transpose traits from lds pipeline

* add padding to the lds pipeline *interface*

* add comment

* remove ck_tile example #37

* update cmakelists

* add test for new pipeline

* update batched transpose test

* roll back load_tile_transpose changes

* remove comments

* pack dispatch parameters into a config

* padM can be enabled

* adjust lds vector size to enable padding along N

* update test

* clean up logic

* swap m/n input vector size

* adjust perf test script

* sweep over C/W in perf test

* count both read and written bytes into bandwidth (x2 the number)

* clang-format

* widen size range for perf test

* remove 64k x 64k case; it's too large for index

* remove thread tile from dispatch

* Solve merge conflict

* fix compile

* modify the transpose

* solve the test error and clang format

* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)

* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Fixing 0ms and inf GB/s issue in img2col (#2565)

issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

* merge with develop

* solve clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

[ROCm/composable_kernel commit: 821cd26c13]
2025-07-26 21:51:54 -07:00
liang
d055927b83 reorder grid dim schedule (#2533)
Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: d2459878cf]
2025-07-26 02:46:55 +08:00
Bartłomiej Kocot
69c8415c05 Fix clang format (#2567)
* clean

* clang format fix

[ROCm/composable_kernel commit: 5741edf761]
2025-07-25 09:54:34 -07:00
rahjain-amd
1e3243ff59 Fixing 0ms and inf GB/s issue in img2col (#2565)
issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

[ROCm/composable_kernel commit: 78082855d8]
2025-07-25 21:15:50 +05:30
Adam Osewski
63d239d406 Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: c8eb2f995c]
2025-07-25 10:34:31 +02:00
Enrico Degregori
2d68b3f9c0 Support b_scale: (#2350)
- extend pipeline v1 and v3
 - add instances
 - add tests
 - add example

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: b01a27ff22]
2025-07-24 18:49:58 -07:00
Cong Ma
c721559117 [CK TILE] Apply CK_GFX950_SUPPORT macro on ck tile GEMM unit tests (#2560)
cherry-pick c68687e30 and apply CK_GFX950_SUPPORT macro on ck tile GEMM unit tests

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 2addf05b91]
2025-07-24 16:06:32 -07:00
Illia Silin
12f5978e20 remove repetitive code (#2562)
[ROCm/composable_kernel commit: 9c04a55626]
2025-07-24 14:52:46 -07:00
Aviral Goel
9097832e16 Revamp TERMINOLOGY.md (#2522)
* Add comprehensive terminology reference for Composable Kernel, including glossary, hardware and memory hierarchy, execution model, programming model, memory access, tile-based computing, kernel operations, and optimizations.

* Refine terminology in documentation for clarity and consistency.

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Enhance TERMINOLOGY.md by adding definitions for CUDA and HIP

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update TERMINOLOGY.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* address review comments

* refine pipeline and tile partitioner

---------

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: 963dfa680b]
2025-07-24 12:38:24 -07:00