Commit Graph

1987 Commits

Author SHA1 Message Date
aska-0096
78d0fd4e65 add vmcnt guard for async copy 2025-05-28 03:47:46 +00:00
Ding, Yi
b99c50a5d5 pad ascale 2025-05-28 03:35:33 +00:00
Ding, Yi
cf5b4c11a2 Pad shuffled a scale only 2025-05-28 02:37:14 +00:00
aska-0096
65255e12fb Unconditional Ascale padding 2025-05-28 01:55:23 +00:00
aska-0096
63c9388881 Pad the M for scale buffer unconditionaly 2025-05-27 11:52:12 +00:00
aska-0096
9da2995163 Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4 2025-05-27 10:23:21 +00:00
aska-0096
04f7265c19 refactor the pipeline 2025-05-27 10:14:45 +00:00
Ding, Yi
d3015785cb Fix 'Merge gemm_mx_common.hpp' 2025-05-27 09:08:02 +00:00
aska-0096
71e7346bf4 Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4 2025-05-27 07:32:16 +00:00
aska-0096
137e28d151 temp save, 4.4~4.5 2025-05-27 07:31:16 +00:00
Ding, Yi
85ac576109 Merge gemm_mx_common.hpp 2025-05-27 06:13:03 +00:00
Ding, Yi
123053b685 Merge remote-tracking branch 'origin/wip-f4-wp' into wip-f4 2025-05-27 03:36:38 +00:00
aska-0096
61748eddba Add NT flag to B/BScale buffer 2025-05-27 02:26:43 +00:00
Ding, Yi
91eb136937 Fix v1; use M padding 2025-05-26 10:32:26 +00:00
aska-0096
d1d56e89ef fix the correctness issue 2025-05-26 09:29:36 +00:00
Ding, Yi
40af523e2c Add rotating to mx examples 2025-05-26 05:05:54 +00:00
aska-0096
4a3205f94a Merge branch 'wip-f4-wp' of https://github.com/ROCm/composable_kernel into wip-f4-wp 2025-05-26 02:22:09 +00:00
Lin, Qun
d5e7580473 correct a typo in tail 2025-05-25 19:22:47 -05:00
Andriy Roshchenko
fdfc9c6fd8 Merge remote-tracking branch 'origin/develop' into andriy/wip-f4 2025-05-23 23:02:43 +00:00
Andriy Roshchenko
f03da29b65 Merge branch origin/wip-f4 into andriy/wip-f4 2025-05-23 22:14:30 +00:00
Andriy Roshchenko
1c91f6bf1e Fix example_gemm_mx build 2025-05-23 22:00:07 +00:00
Illia Silin
8146e471f1 fix the buffer intrinsic names for clang >=20 (#2228) 2025-05-23 14:58:25 -07:00
aska-0096
574d65efed temp save 2025-05-23 14:51:24 +00:00
joye
8afac88f89 fix f4 pipeline issues 2025-05-23 17:13:10 +08:00
Illia Silin
1b846143c6 Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192)" (#2227)
This reverts commit 58f9e9ffbc.
2025-05-22 15:41:17 -07:00
Andriy Roshchenko
715ad01bf2 Fix MX MFMA tests 2025-05-22 21:51:36 +00:00
Illia Silin
bc2551ac3b disable building device_mha_operations by default (#2225) 2025-05-22 14:03:04 -07:00
Adam Dickin
417a6b65b6 Add MIOPEN_REQ_LIBS_ONLY option for cmake to build only the libs MIOpen requires (#2224)
* cut out anything we dont need for MIOpen to test

* refactor exclusion code to be more streamlined.
2025-05-22 11:14:33 -07:00
Ding, Yi
ce50d4bd62 Fix fp4 ckProfiler 2025-05-22 09:39:22 +00:00
aska-0096
a4dae9eb86 optimize offset math in dma 2025-05-22 08:15:31 +00:00
aska-0096
7f7c4d35c7 lds conflict free + buffer load lds 2025-05-22 08:04:52 +00:00
Andriy Roshchenko
e302ab8f0c Merge branch origin/develop into wip-fp4 2025-05-22 06:31:47 +00:00
Aviral Goel
534d4594d0 Refactor tile_window.hpp, tile_window_linear.hpp into a CK Tile Hierarchy (#2214)
* window_origin variable now in base class

* abstracted more functions

* consolidated tile_window_static_distribution and tile_window_static_lengths

* clang format

* skeleton code for tile_window and tile_window_linear consolidation

* more abstraction

* moved variables from child to parent

* clang format

* removed comments

* removed debug code

* removed debug code

* abstracting traits WIP

* consolidated traits

* removed comments and clang formatted
2025-05-21 23:28:00 -07:00
Lin, Qun
97709c4aa1 correct preShuffleBuffer
we should used packed k to do shuffle.
2025-05-22 01:09:13 -05:00
Ding, Yi
352542c49e Better kernel selection in device classes 2025-05-22 06:05:10 +00:00
Lin, Qun
6f8e643629 fix 2 typos in fp4_preshuffle 2025-05-21 23:21:00 -05:00
Bartłomiej Kocot
ebc5a6ef87 Grouped conv bwd wei add for larger filter and Merge Groupes optimization (#2197)
* Grouped conv bwd wei add two stage instances for larger filter and Merge Groups

* Fix

* fix

* Restore removed instances

---------

Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
2025-05-21 22:47:34 +02:00
Aviral Goel
fa39c4e798 Add Doxygen Documentation for HostTesnor, HostTensorDescriptor, DeviceMem, FillUniformDistribution (#2160)
* added documentation for HostTensorDescriptor

* added documentation for DeviceMem and FillUniformDistribution

* fixed merging error

* fixed host_tensor_descriptor error

* clang format
2025-05-21 10:34:30 -07:00
Aviral Goel
990d645578 added gemm universal example in readme (#2216) 2025-05-20 15:35:07 -07:00
SamiAario-AMD
380bca2b85 Fix 11_add_rmsnorm2d_rdquant (#2207) 2025-05-20 15:15:28 -07:00
Thomas Ning
1386924749 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format
2025-05-20 15:05:08 -07:00
Sami Remes
d1e6f0982d [CK_TILE] Grouped GEMM tile loop (#2146)
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm

* Some helper functions for persistent kernel case

* Get max occupancy grid using device properties

* Implement tile loop in main entry point to grouped gemm

* Enable GridSize() on device

* Handle offset tile index using real current block index

* Add persistent kernel choice to grouped gemm example

* Use a for-loop for iterating over the group

* Reduce VGPR spills by early-exit

* Enable persistent kernel choice in grouped_gemm example

* Add persistent kernel option to grouped_gemm test

* Fix formatting with remod.py

* Remove GridUpdateBlocks as blocks are now iteratively computed

* Add comment about VGPR spilling

* Fix formatting

* Use CK_TILE_HOST instead of __host__

* Enable all Row/Col combinations in grouped gemm unit test

* Add some KBatch=2 cases to grouped gemm tests

* Fix SplitK for grouped gemm

* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm

* Add type traits

* Split examples to regular and tileloop

* Formatting

* Use hipExtStreamGetCUMask to get current active CUs for the given stream

* Align test and example kernel config, and disable validation for splitk repeats

* Remove debug options from CMakeLists.txt

* Separate the code paths for persistent/non-persistent in test

* Fix formatting

* Address review comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-20 17:18:57 +03:00
aska-0096
e1084fe7d6 tempsave. compile pass, function wrong 2025-05-20 10:57:26 +00:00
Ding, Yi
667a356cc3 Add mx fp4 pileline v1 instances 2025-05-20 03:28:24 +00:00
Ding, Yi
0c21ae4ead fix fp4 profiler 2025-05-20 03:06:12 +00:00
Aviral Goel
c4929225f6 remove debug statements from CMakeLists (#2204) 2025-05-19 17:31:04 -07:00
Jan Patrick Lehr
0970f22221 [CMake] Disable newly added compiler warning -Wnrvo (#2210)
Recently a new warning was added to Clang to warn when no copy-elision
on return happens. That prevents our CK build. This disables the
warning.
2025-05-19 17:30:15 -07:00
jefyang1
f18170064d Use new mfma instructions for FP8 on gfx950 (#2202)
* Add logic to use new mfma instructions for fp8 bf8

* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format

* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Fix intrin_mfma f8 calls due to merge mistake

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-19 17:29:51 -07:00
Andriy Roshchenko
57e0f5df29 MX GEMM - Expand MX MFMA Testing to BF8, FP6, and BF6 Data Types (#2199)
* Unify test interface for different layouts.

* WIP: Introducing FP4/FP6/FP8 abstractions

* WIP: Introducing packed storage abstraction

* WIP: Introducing packed storage abstraction

* WIP: Improved support for FP6 data type

* Refactor packed storage for f6_t

* WIP: FP6 MFMA test

* Test if we correctly represent all FP6/FP4 numbers

* Additional output for failed FP4 test.

* More failing conversion tests

* Even more failing conversion tests

* Working FP6 MFMA tests

* Expand MX MFMA testing to BF8/6

* Update and verify MX MFMA test for packed types

* Fix fp4 and fp6 conversions on host

* Working MX MFMA tests for FP8/6/4

* Cleanup

* Add missing type

* Cleanup

* Final cleanup

* Restrict FP6/4 values output to CK_LOGGING=1

* Use CHAR_BIT instead of number 8

* Fix typo

* Remove FP6 and FP4 from the list of native types

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
2025-05-19 16:52:51 -05:00
jefyang1
b8b12bb81e Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203)
* Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950

* Run clang format
2025-05-19 14:25:50 -07:00