Commit Graph

975 Commits

Author SHA1 Message Date
Ville Pietilä
9d5e5f7188 Remove obsolete packed cast tensor slice transfers. 2025-09-03 11:00:41 +00:00
Ville Pietilä
70d57ca8b9 Remove separate packed cast step. 2025-09-03 10:57:16 +00:00
Ville Pietilä
d56d7bc821 Use fused packed cats. 2025-09-02 12:00:21 +00:00
Ville Pietilä
d22ec6633a Fused packed cast improvements. 2025-09-01 14:57:42 +00:00
Ville Pietilä
c69539fe3c Optimize LDS write order for packed cast. 2025-08-28 15:01:55 +00:00
Ville Pietilä
9f66d9fbca Packed cast improvement. 2025-08-27 14:32:00 +00:00
Ville Pietilä
b82e68c45f Bug fix. 2025-08-27 14:30:13 +00:00
Ville Pietilä
f1a7cfba26 Packed cast imrovement. 2025-08-27 14:23:05 +00:00
Ville Pietilä
90b90dff08 Vectorized packed cast. 2025-08-27 13:29:38 +00:00
Ville Pietilä
54302c6f77 Remove obsolete version of the packed cast. 2025-08-27 11:28:55 +00:00
Ville Pietilä
481df169f2 Add packed cast to gridwise gemm multi d. 2025-08-27 11:27:03 +00:00
Ville Pietilä
6092643e9b Performat packed cast implementation. 2025-08-27 10:10:18 +00:00
Ville Pietilä
cfbd669455 WIP: Vectorized access. 2025-08-26 12:43:38 +00:00
Ville Pietilä
2302ea9bc6 Add the vectorized option for packed cast. 2025-08-26 12:31:01 +00:00
Ville Pietilä
905cfb6623 Use thread scratch buffer in bf16 conversion. 2025-08-26 09:33:50 +00:00
Ville Pietilä
1b858936cb WIP: Packed cast using threadwise sratch. 2025-08-25 15:31:50 +00:00
Ville Pietilä
4e7f9f7908 Fix packed cast tensor slice transfer. 2025-08-22 10:26:59 +00:00
Ville Pietilä
de93a48b04 Add back the separate packed cast step. 2025-08-20 11:31:07 +00:00
Ville Pietilä
6fbe1895f1 Code clean-up. 2025-08-20 08:19:13 +00:00
Ville Pietilä
0d34572f20 Code clean-up. 2025-08-19 18:37:01 +00:00
Ville Pietilä
b48ae7e447 Add perf test. Fix packed bf16 cast implementation. 2025-08-18 11:37:14 +00:00
Ville Pietilä
6b2b5e7c7c Small optimization to the packed cast. 2025-08-18 06:39:26 +00:00
Ville Pietilä
c0b8f66674 Add packed cast pipeline into gridwise gemm xdlops bwd weight. 2025-08-18 06:10:14 +00:00
Ville Pietilä
00a3ce734a Integrate new packed cast threadwise tensor slice transfer into gridwise gemm pipelines. 2025-08-15 12:06:44 +00:00
Ville Pietilä
51af3d7bac Fix a bug in the packed cast threadwise transfer. 2025-08-15 10:41:06 +00:00
Ville Pietilä
62c66a7d9c WIP: packed bf16 cast v3. 2025-08-14 12:39:18 +00:00
Ville Pietilä
938ff298b4 Add more unit tests. 2025-08-14 11:33:02 +00:00
Ville Pietilä
3ecc8aae74 Add unit test for vectorized packed cast. 2025-08-14 08:41:35 +00:00
Ville Pietilä
ade741dd45 WIP: PackedCast v3. 2025-08-13 15:13:35 +00:00
Ville Pietilä
50e318e072 Fix logging. 2025-08-12 15:53:00 +00:00
Ville Pietilä
ae4c727bc5 Add packed bf16 cast for universal GEMM. 2025-08-12 15:52:49 +00:00
Ville Pietilä
cee7644c85 Working version 2 of the packed cast. 2025-08-12 12:46:01 +00:00
Ville Pietilä
6148d1c75f WIP: Packed cast v2. 2025-08-11 15:18:30 +00:00
Ville Pietilä
c675563468 Addlogging and specific unit tests for bf16 and gfx950. 2025-08-08 08:59:12 +00:00
Ville Pietilä
c47b80580d Fix build issues when __gfx950__ macro is enabled. 2025-08-08 08:01:42 +00:00
Ville Pietilä
4b8a559da9 Fixed packed_cast implementation for slice access. 2025-08-06 11:04:29 +00:00
Ville Pietilä
44202b9d32 WIP: Integration of packed cast into gridwise_gemm_xdl_cshuffle_conv_v3. 2025-08-05 15:12:36 +00:00
Ville Pietilä
e92c0bf68e Initial integaration of packed cast. 2025-08-04 15:34:35 +00:00
Ville Pietilä
e06548675f Fix a bug in packed cast asm. Add more unit tests. 2025-08-04 09:16:12 +00:00
Ville Pietilä
9769fa68a7 Packed BF16 cast with asm volatile. 2025-08-01 13:22:41 +00:00
Ville Pietilä
b2b991d431 Added conversion of two floats into a packed bf16 value. 2025-07-31 15:13:37 +00:00
Ville Pietilä
e962a41638 Automatic deduction of split-K value for grouped convolution (#2491)
* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.

* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.

* Use simple best occupancy model to calculate the split-K.

* Handle split-K autodeduction in explicit gemm conv.

* Add unit tests for split-K autodeduction.

* Remove oversubscription.

* Small fixes.

* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.

* Run clang formatting.

* Fix error handling in the conv profiler.

* Add missing documentation for the autodeducted split-K values.

* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.

* Fix clang formatting and split-K profiler documentation.

* Rename max_occupancy value variable.

* Calculate grid size for split-K autodeduction directly from input array shapes and template params.

---------

Co-authored-by: Ville Pietilä <>
2025-07-31 12:08:45 +02:00
Anton Gorenko
7b074249f4 [CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571)
* Add tests for host convesion f32/f16 to f8

* Add tests for host convesion from f8 to f32/f16

* Fix UB and corner cases in f32/f16 to/from f8 conversion

* There are UBs when very small values are converted to f8: bitshifts
  can be larger that type width. Using unsigned long long does not help
  because exponent_diff >= 64 in such cases. This causes that values
  like 2.117582368e-22 are converted to non-zero f8 in host validation
  of FMHA tests, test_f8 crashes with segfault in completely irrelevant
  code like GTest internals or produces non-deterministic results etc.
* Fix FNUZ conversion to return NaN for NaN inputs.
* Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16
  conversion.

* Replace some magic numbers with values from numeric_traits

* Build tests only on devices supporting the type
2025-07-31 09:54:17 +05:00
Gino Lu
b25d512e8a add constexpr to pk_fp4::pack/unpack() (#2586) 2025-07-30 10:29:04 -04:00
Khushbu Agarwal
61e21f5567 Update to gpu_timer for rotating_buffer (#2524)
* update gpu_timer for rotating buffer as hipblasLt's implementation

* timing fix

* Updating gpu timer for old ck as well

* Revert "Updating gpu timer for old ck as well"

This reverts commit 958cd1bc99.

* code clean up with runtime argument; function rename

* code cleanup

* general timer fixes

* bug fix

* clang formatted

* addressing reveiew comments

* clang formatted

* Addressing review comments

* CI fix

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-07-29 15:21:05 -07:00
Thomas Ning
9d4b494f07 Expand the bandwidth of direct_global_to_lds for gfx950 (#2576)
* Expand the bandwidth of direct_global_to_lds for gfx950

* clang-format

* fix the remod.py and script for clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-07-28 23:56:53 -07:00
Illia Silin
49723e94bb fix the clang-format (#2578) 2025-07-28 20:49:55 -07:00
Yi DING
1926cd0cb8 [CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130)
* Fix shuffle_tile

* Add fmha bwd d160

* CHANGELOG

* Use static_cast

* Update

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-07-29 09:31:14 +08:00
Bartłomiej Kocot
5b244105d9 Enable multiple D for grouped conv fwd large tensors (#2572) 2025-07-28 22:39:07 +02:00
linqunAMD
0782ee8eb3 Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564)
* Remove HIP_COMPILE_DEVICE

* add missing files

* fix clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-07-28 13:01:07 -07:00