Commit Graph

917 Commits

Author SHA1 Message Date
joye
37a4a9aae4 fix a CRTP issue 2025-05-21 14:13:09 +08:00
joye
88601c8a05 update async policy 2025-05-21 13:31:13 +08:00
joye
ad32373ff1 fix clang format 2025-05-21 11:56:35 +08:00
joyeamd
8470702ac0 change async pipeline's tile distribution pattern from thread to warp 2025-05-21 11:53:13 +08:00
Thomas Ning
1386924749 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format
2025-05-20 15:05:08 -07:00
Sami Remes
d1e6f0982d [CK_TILE] Grouped GEMM tile loop (#2146)
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm

* Some helper functions for persistent kernel case

* Get max occupancy grid using device properties

* Implement tile loop in main entry point to grouped gemm

* Enable GridSize() on device

* Handle offset tile index using real current block index

* Add persistent kernel choice to grouped gemm example

* Use a for-loop for iterating over the group

* Reduce VGPR spills by early-exit

* Enable persistent kernel choice in grouped_gemm example

* Add persistent kernel option to grouped_gemm test

* Fix formatting with remod.py

* Remove GridUpdateBlocks as blocks are now iteratively computed

* Add comment about VGPR spilling

* Fix formatting

* Use CK_TILE_HOST instead of __host__

* Enable all Row/Col combinations in grouped gemm unit test

* Add some KBatch=2 cases to grouped gemm tests

* Fix SplitK for grouped gemm

* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm

* Add type traits

* Split examples to regular and tileloop

* Formatting

* Use hipExtStreamGetCUMask to get current active CUs for the given stream

* Align test and example kernel config, and disable validation for splitk repeats

* Remove debug options from CMakeLists.txt

* Separate the code paths for persistent/non-persistent in test

* Fix formatting

* Address review comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-20 17:18:57 +03:00
joye
5079e3f3a2 update lds descriptor 2025-05-20 17:14:29 +08:00
joye
fee156a37d fix a descriptor issue 2025-05-20 16:58:18 +08:00
joye
33c643001e fix some compiling errors 2025-05-20 15:52:33 +08:00
joye
55f3632901 fix a compiling error 2025-05-20 15:16:53 +08:00
joye
c71b3840e8 fix lds descriptor 2025-05-20 15:00:32 +08:00
joye
3500259bdc update async load apis 2025-05-20 13:26:49 +08:00
jefyang1
f18170064d Use new mfma instructions for FP8 on gfx950 (#2202)
* Add logic to use new mfma instructions for fp8 bf8

* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format

* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Fix intrin_mfma f8 calls due to merge mistake

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-19 17:29:51 -07:00
Andriy Roshchenko
57e0f5df29 MX GEMM - Expand MX MFMA Testing to BF8, FP6, and BF6 Data Types (#2199)
* Unify test interface for different layouts.

* WIP: Introducing FP4/FP6/FP8 abstractions

* WIP: Introducing packed storage abstraction

* WIP: Introducing packed storage abstraction

* WIP: Improved support for FP6 data type

* Refactor packed storage for f6_t

* WIP: FP6 MFMA test

* Test if we correctly represent all FP6/FP4 numbers

* Additional output for failed FP4 test.

* More failing conversion tests

* Even more failing conversion tests

* Working FP6 MFMA tests

* Expand MX MFMA testing to BF8/6

* Update and verify MX MFMA test for packed types

* Fix fp4 and fp6 conversions on host

* Working MX MFMA tests for FP8/6/4

* Cleanup

* Add missing type

* Cleanup

* Final cleanup

* Restrict FP6/4 values output to CK_LOGGING=1

* Use CHAR_BIT instead of number 8

* Fix typo

* Remove FP6 and FP4 from the list of native types

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
2025-05-19 16:52:51 -05:00
joye
91113f6464 comment some not gfx950 codes 2025-05-19 10:13:49 +08:00
joye
4a544c4084 comment some not gfx950 codes 2025-05-19 10:11:49 +08:00
joye
5c936dc8f3 add ignore header file 2025-05-19 09:34:26 +08:00
joye
0f2dfa8d38 fix async inline assembly 2025-05-19 09:33:17 +08:00
joye
2a9f0fff5d fix async inline assembly 2025-05-19 09:30:53 +08:00
joye
f0e1dbca49 fix some issues in async load 2025-05-19 08:56:03 +08:00
joye
7a03addc12 fix some compiling errors 2025-05-19 08:28:29 +08:00
arai713
5b3430b868 Narrowing error fix for codegen compilation (#2194)
* removed comment with special characters

* fix for arg/template change after merge from develop

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-05-16 11:11:54 -07:00
Mateusz Ozga
fa3c6811d8 Disable conv for Filter1x1Stride1Pad0 when K or C is even (#2186) 2025-05-16 10:18:47 +02:00
Po Yen Chen
791802b381 [CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations (#2198)
* Write soft-sign in inline asm

* Change tile idx computation

* Add macro to turn off soft-sign asm opt

* Use simple for loop to avoid register spill

* Only do block id transform for masking cases
2025-05-16 15:14:46 +08:00
joye
9c429ad0cc add a new pipeline for async load 2025-05-16 09:25:16 +08:00
Khushbu Agarwal
3d8d6e75e4 Adding validation for tile sizes in Tile Engine (#2189)
* Adding validation for tile sizes

* Add architecture in config, and shuffle lines of code in warp_gemm.hpp

* Enable MFMA for gfx950, and invalid tile handling
2025-05-15 10:28:31 -07:00
joye
acdef41575 add a pipeline which copies from v4 2025-05-15 18:00:30 +08:00
joye
0c15891dea fix some compiling errors 2025-05-15 14:54:13 +08:00
joye
8c09b31f72 fix a compiling error 2025-05-15 14:39:21 +08:00
joye
2fdcb55cf1 fix some compiling errors 2025-05-15 14:36:02 +08:00
joye
726ae62113 add async load api 2025-05-15 13:36:46 +08:00
joye
4dd9e8c0c8 add for async load builtin 2025-05-15 13:29:39 +08:00
BingYuan.Zhou
41c17d0a95 fix moe sorting build fail (#2190)
* fix moe sorting build fail

* refile code

---------

Co-authored-by: solin <bingzhou@amd.com>
2025-05-14 09:31:26 +08:00
Illia Silin
58f9e9ffbc Update the buffer load/store intrinsic names for clang>=20. (#2192)
* fix the buffer load/store intrinsic names

* fix clang format
2025-05-13 10:18:14 -07:00
Bartłomiej Kocot
c53b7bd22e Switch to v2 pipeline for grouped conv bwd data (#2181)
* Change to old pipeline for grouped conv bwd data

* fix

* fix

* fix

* fix

* fix

* fix

* Fix
2025-05-13 10:14:30 +02:00
Po Yen Chen
2920604786 [CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163)
* hack for cap logits

* fix bug

* Re-format files

* Allow specifying logits_soft_cap through APIs

* Support turn on/off logits_soft_cap in async pipeline

* Do not generate non-verified kernels

* Align receipt used in Aiter

* Sync logits soft-capping across pipelines

* Re-enable some hdim pipelines

* fix perf

* Add attention variant for logits_soft_cap

* Add newline at end-of-file

* Fix performance

* Add comment to explain logits_soft_cap pre-processing

* Unify code

* Unify floating-point literal style

* Use class data member to slience the compilation error

* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133)

* Send 'mask' along with variant params to the LogitsMask()

* Send block indices to the variant

* Add indices parameters in variant interface

* Fix fmha bwd codegen error

* Allow switch logits_soft_cap impl

* Eliminate register spills

* Fix compilation errors

* Fix wrong LSE

* Fix LSE for splitkv kernel

* Sync splitkv pipeline changes

* Add batch_prefill kernel/pipeline

* Fix codegen error

* Undo changes in CMakeLists.txt

* Merge pipeline filtering check

* Use different code path if kHasLogitsSoftCap=false

* Remove [[maybe_unused]] attribute

* Use pre-existing compile-time flag to instantiate templates

* Sync pipeline changes

* Update CHANGELOG.md

---------

Co-authored-by: Bernard <bernaliu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-05-13 12:19:25 +08:00
Khushbu Agarwal
f05e45ba59 Disable SMFMA gfx90a (#2184)
* sparsity fix for gfx90a

* reverting tile_engine changes
2025-05-12 09:56:23 -07:00
Thomas Ning
b49f7de81f Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166)
* make the work compiled

* Solved the example code, but still have the profiler error

* Finished the feature

* Clang format and update the CHANGELOG

* solve the preshuffle v1 & v2 problem

* Comment Addressed

* Comment Addressed
2025-05-12 09:52:58 -07:00
Thomas Ning
9d1e44e56a Vectorized Transpose for Batched Transpose CK Tile Operator (#2131)
* Shared Memory for single data point

* CKTile Transpose vectorize CP1

* CKTile Transpose vectorize CP2

* CKTile Transpose vectorize CP2.1

* fixed the compile error of the transpose tile 2d

* Have the correct result for the current test sample

* Changes to printing tensor

* fp8 support added

* Debugging for transpose

* solving the corner issue

* Changed padding flag

* Intermideate Debugging

* Intermidiate Debugging

* Intermediate Debugging

* Finished debugging of the transpose op

* Code Cleanup

* Adding edge case smoke tests

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Addressing Review Comment

* Addressing Comments

* Addressing Comments

* Measuring Perf Tests

* Code Cleanup

* Changlog

* Added the running iterations

* clang format

* Fix the changelog

* Fix the compilation error

* change the printing factor

---------

Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
2025-05-12 00:41:45 -07:00
Khushbu Agarwal
d8faf1c6a1 Support for swizzle and transpose for MFMA_16x16x32_F16/BF16 (#2172)
* Changes for updating tile distribution for shuffle and transpose

* Fixed swizzle and transpose, removed comments

* clang formatted

* Adding support for bf16 type

* Addressing review comments
2025-05-10 22:40:05 -07:00
Bartłomiej Kocot
6fddb5708c Add grouped conv fwd bias relu instances (#2179)
* Add grouped conv fwd bias relu instances

* fixes

* fix
2025-05-09 22:52:34 +02:00
jefyang1
6b1a339b6f Fix grouped conv bwd data tests on gfx950 (#2173) 2025-05-09 09:01:06 -07:00
Khushbu Agarwal
ef72a4b9bc Disable SMFMA for gfx90a (#2182) 2025-05-09 00:18:07 -07:00
Andriy Roshchenko
cb27e7c77f Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171)
* Re-enable MX GEMM instances

* Fix compilation error when building MX GEMM for multiple architectures
2025-05-08 13:26:03 -06:00
Thomas Ning
c757046d49 Revert "Disable the SMFMA instruction for gfx90a. (#2174)" (#2175)
This reverts commit a32d907771.
2025-05-08 00:07:03 -07:00
Khushbu Agarwal
a32d907771 Disable the SMFMA instruction for gfx90a. (#2174)
* remove smfma for gfx90a

* clang formatted
2025-05-07 23:09:22 -07:00
BingYuan.Zhou
6a3960c1e1 Flatmm merge (#2168)
* sync with function interface of cshuffleepiloge,fix flatmm build fail

* move code from solin/flatmm which add mfma16*16*32fp8 and optimize flatmm

---------

Co-authored-by: solin <bingzhou@amd.com>
2025-05-08 12:59:57 +08:00
jakpiase
cb07ad84d5 fix for default epilogue (#2167) 2025-05-07 10:46:53 -07:00
Aviral Goel
769336b640 [CK_TILE] Add type traits to detect tile window types at compile time (#2158)
* added WindowType enum to tile_window_structs and static assert checks in computev4 pipeline

* added type traits instead of enum to tile_window() and tile_window_linear() with debug comments

* removed comments, added documentation and clang format
2025-05-07 00:00:39 -07:00
Rostyslav Geyyer
8a0d659f92 Add FP4 MX MFMA tests (#2151)
* Add conversion tests

* Fix ctor

* Fix nan logic

* Fix conversion logic

* Permute packed f4_t values

* Fix conversion to float, repack vector elements

* Fix device tests

* Permute elements in a vector

* Add a repro test

* Add a conversion for a repro test

* Update test vectors

* Update conversion

* Fix the test

* Update test vector generator

* Fix vector sr conversion

* Permute conversion args

* Update conversion

* Test

* Fix packing

* Simplify conversion function

* Pack conversion in a loop

* Pack conversion in a loop

* Pack another conversion in a loop

* Pack one more conversion in a loop

* Pack the last conversion in a loop

* Clean up

* Add ops

* Add tests

* Add missing utils

* Update reference mx gemm

* Add f4x2 init mode

* Update host tensor utils

* Update chunk size for f4x2

* Add non scaled ops

* Add a type utility

* Update non scaled reference kernel

* Add non scaled tests

* Debug mfma arguments

* Add more debug info

* Update chunk size

* Update data layout

* Add more debugging

* Fix B stride

* Fix reference gemm

* Fix build

* One more reference fix

* Add more debug info

* Disable some tests

* Enable tests

* Add fp4 dimensions

* Update reference kernels

* Temp edits

* Remove leftovers

* Fix conflicts

* Clean up

* More clean up

* Revert "More clean up"

This reverts commit d8d35a0846.

* Add layouts to tests

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-06 09:24:00 -05:00