Commit Graph

667 Commits

Author SHA1 Message Date
YC Lin
ae275aa105 [GEMM] Refactor block gemm, pipeline, and policy of instruction schedule opt 2025-07-28 14:54:51 -04:00
YC Lin
6113ca8062 [Add] Add build option for generating assembly 2025-07-28 14:54:51 -04:00
YC Lin
97a960042b [GEMM] Refactor block gemm and pipeline policy of instruction schedule 2025-07-28 14:54:51 -04:00
Clement Lin
8785e6599e Add flash_attention_fwd toy_example 2025-07-28 14:54:51 -04:00
mhYang
a949b82c9f Update tile size and use slc 2025-07-28 14:54:51 -04:00
mhYang
9158612a9f Fix add flops calculation 2025-07-28 14:54:51 -04:00
ClementLinCF
88a4c7414f Create README.md 2025-07-28 14:54:51 -04:00
mhYang
ac972bfd11 Use mfma 16x16x32 2025-07-28 14:54:51 -04:00
mhYang
5326d403e4 Fix KERNEL_D config 2025-07-28 14:54:51 -04:00
YC Lin
fe319b97ae [GEMM] Add pragma message for different MFMA options 2025-07-28 14:54:51 -04:00
YC Lin
76751567b5 [GEMM] Fix print typos 2025-07-28 14:54:51 -04:00
Clement Lin
4c526ab140 Fix indentation typo 2025-07-28 14:54:51 -04:00
Clement Lin
5b10e9f3dd [GEMM] Fix MFMA condition checks 2025-07-28 14:54:51 -04:00
Clement Lin
a95665a6af [GEMM] Add new macor options check 2025-07-28 14:54:51 -04:00
Clement Lin
1099762267 [GEMM] Add macros for multiple optimization options 2025-07-28 14:54:51 -04:00
YC Lin
890a159877 [GEMM] default MFMA config 2025-07-28 14:54:51 -04:00
YC Lin
8d75ae7c96 git push test 2025-07-28 14:54:51 -04:00
root
a36d246cc0 [GEMM] fix MFMA configurations 2025-07-28 14:54:51 -04:00
mhYang
15e6f36f66 Adjust mfma schedule order 2025-07-28 14:54:51 -04:00
Clement Lin
e9f7c9bf42 [GEMM] Replace const auto with constexpr index_t 2025-07-28 14:54:51 -04:00
Clement Lin
cef77c1dcb [GEMM] Update cache-aware wg schedule 2025-07-28 14:54:51 -04:00
bobofang
127e742e96 Add MFMA M16N16K16 and M16N16K32 methods
these two methods are default off
2025-07-28 14:54:51 -04:00
YC Lin
e866f814f9 [GEMM] remove a_col_major/b_row_majro case 2025-07-28 14:54:51 -04:00
root
bf69235cfb [GEMM] modify if-else locations 2025-07-28 14:54:51 -04:00
mhYang
ba8b5112c4 Fix AccDataType and CDataType
1. Fix AccDataType and CDataType
2. Remove indent
3. Align merge_transform for tutorial
2025-07-28 14:54:51 -04:00
mhYang
d6fd468603 Fix build error 2025-07-28 14:54:51 -04:00
root
b3986c32a6 [GEMM] disable/enable instruction scheduling 2025-07-28 14:54:51 -04:00
mhYang
42f2e21865 Fix missing message 2025-07-28 14:54:51 -04:00
mhYang
38ce4dd8c3 Fix xor transform dim. 2025-07-28 14:54:51 -04:00
Clement Lin
b03668fe8a [GEMM] Add cache-aware WG schedule and adjust block tile
113 -> 121.7 TFops
2025-07-28 14:54:51 -04:00
mhYang
39ca852330 Add LDS bank conlict solutions 2025-07-28 14:54:51 -04:00
bobofang
22147ace51 Fix add accuracy issue
2673 GB/s -> 3271 GB/s
Perf: 0.0512898 ms, 3271.06 GB/s
2025-07-28 14:54:51 -04:00
root
d7d9fdaf1b [GEMM] use mfma k8 warp gemm 2025-07-28 14:54:51 -04:00
root
1b8d7cd1b9 [GEMM] disable/enable prefetch 2025-07-28 14:54:50 -04:00
Clement Lin
6a2036015e [CK TILE] Toy example - basic gemm 2025-07-28 14:54:50 -04:00
Clement Lin
077056b32d Adjust block shape
2673 GB/s -> 3647 GB/s
2025-07-28 14:54:50 -04:00
Clement Lin
2ff691f3f2 Utilize vectorized memory access
1998.24 GB/s -> 2673 GB/s
2025-07-28 14:54:50 -04:00
Clement Lin
078b5c68a0 Adjust the size of thread block
1968.42 GB/s -> 1998.24 GB/s
2025-07-28 14:54:50 -04:00
Clement Lin
8d205a9298 [CK TILE] Toy example - basic add 2025-07-28 14:54:50 -04:00
Illia Silin
504b101da3 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
rocking
b36e0b029f [CK_TILE][FMHA] Uncomment all the headdim, use optdim to control (#2539)
* uncomment all the headdim, use optdim to control

* change default back to -1

* uncomment splitkv instance

* Fix typo in receipt 4 for appendkv

* support optdim for bwd, splitkv and appendkv

* Fix 192 key error

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
2025-07-28 17:16:32 +08:00
Max Podkorytov
821cd26c13 [CK-Tile] Merge transpose examples (#2450)
* unify pipeline signature with existing example

* iwyu

* move stuff around in load-tile-transpose

* cleanups in batched transpose pipeline

* comments

* use same inputs size

* cleaner printf

* print host args

* use 64 block sides in the 37_transpose example

* roll back grid dimension size adjustment for 37_transpose example

* transpose grid for 37_transpose to unify with 35_batched_transpose

* unify grid computation logic

* make policy methods device only (since they are used only on device from the pipeline)

* more host/device attribute cleanups

* copy over problem

* move over pipeline and policy

* add switch to batched transpose api

* make the lds problem more similar to original problem

* factor out logic into traits

* factor out conditional compilation into trait parameter

* propagate pipeline to args

* unhardcode pipeline dispatch parameter

* refactor vector size

* put warp tile out of dispatch

* rename template parameter for trait

* rewrite vector size in terms of problem

* mark policy-internal struct variable as device

* factor out input distribution and thread access pattern from policies

* reword vector size

* use datatype across batched transpose pipelines, problems and kernel

* remove transpose traits from lds pipeline

* add padding to the lds pipeline *interface*

* add comment

* remove ck_tile example #37

* update cmakelists

* add test for new pipeline

* update batched transpose test

* roll back load_tile_transpose changes

* remove comments

* pack dispatch parameters into a config

* padM can be enabled

* adjust lds vector size to enable padding along N

* update test

* clean up logic

* swap m/n input vector size

* adjust perf test script

* sweep over C/W in perf test

* count both read and written bytes into bandwidth (x2 the number)

* clang-format

* widen size range for perf test

* remove 64k x 64k case; it's too large for index

* remove thread tile from dispatch

* Solve merge conflict

* fix compile

* modify the transpose

* solve the test error and clang format

* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)

* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Fixing 0ms and inf GB/s issue in img2col (#2565)

issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

* merge with develop

* solve clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
2025-07-26 21:51:54 -07:00
Bartłomiej Kocot
5741edf761 Fix clang format (#2567)
* clean

* clang format fix
2025-07-25 09:54:34 -07:00
rahjain-amd
78082855d8 Fixing 0ms and inf GB/s issue in img2col (#2565)
issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`
2025-07-25 21:15:50 +05:30
Enrico Degregori
b01a27ff22 Support b_scale: (#2350)
- extend pipeline v1 and v3
 - add instances
 - add tests
 - add example

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-07-24 18:49:58 -07:00
Mateusz Ozga
b507d889c1 [CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520)
* Init commit new API

* apply clang-format

* PreShuffle preapring

* Apply Preshuffle condition to universal_gemm

* Fix: convert size_t to index_t

* Review changes

* Mode 100755 -> 100644

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-07-24 20:39:56 +02:00
Yi DING
4338346b10 Use filename but not path to filter compilation (#2556) 2025-07-24 17:38:14 +08:00
Yashvardhan Agarwal
606b0cc947 [CK_TILE] Support for elementwise kernel (#2246)
* Elementwise kernel implementation

Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: yashagar <yashagar@amd.com>

* Elementwise with generalized nDims

* Adding the n-ary input tensor feature

* Generalize dimensions on top of inputs

* Add TFLOPS + remove std usage for tuples

* 1D basecase optimization

* Cleanup code + refactoring to a common interface

* Generalize to unary and add an example

* Cleanup, refactoring and commenting

* Suggestions for LWPCK-3170: elementwise kernel improvements

* Clang-format: remod.py

* Replace InputTensorType with XDataType as the type of input_tensors

* Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs

* Move examples to folder 19_elementwise

* Add missing copyright headers and fix some existing ones

* Replace an assert with throw std::runtime_error in elementwise example

* Avoid reading the output by using make_static_distributed_tensor for y_tile

* Removed two unused includes

* No need to move windows to the next block when each workgroup processes a single tile

* Only copy input tensors to the device

* Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example

* Adding output strides to the kernel, transposition example and update the other examples

* Changes made by remod.py

* Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp

* Reuse generic reference binary/unary operation in examples + refactoring the transpose reference

* Fix comments in elementwise_example.cpp

- Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses
- ElementWiseTraits was renamed to ElementWiseShape
- Adopt suggestions made by Copilot when prompted to check for factual or typographical errors

* Simplify CMakeLists.txt and remove the unused variables this uncovers

* Rename a file and fix some copyright statements

* Changes made by script/clang-format-overwrite.sh

* Add basic unit test for ElementWiseKernel

* Remove left-over uninformative comment in apply unit test

* Changes made by clang-format-overwrite.sh

* fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp

* Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array

* Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally

* Move examples to folder 20_elementwise

* Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK

* Fix CLang formating

* Bump up the elementwise example folder number

* Elementwise: add padding + minor cleanup

* Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view

* Add isSupportedArg to Elementwise kernel + addapt example and unit tests

* Fix clang-format on the unit test file

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2025-07-24 11:21:45 +02:00
jakpiase
6681593864 [CK_TILE] Grouped Convolution Backward Weight Kernel (#2357)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* resolved conflicts

* rebase fixes

* fixes

* tmp

* add working support for splitk

* minor fix

* fixes

* fixes

* minor fix

* small fix

* Split K and preprocessing fixes

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-07-24 10:41:35 +02:00
Illia Silin
1b6f024836 refactor fmha_bwd.py (#2546) 2025-07-23 09:09:56 -07:00