Commit Graph

2172 Commits

Author SHA1 Message Date
Bartlomiej Kocot
2fa55c9497 universal gemm support 2025-08-09 10:25:23 +00:00
Bartlomiej Kocot
87529070fd Problem descriptor 2025-08-08 15:05:13 +00:00
Bartlomiej Kocot
1b47cea817 Implementation descriptor 2025-08-08 14:00:05 +00:00
Bartlomiej Kocot
55a2a4cac7 Fixes problem descritpor 2025-08-07 10:28:54 +00:00
Bartlomiej Kocot
4d2ecedab2 Conv builder and solution descriptor 2025-08-06 13:45:02 -04:00
John Shumway
594858dd6e Add test code for filling random numbers and a simple GEMM.
Introduces ck_tile::test, with kernels for a simple GEMM and also initializing bf16 memory.
2025-08-05 16:46:35 +00:00
John Shumway
663b7a703c Clean up namespaces and locations of code.
Adds a new ck_tile::runtime for runtime utilities.

Also moves functionality into the builder. Kernel is accessed with the
ckb::Kernel function that is templated by the builder class.
2025-08-05 15:58:30 +00:00
John Shumway
fda9a28e18 Improve comments for single GEMM kernel args. 2025-08-04 23:10:17 +00:00
John Shumway
bac470446c Remove old explanation comment. 2025-08-04 22:59:48 +00:00
John Shumway
6a031b3663 Cleanup example source file. 2025-08-04 22:57:51 +00:00
John Shumway
e416d4ec66 Add code to launch kernel.
The Gemm kernel now executes, but still needs numerical checks.
2025-08-04 22:51:59 +00:00
John Shumway
e49ceff3f5 Fix builder code and choices to compile a GEMM kernel.
I've only verfied that the kernel compiles.

Some of my choices, like float32 types and having the epilogue set the member, are not valid template parameters. I now have this indentical to
a default GEMM universal kernel.

I also fixed some other small logical mistakes I made.

The code currently outputs the GetName results for some of the classes:

```
Kernel name: gemm_bf16_pipeline_AgBgCrCompV3_16x64x128_256_1x4_0x0x0
Shape:       tile_gemm_shape_16x64x128x4_1x4x1_16x16x32
Problem:     gemm_problem_256_0x0x0_Intrawave
Pipeline:    pipeline_AgBgCrCompV3_16x64x128_256_1x4_0x0x0
```
2025-08-04 16:50:37 +00:00
John Shumway
79d34d53dd Add a lot of GEMM template, commiting to save progress.
This commit has a lot of template code extracted from the gemm universal example, but it needs more debugging before we can compile the kernel object.
2025-08-02 00:32:50 +00:00
John Shumway
95bf30511d Move gemm builder to experimental subdirectory. 2025-08-01 18:12:07 +00:00
John Shumway
24f738596a Merge remote-tracking branch 'origin/develop' into jshumway/builder-experiment 2025-08-01 18:06:20 +00:00
Thomas Ning
7c44a763fa Fix the GFX 950 Universal GEMM (#2597)
* solve the gfx950 error

* clang format

* fix a typo error

---------

Co-authored-by: ThomasNing <thomasning@amd.com>
2025-08-01 09:32:24 -07:00
Illia Silin
e6104daecc Add a daily CI stage to test AITER with latest CK. (#2598)
* add a CI stage for AITER testing
2025-08-01 07:55:51 -07:00
lalala-sh
bb5c478295 fix weight index out of range (#2414) 2025-08-01 17:50:02 +08:00
Aviral Goel
1441a0a7ee Integration of a new pipeline for weight preshuffle into gemm examples (#2516)
* something khushbu can help with

* v1 v2 works with flatmm develop

* v0 v1 v2 numerical error gone

* Fixing numerical error, and interchange preshuffle configs to match with flatmm

* Refactor GEMM pipeline configurations and integrate preshuffle support

- Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3).
- Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations.
- Removed obsolete code and comments

* clang format

* fix vectorloadsize bug

* add the Preshuffle3

* update kwarp calculation in gemm utils

* update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch

* fix: add CK_GFX950_SUPPORT macro for gfx950 detection

* default disable rotating buffer

* docs(CHANGELOG): update changelog for rocm 7.0

* Revert "docs(CHANGELOG): update changelog for rocm 7.0"

This reverts commit 2bc16fff84.

* Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files.

* revert example/ck_tile/flatmm to its original state

* remove comment added by second author

* switch to xor ALDSDescriptor

* modify the MakeALdsDescriptor()

* temporary profiling script

* getting rid of line marker compiler error

* UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy

* add a minor fix for the config

* typo fix

* Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2

* revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp

* revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp

* reenable the GemmSpatiallyLocalTilePartitioner

* make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline

* remove hardcoded true for preshuffle bool template argument

* rename script

* remove gemm_profilie.sh script

* merge conflict resolve

* clang formatted

* typo fix

* Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp

* Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy

* Fix missing newline at end of file in run_gemm_example.inc

* Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1

* addressing review comments

* removing debug code

* addressing review comments

* Revert "addressing review comments"

This reverts commit 29c45192ba.

* updating tile_engine code

* addressing review comments

---------

Co-authored-by: amd-khushbu <khuagarw@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-08-01 00:04:54 -07:00
Khushbu Agarwal
88d72178d6 [CK_Tile] Updating gpu timer when doing flush cache (#2593)
* Missed updating function names in example

* updating timer

* code cleanup

* addressing review comments

* updating tile_engine code

* addressing review comments
2025-07-31 16:43:33 -07:00
Aviral Goel
546ef78d1d Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt

* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt

* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

---------

Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
2025-07-31 12:18:02 -07:00
Ville Pietilä
e962a41638 Automatic deduction of split-K value for grouped convolution (#2491)
* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.

* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.

* Use simple best occupancy model to calculate the split-K.

* Handle split-K autodeduction in explicit gemm conv.

* Add unit tests for split-K autodeduction.

* Remove oversubscription.

* Small fixes.

* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.

* Run clang formatting.

* Fix error handling in the conv profiler.

* Add missing documentation for the autodeducted split-K values.

* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.

* Fix clang formatting and split-K profiler documentation.

* Rename max_occupancy value variable.

* Calculate grid size for split-K autodeduction directly from input array shapes and template params.

---------

Co-authored-by: Ville Pietilä <>
2025-07-31 12:08:45 +02:00
John Shumway
4f7d3f7761 Clean up names and README.md and add comments.
The intention and direction of this work is more clear.
No functional changes.
2025-07-31 05:19:56 +00:00
Anton Gorenko
7b074249f4 [CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571)
* Add tests for host convesion f32/f16 to f8

* Add tests for host convesion from f8 to f32/f16

* Fix UB and corner cases in f32/f16 to/from f8 conversion

* There are UBs when very small values are converted to f8: bitshifts
  can be larger that type width. Using unsigned long long does not help
  because exponent_diff >= 64 in such cases. This causes that values
  like 2.117582368e-22 are converted to non-zero f8 in host validation
  of FMHA tests, test_f8 crashes with segfault in completely irrelevant
  code like GTest internals or produces non-deterministic results etc.
* Fix FNUZ conversion to return NaN for NaN inputs.
* Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16
  conversion.

* Replace some magic numbers with values from numeric_traits

* Build tests only on devices supporting the type
2025-07-31 09:54:17 +05:00
John Shumway
1e70e5b3cf Add initial implementation of GemmBuilder
This is a prototype for building a GEMM using templates and C++20 features, especially concepts and consteval.

This PR only creates a dummy GEMM class, but tests some of the design patterns.

It also includes a utilty to alloc host buffers with RAII.
2025-07-31 04:35:11 +00:00
Illia Silin
e8709c24f4 upgrade clang-format version in install_precommit.sh (#2589) 2025-07-30 08:02:25 -07:00
Max Podkorytov
de0cdb4c31 [CK-tile] add gtest for ck-tile batched transpose kernels (#2585)
* add a dummy test file

* add kernel launch logic to the test

* transfer all test cases into gtest params

* factor kernel out into test config

* add load transpose pipeline tests

* add padded tests and skip invalid kernels at runtime

* enum class for pipeline type

* add multiwarp test cases

* fix type

* try to solve the problem

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-07-30 07:31:05 -07:00
Gino Lu
b25d512e8a add constexpr to pk_fp4::pack/unpack() (#2586) 2025-07-30 10:29:04 -04:00
Khushbu Agarwal
61e21f5567 Update to gpu_timer for rotating_buffer (#2524)
* update gpu_timer for rotating buffer as hipblasLt's implementation

* timing fix

* Updating gpu timer for old ck as well

* Revert "Updating gpu timer for old ck as well"

This reverts commit 958cd1bc99.

* code clean up with runtime argument; function rename

* code cleanup

* general timer fixes

* bug fix

* clang formatted

* addressing reveiew comments

* clang formatted

* Addressing review comments

* CI fix

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-07-29 15:21:05 -07:00
Illia Silin
b80099cc5f Revert "Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558)" (#2584)
This reverts commit c64a0c65b9.
2025-07-29 13:04:51 -07:00
Thomas Ning
9d4b494f07 Expand the bandwidth of direct_global_to_lds for gfx950 (#2576)
* Expand the bandwidth of direct_global_to_lds for gfx950

* clang-format

* fix the remod.py and script for clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-07-28 23:56:53 -07:00
rocking
01642ca8b1 set default optdim (#2580) 2025-07-29 13:44:10 +08:00
Illia Silin
49723e94bb fix the clang-format (#2578) 2025-07-28 20:49:55 -07:00
Yi DING
1926cd0cb8 [CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130)
* Fix shuffle_tile

* Add fmha bwd d160

* CHANGELOG

* Use static_cast

* Update

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-07-29 09:31:14 +08:00
Andres Lugo
7fe50dc3da Remove filter for only batch on receipt 4 (#2574)
Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA
2025-07-28 14:53:24 -07:00
Bartłomiej Kocot
5b244105d9 Enable multiple D for grouped conv fwd large tensors (#2572) 2025-07-28 22:39:07 +02:00
linqunAMD
0782ee8eb3 Remove !defined(__HIP_DEVICE_COMPILE__) in CK kernel (#2564)
* Remove HIP_COMPILE_DEVICE

* add missing files

* fix clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-07-28 13:01:07 -07:00
Illia Silin
504b101da3 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
Illia Silin
9786087010 use ninja to build packages (#2575) 2025-07-28 11:04:12 -07:00
jefyang1
c64a0c65b9 Add gemm universal f8 f8 bf16 mk nk instances on gfx950 (#2558) 2025-07-28 09:03:54 -07:00
rocking
b36e0b029f [CK_TILE][FMHA] Uncomment all the headdim, use optdim to control (#2539)
* uncomment all the headdim, use optdim to control

* change default back to -1

* uncomment splitkv instance

* Fix typo in receipt 4 for appendkv

* support optdim for bwd, splitkv and appendkv

* Fix 192 key error

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
2025-07-28 17:16:32 +08:00
shay-li77
8ae528a1b4 fix mha bwd dbias random mismatch (#2570)
* fix mha bwd dbias random mismatch

* formatting code
2025-07-28 14:39:31 +08:00
Bartłomiej Kocot
685771b875 Enable bf16 RNE on gfx950 (#2542)
* Enable bf16 RNE for gfx950

* test bhalf

* fix

* fix

* Comments fixes

* fixes

* clean

* fix
2025-07-28 00:47:17 +02:00
Gheorghe-Teodor Bercea
cbfa62e4b6 Refactor async loads to work on all GPUs (#2545)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-07-26 22:04:59 -07:00
Max Podkorytov
821cd26c13 [CK-Tile] Merge transpose examples (#2450)
* unify pipeline signature with existing example

* iwyu

* move stuff around in load-tile-transpose

* cleanups in batched transpose pipeline

* comments

* use same inputs size

* cleaner printf

* print host args

* use 64 block sides in the 37_transpose example

* roll back grid dimension size adjustment for 37_transpose example

* transpose grid for 37_transpose to unify with 35_batched_transpose

* unify grid computation logic

* make policy methods device only (since they are used only on device from the pipeline)

* more host/device attribute cleanups

* copy over problem

* move over pipeline and policy

* add switch to batched transpose api

* make the lds problem more similar to original problem

* factor out logic into traits

* factor out conditional compilation into trait parameter

* propagate pipeline to args

* unhardcode pipeline dispatch parameter

* refactor vector size

* put warp tile out of dispatch

* rename template parameter for trait

* rewrite vector size in terms of problem

* mark policy-internal struct variable as device

* factor out input distribution and thread access pattern from policies

* reword vector size

* use datatype across batched transpose pipelines, problems and kernel

* remove transpose traits from lds pipeline

* add padding to the lds pipeline *interface*

* add comment

* remove ck_tile example #37

* update cmakelists

* add test for new pipeline

* update batched transpose test

* roll back load_tile_transpose changes

* remove comments

* pack dispatch parameters into a config

* padM can be enabled

* adjust lds vector size to enable padding along N

* update test

* clean up logic

* swap m/n input vector size

* adjust perf test script

* sweep over C/W in perf test

* count both read and written bytes into bandwidth (x2 the number)

* clang-format

* widen size range for perf test

* remove 64k x 64k case; it's too large for index

* remove thread tile from dispatch

* Solve merge conflict

* fix compile

* modify the transpose

* solve the test error and clang format

* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)

* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Fixing 0ms and inf GB/s issue in img2col (#2565)

issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

* merge with develop

* solve clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
2025-07-26 21:51:54 -07:00
liang
d2459878cf reorder grid dim schedule (#2533)
Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-07-26 02:46:55 +08:00
Bartłomiej Kocot
5741edf761 Fix clang format (#2567)
* clean

* clang format fix
2025-07-25 09:54:34 -07:00
rahjain-amd
78082855d8 Fixing 0ms and inf GB/s issue in img2col (#2565)
issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`
2025-07-25 21:15:50 +05:30
Adam Osewski
c8eb2f995c Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-07-25 10:34:31 +02:00
Enrico Degregori
b01a27ff22 Support b_scale: (#2350)
- extend pipeline v1 and v3
 - add instances
 - add tests
 - add example

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-07-24 18:49:58 -07:00