Commit Graph

57 Commits

Author SHA1 Message Date
Anton Gorenko
1edd250115 [CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd
2025-09-27 18:03:48 +05:00
Khushbu Agarwal
b56e5d1d79 Fix for Add the API to load SGPR (#2913)
* Revert "Revert "[CK-Tile] Add the API to load SGPR  (#2878)" (#2904)"

This reverts commit f161b5b738.

* Fix: sgpr minor issue

* cyclic dependency resolved

* clang formatted

* removing unused variable

* clang formatted

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-25 10:32:42 -07:00
Sami Remes
dcd33a6ecc [CK_TILE] Fix cshuffle epilogue issue with IsLoadableTile (#2903)
* Fix issue with constexpr checks in scaling/cshuffle

* Remove IsLoadableTile

* Move amd_wave_read_first_lane before first usage
2025-09-23 23:08:18 -07:00
asleepzzz
f161b5b738 Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)
This reverts commit 2cbbf5dcb3.
2025-09-23 14:33:51 -07:00
Thomas Ning
2cbbf5dcb3 [CK-Tile] Add the API to load SGPR (#2878)
* Have a workable version for SGPR

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* substitute with the new sgpr read api

* update the CHANGELOG

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* change to static for logic

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
2025-09-23 01:23:56 -07:00
linqunAMD
f22740df82 Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725)
* Enable xdl in gfx11 & gfx12

* update cmake file

* fix all instance build (cmake)

* fix batched_gemm_gemm(cmake)

* rebase cmake files

* fix cmake build error

* remve CK_ENABLE_DYNAMIC_WARP_SIZE

* update cmake build error2

* fix gfx11 build

CK_USE_XDL is enabled on gfx11 and gfx12

* fix gfx10 build

* fix gfx11 error

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-09-15 10:59:25 -07:00
Thomas Ning
1894a0dbc3 Fix the vector load & fix the gfx950 compv4 error (#2831) 2025-09-12 11:48:45 -07:00
Illia Silin
b4207c01c7 Revert "add vector load 16/32 for bf16/fp16 (#2779)" (#2818)
This reverts commit 7ecdba878f.
2025-09-10 13:35:15 -07:00
zjing14
7ecdba878f add vector load 16/32 for bf16/fp16 (#2779)
* add vector load 16/32 for bf16/fp16

* comment addressed

* clang format

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-09-09 23:15:19 -07:00
Yi DING
de61e55493 [CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715)
* FMHA avoid unnecessary vmcnt0

Squashed commit of the following:

commit 7bdf6a7eef
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 22 03:15:51 2025 +0000

    merge develop and solve conflicts

commit f21e916a8c
Merge: a7dd2a7d1 0db21053e
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 22 03:15:21 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue

commit a7dd2a7d13
Author: Ding, Yi <yi.ding@amd.com>
Date:   Tue Aug 19 02:17:43 2025 +0000

    update bwd

commit 380aa8f311
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Mon Aug 18 19:36:38 2025 +0000

    add restrict to applicable functions

commit b85daba2a3
Author: Ding, Yi <yi.ding@amd.com>
Date:   Mon Aug 18 02:07:03 2025 +0000

    bwd filter

commit 75c4b9372f
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Sat Aug 16 08:15:23 2025 +0000

    remove noinline attr as it causes a lot more s_waitcnt's

commit 598e3fec41
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Thu Aug 14 12:11:17 2025 +0000

    remove innerloop, move restrict parameters to mainloop and add noinline attribute.

commit 3340408537
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Thu Aug 14 07:06:51 2025 +0000

    Create inner lambda with restrict parameters, add restrict to some parameters

commit 3bc45ecbc7
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Aug 14 03:43:54 2025 +0000

    save for debug

commit de4db6c4c5
Merge: 108abf00e 68694cb78
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 13 02:15:22 2025 +0000

    Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 108abf00e0
Merge: 0810799e2 0f42a92fc
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 13 02:14:26 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 68694cb781
Merge: 0810799e2 20288caa2
Author: asleepzzz <hanwen.chang@amd.com>
Date:   Wed Aug 13 00:34:11 2025 +0800

    Merge branch 'develop' into wip-async-tr-fa

commit 0810799e25
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 14:25:50 2025 +0000

    refactor blockgemm change, isolate to v2;

commit fd1eb323af
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 09:26:13 2025 +0000

    clang format

commit 75f6f6bac4
Merge: bcc05eee6 8e1eb0c1e
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 09:04:41 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit bcc05eee62
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 08:46:06 2025 +0000

    Fix the bug

commit 96d24497f5
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 04:02:41 2025 +0000

    fix conflict. disable all v-col instance for fmha fwd

commit 1716171be4
Merge: 1c9800790 4fde1646e
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 03:52:34 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 1c98007901
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 01:53:31 2025 +0000

    clang format

commit f43e903b1d
Merge: 3868ddd70 a7badc6ec
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 01:52:52 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 3868ddd708
Merge: 498d234ab 191c62967
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 15:59:40 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 498d234ab8
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 15:37:37 2025 +0000

    change the warp setting for hdim32 fmha fwd

commit b86f7786e2
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 14:21:09 2025 +0000

    tempsave, update the blocksync functions

commit 7b8052d7ca
Author: aska-0096 <haocwang@amd.com>
Date:   Sun Aug 10 06:00:51 2025 +0000

    fix bug in pki4

commit 76cbbb84a2
Author: aska-0096 <haocwang@amd.com>
Date:   Sat Aug 9 03:25:12 2025 +0000

    fix bugs in gemm

commit 8c101ccb88
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 18:35:53 2025 +0000

    fix bug on non-gfx950

commit efb8549279
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 17:53:19 2025 +0000

    fix bug

commit 729e8785fb
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 15:42:15 2025 +0000

    fix bugs

commit 250dc13c75
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:31:01 2025 +0000

    fix clangformat with 18.1.3

commit 106edeecd9
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:07:40 2025 +0000

    remove non-necessary change

commit 78edd7303b
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:04:02 2025 +0000

    bug fix, clang format;

commit 3b9fb6af38
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 08:08:03 2025 +0000

    Remove unnecessary changes

commit 6bb57c2c57
Merge: 1ecee378d ab2602683
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 07:50:12 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 1ecee378d5
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 06:19:31 2025 +0000

    remove unnecessary files; rename some files

commit b4640a9de6
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 05:46:18 2025 +0000

    merge fa_decode pipeline into fmha_fwd api

commit fe63a646a4
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 6 05:58:43 2025 +0000

    add __restrict__ to tr load

commit 414cad667b
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 5 07:23:51 2025 +0000

    Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

commit 0d12fc944f
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 10:27:42 2025 +0000

    Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

commit 4f31847de1
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 10:02:17 2025 +0000

    add vmcnt guard before load ktile

commit 746f4ccb99
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 06:49:01 2025 +0000

    Load Q through lds, implement xor;

commit 2d4e73d2b4
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 1 10:44:54 2025 +0000

    small refactor

commit a28b6e67fe
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 31 10:25:37 2025 +0000

    upgrade prefill pipeline; simple iglp; consistent data produce and consume order

commit 75cba48682
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 31 05:13:27 2025 +0000

    enable larger tile size; upgrade xor pattern

commit 69890afc98
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 30 12:25:33 2025 +0000

    remove all lds bankconflict with xor layouts

commit 8dacc35c4c
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 30 03:51:06 2025 +0000

    enable prefill overload operator().

commit 13bcc913de
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 25 07:10:01 2025 +0000

    fix the lds alignment caused performance regression

commit af28123cec
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 23 09:05:57 2025 +0000

    remove unnecessary features

commit 14e0ab70c6
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 22 08:04:05 2025 +0000

    tempsave. asynccopy+trload sanity checked

commit 1b468bac0b
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Jul 21 05:55:55 2025 +0000

    tempsave, trload+asyncload done

commit afd96d8180
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 10:04:34 2025 +0000

    compile pass

commit 5616551115
Merge: ae39c84f5 095393276
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 05:17:27 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit ae39c84f55
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 05:16:39 2025 +0000

    tempsave

commit 94b6430489
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 10:06:09 2025 +0000

    temp save

commit 7e330553dc
Merge: 18669925c 804f77dce
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 07:24:32 2025 +0000

    Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline

commit 804f77dce5
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 03:10:46 2025 +0000

    move test_copy into test

commit 21627d7ca7
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:41:31 2025 +0000

    remove unnecessary output

commit 287792c44a
Merge: a4221db30 21fd7e953
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:26:13 2025 +0000

    Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix

commit a4221db304
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:26:10 2025 +0000

    add input validation and bug fix

commit 21fd7e9538
Merge: d6df7bf85 6e76b8205
Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Date:   Wed Jul 16 11:23:57 2025 -0700

    Merge branch 'develop' into test_copy_fix

commit d6df7bf851
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 08:55:50 2025 +0000

    fix vmcnt shift

commit 40e039e4e4
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 08:37:07 2025 +0000

    Improve s_waitcnt_imm calculation

commit c30f8b709b
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 05:39:50 2025 +0000

    fix the s_waitcnt_imm calculation

commit ec0a45b29f
Merge: e5cc4af80 6b09f0823
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 03:57:57 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix

commit e5cc4af808
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 03:54:33 2025 +0000

    Add block_sync_lds_direct_load utility

commit eea58629cf
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 15 09:39:03 2025 +0000

    fix async copytest bug

commit 18669925cc
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 10 04:29:33 2025 +0000

    temp save, change all instance to 1wave

commit 18686cfe5b
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 8 08:37:20 2025 +0000

    tempsave, fmha_decode

commit 47565f21a5
Author: aska-0096 <haocwang@amd.com>
Date:   Sat Jun 21 15:02:57 2025 +0000

    temp save, waiting for debug

commit e0a634ef97
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jun 19 05:11:52 2025 +0000

    save an example for __bf16 type

commit 4bd5fd4a3c
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jun 18 07:27:24 2025 +0000

    fix bwd code

commit 69809d9513
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jun 18 06:37:16 2025 +0000

    Fix for fwd/bwd kernel build filter

commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba
Author: Ding, Yi <yi.ding@amd.com>
Date:   Tue Aug 19 08:13:18 2025 +0000

    Add restrict to avoid unnecessary vmcnt

---------

Co-authored-by: aska-0096 <haocwang@amd.com>

* Add comments for c-stype cast

* Better comments

---------

Co-authored-by: aska-0096 <haocwang@amd.com>
2025-08-25 20:55:12 +08:00
Po Yen Chen
0db21053e6 [CK_TILE] Allow switching between SGPR/VGPR get_warp_id() return values (#2669)
* Allow return VGPR get_warp_id() value

* Avoid using SALU in async_load_raw()
2025-08-22 10:17:05 +08:00
Po Yen Chen
4a7ecce096 [CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549)
* Support async load dwordx4

* Enlarge load size on gfx950
2025-08-22 10:13:47 +08:00
linqunAMD
9fcc1ee9fd Support Wave32 in CK_TILE - Part 1 (#2594)
* Support wave32/wave64 in CK_TILE - Part 1

* remove blocksize in kernel launch

* fix build error

* fix clang format

* fix clang format 2

* fix clang format 3

* fix fmha build error

* fix fmha build 2

* fix fmha build 3

* fix build error 4

* address review comment

* update change log

* replace KernelBlockSize with kBlockSize

* fix CI fail

* fix clang format

* address review comment and rebase code.

* fix universal test fail

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-08-18 10:08:31 -07:00
Tianyuan Wu
68134b60e4 [CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466)
* WMMA GEMM F16 Implementation

Signed-off-by: root <tianyuwu@amd.com>

* Self-review

Signed-off-by: root <tianyuwu@amd.com>

* ASIC check minor tweak

Signed-off-by: root <tianyuwu@amd.com>

* add missing include file

* Set GPU_TARGETS to gfx11/12 generic

Signed-off-by: root <tianyuwu@amd.com>

* INT8 GFX12

Signed-off-by: root <tianyuwu@amd.com>

* add int8x16 branch

* Fix CI script

Signed-off-by: root <tianyuwu@amd.com>

* Fix typo

Signed-off-by: root <tianyuwu@amd.com>

* Add CK_Tile WMMA example

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* Fix CI

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* fix clang format

* Set M/N_Warp Back to Constant

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* Use GemmConfigComputeV3 by default

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Remove CK_Tile wmma gemm examples from the CI list

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Add atomic add fallback method for gfx11

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix typo

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Omit copyright year

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Support non-square cases

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix CI

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Add get_device_ip()

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Revert "Add atomic add fallback method for gfx11"

This reverts commit 07a79e797d.

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12"

This reverts commit ceee918007.

* Revise method name and typos

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* clang-format

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Try fix CI

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Revert "Try fix CI"

This reverts commit 7a7241085e.

* clang-format

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix typo caused by merge

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Fix typo caused by merging

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

---------

Signed-off-by: root <tianyuwu@amd.com>
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-08-15 16:22:27 -07:00
Haocong WANG
05a6e92705 Re-enable optimization for gfx950 fmha fwd (#2671)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

* refactor blockgemm change, isolate to v2;

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-08-13 14:57:43 +08:00
asleepzzz
5b39de4bb6 Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)
This reverts commit b7322a521a.
2025-08-12 20:27:10 +08:00
Haocong WANG
b7322a521a Optimize fmha fwd decode & prefill for gfx950 (#2641)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-08-12 19:43:14 +08:00
Sami Remes
ffdee5e774 [CK_TILE] Enable printing more structures in CK-Tile (#2443)
* Add more printing to core cktile

* Revert other changes in static encoding pattern

* Refactor to using a free print() function

* Remove loops and print just the containers

* Print tuple with better formatting, fix sequence compilation

* Add some tests for print utility

* Add print utility header

* Print for static_encoding_pattern

* add buffer_view printing

* Align vector_traits

* Fix formatting

* Lower-case enum strings

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

* Remove empty comment lines

* Fix test with lower-case too

* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y

* Add test_print_common.hpp

* add print.hpp in core.hpp

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-08-07 15:45:27 +03:00
Adam Osewski
2622ff06cb Remove unused lds direct load instruction. (#2573)
This functionality is replaced by amd_async_buffer_load

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2025-08-06 15:16:12 +02:00
Thomas Ning
9d4b494f07 Expand the bandwidth of direct_global_to_lds for gfx950 (#2576)
* Expand the bandwidth of direct_global_to_lds for gfx950

* clang-format

* fix the remod.py and script for clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-07-28 23:56:53 -07:00
Illia Silin
504b101da3 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
Gheorghe-Teodor Bercea
cbfa62e4b6 Refactor async loads to work on all GPUs (#2545)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-07-26 22:04:59 -07:00
Haocong WANG
a5fdc663c8 fix async copytest bug (#2509)
* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* change bit width check

* refactor macros into constexpr functions

which still get inlined

* wrap s_waitcnt api

* parameterize test

* cleanup

* cleanup fp8 stub

* add fp8 test cases; todo which input parameters are valid?

* replace n for fp8 in test cases

* add large shapes; fp8 fails again

* change input init

* test sync/async

* time the test

* clang-format test

* use float instead of bfloat to cover a 4-byte type

* fix logic - arg sections should be 'or'd

* make block_sync_lds_direct_load interface similar to old ck

* fix a few comment typos

* name common shapes

* revert the example to original logic of not waiting lds

* clang-format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-07-23 00:14:02 -07:00
Yi DING
f1d8ad2818 [CK_TILE] Use read_tr in universal gemm (#2436)
* Use read_tr in universal gemm

* Enable all instances back

* Revert example37 changes

* Resolve comments

* resolve comments 2

* Fix assertion msg

* fix the gemm basic

* change index_t to bool for preshuffle variable

* Solve the comment

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
2025-07-16 23:56:22 -07:00
Illia Silin
e033a1b4bf fix compilation errors with clang20 (#2464) 2025-07-07 19:40:30 -07:00
Thomas Ning
f240ae3248 Enable Async Copy for MI355 (#2425)
* add for async load builtin

* add async load api

* fix some compiling errors

* fix a compiling error

* fix some compiling errors

* add a pipeline which copies from v4

* add a new pipeline for async load

* fix some compiling errors

* add async load tests

* fix some issues in async load

* fix

* fix async inline assembly

* fix async inline assembly

* add ignore header file

* comment some not gfx950 codes

* comment some not gfx950 codes

* fix a error

* update async load apis

* fix lds descriptor

* fix a compiling error

* fix some compiling errors

* fix a descriptor issue

* update lds descriptor

* change async pipeline's tile distribution pattern from thread to warp

* fix clang format

* update async policy

* fix a CRTP issue

* fix a typo error

* change lds layout

* fix some sync issues

* improve codes

* delete the async test

* fix a commented format issue

* avoid compiling device functions when compile host

* make gemm run

* add the copy kernel support

* finish the feature

* Address comment

* add the support for buffer_builtin

* solved the merging problem

* Comment Addressed

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: joyeamd <John.Ye@amd.com>
2025-07-07 10:08:49 -07:00
Thomas Ning
107e3623c7 Transpose builtin macro defense (#2374)
* add the macro defense

* add the static assert check
2025-06-20 11:24:54 -07:00
joyeamd
a2f01141aa transpose load api development (#2177)
* add transpose load; no real logic

* fix some compile errors

* fix some issues

* update transpose load logic

* add some fixes

* fix a distribution issue

* update some codes

* add some fix

* can pass; but no logic

* transpose load enable

* update tile transpose

* miss output tile distribution mapping

* hack for transpose 16x16

* update output tensor distribution

* delete unused variables

* fix transpose related codes

* update transpose load example

* exchange the iteration order

* fix 16x16 related dimension transpose

* fix a transpose index issue

* fix a transpose index issue

* fix clang format check

* update load tile transpose related codes

* fix compile errors and pass 16x16 tests

* fix a typo

* update logic

* check other data types

* add transpose load api

* update transpose load api

* fix clang format check

* change file name

* refactor codes

* update code name

* delete some unused codes

* delete the unused oob flag for transpose load

* update tensor view api for transpose load

* update for testing

* fix a typo error

* move transpose ops to example directory

* update transpose api

* update include file

* fix for pr review

* fix compile errors

* add transpose load; no real logic

* fix some compile errors

* fix some issues

* update transpose load logic

* add some fixes

* fix a distribution issue

* update some codes

* add some fix

* can pass; but no logic

* transpose load enable

* update tile transpose

* miss output tile distribution mapping

* hack for transpose 16x16

* update output tensor distribution

* delete unused variables

* fix transpose related codes

* update transpose load example

* exchange the iteration order

* fix 16x16 related dimension transpose

* fix a transpose index issue

* fix a transpose index issue

* fix clang format check

* update load tile transpose related codes

* fix compile errors and pass 16x16 tests

* fix a typo

* update logic

* check other data types

* add transpose load api

* update transpose load api

* fix clang format check

* change file name

* refactor codes

* update code name

* delete some unused codes

* delete the unused oob flag for transpose load

* update tensor view api for transpose load

* update for testing

* fix a typo error

* move transpose ops to example directory

* update transpose api

* update include file

* fix for pr review

* fix compile errors

* change directory name

* delete the duplicated directory

* update cmakelists file

* delete the unused codes

* update function names

* update transpose policy

* update code after remod.py

* update codes

* add some comment

* Polish the instr infrastructure

* build up the fixed instr

* redesign the transpose api, currently it has numerical error

* add the bf16 transpose

* fix some issues

* add some comments

* update document

* Finished the refactor of API and pass through the verification

* fix the merging issue

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-06-18 01:28:34 -07:00
Satyanvesh Dittakavi
4c57157d50 Do not use warpSize as compile time constant as it is removed (#2320)
* Do not use warpSize as compile time constant as it is removed

* Update tile_image_to_column_shape.hpp

update warpSize usage.

* clean-up all use of warpSize, make sure code builds

* fix

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-06-17 11:54:30 -07:00
John Afaganis
6635d1bb88 Remove usage of 'warpSize' variable as it has been deprecated (#2295)
* SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround

* SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo

---------

Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com>
2025-06-10 07:34:54 -07:00
valarLip
8482977a37 extend buffer load to support load 32 bf16/fp16 at same time (#2291) 2025-06-06 17:21:19 +08:00
valarLip
0fdbf6bcd1 extend buffer load for fp16/bf16x16 (#2270)
* extend buffer load for fp16/bf16x16

* format
2025-06-02 10:29:54 +08:00
Illia Silin
8146e471f1 fix the buffer intrinsic names for clang >=20 (#2228) 2025-05-23 14:58:25 -07:00
Illia Silin
1b846143c6 Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192)" (#2227)
This reverts commit 58f9e9ffbc.
2025-05-22 15:41:17 -07:00
Illia Silin
58f9e9ffbc Update the buffer load/store intrinsic names for clang>=20. (#2192)
* fix the buffer load/store intrinsic names

* fix clang format
2025-05-13 10:18:14 -07:00
carlushuang
4e9b76f88c [CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153)
* combine 2-3 as single stage

* support zeroing

* improve long tokens

* update specialization

* b16 ws

* 8bit topk optimize

* update 15 example
2025-05-06 17:32:07 +08:00
Juan Manuel Martinez Caamaño
f14e648e7c Replace inline assembly with builtins in FHMA (#2067)
* Replace inline assembly with builtins in FHMA

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
2025-04-10 09:48:37 +02:00
carlushuang
3e81279d26 Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit 8cbcd3e0d0.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-13 11:41:39 +08:00
Illia Silin
8cbcd3e0d0 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 7a93b16ff6.
2025-03-11 10:40:18 -07:00
carlushuang
7a93b16ff6 [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update
2025-03-11 21:07:40 +08:00
Juan Manuel Martinez Caamaño
0f62dd9928 Fix typo: v_offset used in initialization of v_offset (#1951)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-06 17:37:29 -08:00
kylasa
66c5f5b0b6 Addressing (Post Merge) code review comments for PR 1845 (#1883)
* Addressing code review comments.

* Addressing code review comments.

* Reorganized code for better readability.

* add ck_tile gemms for new types in CI

* fix jenkins syntax

* fix script syntax

* Add the test cases back

* Address the review comments

* Address review comments

* clang format

* Solve the merging issues

* Addressed the comments

* clang format

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-03-06 11:40:30 -08:00
Illia Silin
a88bf76ecc Replace buffer load/store intrinsics with builtins (#1876)
* replace buffer load/store intrinsics with builtins

* fix clang format

* replace buffer load/store intrinsics with built-ins in ck_tile

* fix clang format

* add switch between buffer intrinsics and built-ins

* change the builtins threshold to clang20

* fix clang format

* fix some compilation errors

* revert changes in ck_tile

* revert changes in ck_tile

* delete all root files and folders when CI completes

* try changing the username in CI

* fix groovy syntax

* add user and group id info to ci dockers

* change ownership of all files in CI to jenkins at the end

* update changelog
2025-03-05 14:33:28 -08:00
Bartłomiej Kocot
4d9973ec8e [CK TILE] GEMM with packed i4 (#1885)
* [CK TILE] GEMM with packed i4

* Fixes

* fixes

* fixes

* fixes
2025-02-20 09:59:49 +01:00
kylasa
ab5d027866 Support for dtypes (fp8, bf8, bf16 and fp16) for the ck_tile/03_gemm example. (#1845)
* Support bf16/fb8/bf8 datatypes for ck_tile/gemm

* remove commented out code.

* Addressing code review comments and enabling universal_gemm for all the supported data types.

* Merge conflict resolution.

* Solve the memory pipeline compilation error. Merge with the new change of CShuffle

* finish the feature, pass the tests

* Fix the pipeline and add the benchmark script for other data types

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-02-07 15:07:06 -07:00
Mateusz Ozga
3c93d3c444 CK-Tile Grouped GEMM refactor and post PR fixes (#1756)
* Grouped gemm simple code refactor

* Offset invoker

* Invoke generic Run, and replace name of parrtitioner variable

* Tests fix type

* Removed namespaces

* Add template param to avoid implicit cast

* Remove generic function

* Constant value

* underline enum to int16_t

* Generalize partitioner function

* Remove whitespaces

* Rename function

* Using support

* Clang-format

* Clang-format

* Fn-partitioner description fn

* Typo

* Typo 2

* Better description

* Better description

* Refactor after review

* Use ctr instead of set fn

* Inovke ctr and typo

* Comments

* Remove unnecessary comment

* Review, remove modulo
2025-01-21 21:06:10 +01:00
Po Yen Chen
37cdbf4f0e [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705)
* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------

Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
2024-12-20 14:41:01 +08:00
carlushuang
440e28b08f [CK_TILE] fused-moe first version (#1634)
* moe pipeline

* update code

* compile OK

* update

* update cpu reference

* update pipeline_gemm0

* compiler ok

* update pipeline

* rename to ex pipeline

* block-asm

* update

* update

* update first gemm ok

* compute correct

* update file structure

* update README

* update

* update

* update code

* update API

* return unsupport case

* add comment

* update readme

* update

* uncomment

* update

* fix build err

---------

Co-authored-by: valarLip <340077269@qq.com>
2024-11-26 11:14:56 +08:00
carlushuang
b098b71b05 topk_softmax (#1592)
* topk_softmax

* remove some file

* fix atomix linear_offset

* address various comment, and change sfc get_index api to static(tuple)
2024-10-26 23:52:49 +08:00
ltqin
0394f8a713 update layernorm (#1570)
* port layernorm

* change warp_welford.hpp

* Update warpshuffle

* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()

* refine welford max count calculation

* unify layernorm api

* Rename file

* Remove save mean and inv std

* Revert "refine welford max count calculation"

This reverts commit 022365802b.

* Fix order of parameter

* refine welford max count calculation again

* Remove fp32 instances

* Fix bug of padding

* refactor api

* Support bf16

* Extract common function

* Refine arg of operator()

* Add kMThreadPerBlock to template parameter

* clang format

* Refine variable name

* Refine file name

* remove redundant line

* refactor layernorm2d pipeline and add block-per-block utility

* fix name

* rename more

* add more block-per-tile instance

* remove duplicated define

* update instance for 2048, 1024 case

* support up to 2048 now

* opt loading

* add n1536

* Add two pass pipeline

* format

* Fix incorrect type

* parallel compilation

* Use smaller N

* fix 2p pass

* Support Repeat_M in distribution

* Refine nameing

* Add reduce example

---------

Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-10-22 09:26:18 +08:00