Commit Graph

95 Commits

Author SHA1 Message Date
Anton Gorenko
1e77695fe8 [CK_TILE] Support WMMA (gfx12) in FMHA (#2528)
* Pass hdim to tile_example_fmha_fwd in fp8 tests

* Add WMMA support to fwd FMHA pipelines

* Tune tile sizes a bit for less spilling

fp16 256 is still quite slow

* Fix Q grad tile distribution for warp size = 32 and hdim >= 256

With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.

* Use code based on BlockDropout in BlockDropoutBwd

* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal

* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
  because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
  it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
    * any number of warps (power-of-2), not only 4;
    * kN1 = 16, not only 32;
    * any number of splits;

* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now

* Replace hard-coded kN1 in dispatch code with the requested tile size

* Add gfx12-specific tile sizes for split KV

* Pass GPU architecture to kernel generation scripts

This is still a temporary solution.

* Build and run FMHA CI tests for gfx12

* Fix issue after merging

* Fix bwd tile sizes

The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).

* Use hardware f32->f8 on gfx12, remove v_perm

__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
 32-bit dword) is used to store results (two f8 values).

* Update changelog

* Add WMMA support to pagedkv

* Fix scripts after rebasing

* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Fix names after cherry-picking

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Do not use filters related to qr_async_trload

They disable tiles/pipelines which are valid for gfx12.

* Use different dstr encoding when C is transposed

* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code

Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().

* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution

Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.

* Generalize usages of WarpGemmDispatcher for MFMA and WMMA

WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.

* Mark has_load_tr as maybe_unused

There are no transpose loading for RDNA.

* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code

* Detect BlockSize on host based on warp size of the current device

If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.

* Fix calculation of grid size for combine kernel with warp size = 32

* Add missing includes and header

* Support multiple archs in one binary for fwd

* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill

* Support multiple archs in one binary for bwd

* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
  they can be used as fallbacks (similarly to fwd);

* Extract common code from register_traits

* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"

To simplify merging , the proper fix is in develop already.

* Support new numerical d paddings in trait ordering checks

* Build fp32 tests only on gfx9

* Do not use hardcoded M0 = 64 for dot bwd kernel

* Use textwrap.indent from standard library

* Make fp8 pipelines on gfx12 consistent with gfx9

* Update tests for current pipelines

* Make ninja check more responsive in CI

ninja buffers output so this job looks hanging.

* Support fp8fp32 by limiting O vector size

The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).

* Remove unused cmake options

* Unify including  amd_buffer_addressing.hpp/_builtins.hpp

* Temporarily use amd_buffer_addressing.hpp on >=gfx10

amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
 * 1 scalar for exec masks instead of 2,
 * gfx12 uses different instruction names etc.

* Update asm in bf16 conversions to work with warp 32

* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd

* Add arch tags to kernels/host funcs, compile for each arch separately

* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename

* Add workaround for miscompilation of bwd with padded hdim

SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.

* Fix add_gtest_executable for absolute paths

Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.

* Run only hdim 128 smoke tests for fp8fp32

There are no instances for hdim 64 and 256.

* Format py with ruff to simplify merging develop

* Fix incorrect var name

* Codegen for gfx9,gfx950 when --targets is not specified

Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).

* Combine arch-related values into ArchTrait

This more centralized approach removes duplication of various formatting templates.

* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"

Some code is extracted into a function.
2025-10-29 13:31:08 -07:00
Haocong WANG
0d3860dfdb [CKTILE] FMHA fwd trload lse fix (#3046)
* enable storelse for fmha_fwd_trload kernel

* fix lse in trload

* fix the mask related bug
2025-10-23 09:33:33 +08:00
Anton Gorenko
1edd250115 [CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Add F32 MFMA warp gemms

* Support f32 in fwd FMHA

* Implement transpose_vectors for 4-byte types (float)

* Fix unexpected implicit f32->uint32 cast in buffer_store<4>

__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.

* Support F32 in bwd FMHA

hdim = 256 is disabled for now because it uses too much memory on gfx90a

* Support Headdim = 48 (divisible by 16) in fwd

* Add fp32-specific receipts (800 and 801)

* Tune fwd tiles

* Tune bwd tiles

* Use small tiles only for small seqlen_q

* Fix after rebasing

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Remove constraints and adjust filtering for fp32

Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.

* Add fp32 tests

* Make splitkv and appendkv compile for fp32 only

There are no instances yet, but API still must compile when only fp32 is
requested.

* Remove unimportant f32 instances

* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS

* Replace magic numbers with a constant, improve comments for dropout

* Update changelog

* Fix condition that dq_acc must be set to zero when mask is used

The change was introduced in #2799

* Replace warp_uniform with recently added amd_wave_read_first_lane

* Add hdim = 96 and 192 to fwd
2025-09-27 18:03:48 +05:00
Anton Gorenko
c6bfd97c2d [CK_TILE] FMHA Fix synchronization issue in FWD splitkv combine pipeline (#2934)
* Fix validation of rotary embedding with time_kernel_

When rotary embedding is used, the appendkv kernel modifies the q tensor
(multiple times when time_kernel_ is set). We need to reset the q buffer
and rerun all kernels.

* Fix synchronization issue in splitkv combine pipeline

Different warps can read and then rewrite the same values of lse_acc_lds.
Sometimes warps progress at different speeds, one warp can rewrite
values that are still being read by another warp.

Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:

bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/*KV*"
2025-09-27 08:16:10 +05:00
Yi DING
32773fe5cb [CK_TILE] FMHA BWD Pad HDim to a Multiple of 8 (#2918) 2025-09-26 16:42:59 +08:00
Khushbu Agarwal
b56e5d1d79 Fix for Add the API to load SGPR (#2913)
* Revert "Revert "[CK-Tile] Add the API to load SGPR  (#2878)" (#2904)"

This reverts commit f161b5b738.

* Fix: sgpr minor issue

* cyclic dependency resolved

* clang formatted

* removing unused variable

* clang formatted

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-25 10:32:42 -07:00
Yi DING
fe0a47a011 [CK_TILE] FMHA BWD Add D96 Instances (#2916) 2025-09-24 17:04:23 +08:00
asleepzzz
f161b5b738 Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)
This reverts commit 2cbbf5dcb3.
2025-09-23 14:33:51 -07:00
Haocong WANG
959df2a155 [FMHA FWD] gfx950 Accuracy enhancement & bug fix (#2900)
* disable cast_tile_pk_fp16_fp32 on gfx950

* fix wrong encoding when hdim is not exponentiation of 2

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-09-24 00:59:41 +08:00
Thomas Ning
2cbbf5dcb3 [CK-Tile] Add the API to load SGPR (#2878)
* Have a workable version for SGPR

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* substitute with the new sgpr read api

* update the CHANGELOG

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.

* change to static for logic

* have a workable version for atomic add

* Revert "have a workable version for atomic add"

This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
2025-09-23 01:23:56 -07:00
Haocong WANG
b6e8994386 [CK_TILE] FMHA FWD bug fix (#2888)
* tempsave debug

* fix the bug in fmha fwd_kernel

* Remove unnecessary changes

* Fix the buggy part

* remove fmha fwd known failure cases
2025-09-23 15:00:46 +08:00
Yi DING
6cf3fdd21c [CK_TILE] FMHA BWD Fix Decode Accuracy (#2881)
* [CK_TILE] FMHA BWD Fix Decode Accuracy

* use s_waitcnt utils
2025-09-19 21:45:02 +08:00
Anton Gorenko
2aec38f9ec [CK_TILE] FMHA Fix synchronization issues in BWD pipelines (#2876)
* Run ctest with --output-on-failure

* Fix synchronization issues in bwd pipelines

The bwd kernel reuses the same area of LDS for ds (SGrad), bias and
dbias (BiasGrad). This means that there must be block_sync_lds between
loading one tensor and storing another to the same area.

Heavy instructions like MFMA/WMMA and global loads are executed between
reuses of the same memory so in MOST cases loading is finished by all
warps before storing is started. However, sometimes warps progress at
different speeds.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:

bin/test_ck_tile_fmha_bwd_bf16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure
2025-09-19 11:34:45 +05:00
Po Yen Chen
7fbc9d6c97 [CK_TILE] FMHA FAv3 scheduling fine-tuning for performance (#2833)
* Re-mapping thread block indices for causal=True kernels

* Use more intuitive remap_opt value

* Fallback to origin remapping if seqlen_q >= 64K

* Use GenericAttentionMask to reduce mask computation

* Avoid unnecessary boundary check for IsMasking=false case

* Fix wrong kernel entry specifier

* Add s_nop to prevent delay wave0-3

* Refine scheduling

* Remove unnecessary sched_group_barrier()

* Move sched_group_barrier() call to scheduler

* Replace inline asm s_setprio with intrinsics

* Rephrase comments

* Expend some o_acc rescaling insts to avoid SIMD idle

* Fix block idx special mapping logic

* Tune block index mapping for causal=False cases

* Tune block index mapping for causal=True cases

* Fix wrong vmcnt()

* Remove parameter name

* Use boolean option for turn on/off causal mask

* Update benchmark_fwd_v3.sh option usages

* Add option if compiler support it
2025-09-16 11:32:38 +08:00
Yi DING
5ff205ca79 Reapply "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)" (#2757) (#2761)
This reverts commit 038ea82315.
2025-09-08 09:21:14 -07:00
Thomas Ning
42a43d1523 [FIX] fix on fmha_bwd (#2784)
* fix on fmha_bwd

* Add 'const' to the Default2DEpilogue call operator

* Fix more calls to Default2DEpilogue

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
2025-09-08 14:31:27 +08:00
Po Yen Chen
d876e87fe4 [CK_TILE] Add FAv3 fwd pipeline (#2731)
* Add FAv3 fwd pipeline

* Unpack v_pk_mul to hide v_mov

* Avoid compiler moving l compute across phase

* Sync sched_group_barrier() setting for masking cases
2025-09-01 09:16:45 +08:00
asleepzzz
038ea82315 Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)" (#2757)
This reverts commit ead4447b20.
2025-08-28 22:50:42 +08:00
Yi DING
ead4447b20 [CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)
* 16x192

* Use buffer_load_lds for lse/d

* Dispatch & cleanup

* Avoid zeroing dq & fix

* fix
2025-08-28 18:54:18 +08:00
Linjun-AMD
bf7b458e6e use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline (#2711)
* add k_lds padding and iglp to improve dim256 fmha fwd

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update  block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp

* clang format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* use same naming style

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-28 11:39:39 +08:00
Yi DING
de61e55493 [CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715)
* FMHA avoid unnecessary vmcnt0

Squashed commit of the following:

commit 7bdf6a7eef
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 22 03:15:51 2025 +0000

    merge develop and solve conflicts

commit f21e916a8c
Merge: a7dd2a7d1 0db21053e
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 22 03:15:21 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue

commit a7dd2a7d13
Author: Ding, Yi <yi.ding@amd.com>
Date:   Tue Aug 19 02:17:43 2025 +0000

    update bwd

commit 380aa8f311
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Mon Aug 18 19:36:38 2025 +0000

    add restrict to applicable functions

commit b85daba2a3
Author: Ding, Yi <yi.ding@amd.com>
Date:   Mon Aug 18 02:07:03 2025 +0000

    bwd filter

commit 75c4b9372f
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Sat Aug 16 08:15:23 2025 +0000

    remove noinline attr as it causes a lot more s_waitcnt's

commit 598e3fec41
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Thu Aug 14 12:11:17 2025 +0000

    remove innerloop, move restrict parameters to mainloop and add noinline attribute.

commit 3340408537
Author: Kevin Choi <kevin.choi@amd.com>
Date:   Thu Aug 14 07:06:51 2025 +0000

    Create inner lambda with restrict parameters, add restrict to some parameters

commit 3bc45ecbc7
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Aug 14 03:43:54 2025 +0000

    save for debug

commit de4db6c4c5
Merge: 108abf00e 68694cb78
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 13 02:15:22 2025 +0000

    Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 108abf00e0
Merge: 0810799e2 0f42a92fc
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 13 02:14:26 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 68694cb781
Merge: 0810799e2 20288caa2
Author: asleepzzz <hanwen.chang@amd.com>
Date:   Wed Aug 13 00:34:11 2025 +0800

    Merge branch 'develop' into wip-async-tr-fa

commit 0810799e25
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 14:25:50 2025 +0000

    refactor blockgemm change, isolate to v2;

commit fd1eb323af
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 09:26:13 2025 +0000

    clang format

commit 75f6f6bac4
Merge: bcc05eee6 8e1eb0c1e
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 09:04:41 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit bcc05eee62
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 08:46:06 2025 +0000

    Fix the bug

commit 96d24497f5
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 04:02:41 2025 +0000

    fix conflict. disable all v-col instance for fmha fwd

commit 1716171be4
Merge: 1c9800790 4fde1646e
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 03:52:34 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 1c98007901
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 01:53:31 2025 +0000

    clang format

commit f43e903b1d
Merge: 3868ddd70 a7badc6ec
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 12 01:52:52 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 3868ddd708
Merge: 498d234ab 191c62967
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 15:59:40 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 498d234ab8
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 15:37:37 2025 +0000

    change the warp setting for hdim32 fmha fwd

commit b86f7786e2
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 11 14:21:09 2025 +0000

    tempsave, update the blocksync functions

commit 7b8052d7ca
Author: aska-0096 <haocwang@amd.com>
Date:   Sun Aug 10 06:00:51 2025 +0000

    fix bug in pki4

commit 76cbbb84a2
Author: aska-0096 <haocwang@amd.com>
Date:   Sat Aug 9 03:25:12 2025 +0000

    fix bugs in gemm

commit 8c101ccb88
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 18:35:53 2025 +0000

    fix bug on non-gfx950

commit efb8549279
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 17:53:19 2025 +0000

    fix bug

commit 729e8785fb
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 15:42:15 2025 +0000

    fix bugs

commit 250dc13c75
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:31:01 2025 +0000

    fix clangformat with 18.1.3

commit 106edeecd9
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:07:40 2025 +0000

    remove non-necessary change

commit 78edd7303b
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 09:04:02 2025 +0000

    bug fix, clang format;

commit 3b9fb6af38
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 08:08:03 2025 +0000

    Remove unnecessary changes

commit 6bb57c2c57
Merge: 1ecee378d ab2602683
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 07:50:12 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit 1ecee378d5
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 06:19:31 2025 +0000

    remove unnecessary files; rename some files

commit b4640a9de6
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 8 05:46:18 2025 +0000

    merge fa_decode pipeline into fmha_fwd api

commit fe63a646a4
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Aug 6 05:58:43 2025 +0000

    add __restrict__ to tr load

commit 414cad667b
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Aug 5 07:23:51 2025 +0000

    Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

commit 0d12fc944f
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 10:27:42 2025 +0000

    Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

commit 4f31847de1
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 10:02:17 2025 +0000

    add vmcnt guard before load ktile

commit 746f4ccb99
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Aug 4 06:49:01 2025 +0000

    Load Q through lds, implement xor;

commit 2d4e73d2b4
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Aug 1 10:44:54 2025 +0000

    small refactor

commit a28b6e67fe
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 31 10:25:37 2025 +0000

    upgrade prefill pipeline; simple iglp; consistent data produce and consume order

commit 75cba48682
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 31 05:13:27 2025 +0000

    enable larger tile size; upgrade xor pattern

commit 69890afc98
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 30 12:25:33 2025 +0000

    remove all lds bankconflict with xor layouts

commit 8dacc35c4c
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 30 03:51:06 2025 +0000

    enable prefill overload operator().

commit 13bcc913de
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 25 07:10:01 2025 +0000

    fix the lds alignment caused performance regression

commit af28123cec
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 23 09:05:57 2025 +0000

    remove unnecessary features

commit 14e0ab70c6
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 22 08:04:05 2025 +0000

    tempsave. asynccopy+trload sanity checked

commit 1b468bac0b
Author: aska-0096 <haocwang@amd.com>
Date:   Mon Jul 21 05:55:55 2025 +0000

    tempsave, trload+asyncload done

commit afd96d8180
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 10:04:34 2025 +0000

    compile pass

commit 5616551115
Merge: ae39c84f5 095393276
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 05:17:27 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa

commit ae39c84f55
Author: aska-0096 <haocwang@amd.com>
Date:   Fri Jul 18 05:16:39 2025 +0000

    tempsave

commit 94b6430489
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 10:06:09 2025 +0000

    temp save

commit 7e330553dc
Merge: 18669925c 804f77dce
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 07:24:32 2025 +0000

    Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline

commit 804f77dce5
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 03:10:46 2025 +0000

    move test_copy into test

commit 21627d7ca7
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:41:31 2025 +0000

    remove unnecessary output

commit 287792c44a
Merge: a4221db30 21fd7e953
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:26:13 2025 +0000

    Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix

commit a4221db304
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 17 02:26:10 2025 +0000

    add input validation and bug fix

commit 21fd7e9538
Merge: d6df7bf85 6e76b8205
Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Date:   Wed Jul 16 11:23:57 2025 -0700

    Merge branch 'develop' into test_copy_fix

commit d6df7bf851
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 08:55:50 2025 +0000

    fix vmcnt shift

commit 40e039e4e4
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 08:37:07 2025 +0000

    Improve s_waitcnt_imm calculation

commit c30f8b709b
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 05:39:50 2025 +0000

    fix the s_waitcnt_imm calculation

commit ec0a45b29f
Merge: e5cc4af80 6b09f0823
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 03:57:57 2025 +0000

    Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix

commit e5cc4af808
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jul 16 03:54:33 2025 +0000

    Add block_sync_lds_direct_load utility

commit eea58629cf
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 15 09:39:03 2025 +0000

    fix async copytest bug

commit 18669925cc
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jul 10 04:29:33 2025 +0000

    temp save, change all instance to 1wave

commit 18686cfe5b
Author: aska-0096 <haocwang@amd.com>
Date:   Tue Jul 8 08:37:20 2025 +0000

    tempsave, fmha_decode

commit 47565f21a5
Author: aska-0096 <haocwang@amd.com>
Date:   Sat Jun 21 15:02:57 2025 +0000

    temp save, waiting for debug

commit e0a634ef97
Author: aska-0096 <haocwang@amd.com>
Date:   Thu Jun 19 05:11:52 2025 +0000

    save an example for __bf16 type

commit 4bd5fd4a3c
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jun 18 07:27:24 2025 +0000

    fix bwd code

commit 69809d9513
Author: aska-0096 <haocwang@amd.com>
Date:   Wed Jun 18 06:37:16 2025 +0000

    Fix for fwd/bwd kernel build filter

commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba
Author: Ding, Yi <yi.ding@amd.com>
Date:   Tue Aug 19 08:13:18 2025 +0000

    Add restrict to avoid unnecessary vmcnt

---------

Co-authored-by: aska-0096 <haocwang@amd.com>

* Add comments for c-stype cast

* Better comments

---------

Co-authored-by: aska-0096 <haocwang@amd.com>
2025-08-25 20:55:12 +08:00
John Shumway
c71d7ddd74 Remove unsupported use of c++20 concept. (#2719)
Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.
2025-08-24 21:29:23 -07:00
Po Yen Chen
4a7ecce096 [CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549)
* Support async load dwordx4

* Enlarge load size on gfx950
2025-08-22 10:13:47 +08:00
Yi DING
4cfa2c7158 [CK_TILE] FMHA BWD Fix Compilation with Bias (#2682)
* [CK_TILE] FMHA BWD Fix Compilation with Bias

* Fix appendkv kApplyRoPE
2025-08-22 10:01:10 +08:00
Tianyuan Wu
68134b60e4 [CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466)
* WMMA GEMM F16 Implementation

Signed-off-by: root <tianyuwu@amd.com>

* Self-review

Signed-off-by: root <tianyuwu@amd.com>

* ASIC check minor tweak

Signed-off-by: root <tianyuwu@amd.com>

* add missing include file

* Set GPU_TARGETS to gfx11/12 generic

Signed-off-by: root <tianyuwu@amd.com>

* INT8 GFX12

Signed-off-by: root <tianyuwu@amd.com>

* add int8x16 branch

* Fix CI script

Signed-off-by: root <tianyuwu@amd.com>

* Fix typo

Signed-off-by: root <tianyuwu@amd.com>

* Add CK_Tile WMMA example

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* Fix CI

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* fix clang format

* Set M/N_Warp Back to Constant

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* Use GemmConfigComputeV3 by default

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Remove CK_Tile wmma gemm examples from the CI list

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Add atomic add fallback method for gfx11

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix typo

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Omit copyright year

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Support non-square cases

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix CI

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Add get_device_ip()

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Revert "Add atomic add fallback method for gfx11"

This reverts commit 07a79e797d.

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12"

This reverts commit ceee918007.

* Revise method name and typos

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* clang-format

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Try fix CI

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Revert "Try fix CI"

This reverts commit 7a7241085e.

* clang-format

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix typo caused by merge

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Fix typo caused by merging

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

---------

Signed-off-by: root <tianyuwu@amd.com>
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-08-15 16:22:27 -07:00
Haocong WANG
05a6e92705 Re-enable optimization for gfx950 fmha fwd (#2671)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

* refactor blockgemm change, isolate to v2;

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-08-13 14:57:43 +08:00
asleepzzz
5b39de4bb6 Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)
This reverts commit b7322a521a.
2025-08-12 20:27:10 +08:00
Haocong WANG
b7322a521a Optimize fmha fwd decode & prefill for gfx950 (#2641)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-08-12 19:43:14 +08:00
Yi DING
8e1eb0c1ee [CK_TILE] FMHA BWD Decode Pipeline (#2643)
* Fix distr

* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr

* decode 16x16 o2
2025-08-12 17:02:52 +08:00
Yi DING
4fde1646e5 [CK_TILE] FMHA BWD Optimization For GFX950 (#2628)
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window

* simply duplicate

* trload pipeline

* Try two-stage

* add prefetch

* optimize & iglp
2025-08-12 11:11:55 +08:00
Yi DING
b0a97498b0 [CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550)
* Remove unnecessary pssk

* Add BlockFmhaBwdDQDKDVPipeline wrapper

* Resolve copilot comments & Remove kpad & fix

* Remove spad
2025-08-07 21:24:43 +08:00
Illia Silin
49723e94bb fix the clang-format (#2578) 2025-07-28 20:49:55 -07:00
Yi DING
1926cd0cb8 [CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130)
* Fix shuffle_tile

* Add fmha bwd d160

* CHANGELOG

* Use static_cast

* Update

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-07-29 09:31:14 +08:00
Illia Silin
504b101da3 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
shay-li77
8ae528a1b4 fix mha bwd dbias random mismatch (#2570)
* fix mha bwd dbias random mismatch

* formatting code
2025-07-28 14:39:31 +08:00
Po Yen Chen
722c22fb15 Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389)" (#2514)
This reverts commit b2dea90116.
2025-07-17 10:09:01 +08:00
Qianfeng
45904b8fd7 Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487)
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline

* i_nhead_ conversion type to prevent overflow

---------

Co-authored-by: ltqin <letaoqin@amd.com>
2025-07-11 18:14:47 +08:00
Po Yen Chen
b2dea90116 Eliminate warning caused by failed to meet occupancy requirement (#2389)
Co-authored-by: felix <felix.li@amd.com>
2025-07-08 09:17:25 +08:00
ltqin
9f4c5d7372 ck tile pagedkv prefill (#2405)
* add prefetching physical block id for pagedkv

* start add pagedkv prefill

* rename pipeline

* add kernel for pagedkv

* add an init version pagedkv prefill

* fix redefine issue

* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args

* generate dispatch code

* add body generating code

* comipling pass

* remove dropout from pagedkv

* set lse to false in generating code

* start changing qr kernel to pagedkv

* init version of  kernerl with pagedkv

* change names of file that are generated

* chang host validation for pagedkv prefill

* using iglp to change blockgemm

* add kernel files to op head file

* show parameters

* rewrite print parameter fun

* add fwd

* remove default parameter of GridSize

* format

* fix nhead issue and add seqlen_k_ptr to batch mode

* format code

* remove no-longer used code

* format

* fix some comments

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-07-07 16:16:54 +08:00
Yi DING
b8212864cf [CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114)
* 160+192

* Add splitkv d160

* cleanup

* fix

* Add change log

* Fix CHANGELOG

* Use static_cast

* Update ignored instance

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-06-24 01:33:31 +08:00
Max Podkorytov
0366fb2abc Update for xformers (#2372)
* update api

* update kernel api

* clang-format
2025-06-22 00:28:30 -07:00
Satyanvesh Dittakavi
4c57157d50 Do not use warpSize as compile time constant as it is removed (#2320)
* Do not use warpSize as compile time constant as it is removed

* Update tile_image_to_column_shape.hpp

update warpSize usage.

* clean-up all use of warpSize, make sure code builds

* fix

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-06-17 11:54:30 -07:00
MHYangAMD
9fcf21a4ec Fix fmha fwd precision issue on MI3XX series (#2285)
* Fix fmha fwd precision issue on MI3XX series

For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.

For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.

* Add option to switch rtn/rtz for fmha fwd
2025-06-10 15:03:23 +08:00
Zzz9990
ece38b9d7a [VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221)
* fix splitkv compiler issue since lse is used to select kernel instances

* bypass seqlen == 1

* add chunked prefill into mha varlen

This reverts commit aa9847e42d.

* skip compile when receipt 2-4 and add comments

* fix

---------

Co-authored-by: fsx950223 <fsx950223@outlook.com>
2025-05-26 19:17:18 +08:00
Po Yen Chen
791802b381 [CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations (#2198)
* Write soft-sign in inline asm

* Change tile idx computation

* Add macro to turn off soft-sign asm opt

* Use simple for loop to avoid register spill

* Only do block id transform for masking cases
2025-05-16 15:14:46 +08:00
Po Yen Chen
2920604786 [CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163)
* hack for cap logits

* fix bug

* Re-format files

* Allow specifying logits_soft_cap through APIs

* Support turn on/off logits_soft_cap in async pipeline

* Do not generate non-verified kernels

* Align receipt used in Aiter

* Sync logits soft-capping across pipelines

* Re-enable some hdim pipelines

* fix perf

* Add attention variant for logits_soft_cap

* Add newline at end-of-file

* Fix performance

* Add comment to explain logits_soft_cap pre-processing

* Unify code

* Unify floating-point literal style

* Use class data member to slience the compilation error

* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133)

* Send 'mask' along with variant params to the LogitsMask()

* Send block indices to the variant

* Add indices parameters in variant interface

* Fix fmha bwd codegen error

* Allow switch logits_soft_cap impl

* Eliminate register spills

* Fix compilation errors

* Fix wrong LSE

* Fix LSE for splitkv kernel

* Sync splitkv pipeline changes

* Add batch_prefill kernel/pipeline

* Fix codegen error

* Undo changes in CMakeLists.txt

* Merge pipeline filtering check

* Use different code path if kHasLogitsSoftCap=false

* Remove [[maybe_unused]] attribute

* Use pre-existing compile-time flag to instantiate templates

* Sync pipeline changes

* Update CHANGELOG.md

---------

Co-authored-by: Bernard <bernaliu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-05-13 12:19:25 +08:00
carlushuang
3e81279d26 Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit 8cbcd3e0d0.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-13 11:41:39 +08:00
Illia Silin
8cbcd3e0d0 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 7a93b16ff6.
2025-03-11 10:40:18 -07:00
carlushuang
7a93b16ff6 [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update
2025-03-11 21:07:40 +08:00
Qianfeng
4f54fa3058 Ck tile/complete k prefetch (#1941)
* Re-implement qr_ks_vs_async pipeline by using kLoadOnce

* Remove last block_sync_lds() in the loop

* Tiny adjustment in qr_ks_vs_async pipeline for better performance

* Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline

* Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline

* Use un-rolled gemm for Gemm-0

* Use k0_loops small tile load/store to replace the big tile load/store for K

* Remove the commented lines in qx_ks_vs_custom_policy.hpp

* Tune the prefetching of V in qr_ks_vs_async pipeline

* Move the codes for storing the first v_lds tile some later

* Let BlockDropout reuse LDS with V

* Switch to separate code blocks according to iteration index

* Interleave code blocks for better performance

* Move clear_tile(s_acc) for better interleaving

* Move code interleaving

* Use MakeQDramTileDistribution for q_dram_window

* Roll-back to load Q directly from global memory instead of using LDS as intermediary stop

* Let V reuse the LDS of K

* Use array of tiles to represent Q in vgprs

* Use QLoadOnce == false for qr_ks_vs_async pipeline

* Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline

* Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile()

* Move the definition of v_tiles out from the loop

* Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile()

* Remove using KLoadOnce in qx_ks_vs_custom_policy

* Remove un-used get_slice_tile() call

* Move the code line of clear_tile(s_acc)

* Tune the lines of codes to make them more tidy

* Re-arrange the codes before the main-loop

* Add comments

* Unify the alignment to be 8 for Q/K/V Lds decriptors

* Tuning to K pre-loading

* Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false

* Adjust the pipeline codes

* Use NumPrefetchV to separate from NumVLdsBuffers

* Tune the location of a scheduler barrier code line

* Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths

* Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel

* Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline

* Move the location for store_tile() of first v_tile

* Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline

* Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<>

* Try to fix old bugs in qx_ks_vs_custom_policy

* Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler

* Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy

* Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV)

* Tiny Fix (#1888)

* Ck tile/paged attention workaround (#1894)

* Correction in GetRangeAlongX()

* Work-around to solve the failures in test_paged_attention_ck in xformers

* Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline

* Remove one call of move_tile_window for q_dram_window

* Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers()

* Tiny fix in qr_ks_vs_whole_k_prefetch pipeline

* Adjust the location of codes for storing the first V tile to LDS

* Tiny fix and add comments

* Change GetSmemKPackK size to improve performance

* Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy

* Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access

* Adjustment in intermediate iteration codes for tiny performance improvement

* Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation

* Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue

* Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer()

* Remove useless AsyncopyV

* Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used

* Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy

* Recover the qr_ks_vs_async pipeline

* Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline

* Revert "Try to fix old bugs in qx_ks_vs_custom_policy"

This reverts commit 39b82ca194.

* Tiny fix with regard to whole_k_prefetch pipeline compiling

* Update kPadSeqLenK setting in fmha_fwd_kernel

* Use q_element_func and k_element_func

* Use single q_tile rather than multiple sliced q_tiles

* Codes refine according to the comments

* Re-format one file

* Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true
2025-03-07 14:19:51 +08:00