composable_kernel/include/ck_tile/ops at baa73e1515cdc8c85a4a605abb39eba072ec7ae1 - composable_kernel - Public git mirror

ROCm/composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 00:58:44 +00:00

Files

History

rocking baa73e1515 [CK_TILE] Fix FMHA async pipeline LDS sync issue (#4742 )

## Motivation

Fix FMHA forward async pipeline
(`block_fmha_pipeline_qr_ks_vs_async.hpp`) sync issue.
Some attention test cases intermittently fail due to a race condition
where the V tile store to LDS overwrites K tile data that is still being
read by other threads during the tail `gemm_0` operation.

## Technical Details

In the `BlockFmhaPipelineQRKSVSAsync` pipeline, K and V tiles share the
same LDS memory through a rotation schedule (`LdsSeq`).
After the tail `gemm_0` (line 458), some fast threads may proceed to
store V to LDS (line 617) before slow threads finish reading K data from
the same LDS buffer.

The fix adds an `s_barrier` synchronization after the tail `gemm_0` when
K's last sub-tile and V's first sub-tile use the same LDS buffer (i.e.,
`LdsSeq[k0_loops - 1] == LdsSeq[k0_loops]`):

`if constexpr(LdsSeq.at(number<k0_loops - 1>{}) ==
LdsSeq.at(number<k0_loops>{}))
    __builtin_amdgcn_s_barrier();`

Why `s_barrier` alone is sufficient (no s_waitcnt lgkmcnt(0) needed):
The `gemm_0` MFMA instruction internally waits for its LDS operands
(ds_read) to complete before execution
Therefore, each thread's ds_read of K data is already complete by the
time gemm_0 finishes
Only cross-thread synchronization (`s_barrier`) is needed to ensure all
threads have finished reading before any thread starts writing V

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

2026-03-09 18:05:36 +00:00

..

add_rmsnorm2d_rdquant

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

batched_contraction

[CK Tile] batched contraction kernel generalizing (#3126 )

2025-12-02 13:30:27 +01:00

batched_transpose

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

[CK_TILE] Extend support of mix precision microscaling BQuant (#4267 )

2026-02-24 09:55:50 -08:00

[CK_TILE][GEMM] Fix eightwarp error & Add eightwarp unit test (#4834 )

2026-03-04 04:10:28 +00:00

[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 )

2026-03-03 13:54:08 -08:00

[CK_TILE] Fix FMHA async pipeline LDS sync issue (#4742 )

2026-03-09 18:05:36 +00:00

[CK] Fix windows build issues (#4819 )

2026-02-25 09:12:19 -07:00

[CK TILE] Unification of sparse MFMA/WMMA policy structs (#4837 )

2026-03-05 19:52:04 +00:00

[CK_TILE][GEMM] Fix eightwarp error & Add eightwarp unit test (#4834 )

2026-03-04 04:10:28 +00:00

grouped_convolution

[CK_TILE] Add CK Tile bwd weight profiler (#4797 )

2026-03-04 21:49:42 +00:00

image_to_column

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

Shuffle fix for gfx950 (#3491 )

2026-01-13 09:21:29 -08:00

[CK Tile] multi reduce improvements (#3607 )

2026-01-27 12:56:09 -08:00

Fix redundant cast in model sensitive rmsnorm (#3681 )

2026-01-30 10:52:19 +08:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

Add padding to cshuffle epilogue to avoid bank conflict (#4274 )

2026-02-10 22:52:00 -07:00

[CK_TILE][FMHA] Add sparse attention VSA (#3341 )

2026-01-31 00:59:47 +08:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

Shuffle fix for gfx950 (#3491 )

2026-01-13 09:21:29 -08:00

add_rmsnorm2d_rdquant.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

batched_contraction.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

batched_transpose.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

common.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

elementwise.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

epilogue.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

flatmm.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

fmha.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

fused_moe.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

gemm_quant.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

gemm.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

grouped_convolution.hpp

[CK_TILE] Add CK Tile bwd weight profiler (#4797 )

2026-03-04 21:49:42 +00:00

image_to_column.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

layernorm2d.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

moe_flatmm.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

norm_reduce.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

permute.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

pooling.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

reduce.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

rmsnorm2d.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

smoothquant.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

softmax.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

sparse_attn.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

topk_softmax.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

topk.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00