composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-10 16:28:38 +00:00

Files

rocking 79faaa4f5c [rocm-libraries] ROCm/rocm-libraries#4742 (commit d340a14)

[CK_TILE] Fix FMHA async pipeline LDS sync issue (#4742)

## Motivation

Fix FMHA forward async pipeline
(`block_fmha_pipeline_qr_ks_vs_async.hpp`) sync issue.
Some attention test cases intermittently fail due to a race condition
where the V tile store to LDS overwrites K tile data that is still being
read by other threads during the tail `gemm_0` operation.

## Technical Details

In the `BlockFmhaPipelineQRKSVSAsync` pipeline, K and V tiles share the
same LDS memory through a rotation schedule (`LdsSeq`).
After the tail `gemm_0` (line 458), some fast threads may proceed to
store V to LDS (line 617) before slow threads finish reading K data from
the same LDS buffer.

The fix adds an `s_barrier` synchronization after the tail `gemm_0` when
K's last sub-tile and V's first sub-tile use the same LDS buffer (i.e.,
`LdsSeq[k0_loops - 1] == LdsSeq[k0_loops]`):

`if constexpr(LdsSeq.at(number<k0_loops - 1>{}) ==
LdsSeq.at(number<k0_loops>{}))
    __builtin_amdgcn_s_barrier();`

Why `s_barrier` alone is sufficient (no s_waitcnt lgkmcnt(0) needed):
The `gemm_0` MFMA instruction internally waits for its LDS operands
(ds_read) to complete before execution
Therefore, each thread's ds_read of K data is already complete by the
time gemm_0 finishes
Only cross-thread synchronization (`s_barrier`) is needed to ensure all
threads have finished reading before any thread starts writing V

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

2026-03-09 18:05:36 +00:00

[rocm-libraries] ROCm/rocm-libraries#5135 (commit 5ccc138)

2026-03-09 09:34:18 -07:00

ck_tile

[rocm-libraries] ROCm/rocm-libraries#4742 (commit d340a14)

2026-03-09 18:05:36 +00:00

rapidjson

Update pre-commit to fixed versions, run remod for ck_tile (#2895 )

2025-10-16 15:29:17 -07:00