composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-04 05:31:24 +00:00

Files

Yi DING fb64a4453c [rocm-libraries] ROCm/rocm-libraries#5915 (commit a72cf7d)

[CK_TILE] Fix FMHA BWD register pressure by wrapping
 num_total_loop with amd_wave_read_first_lane (#5915)

## Motivation

In three FMHA backward pipelines, `num_total_loop` is computed without
`amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even
though it is logically uniform across all lanes. This raises register
pressure, and under high pressure the compiler may reuse VGPRs across
overlapping live ranges. This was confirmed via assembly inspection: the
compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an
intermediate value for dV, producing incorrect dK/dV gradients.

## Technical Details

Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three
pipelines:
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr`
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp`
- `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr`

This promotes `num_total_loop` to an SGPR, eliminating the excess
register pressure and the incorrect VGPR reuse.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-03-30 01:45:16 +00:00

block

[rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc)

2026-03-11 10:00:52 +00:00

kernel

[rocm-libraries] ROCm/rocm-libraries#5790 (commit c132b5a)

2026-03-27 07:54:53 +00:00

pipeline

[rocm-libraries] ROCm/rocm-libraries#5915 (commit a72cf7d)

2026-03-30 01:45:16 +00:00