Add missing constraint in the FMHA qr async pipeline to
enforce bk0=bk1 (#8424)
## Motivation
The purpose of this change is to add a guardrail to what values bk0 and
bk1 can take. This is to avoid ill defined sizes, silently failing and
generating NaN (or other error) at runtime.
An example of such failure can be obtained using the tile engine:
```
cd rocm-libraries/projects/composablekernel/tile_engine/ops/fmha
python fmha_benchmark.py configs/batch_prefill.json \
--problems "1,4,1,8000,8000,256" \
--filter "c.data_type=='bf16' and c.hdim_q==256 and c.pipeline=='qr_async' and c.mode=='group' and c.tile_n0==32 and c.tile_k0==64"
```
## Technical Details
The qr_async pipeline stages data in the K dimensions into LDS using a bk1-descriptor, while the (Q*K^T) gemm0 consumes bk0
## Test Plan
See command above
## Test Result
Before the change: (invalid) generate instances, error at runtime
After this change: no instance generated
## Submission Checklist
- [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>