[CK_Tile] Add wmma_bf16f32_16x16x32_bf16 warp-gemm test
(#8035)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
Adds the warp-gemm unit test for `wmma_bf16f32_16x16x32_bf16`. **Stacked
on #8028** (the API change) and based on its branch, so #8028 shows the
isolated API diff and this PR shows just the test.
## Test
gfx125-guarded `WmmaBf16f32.ResidualPrecisionContrast`: computes `Y_bf16
= X_bf16·W_bf16 + R_fp32` via `WarpGemm::mac_downconvert`, compares
against an fp32 reference (within bf16 tolerance), and asserts it is at
least as accurate as the bf16-accumulate path — i.e. it demonstrates the
precision benefit of the fp32 accumulator (`C`) carried into the fused
bf16 down-convert.
Passes on gfx1250.