Files
composable_kernel/include
Aviral Goel 4aecc8de5b [rocm-libraries] ROCm/rocm-libraries#7442 (commit b7d57ef)
[CK] CompV4: remove redundant barrier (+5.7% gfx942, +1% gfx950) (#7442)

## Summary

- Remove one redundant `block_sync_lds()` from the pong phase of the
CompV4 GEMM pipeline hot loop
- The pong phase had 2 barriers while ping had 1 — the second pong
barrier (after LDS writes, before global loads) was unnecessary because
the sync at the top of the next ping iteration already ensures LDS
coherence
- Removing this barrier allows global loads to overlap with LDS write
drain, restoring the latency hiding the ping-pong design was built to
provide
- Abstracting away Ping Pong phases into generic lambda avoids making
such mistake again.

## Benchmark

### gfx942 (MI300X), 86 fp16 GEMM shapes

| Metric | Value |
|---|---|
| Improved (>1%) | **80** |
| Neutral (±1%) | **4** |
| Regressed | **2** |
| Average gain | **+5.7%** |
| Best gain | +18.0% (4096x256x16384) |
| Worst regression | -2.9% (12288x3072x4096) |

### gfx950 (MI355X), 86 fp16 GEMM shapes

| Metric | Value |
|---|---|
| Improved (>1%) | **32** |
| Neutral (±1%) | **54** |
| Regressed | **0** |
| Best gain | +9.0% (4096x2048x28672) |

### Top gains by workload

| Shape (MxNxK) | Source | gfx942 BL | gfx942 Opt | gfx942 Gain | gfx950
BL | gfx950 Opt | gfx950 Gain |
|---|---|---|---|---|---|---|---|
| 4096x256x16384 | bloom_fc2 | 38.3 | 45.2 | **+18.0%** | 75.6 | 77.0 |
+1.9% |
| 4096x512x22016 | llama2_7b | 77.8 | 90.8 | **+16.7%** | 152.4 | 154.9
| +1.7% |
| 256x1536x7168 | deepseek | 14.4 | 16.7 | **+16.0%** | 27.2 | 28.0 |
+2.8% |
| 4096x1024x22016 | llama2_7b | 156.2 | 180.8 | **+15.7%** | 304.8 |
311.6 | +2.2% |
| 4096x1024x16384 | bloom_fc2 | 154.6 | 178.5 | **+15.4%** | 303.1 |
309.5 | +2.1% |
| 4096x4096x22016 | llama2_7b | 371.0 | 412.3 | **+11.1%** | 819.8 |
823.6 | +0.5% |
| 4096x2048x28672 | llama3_8b | 235.5 | 259.5 | **+10.2%** | 530.0 |
577.7 | **+9.0%** |
| 250880x256x4096 | bloom_logits | 289.0 | 335.9 | **+16.2%** | 595.5 |
599.1 | +0.6% |
| 8192x8192x8192 | square | 411.8 | 432.9 | **+5.1%** | 825.1 | 825.8 |
+0.1% |
| 7168x4096x8192 | llama70b | 362.9 | 374.7 | **+3.3%** | 775.8 | 782.5
| +0.9% |

## Hardware counter analysis (rocprof-compute, 8192x8192x8192, gfx942)

| Metric | Baseline | Optimized | Delta |
|---|---|---|---|
| s_barrier per ping+pong | 5 | 4 | **-1** |
| MFMA Utilization | 47.8% | 55.5% | **+7.7pp** |
| IPC | 0.17 | 0.21 | **+23.5%** |
| MFMA F16 % of peak | 30.6% | 33.5% | **+2.8pp** |
| VALU (instructions) | 41.67M | 41.67M | identical |
| MFMA (instructions) | 65.91M | 65.91M | identical |
| Spill/Stack Read | 8.27M | 8.27M | identical |

All instruction counts are identical — the optimization removed one
synchronization point, not any compute instructions.

## Correctness

- gfx942: GPU verification (`-v=2`) passed on 4 shapes (8192x8192x8192,
4096x4096x4096, 22016x4096x4096, 4096x512x28672)
- gfx950: GPU verification (`-v=2`) passed on all 86 shapes
2026-05-27 12:23:43 -04:00
..