mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-10 16:28:38 +00:00
[CK] CompV4: remove redundant barrier (+5.7% gfx942, +1% gfx950) (#7442) ## Summary - Remove one redundant `block_sync_lds()` from the pong phase of the CompV4 GEMM pipeline hot loop - The pong phase had 2 barriers while ping had 1 — the second pong barrier (after LDS writes, before global loads) was unnecessary because the sync at the top of the next ping iteration already ensures LDS coherence - Removing this barrier allows global loads to overlap with LDS write drain, restoring the latency hiding the ping-pong design was built to provide - Abstracting away Ping Pong phases into generic lambda avoids making such mistake again. ## Benchmark ### gfx942 (MI300X), 86 fp16 GEMM shapes | Metric | Value | |---|---| | Improved (>1%) | **80** | | Neutral (±1%) | **4** | | Regressed | **2** | | Average gain | **+5.7%** | | Best gain | +18.0% (4096x256x16384) | | Worst regression | -2.9% (12288x3072x4096) | ### gfx950 (MI355X), 86 fp16 GEMM shapes | Metric | Value | |---|---| | Improved (>1%) | **32** | | Neutral (±1%) | **54** | | Regressed | **0** | | Best gain | +9.0% (4096x2048x28672) | ### Top gains by workload | Shape (MxNxK) | Source | gfx942 BL | gfx942 Opt | gfx942 Gain | gfx950 BL | gfx950 Opt | gfx950 Gain | |---|---|---|---|---|---|---|---| | 4096x256x16384 | bloom_fc2 | 38.3 | 45.2 | **+18.0%** | 75.6 | 77.0 | +1.9% | | 4096x512x22016 | llama2_7b | 77.8 | 90.8 | **+16.7%** | 152.4 | 154.9 | +1.7% | | 256x1536x7168 | deepseek | 14.4 | 16.7 | **+16.0%** | 27.2 | 28.0 | +2.8% | | 4096x1024x22016 | llama2_7b | 156.2 | 180.8 | **+15.7%** | 304.8 | 311.6 | +2.2% | | 4096x1024x16384 | bloom_fc2 | 154.6 | 178.5 | **+15.4%** | 303.1 | 309.5 | +2.1% | | 4096x4096x22016 | llama2_7b | 371.0 | 412.3 | **+11.1%** | 819.8 | 823.6 | +0.5% | | 4096x2048x28672 | llama3_8b | 235.5 | 259.5 | **+10.2%** | 530.0 | 577.7 | **+9.0%** | | 250880x256x4096 | bloom_logits | 289.0 | 335.9 | **+16.2%** | 595.5 | 599.1 | +0.6% | | 8192x8192x8192 | square | 411.8 | 432.9 | **+5.1%** | 825.1 | 825.8 | +0.1% | | 7168x4096x8192 | llama70b | 362.9 | 374.7 | **+3.3%** | 775.8 | 782.5 | +0.9% | ## Hardware counter analysis (rocprof-compute, 8192x8192x8192, gfx942) | Metric | Baseline | Optimized | Delta | |---|---|---|---| | s_barrier per ping+pong | 5 | 4 | **-1** | | MFMA Utilization | 47.8% | 55.5% | **+7.7pp** | | IPC | 0.17 | 0.21 | **+23.5%** | | MFMA F16 % of peak | 30.6% | 33.5% | **+2.8pp** | | VALU (instructions) | 41.67M | 41.67M | identical | | MFMA (instructions) | 65.91M | 65.91M | identical | | Spill/Stack Read | 8.27M | 8.27M | identical | All instruction counts are identical — the optimization removed one synchronization point, not any compute instructions. ## Correctness - gfx942: GPU verification (`-v=2`) passed on 4 shapes (8192x8192x8192, 4096x4096x4096, 22016x4096x4096, 4096x512x28672) - gfx950: GPU verification (`-v=2`) passed on all 86 shapes