mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-10 16:28:38 +00:00
[CK_TILE] Redesign LDS store API with pre-computed window coordinates (+15% MI355X, +6% MI300X) (#7130) ## Summary - Redesign the LDS store API to separate window creation from memory transfer - Add `MakeDistributedLdsStoreWindow` factory, `LocalStore` (fast path), and `LocalStoreWithCoordRecompute` (slow path) to the pipeline base class - Convert CompV3 as the reference implementation - Document the slow/fast path distinction across core tensor headers ## Motivation `LocalPrefill` hides a performance cliff: when given a bare `tile_window_with_static_lengths`, it silently reconstructs `tile_window_with_static_distribution` on every call — paying significant VALU overhead (~96 for typical configurations) for XOR coordinate computation. The cost is invisible at the call site. The new API makes the cost explicit via three verbs: | Verb | Method | Cost | When to use | |------|--------|------|-------------| | **Create** | `MakeDistributedLdsStoreWindow(bare, dstr)` | VALU (once) | Before hot loop, when VGPR budget allows | | **Store (fast)** | `LocalStore(precomputed_window, tensor)` | 0 VALU for coords | Pre-computed window available | | **Store (on-the-fly)** | `LocalStoreWithCoordRecompute(bare, tensor)` | VALU per call | VGPR budget tight, or one-shot stores | Both `LocalStore` and `LocalStoreWithCoordRecompute` enforce correct window types via `static_assert`. `LocalPrefill` is retained for backward compatibility (69 call sites across 6 pipeline files). ## Performance ### 86 Shapes, CompV3_2 (128×128 tile), fp16, RCR layout **gfx942 (MI300X): 86/86 improved, 0 regressions. Average gain: +6.2%** **gfx950 (MI355X): 85/86 improved, 1 neutral, 0 regressions. Average gain: ~+15%** <img width="2777" height="1178" alt="pr7130_perf_chart" src="https://github.com/user-attachments/assets/b2f5c406-eb20-469d-8da6-dd608c28fbcc" /> | Shape (MxNxK) | Source | gfx942 | gfx950 | |---|---|---|---| | 22016x256x4096 | llama2_7b_fc1 | +5.3% | +11.4% | | 22016x512x4096 | llama2_7b_pfill | +5.9% | +10.9% | | 4096x512x22016 | llama2_7b_pfill | +7.6% | +28.5% | | 22016x1024x4096 | llama2_7b_pfill | +6.1% | +10.1% | | 4096x1024x22016 | llama2_7b_pfill | +7.4% | +17.2% | | 22016x4096x4096 | llama2_7b_pfill | +5.2% | +9.3% | | 4096x4096x22016 | llama2_7b_pfill | +6.0% | +9.3% | | 4096x4096x4096 | llama2_7b_pfill | +5.7% | +10.6% | | 28672x256x4096 | llama3_8b_fc1 | +5.4% | +12.2% | | 28672x512x4096 | llama3_8b_pfill | +4.9% | +6.4% | | 4096x512x28672 | llama3_8b_pfill | +7.4% | +1.5% | | 28672x2048x4096 | llama3_8b_pfill | +4.9% | +8.6% | | 4096x2048x28672 | llama3_8b_pfill | +6.4% | +8.4% | | 28672x8192x4096 | llama3_8b_pfill | +5.4% | +8.0% | | 7168x1024x8192 | llama70b_pfill | +6.6% | +10.8% | | 8192x1024x7168 | llama70b_pfill | +6.4% | +11.4% | | 7168x4096x8192 | llama70b_pfill | +6.2% | +9.6% | | 16384x256x4096 | bloom_fc1 | +6.4% | +20.3% | | 16384x512x4096 | bloom_fc1 | +5.8% | +8.5% | | 16384x1024x4096 | bloom_fc1 | +6.0% | +10.9% | | 16384x2048x4096 | bloom_fc1 | +5.3% | +10.1% | | 16384x3072x4096 | bloom_fc1 | +5.5% | +8.8% | | 16384x4096x4096 | bloom_fc1 | +5.7% | +8.8% | | 4096x256x16384 | bloom_fc2 | +7.8% | +33.6% | | 4096x512x16384 | bloom_fc2 | +7.5% | +31.6% | | 4096x1024x16384 | bloom_fc2 | +7.1% | +17.1% | | 4096x2048x16384 | bloom_fc2 | +6.9% | +11.0% | | 4096x3072x16384 | bloom_fc2 | +6.8% | +11.0% | | 4096x4096x16384 | bloom_fc2 | +6.7% | +10.3% | | 12288x256x4096 | bloom_inproj | +6.7% | +22.0% | | 12288x512x4096 | bloom_inproj | +6.2% | +9.8% | | 12288x1024x4096 | bloom_inproj | +5.9% | +12.4% | | 12288x2048x4096 | bloom_inproj | +5.8% | +10.1% | | 12288x3072x4096 | bloom_inproj | +5.4% | +10.1% | | 12288x4096x4096 | bloom_inproj | +5.7% | +9.1% | | 250880x256x4096 | bloom_logits | +2.6% | +0.5% | | 4096x256x4096 | bloom_outproj | +7.1% | +28.4% | | 4096x512x4096 | bloom_outproj | +6.8% | +27.4% | | 4096x1024x4096 | bloom_outproj | +6.5% | +21.3% | | 4096x2048x4096 | bloom_outproj | +5.9% | +13.1% | | 4096x3072x4096 | bloom_outproj | +5.9% | +12.0% | | 16x1536x7168 | deepseek | +7.7% | +34.7% | | 32x1536x7168 | deepseek | +7.7% | +34.9% | | 64x1536x7168 | deepseek | +7.6% | +31.3% | | 128x1536x7168 | deepseek | +7.6% | +25.8% | | 256x1536x7168 | deepseek | +7.7% | +27.9% | | 512x1536x7168 | deepseek | +7.6% | +29.1% | | 1024x1536x7168 | deepseek | +7.3% | +28.8% | | 2048x1536x7168 | deepseek | +6.9% | +20.5% | | 4096x1536x7168 | deepseek | +6.3% | +11.0% | | 8192x1536x7168 | deepseek | +6.2% | +11.3% | | 16384x1536x7168 | deepseek | +6.0% | +9.1% | | 20480x1536x7168 | deepseek | +4.8% | +9.3% | | 16x3072x1536 | deepseek | +6.3% | +25.1% | | 32x3072x1536 | deepseek | +6.4% | +25.3% | | 64x3072x1536 | deepseek | +6.4% | +24.8% | | 1024x1024x1024 | square | +5.5% | +18.7% | | 2048x2048x2048 | square | +6.0% | +19.2% | | 3584x3584x3584 | square | +5.3% | +11.2% | | 5120x5120x5120 | square | +6.1% | +10.0% | | 6144x6144x6144 | square | +5.5% | +9.8% | | 8192x8192x8192 | square | +6.0% | +8.2% | | 1024x4608x1024 | midsize | +4.6% | +4.6% | | 512x18432x512 | midsize | +1.9% | +10.1% | | 4096x18432x4096 | midsize | +5.8% | +8.8% | | 320x8192x320 | stablediff | +4.0% | +11.3% | | 640x2048x640 | stablediff | +4.5% | +14.0% | | 320x8192x1280 | stablediff | +5.6% | +20.1% | | 1x1280x8192 | skinny_m1 | +7.7% | +35.3% | | 1x8192x1024 | skinny_m1 | +6.0% | +20.3% | | 1x7168x8192 | skinny_m1 | +7.7% | +36.6% | | 1x8192x3584 | skinny_m1 | +7.3% | +27.9% | | 1x13312x6656 | skinny_m1 | +7.6% | +30.3% | | 1x13312x16384 | skinny_m1 | +7.8% | +4.2% | | 1x16384x6656 | skinny_m1 | +7.5% | +28.7% | | 1x16384x16384 | skinny_m1 | +7.7% | +2.3% | | 16x4096x4096 | skinny_m16 | +7.4% | +31.9% | | 16x22016x4096 | skinny_m16 | +7.5% | +26.5% | | 16x28672x4096 | skinny_m16 | +7.0% | +15.1% | | 16384x1280x8192 | skinny_m16 | +5.6% | +8.7% | | 16384x8192x1024 | skinny_m16 | +4.5% | +8.8% | | 2048x4096x2048 | mixed | +4.7% | +9.0% | | 4096x2048x8192 | mixed | +6.8% | +11.0% | | 8192x4096x4096 | mixed | +5.2% | +10.0% | | 1x4096x4096 | mixed | +7.4% | +32.4% | | 1024x1024x4096 | mixed | +7.1% | +27.4% | ### ISA Hot Loop Diff (LBB1_32, per K-iteration, gfx942) | Metric | Baseline | Optimized | Delta | |--------|----------|-----------|-------| | Total VALU | 621 | 500 | **-121** | | VGPR / SGPR | 512 / 96 | 512 / 96 | unchanged | ### Hardware Counters — Instruction Mix (gfx950, rocprofiler-compute) Profiled on MI350X, shape 4096×256×16384 (bloom_fc2). Instruction counts are deterministic hardware counters. | Metric | Baseline | Optimized | Δ | |--------|----------|-----------|---| | **VALU instructions/kernel** | 4,642,473 | 987,958 | **−78.7%** | | **INT32 VALU** | 2,592,786 | 541,129 | **−79.1%** | | Instructions / wavefront | 39,178 | 24,400 | −37.7% | | VGPRs (avg) | 98 | 90 | −8% | | **MFMA instructions** | 2,059,702 | 2,059,702 | **0%** | | **LDS instructions** | 1,564,891 | 1,564,891 | **0%** | | **VMEM instructions** | 520,996 | 520,996 | **0%** | MFMA as fraction of total instructions: **30.7% → 67.5%**. Eliminating ~3.65M redundant INT32 VALU instructions (XOR coordinate recomputation per K-iteration) leaves the scheduler more headroom for MFMA dispatch, directly explaining the benchmark gains.