Files
composable_kernel/example/ck_tile/52_cshuffle_lds/README.md
Max Podkorytov 7dcc606adc [rocm-libraries] ROCm/rocm-libraries#5383 (commit b660b8c)
[CK_TILE] Add CShuffleLds microbenchmark suite
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Microbenchmarks isolating LDS store/load operations in CShuffleEpilogue
for bank conflict analysis.

## Motivation

CShuffleEpilogue performs LDS store (MFMA registers → LDS) and load (LDS
→ registers for coalesced global writes). This suite isolates each
operation to:
- Identify which operation causes bank conflicts
- Measure pure LDS bandwidth per access pattern
- Validate access patterns across MFMA tile sizes and wave layouts

## Components

- **Microkernels** (`tile_load_store_microkernels.hpp`):
`StoreTile<Setup>`, `LoadTile<Setup>`
- **Setup Adapters** (`benchmark_cshuffle_lds.hpp`): Wire
CShuffleEpilogue to microkernels
- **Template** (`benchmark_template.cpp.in`): Generated benchmarks with
timing

## Build

```bash
cmake -G Ninja -B build -S . \
    -DGPU_TARGETS=gfx950 \
    -DBUILD_CK_EXAMPLES=ON \
    -DBUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS=ON

ninja -C build bench_lds_fp8_16x16x128_2x2_fp8
```

## New CMake Options

| Option | Default | Description |
|--------|---------|-------------|
| `BUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS` | OFF | LDS microbenchmarks |
| `BUILD_CK_TILE_FMHA_TESTS` | ON | FMHA tests |
| `BUILD_CK_TILE_ENGINE` | ON | Tile engine |
| `BUILD_CK_TILE_ENGINE_TESTS` | ON | Tile engine tests |
| `BUILD_CK_EXAMPLES` | ON | Examples |
| `BUILD_CK_TUTORIALS` | ON | Tutorials |
| `BUILD_CK_DEVICE_INSTANCES` | ON | Device instances |
| `BUILD_CK_PROFILER` | ON | Profiler |

Setting guards to OFF reduces cmake configure from ~150s to ~5s.
2026-04-15 03:44:07 +00:00

2.1 KiB
Raw Blame History

CShuffleLds LDS Microbenchmarks

Microbenchmark suite for measuring LDS (Local Data Share) bandwidth and bank conflicts in the CShuffleEpilogue cross-lane shuffle patterns.

What This Measures

The CShuffleEpilogue uses LDS to redistribute GEMM output tiles from MFMA register layout to thread-raked layout for efficient global memory writes. This benchmark isolates the LDS store/load operations to measure:

  1. Store bandwidth - Writing accumulator tiles to LDS (MFMA → LDS)
  2. Load bandwidth - Reading shuffled tiles from LDS (LDS → thread-raked)
  3. Bank conflicts - LDS bank conflicts during store/load (via rocprofv3)

Configurations

Benchmarks are generated for all combinations of:

  • FP32 MFMA tiles: 32x32x4, 32x32x8, 16x16x4, 16x16x8, 16x16x16
  • FP16 MFMA tiles: 32x32x8, 32x32x16, 16x16x16, 4x64x16, 64x4x16
  • FP8 MFMA tiles: 32x32x16, 16x16x32 (output FP16 or FP8)
  • Wave layouts: 4x1, 2x2, 1x4 (block size = MFMA tile × wave layout)

gfx950-only configurations:

  • FP16: 16x16x32
  • BF16: 16x16x64 (uses gfx950-only 16x16x32 base instruction)
  • FP8: 32x32x32, 32x32x64, 16x16x64, 16x16x128 (output FP16 or FP8)

Each configuration produces two measurements: Store and Load.

Building

cmake -G Ninja -B build -S . \
  -DGPU_TARGETS=gfx950 \
  -DBUILD_CK_EXAMPLES=ON \
  -DBUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS=ON

ninja -C build bench_lds_fp8_16x16x128_2x2_fp8  # Single benchmark

Running

# Run a single benchmark
./build/bin/bench_lds_fp8_16x16x128_2x2_fp8 --warmup 3 --iters 10

# Profile with rocprofv3 for bank conflicts
cat > counters.txt <<EOF
pmc: SQ_LDS_BANK_CONFLICT SQ_INSTS_LDS
EOF

rocprofv3 -i counters.txt -d output/ -- \
  ./build/bin/bench_lds_fp8_16x16x128_2x2_fp8

Implementation

  • Generic kernels: include/ck_tile/utility/tile_load_store_microkernels.hpp
  • Setup adapters: benchmark_cshuffle_lds.hpp
  • Template generation: benchmark_template.cpp.in

The benchmark uses CK's launch_kernel infrastructure for timing and make_kernel for functor-based kernel dispatch.