# CShuffleLds LDS Microbenchmarks Microbenchmark suite for measuring LDS (Local Data Share) bandwidth and bank conflicts in the CShuffleEpilogue cross-lane shuffle patterns. ## What This Measures The CShuffleEpilogue uses LDS to redistribute GEMM output tiles from MFMA register layout to thread-raked layout for efficient global memory writes. This benchmark isolates the LDS store/load operations to measure: 1. **Store bandwidth** - Writing accumulator tiles to LDS (MFMA → LDS) 2. **Load bandwidth** - Reading shuffled tiles from LDS (LDS → thread-raked) 3. **Bank conflicts** - LDS bank conflicts during store/load (via rocprofv3) ## Configurations Benchmarks are generated for all combinations of: - **FP32 MFMA tiles**: 32x32x4, 32x32x8, 16x16x4, 16x16x8, 16x16x16 - **FP16 MFMA tiles**: 32x32x8, 32x32x16, 16x16x16, 4x64x16, 64x4x16 - **FP8 MFMA tiles**: 32x32x16, 16x16x32 (output FP16 or FP8) - **Wave layouts**: 4x1, 2x2, 1x4 (block size = MFMA tile × wave layout) **gfx950-only configurations:** - **FP16**: 16x16x32 - **BF16**: 16x16x64 (uses gfx950-only 16x16x32 base instruction) - **FP8**: 32x32x32, 32x32x64, 16x16x64, 16x16x128 (output FP16 or FP8) Each configuration produces two measurements: Store and Load. ## Building ```bash cmake -G Ninja -B build -S . \ -DGPU_TARGETS=gfx950 \ -DBUILD_CK_EXAMPLES=ON \ -DBUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS=ON ninja -C build bench_lds_fp8_16x16x128_2x2_fp8 # Single benchmark ``` ## Running ```bash # Run a single benchmark ./build/bin/bench_lds_fp8_16x16x128_2x2_fp8 --warmup 3 --iters 10 # Profile with rocprofv3 for bank conflicts cat > counters.txt <