LDS Bank Conflicts on CDNA

From 4.7 to 5.7 TB/s with coalesced loads and XOR-swizzled LDS

Amir Hossein Ghamarian

2026-04-21

The Problem

Moving a [65536 x 256] FP16 Tile Through LDS

We have a simple task: copy a matrix from global memory, transpose it via LDS, and write it back.

for (k_block = 0; k_block < K; k_block += kK) {
    auto reg_tile      = load_tile(gmem_window_in);   // HBM -> registers
    store_tile(lds_window_mk, reg_tile);              // registers -> LDS
    block_sync_lds();

    auto reg_transposed = load_tile(lds_window_km);   // LDS -> registers (TRANSPOSED)
    block_sync_lds();

    store_tile(gmem_window_out, reg_transposed);      // registers -> HBM
}

The question: on MI300 / MI355 (peak HBM3 ~5.2 / ~8 TB/s), can we saturate the bus with a kernel this simple?

The answer: only if every LDS access is conflict-free and every HBM access is coalesced. Miss either and you leave ~15-20% on the floor.

What we will measure for every variant: global load/store ISA, LDS read/write ISA, SQ_LDS_BANK_CONFLICT, SQ_INSTS_LDS, kernel time, HBM bandwidth.

Part A - Foundations

What Is Coalescing?

Coalescing is the GPU memory unit’s job: take 64 lane-level addresses emitted by one wave and collapse them into the minimum number of 64-byte HBM transactions that cover them all.

The 5-step mental model (from coalescing_model.py):

  1. Every lane of the wave emits its own byte address in SIMT.
  2. The memory unit walks each lane’s request and slices it into pieces that fit in one 64 B cache line.
  3. THE COALESCING STEP: issue exactly one 64 B HBM transaction per unique cache line.
  4. Scatter the returned bytes back to the lanes that wanted them.
  5. Accounting: efficiency = useful_bytes / fetched_bytes.

Key numbers on CDNA:

CACHE_LINE = 64 B    (HBM transaction granularity)
WAVE_SIZE  = 64 lanes
per-lane widths: b32 = 4 B, b64 = 8 B, b128 = 16 B
peak coalesced wave load: 1024 B = 16 cache lines

The rule of thumb:

  • 64 lanes x 16 B contiguous -> 16 transactions -> 100% efficiency.
  • 64 lanes x 4 B strided by 256 B -> 64 transactions -> ~6% efficiency.

The Coalescer in 40 Lines

def execute_wave_load(lane_addresses, lane_nbytes, HBM):
    # Step 2: slice each lane request into per-line contributions
    line_requests = defaultdict(list)
    for lane, addr in enumerate(lane_addresses):
        remaining, cur = lane_nbytes, addr
        while remaining > 0:
            line_id = cur // CACHE_LINE
            offset  = cur %  CACHE_LINE
            chunk   = min(remaining, CACHE_LINE - offset)
            line_requests[line_id].append((lane, offset, chunk))
            cur, remaining = cur + chunk, remaining - chunk

    # Step 3: THE COALESCING STEP - one transaction per unique cache line
    fetched_lines = {
        line_id: bytes(HBM[line_id*CACHE_LINE : (line_id+1)*CACHE_LINE])
        for line_id in line_requests
    }

    # Step 5: accounting
    useful_bytes  = WAVE_SIZE * lane_nbytes
    fetched_bytes = len(fetched_lines) * CACHE_LINE
    return num_transactions=len(fetched_lines), efficiency=useful_bytes/fetched_bytes

assets/coalescing_model.py has the full implementation with 7 scenarios you can step through in a debugger.

Coalescing - Interactive

Pick a scenario with the buttons. Rows = 64 lanes, columns = 64 B HBM cache lines. Green cell = useful byte (lane wanted it). Number of non-empty columns = HBM transactions. Efficiency = useful / fetched bytes.

Coalescing - Seven Scenarios (Summary)

Scenario 7 = the global-load shape used by every tutorial kernel (green). Scenario 6 = what a naive transpose-read of gmem would do (red) – why we go through LDS.

Coalescing: What Transfers To LDS

For the rest of the deck, only the coalesced b128 pattern matters on the HBM side. All six tutorial kernels use this same global-memory read:

const auto gmem_desc_in = make_naive_tensor_descriptor(
    make_tuple(number<kM>{}, number<kK>{}),
    make_tuple(K, number<1>{}),
    number<16 / sizeof(DataType)>{},   // GuaranteedLastDimensionVectorLength = 8 for fp16
    number<1>{});                      // GuaranteedLastDimensionVectorStride

The LDS side is where kernels diverge. Next: what a bank conflict actually is.

LDS Hardware Model

gfx942 (MI300):  32 banks x 4 B  =  128 B / cycle / wave
gfx950 (MI355):  64 banks x 4 B  =  256 B / cycle / wave

wave = 64 lanes, processed in 2 half-waves of 32 lanes
    (each half-wave must land on 32 distinct banks, or same slot)

bank(addr_bytes) = (addr_bytes / 4) % NBanks
slot(addr_bytes) =  addr_bytes / 4             # index within a bank

LDS instruction widths (ds_read / ds_write):
    b16   =  2 B       (fp16 scalar)
    b32   =  4 B  = 1 slot  per lane
    b64   =  8 B  = 2 slots per lane
    b128  = 16 B  = 4 slots per lane   -> one instruction = 4 sub-steps

Bank-Conflict Rules - Interactive

Buttons walk through all three rules on a real 32-bank half-wave. R1: distinct banks or exact broadcast. R2: fp16 same-slot is all-or-nothing. R3: ds_read_b128 = 4 sub-steps, each checked against R1.

The Three Rules, Condensed

R1 - one half-wave, one sub-step

For every 32-lane half-wave and every 4 B sub-step:

  • distinct banks across the 32 lanes, OR
  • every same-bank lane on the same 4 B slot (broadcast)

Anything else = N-way conflict.

R2 - fp16 inside one slot

Two fp16 values share a 4 B slot. The slot broadcasts only if every lane that lands on that bank targets the same fp16 half.

One dissenter = full N-way conflict. No partial credit.

R3 - wide reads = 4 x R1

ds_read_b128 issues 4 sub-steps at base, base+4, base+8, base+12. Each sub-step is a full bank check.

Widening helps only when R1 already holds. If R1 fails on one sub-step it fails on all four.

Checklist: Does (write_dist, lds_desc, read_dist) Give 0 Conflicts?

Given a per-thread read pattern with stride S bytes along the lane axis and instruction width W bytes:

(Q1) slots_per_lane_per_substep = W / 4                        (1 for b32, 4 for b128)
(Q2) for each sub-step s in [0, W/4):
         lane_addr(lane) = base + lane * S + s * 4
         bank(lane)      = (lane_addr / 4) % NBanks
(Q3) in each half-wave (lanes 0..31 and 32..63):
         if all 32 banks distinct            -> sub-step is OK
         elif all lanes on same bank share 1 slot -> OK (broadcast)
         else                                -> N-way conflict
(Q4) repeat for the WRITE side with its own (S, W).

Worked example - tutorial 01 row-major transpose read:

kM=64, kK=32, FP16. Read pattern = 8 x ds_read_u16 at stride 64 B:
  offsets: 0, 64, 128, 192, 256, 320, 384, 448
  slots:   0, 16,  32,  48,  64,  80,  96, 112
  banks:   0, 16,   0,  16,   0,  16,   0,  16    <- 8 lanes share {0, 16}
  slots differ on bank 0: {0, 32, 64, 96}  -> NOT broadcast
  -> Rule R1 FAILS -> 8-way conflict per half-wave per instruction.

This is exactly what the assembly shows (next section).

The Full Pipeline

Every tutorial in Part B attacks a specific stage of this pipeline.

Part B - Six Tutorials

01 Row-Major Baseline - Code

// LDS descriptor for writing [M, K] - plain row-major
static constexpr auto MakeLdsDescriptorMK()
{
    return make_naive_tensor_descriptor_packed(
        make_tuple(number<kM>{}, number<kK>{}));
}

// Transposed LDS descriptor for reading [K, M]
static constexpr auto MakeLdsDescriptorKM()
{
    return make_naive_tensor_descriptor(
        make_tuple(number<kK>{}, number<kM>{}),
        make_tuple(number<1>{}, number<kK>{}));   // stride kK=32 on the fast axis
}

Tile is kM=64, kK=32. Writes are row-major (fine). Reads are columns at stride 64 B - bank 0 and bank 16, over and over.

What the compiler generated:

ds_write_b128 v..., v...         # write:   OK
ds_read_u16  v8,  v6             # read:    bad
ds_read_u16  v9,  v6 offset:64
ds_read_u16 v10,  v6 offset:128
ds_read_u16 v11,  v6 offset:192
ds_read_u16 v12,  v6 offset:256
ds_read_u16 v13,  v6 offset:320
ds_read_u16 v14,  v6 offset:384
ds_read_u16 v15,  v6 offset:448

All 8 reads share the same base register v6. Offsets step by 64 B = 16 slots = a full trip around 32 banks, landing on {0, 16, 0, 16, ...}.

01 Row-Major Baseline - Counters

global loads   :   8x buffer_load_dwordx4   (coalesced)
global stores  :   8x buffer_store_short    (narrow transpose)
LDS writes     :   4x ds_write_b128
LDS reads      :   8x ds_read_u16           (conflicting)

SQ_LDS_BANK_CONFLICT  :   3,670,016         (very high, 14 per dispatch)
SQ_INSTS_LDS          :     294,912
kernel time (M=65536) :      14.08 us
HBM bandwidth         :       4.77 TB/s    <-- well below peak

What the rules say:

  • R1 fails: 8 lanes on {bank 0, bank 16} with 8 distinct slots -> 7-way conflict per half-wave.
  • R2 fails: different rows -> different slots, no FP16 broadcast.
  • R3: b16 reads are a single sub-step, so width doesn’t hurt (but also doesn’t help).

Take-away: a naive row-major LDS layout is unusable for transpose reads. Everything that follows attacks this read pattern.

02 Column-Major LDS - Move the Pain

// LDS descriptor for writing [M, K] - column-major strides
static constexpr auto MakeLdsDescriptorMK()
{
    return make_naive_tensor_descriptor(
        make_tuple(number<kM>{}, number<kK>{}),
        make_tuple(number<1>{}, number<kM>{}));  // M is fast axis
}

// Reading [K, M] -- now this is contiguous in M
static constexpr auto MakeLdsDescriptorKM()
{
    return make_naive_tensor_descriptor(
        make_tuple(number<kK>{}, number<kM>{}),
        make_tuple(number<kM>{}, number<1>{}));
}

We just swapped which axis is contiguous in LDS.

  • Reads are now contiguous -> ds_read_b128 possible.
  • Writes become the strided ones now -> ds_write_b16 and conflicts on the write side.
  • Net bank-conflict count does not drop; the pain just moved.

Rule R1 check: write now has stride 64 B -> exactly the same {0, 16, ...} pattern R1 failed on for 01, except flipped to writes.

Take-away: you cannot solve a transpose via LDS layout alone if the layout is rectangular.

03 Row-Major + Padding - The Classical Fix

static constexpr index_t kPadding  = 2;        // 2 fp16 = 4 B padding
static constexpr index_t kKPadded  = kK + kPadding;  // 34 elements per row

static constexpr auto MakeLdsDescriptorMK()
{
    return make_naive_tensor_descriptor(
        make_tuple(number<kM>{}, number<kK>{}),       // logical [M, K]
        make_tuple(number<kKPadded>{}, number<1>{})); // padded stride!
}

static constexpr auto MakeLdsDescriptorKM()
{
    return make_naive_tensor_descriptor(
        make_tuple(number<kK>{}, number<kM>{}),
        make_tuple(number<1>{}, number<kKPadded>{}));
}

Why it works:

Unpadded row stride = 32 x 2 = 64 B  = 16 slots  (divides 32 -> collisions)
Padded   row stride = 34 x 2 = 68 B  = 17 slots  (coprime with 32 -> all banks touched)

Since 17 is coprime with 32, 32 lanes walking down a column hit 32 distinct banks. R1 satisfied.

Cost: 2 extra fp16 per row x 64 rows = 256 B wasted LDS. Benefit: conflicts drop 5x (3.67M -> 0.79M), bandwidth jumps from 4.77 -> 5.62 TB/s. Caveat: padding alone does not hit 0. The remaining lanes still broadcast-miss on two banks per half-wave. Pair with XOR (05) or switch to 08 to eliminate them.

05 XOR + Padding - Belt and Braces

static constexpr index_t kKPack   = 8;
static constexpr index_t kKPadded = 33;  // padding on the pack axis

static constexpr auto MakeLdsDescriptorMK() {
    constexpr auto MLdsLayer = /* 32*4/kK/sizeof(T) */;
    auto lds_desc_0 = make_naive_tensor_descriptor(
        make_tuple(number<kK/kKPack*MLdsLayer>{},
                   number<kM/MLdsLayer>{},
                   number<kKPack>{}),
        make_tuple(number<kKPack>{},
                   number<kKPadded * MLdsLayer>{},  // <<< PADDED
                   number<1>{}), ...);
    auto lds_desc_permuted = transform_tensor_descriptor(
        lds_desc_0,
        make_tuple(make_xor_transform(...),         // <<< XOR on (M, A)
                   make_pass_through_transform(number<kKPack>{})), ...);
    // unmerge A -> (MLdsLayer, K0); merge back to [M, K]
    return lds_desc;
}

Why both? Tutorial 04 (XOR alone, same LDS descriptor as 08 but a mismatched read distribution) measured ~3072 residual conflicts because 3 of its 8 ds_read_u16 instructions used hardcoded offsets that bypass the XOR swizzle – see investigation/PRESENTATION_READY_SUMMARY.md:

ds_read_u16 v18, v29 offset:128   # bypasses XOR!
ds_read_u16 v20, v26 offset:128   # bypasses XOR!
ds_read_u16 v21, v22 offset:256   # bypasses XOR!
3 / 8 = 37.5%  ==  observed 38% residual conflicts

05’s fix: pad the stride so even the bypassing offsets land on fresh banks. Zero conflicts, at the cost of ~64 bytes of LDS per tile plus the XOR machinery.

08 XOR Done Right - The Descriptor

static constexpr auto MakeLdsDescriptorMK()
{
    constexpr index_t NBanks    = get_n_lds_banks();                   // 32 gfx942 / 64 gfx950
    constexpr index_t MLdsLayer = (NBanks * 4 / kK / sizeof(T));       // 2 for fp16/kK=32
    constexpr index_t RowMul    = (NBanks == 64) ? 2 : 1;              // <<< gfx950 doubles B

    // Step 1: reshape to [A = kK/kKPack * MLdsLayer, B = kM/MLdsLayer, KPack]
    auto lds_desc_0 = make_naive_tensor_descriptor(
        make_tuple(number<kK/kKPack*MLdsLayer>{}, number<kM/MLdsLayer>{}, number<kKPack>{}),
        make_tuple(number<kKPack>{}, number<kK*MLdsLayer>{}, number<1>{}),
        number<kKPack>{}, number<1>{});

    // Step 2: XOR-swizzle (B, A) with B multiplied by RowMul
    auto lds_desc_permuted = transform_tensor_descriptor(lds_desc_0,
        make_tuple(make_xor_transform(make_tuple(number<kM/MLdsLayer * RowMul>{},
                                                 number<kK/kKPack * MLdsLayer>{})),
                   make_pass_through_transform(number<kKPack>{})),
        make_tuple(sequence<1, 0>{}, sequence<2>{}),
        make_tuple(sequence<1, 0>{}, sequence<2>{}));

    // Step 3: unmerge A -> (MLdsLayer, K0 = kK/kKPack)
    // Step 4: merge (B, MLdsLayer) -> M ;  merge (K0, KPack) -> K
    return lds_desc;   // presents as [M, K] to the user
}

The whole pipeline is 4 make_*_transform calls. No padding. No wasted LDS.

08 - XOR Descriptor (Interactive, 4 steps)

Buttons: Before (64x32 logical) -> Step1 reshape to [B=32,A=8] KPack blocks -> Recolor by column identity -> Step2 XOR shuffles blocks along A using key b mod A -> Step3 unmerge A -> (L=2, K0=4) -> Step4 merge back to physical [32 x 64] LDS rows. Matches xor_t::calculate_lower_index in ck_tile exactly for kM=64, kK=32, kKPack=8, MLdsLayer=2.

08 - XOR Descriptor, Step By Step (Static)

Colour = element id in the original [M, K] matrix. Follow one colour across the 5 panels. Only step 2 permutes: two elements that shared a bank at step 1 end up in different physical columns at step 4, which is where R1 starts passing.

08 - Why It Actually Has 0 Conflicts

static constexpr auto MakeReadDistributionMK()
{
    constexpr index_t K1 = 16 / sizeof(DataType);  // 8
    constexpr index_t K0 = kK / K1;                // 4 (= num warps)
    constexpr index_t M0 = 1, M1 = 8, M2 = 8;
    return make_static_tile_distribution(
        tile_distribution_encoding<
            sequence<1>,
            tuple<sequence<M0, M1, M2>, sequence<K0, K1>>,
            tuple<sequence<2>,        sequence<1, 1>>,
            tuple<sequence<0>,        sequence<1, 2>>,
            sequence<1, 2>,           // Ys -> (M0, K1)
            sequence<0, 1>>{});
}

Per-thread Y shape = (M0=1, K1=8): each lane reads 8 consecutive K values at a fixed M.

ds_read_b128 v[4:7], v2         # one instruction, 16 B

Contrast with tutorial 01/04’s read distribution which reads 8 M values at a fixed K -> must be 8 x ds_read_u16.

Why 0 conflicts:

  • The 8 K values sit in one XOR pack (the kKPack = 8 axis is not XORed).
  • 16 B = 4 sub-steps of 4 B each (R3) that all land on the same XOR key -> 4 contiguous banks.
  • 64 lanes * 4 banks/lane -> all 256 banks-cycles distinct -> R1 satisfied for every sub-step.

buffer_store_* on the output side stays narrow (8x buffer_store_short) because the per-thread register shape is K-vectorized, not M-vectorized. That is what tutorial 10 fixes.

10 M-Vector Store - Flip the Output Vector

static constexpr index_t kMPack = 8;  // per-thread register vector on READ side

static constexpr auto MakeReadDistributionMK()
{
    constexpr index_t M_PerThread = kMPack;           // 8
    constexpr index_t M_Lane      = kM / M_PerThread; // 8
    // Y shape: (M_PerThread=8, K=1) per thread  -- OPPOSITE of 08
    ...
}

Same LDS descriptor as 08, but each lane now holds 8 consecutive M values at one K.

Trade-off:

               07/08                 10
global load    buffer_load_dwordx4   buffer_load_dwordx4
LDS write      ds_write_b128         ds_write_b128
LDS read       1x ds_read_b128       8x ds_read_u16    (<-- more LDS ops)
global store   8x buffer_store_short buffer_store_dwordx4  (<-- wide!)
conflicts      0                     0
  • Global store becomes coalesced b128. Win on the HBM side.
  • LDS read count goes up. Because the XOR pack axis is K but we read along M now, 8 M values -> 8 different XOR keys -> 8 instructions. Rule R1 still holds so no new conflicts.
  • Net: slightly slower on our workload; wins only when the output is heavily reused.

Summary

All Six, Side by Side

kernel gload gstore LDS write LDS read conflicts LDS insts time (us) BW (TB/s) rules
01 row-major bx4 store_short x8 ds_write_b128 ds_read_u16 x8 3,670,016 294,912 14.08 4.77 R1 X R2 X
02 col-major bx4 store_short x8 ds_write_u16 x8 ds_read_b128 1,572,864 294,912 12.25 5.48 R1 X on write
03 padding bx4 store_short x8 ds_write_b128 ds_read_u16 x8 786,432 327,680 11.93 5.62 pad helps write, read partial
05 XOR+pad bx4 store_short x8 ds_write_b128 ds_read_u16 x8 0 294,912 12.42 5.4 R1 OK via XOR+pad
08 XOR clean bx4 store_short x8 ds_write_b128 ds_read_b128 0 65,536 11.88 5.65 R1 R2 R3 OK
10 M-vec store bx4 store_dwordx4 ds_write_b128 ds_read_u16 x8 0 294,912 12.69 5.29 R1 OK, wider gstore

Numbers are rocprofv3 --pmc SQ_LDS_BANK_CONFLICT SQ_INSTS_LDS SQ_INSTS_LDS_LOAD SQ_INSTS_LDS_STORE on MI300, FP16, M=65536, K=256. Source: assets/bank_conflict_numbers.csv. Two observations worth calling out:

  • 08 has 4.5x fewer LDS instructions than any other kernel (65k vs 295k) – one ds_read_b128 replaces 8 ds_read_u16.
  • 03 padding does not fully zero conflicts at this problem size (reduces from 3.67M to 0.79M); XOR+pad (05) and XOR done right (08) both land on exactly 0.

Rules Of Thumb

  1. Start from coalescing. If the global load/store is already narrow, nothing on the LDS side can rescue you.
  2. Pick padding (03) when you need a one-line fix and can spare a few bytes per row. It satisfies R1 by making the stride coprime with NBanks.
  3. Pick XOR (08) when you want zero wasted LDS, your LDS tile is rectangular, and your per-thread read shape can be made to align with the kKPack axis (R3 pays off only then).
  4. Remember RowMul on gfx950: the XOR B-axis doubles. Hard-coding NBanks=32 gives residual conflicts on MI355.
  5. If the global store is slow (M-vectorized output) use a tutorial-10-style rotated read distribution, accepting extra LDS instructions to get buffer_store_dwordx4. Measure; the LDS-vs-HBM trade is workload-dependent.
  6. Watch out for compiler offsets: tutorial 04 showed ds_read_u16 offset:N can bypass XOR (the +N is added after the swizzle). Either align reads with the pack axis (08) or pad on top of XOR (05).

Code Pointers

Background reading in investigation/:

  • BANK_CONFLICT_CALCULATION_FINAL.md - the worked Phase-0 conflict count.
  • CONFLICT_MODEL_FINAL.md - same-slot all-or-nothing rule (R2).
  • PRESENTATION_READY_SUMMARY.md - the 3/8-offsets-bypass-XOR finding (why 04 is not enough).
  • WHY_XOR_STILL_HAS_CONFLICTS.md - the 04-vs-08 tile distribution contrast.

Assets:

  • assets/coalescing_model.py
  • assets/xor_full_steps_simple.html

Questions?