Files
jakpiase bd8e03f3a6 [rocm-libraries] ROCm/rocm-libraries#7466 (commit cc2861f)
[CK Tile] Enable hardware OOB buffer load offset trick by
 default (#7466)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Enables `CK_TILE_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK`
inside `config.hpp`.

  ### Background

When loading from global memory with out-of-bound (OOB) check, CK Tile
must suppress invalid lanes. The previous default used a software
branch:

  ```cpp
  // Old path (oob_conditional_check, no trick)
  if(!src_thread_element_valid) { return zeros; }
      return amd_buffer_load_impl(...);
```

  This generates divergent control flow, the compiler emits exec-mask save/restore and per-lane comparison SALU instructions one set per buffer load that touches a padded dimension.

  ### Change

  With the trick enabled, invalid lanes are suppressed entirely in hardware:
```cpp
  // New path (trick enabled)
uint32_t shift = src_thread_element_valid ? 0 : 0x80000000;
return amd_buffer_load_impl(resource, shift + offset, 0);
```
  The `0x80000000` offset overflows the buffer descriptor's declared size, causing the hardware to silently return zero for that lane - no branch, no exec mask manipulation. This matches the behavior of old CK XDL kernels, which use an unconditional load followed by a `v_cndmask` select.

  ### Expected impact

  Eliminates ALU overhead from OOB validity branches which reduces the kernel execution time, especially for memory-bound cases.
2026-05-21 10:08:12 +00:00
..