Files
composable_kernel/test/ck_tile/multicast_load
John Afaganis 96c39b331e [rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da)
[ck] Enforce ASCII-only C/C++ sources for hipRTC
 compatibility (#7829)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

CK source files must be compilable via **hipRTC (HIP runtime
compilation)**, whose preprocessor does not accept non-ASCII bytes
anywhere in a translation unit — **including in comments**. Bytes that
are harmless under `hipcc` (em-dashes, smart quotes, multiplication
signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at
preprocessing time. These regularly leak in via LLM-assisted authoring
or copy/paste from formatted documents and silently break hipRTC paths
that are not exercised by the default `hipcc`-based build matrix.

This PR (a) cleans every existing violation (53 files) and (b) adds a
pre-checkin gate so new violations are rejected before merge.

## File extensions covered

Both the cleanup scan and the new Jenkins enforcement stage use the same
predicate:

```
*.h  *.hpp  *.cpp  *.h.in  *.hpp.in  *.cpp.in  *.inc  *.cl
```

(excluding `*/build/*` and `*/include/rapidjson/*`). This is a strict
superset of the existing `Clang Format` stage's predicate — `*.inc` is
added so test-fixture include files are also gated. The local pre-commit
hook's `c++/inc` type filter covers the same set.

## Why no enforcement today

CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so
the existing `pre-commit` workflow doesn't touch CK. The local CK
`.pre-commit-config.yaml` only runs for developers who installed hooks.
The **authoritative gate is therefore the new Jenkins stage** in this
PR; the local hook is convenience.

## Commit layout (bisect-friendly)

1. `79798aa6261` — **`[ck] Convert reflect/ rendering to ASCII for
hipRTC compatibility`**
Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `|-
/ +- / | ` (3-col width preserved so alignment is unchanged).
`conv_description.hpp` swaps `×` for `x` as the dimension separator.
`test_conv_description.cpp` expected strings updated in lockstep so the
snapshot test stays green. This is the only commit in the series with
observable runtime impact.

2. `738fdb0d81c` — **`[ck] Strip non-ASCII bytes from C++ sources for
hipRTC compatibility`**
Mechanical text cleanup across 53 files. Replacements happen in comments
or in `std::cout` strings that are not asserted on by any test. None of
the 174 `.inc` files in the tree required edits, but they were in the
scan's predicate so the enforcement stage's predicate is a superset of
what was scanned. Full replacement table in the commit message.

3. `1d7cd8ba235` — **`[ck] Enforce ASCII-only C/C++ sources for hipRTC
compatibility`**
- New `projects/composablekernel/script/check_ascii_only.sh` (modeled on
`check_copyright_year.sh`).
- New entry in `projects/composablekernel/.pre-commit-config.yaml` under
the local-hooks block (`types_or: [c++, inc]`).
- New `ASCII Only Check` parallel stage in
`projects/composablekernel/Jenkinsfile`'s `Static checks` block,
mirroring the existing `Clang Format` stage but with `*.inc` added to
the find predicate. Always-on, no `RUN_CPPCHECK` gate.

The tree is buildable at every commit boundary. Commit 1 leaves 50 known
violations; commit 2 leaves 0; commit 3 wires the gate.

## Demo

Script output on a synthesized violation:

```
$ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp
$ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp
ERROR: /tmp/bad.cpp contains non-ASCII bytes:
1:// em-dash test — here
  Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.)
$ echo $?
1
```

Full repo scan after the cleanup commits (note the `-name '*.inc'`
clause):

```
$ cd projects/composablekernel && find . -type f \( -name '*.h' -o -name '*.hpp' -o -name '*.cpp' \
    -o -name '*.h.in' -o -name '*.hpp.in' -o -name '*.cpp.in' -o -name '*.inc' -o -name '*.cl' \) \
    -not -path '*/build/*' -not -path '*/include/rapidjson/*' -print0 \
  | xargs -0 -P 8 -n 64 script/check_ascii_only.sh
$ echo $?
0
```

## Test plan

- [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check`
stage runs green over the full predicate (incl. `*.inc`) and existing
`Clang Format` stage is unaffected.
- [ ] `test_conv_description` passes against the ASCII tree-formatter
output (touched in commit 1).
- [ ] Local: `pre-commit run ascii-only-checker --all-files` runs
cleanly after installing CK pre-commit hooks via
`script/install_precommit.sh`.
- [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file,
push: confirm Jenkins fails the new stage with a clear error.
- [ ] Spot-check a representative subset of touched files under hipRTC
compilation to confirm no remaining hipRTC-blocking content (optional,
since the static byte check is a sufficient condition for hipRTC
preprocessor acceptance on this dimension).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-04 15:00:17 +00:00
..

Multicast Load Tests (gfx1250)

Tests for the CLUSTER_LOAD and CLUSTER_LOAD_ASYNC_TO_LDS instructions on gfx1250.

Overview

These instructions broadcast global memory data to multiple workgroups within a cluster, reducing redundant memory traffic when multiple workgroups need the same data.

  • CLUSTER_LOAD_B* — synchronous load; data lands in VGPRs. Tracked by LOADcnt; wait with s_wait_loadcnt.
  • CLUSTER_LOAD_ASYNC_TO_LDS_B* — asynchronous load; data lands directly in LDS. Tracked by ASYNCcnt; wait with s_wait_asynccnt.

How CLUSTER_LOAD_B* works

Each lane supplies its own global source address (VADDR). The hardware detects which lanes across the participating WGPs are loading the same cache line. Instead of each WGP issuing an independent memory fetch, the hardware coalesces those requests: the data is fetched once from global memory and the result is broadcast (multicast) to all requesting WGPs simultaneously. The loaded value is written into each lane's destination VGPR.

The instruction is synchronous from the wave's perspective — the issuing wave stalls until the data is available in VGPRs. No explicit barrier is needed between the load and subsequent use within the same wave.

How CLUSTER_LOAD_ASYNC_TO_LDS_B* works

The async variant operates similarly at the cluster level — participating WGPs coordinate to fetch data once and broadcast — but differs in two important ways:

  1. LDS destination: Data is written directly to LDS (Local Data Share) using the per-lane LDS address supplied in VDST, bypassing VGPRs entirely. This eliminates the VGPR → LDS copy that would otherwise be required.

  2. Asynchronous completion: The issuing wave does not stall. The instruction returns immediately and increments ASYNCcnt. The wave must later issue s_wait_asynccnt(0) to ensure the LDS write has committed before reading from LDS.

The typical usage pattern is:

// Wave 0: issue async load to LDS (returns immediately)
cluster_multicast_load_async_to_lds(src + lane_id, lds_ptr, mask);

// All waves in the workgroup synchronize:
s_wait_asynccnt(0);       // Wave 0 waits for LDS write to complete
s_barrier_signal(-1);     // All waves signal they have reached the barrier
s_barrier_wait(-1);       // All waves wait for every other wave to arrive

// Now safe for all waves to read from LDS
dst[lane_id] = lds_buf[lane_id];

Clusters

A cluster is a group of up to 16 Workgroup Processors (WGPs) that can share data via multicast. When multiple workgroups within a cluster request the same address, the hardware fetches the data once and broadcasts it to all requesters.

Broadcasting

Broadcasting is controlled by the M0 register:

  • Bits M0[15:0] form a bitmask indicating which WGPs should receive the data
  • All waves requesting the same data must set identical M0 values
  • If M0[15:0] == 0, the load behaves as a normal non-multicast load
  • M0[16] is an early-timeout bit: when set, the instruction completes without waiting for all masked WGPs to participate, preventing deadlock when fewer WGPs are launched than the mask implies

Variants

Instruction Data Size Destination Wait instruction
CLUSTER_LOAD_B32 32-bit VGPR s_wait_loadcnt
CLUSTER_LOAD_B64 64-bit VGPR s_wait_loadcnt
CLUSTER_LOAD_B128 128-bit VGPR s_wait_loadcnt
CLUSTER_LOAD_ASYNC_TO_LDS_B32 32-bit LDS s_wait_asynccnt
CLUSTER_LOAD_ASYNC_TO_LDS_B64 64-bit LDS s_wait_asynccnt
CLUSTER_LOAD_ASYNC_TO_LDS_B128 128-bit LDS s_wait_asynccnt

INST_OFFSET

For CLUSTER_LOAD_ASYNC_TO_LDS_B*, the compile-time INST_OFFSET immediate is applied to both the global source address (VADDR) and the LDS destination address (VDST), per ISA section 4.9.9.1:

LDS[VGPR[VDST][lane] + INST_OFFSET] = GLOBAL_MEMORY[VGPR[VADDR][lane] + INST_OFFSET]

To offset only the LDS write position, adjust VDST directly and keep inst_offset=0.

Tests

test_cluster_load_multicast — synchronous VGPR destination

Group Description
SingleWGP B32/B64/B128 correctness with a single WGP, mask=0x1
M0Mask mask=0x0 (non-multicast path) and mask=0x1 (single-WGP multicast)
MultiWGP 26 WGP cluster broadcasts for B32, B64, B128
PartialBroadcast Non-contiguous mask (0x5): only WGPs 0 and 2 issue cluster load, others use a plain load
ConcurrentGroups Two independent broadcast groups within the same 4-WGP cluster
EarlyTimeout M0[16] early-timeout bit prevents deadlock when fewer WGPs are launched than the mask claims

test_cluster_load_async_to_lds — asynchronous LDS destination

Group Description
1 AsyncLDS B32/B64/B128 single-WGP baseline; mask=0x0 zero-mask degradation
2 LDSVisibility Non-requesting waves read LDS correctly after block_sync_lds_direct_load
3 LDSAddressLayout Per-lane strided VDST addressing
4 MultiWGPBroadcast Async LDS delivery at cluster scale: 1D (2-WGP, 4-WGP) and 2D dim3(2,2,1) cluster dims
5 ASYNCcntOrdering CLUSTER_LOAD_ASYNC_TO_LDS and GLOBAL_LOAD_ASYNC_TO_LDS share one ASYNCcnt
6 PartialBroadcast Non-contiguous mask (0x5) with mixed instruction types
8 MultiWGPLDSVisibility Canonical GEMM tile-load pattern: one wave loads, all waves read
10 ConcurrentGroupsLDS LDS routing isolation between two independent broadcast groups
11 BufferViewAsyncGet buffer_view::cluster_async_get() interface; INST_OFFSET ISA behaviour