mscclpp/test/mp_unit/CMakeLists.txt at ekow-dev - mscclpp - Public git mirror

microsoft/mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-13 09:46:00 +00:00

Files

Binyang Li 4f3638b60d Use PTX red for D2D semaphore signal (#768 )

## Summary
- Replace the two-step `signal()` implementation (`incOutbound()` +
`atomicStore()`) with a single fire-and-forget PTX
`red.release.sys.global.add.u64` instruction
- This eliminates one local atomic fetch-add and replaces a remote store
with a remote atomic add that has no return value — more efficient on
both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to
fire-and-forget `flat_atomic_add_x2`)
- Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait
ping-pong latency

### Performance (H100, 2 ranks, signal+wait round-trip)

```
SemaphorePerfTest.SignalPingPong:
  Store-based (old): 2.595 us/iter
  Red-based   (new): 2.345 us/iter
  Speedup:           1.11x
```

## Test plan
- [x] Builds successfully (`make mp_unit_tests`)
- [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter
"SemaphorePerfTest"` — 1.11x speedup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-31 15:34:43 -07:00

15 lines

331 B

CMake

Raw Permalink Blame History

 # Copyright (c) Microsoft Corporation.
 # Licensed under the MIT license.
 target_sources(mp_unit_tests PRIVATE
     mp_unit_tests.cc
     bootstrap_tests.cc
     ib_tests.cu
     communicator_tests.cu
     port_channel_tests.cu
     memory_channel_tests.cu
     semaphore_perf_tests.cu
     switch_channel_tests.cu
     executor_tests.cc
 )