mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-12 01:10:22 +00:00
Implement signal/wait synchronization for the switch (multicast) group using multimem.red hardware reduction on the NVSwitch, matching the approach used by NCCL. - Signal: multimem.red.release.sys.global.add.u32 on the multicast flag address atomically increments the flag on all peers via a single multicast reduction operation. - Wait: polls the local device flag pointer until all devices have signaled (flag reaches numDevices * signalCount). New types: - SwitchGroupSemaphoreDeviceHandle: device-side handle with signal(), relaxedSignal(), wait(), and relaxedWait() methods. - SwitchGroupSemaphore: host-side class that manages the flag channel and expected inbound counter. Also adds a GroupSignalWait test that verifies all-rank GPU-side barrier synchronization using the new semaphore. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>