mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 08:50:21 +00:00
1. Fix pinned buffer race condition (alltoallv_single.py):
- The shared pinned CPU buffer was reused for 4 sequential non_blocking
H2D copies. GPU DMA read stale data after CPU overwrote the buffer
with the next field, corrupting sendCounts/recvCounts and causing the
kernel to write to wrong addresses. Fixed by using 5 dedicated pinned
buffers — one per field (send_counts, send_displs, recv_counts,
recv_displs, remote_recv_displs).
2. Remove C++ periodic reset (alltoallv_fullmesh.cu):
- A hardcoded static counter reset destroyed MemoryChannels and
semaphores every 1000 kernel calls while inter-GPU signaling was
still in progress, causing semaphore epoch mismatch and illegal
memory access.
3. Fix semaphore wait (alltoallv_kernel.hpp):
- Make wait() unconditional after signal(). Skipping wait() when
recvCounts==0 desynced the semaphore epoch counter — subsequent
calls wait() returned immediately before the peer finished writing.
4. Add memory fence (alltoallv_kernel.hpp):
- Add __threadfence_system() after wait() outside the primary-block
guard so ALL thread blocks execute it before kernel exit. Ensures
NVLink remote writes from put() are globally visible to subsequent
kernels on the receiving GPU.