mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-07-17 09:17:25 +00:00

Files

Qinghua Zhou 935cc70534 fix: resolve illegal memory access and kernel correctness issues in alltoallv

1. Fix pinned buffer race condition (alltoallv_single.py):
   - The shared pinned CPU buffer was reused for 4 sequential non_blocking
     H2D copies. GPU DMA read stale data after CPU overwrote the buffer
     with the next field, corrupting sendCounts/recvCounts and causing the
     kernel to write to wrong addresses. Fixed by using 5 dedicated pinned
     buffers — one per field (send_counts, send_displs, recv_counts,
     recv_displs, remote_recv_displs).

2. Remove C++ periodic reset (alltoallv_fullmesh.cu):
   - A hardcoded static counter reset destroyed MemoryChannels and
     semaphores every 1000 kernel calls while inter-GPU signaling was
     still in progress, causing semaphore epoch mismatch and illegal
     memory access.

3. Fix semaphore wait (alltoallv_kernel.hpp):
   - Make wait() unconditional after signal(). Skipping wait() when
     recvCounts==0 desynced the semaphore epoch counter — subsequent
     calls wait() returned immediately before the peer finished writing.

4. Add memory fence (alltoallv_kernel.hpp):
   - Add __threadfence_system() after wait() outside the primary-block
     guard so ALL thread blocks execute it before kernel exit. Ensures
     NVLink remote writes from put() are globally visible to subsequent
     kernels on the receiving GPU.

2026-04-20 17:18:05 +00:00

csrc

Merge multinode branch

2026-03-25 02:51:24 +00:00

mscclpp

fix: resolve illegal memory access and kernel correctness issues in alltoallv

2026-04-20 17:18:05 +00:00

mscclpp_benchmark

Torch integration (#692 )

2026-01-21 20:32:24 -08:00

test

Merge latest multinode branch