Files
mscclpp/python
Qinghua Zhou 935cc70534 fix: resolve illegal memory access and kernel correctness issues in alltoallv
1. Fix pinned buffer race condition (alltoallv_single.py):
   - The shared pinned CPU buffer was reused for 4 sequential non_blocking
     H2D copies. GPU DMA read stale data after CPU overwrote the buffer
     with the next field, corrupting sendCounts/recvCounts and causing the
     kernel to write to wrong addresses. Fixed by using 5 dedicated pinned
     buffers — one per field (send_counts, send_displs, recv_counts,
     recv_displs, remote_recv_displs).

2. Remove C++ periodic reset (alltoallv_fullmesh.cu):
   - A hardcoded static counter reset destroyed MemoryChannels and
     semaphores every 1000 kernel calls while inter-GPU signaling was
     still in progress, causing semaphore epoch mismatch and illegal
     memory access.

3. Fix semaphore wait (alltoallv_kernel.hpp):
   - Make wait() unconditional after signal(). Skipping wait() when
     recvCounts==0 desynced the semaphore epoch counter — subsequent
     calls wait() returned immediately before the peer finished writing.

4. Add memory fence (alltoallv_kernel.hpp):
   - Add __threadfence_system() after wait() outside the primary-block
     guard so ALL thread blocks execute it before kernel exit. Ensures
     NVLink remote writes from put() are globally visible to subsequent
     kernels on the receiving GPU.
2026-04-20 17:18:05 +00:00
..
2026-03-25 02:51:24 +00:00
2026-04-08 23:03:12 +00:00
2026-01-21 20:32:24 -08:00
2024-09-15 22:30:54 +00:00