1. Fix pinned buffer race condition (alltoallv_single.py):
- The shared pinned CPU buffer was reused for 4 sequential non_blocking
H2D copies. GPU DMA read stale data after CPU overwrote the buffer
with the next field, corrupting sendCounts/recvCounts and causing the
kernel to write to wrong addresses. Fixed by using 5 dedicated pinned
buffers — one per field (send_counts, send_displs, recv_counts,
recv_displs, remote_recv_displs).
2. Remove C++ periodic reset (alltoallv_fullmesh.cu):
- A hardcoded static counter reset destroyed MemoryChannels and
semaphores every 1000 kernel calls while inter-GPU signaling was
still in progress, causing semaphore epoch mismatch and illegal
memory access.
3. Fix semaphore wait (alltoallv_kernel.hpp):
- Make wait() unconditional after signal(). Skipping wait() when
recvCounts==0 desynced the semaphore epoch counter — subsequent
calls wait() returned immediately before the peer finished writing.
4. Add memory fence (alltoallv_kernel.hpp):
- Add __threadfence_system() after wait() outside the primary-block
guard so ALL thread blocks execute it before kernel exit. Ensures
NVLink remote writes from put() are globally visible to subsequent
kernels on the receiving GPU.
This change makes MSCCL++ automatically select CUDA architectures based
on the build environment. If an NVIDIA GPU is detected, the build
targets the native GPU architecture for optimal performance; otherwise,
it falls back to building for multiple architectures for portability.
When building for the native architecture, FP8 support is automatically
enabled for “a-series” GPUs (e.g., sm_100a), allowing the appropriate
optimized code paths to be picked up.
* Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast
memory handling.
* Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API
handles this automatically).
* Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a
shared pointer with custom deleter for unmapping, which prevents misuse
of raw pointers and reduces states to be stored in the `GpuIpcMem`
instance.
* Now for `RuntimeIpc` type handles, for consistency with other types,
`cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of
the ctor of `GpuIpcMem`.
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Add `GpuIpcMemHandle` that is a generic GPU memory handle that covers
all existing methods for GPU memory mapping. This PR fixes issues that
fail to properly fallback to a feasible type of memory handle on the
importing environment. It also consolidates code for creating or
destroying various memory handles into a single RAII wrapper.
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
* Updated Dockerfiles and the build script to support CUDA 13.0
* Added Python3 venv which is required since Python 3.12
* Updated the default MLNX-OFED version to the LTS version
* Added docker push instruction for multi-arch manifest
- Remove cuda11 support for nccl-test pipeline, since nccl build failed
for cuda11.
- Update to cuda12.9 for CI pipeline. Will consider dropping cuda11
support add cuda13 support in near future
Tune the nThreadsPerBlock for message size in 32KB to 256KB range for FP8 and Half datatype on MI300.
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles
For nvidia, we don't need this feature since nvidia will count the
handle reference internally and reuse the same handle if already be
opened
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>