* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a
synchronization bug. The PR addresses it by ensuring that signals are
only issued after all threads in the block have issued their writes to
guarantee correct ordering between data writes and signal writes.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda
Test passed for `allreduce_nvls.json`
Adding allreduce kernel code for message sizes smaller than 32 bytes,
when the number of elements are smaller than the number of ranks.
---------
Co-authored-by: Caio Rocha <caio.rocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>