* Add a FIFO test code that reproduced a correctness issue
* Fix the correctness issue by using pinned memory instead of cudaMemcpy
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
* Page-locking `Host2DeviceSemaphore::outboundSemaphore_` caused
unexpected performance issues so reverting it back. We may revisit this
later.
* Removed reference to connections from context as now connections refer
to context.
* Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default.
* Optimize `FifoDeviceHandle`.
* Do not use `cudaHostAllocWriteCombined` that increases latency.
* Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`.
* Fix proxy NUMA binding issues.
* Prevent graph capture inside proxy threads.
* Now `CudaIpcConnection` skips stream sync when unnecessary.
* Now any type of connection needs to hold a shared pointer to the
context for memory safety.
* Now a context should be always managed by a shared pointer for memory
safety.
* Minor docs & interface improvements.
* Minor fix in `mscclpp-test` correctness test.
Fix the bug, make sure a thread will be wake up if semaphore be
released.
This pull request includes a modification to the `DeviceSemaphore`
struct in the `concurrency_device.hpp` file, specifically in the
`acquire` method. The change refines the logic for acquiring a semaphore
by adjusting the condition used to handle contention scenarios.
* Let CMake read version numbers from the `VERSION` file.
* Upgrade dlpack and drop `CMAKE_POLICY_VERSION_MINIMUM`.
* Do not install dlpack.
* Add license files in the wheel and exclude `*.cpp` files.
Previous `gpuCalloc*()` creates a new stream for each allocation, which
messes the timeline up in profiler traces. Now `GpuStreamPool` allows
reusing the temporal streams.
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
Cherry-picked a part of features from #167: now `Communicator::setup()`
is unneeded. `Communicator::sendMemory()` conducts the task inline, and
`Communicator::recvMemory()` and `Communicator::connect()` conducts the
task asynchronously without explicit setup.
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
`allreduce7` and `allreduceAllpairs` kernels were updating the LL
protocol flag on the host side. So, it was not properly captured in
graph mode. This PR fixes the issue by updating the flag in the kernels.
* Pass the op type as a template parameter
* Use the all-pairs algorithm for ~10KB
* Don't write channel handles on the shared memory for small sizes
* A reduction bug fix & cleanup
Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported
by Azure vm. Add this flag to force to use cudaMalloc for memory
allocation and disable nvls feature
1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is
more efficient on A100
2. Update the deviceSyncer. Previous one cannot handle threadBlock
number change correctly. Use three counters to solve this issue. Reset
previous counter before sync on current counter.
3. Introduce relaxedWait which can be used with relaxedSignal for case
doesn't need guarantee the memory visibility
Remove __assert_fail for release build. This will reduce the number of
PTX instructions inside the loop. Also Trying to resolve this issue
reported in #497. Reduce the number of PTX instructions from 8 to 6.
8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5
Also NDEBUG flag is confused here, sometime it will not be set. Use
customized flag for debug build.
Here is current PTX:
```
ld.u64 %rd12, [%rd2+-24];
mov.u64 %rd13, %rd12;
mov.u64 %rd11, %rd13;
ld.acquire.sys.b64 %rd10,[%rd11];
setp.lt.u64 %p1, %rd10, %rd3;
@%p1 bra $L__BB2_1;
```
If we change to `asm volatile("ld.global.acquire.sys.b64 %0, [%1];" :
"=l"(flag) : "l"(flag_addr));` will reduce to 4 instructions. We can get
2.1 us for 8 ranks signal/wait
```
ld.u64 %rd9, [%rd1+-24];
ld.global.acquire.sys.b64 %rd8, [%rd9];
setp.lt.u64 %p1, %rd8, %rd2;
@%p1 bra $L__BB2_1;
```
For mscclpp, to use nvls we require the buffer is allocated by
mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export
the raw buffer to dlpack format.
User can use this feature to create buffer with type supported by
pytorch
```python
buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16
dl_pack = buffer.to_dlpack(str(torch.bfloat16))
tensor = torch.utils.dlpack.from_dlpack(dl_pack)
```
Add CI test for fallback allgather, allreduce, broadcast, and
reducescatter to NCCL operations
Test following parameters:
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl/nccl/build/lib/libnccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allgather, allreduce,
broadcast, reducescatter" or "all"