Commit Graph

776 Commits

Author SHA1 Message Date
Changho Hwang
20eca28942 Fix a FIFO correctness bug (#549)
* Add a FIFO test code that reproduced a correctness issue
* Fix the correctness issue by using pinned memory instead of cudaMemcpy

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-07-11 23:53:59 +00:00
Binyang Li
9b71d524b3 Fix pytest failure (#567)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-07-11 16:49:28 -07:00
Binyang Li
65c10fa8ec Support any GPUs per node for NCCL_API (#566)
Support any GPUs per node for NCCL API
2025-07-11 13:42:39 -07:00
Binyang Li
1d85ac6f66 Fix multi-nodes CI pipeline (#564)
Fix multi-nodes CI pipeline
2025-07-08 09:51:44 -07:00
Changho Hwang
22e8db4885 Support connection between local endpoints (#561) 2025-07-02 13:02:44 -07:00
Changho Hwang
3de6d5b63a Fix #557 (#560)
* Page-locking `Host2DeviceSemaphore::outboundSemaphore_` caused
unexpected performance issues so reverting it back. We may revisit this
later.
* Removed reference to connections from context as now connections refer
to context.
2025-06-30 11:33:19 -07:00
Changho Hwang
b4dde38db8 FIFO improvements (#557)
* Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default.
* Optimize `FifoDeviceHandle`.
* Do not use `cudaHostAllocWriteCombined` that increases latency.
* Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`.
* Fix proxy NUMA binding issues.
* Prevent graph capture inside proxy threads.
* Now `CudaIpcConnection` skips stream sync when unnecessary.
* Now any type of connection needs to hold a shared pointer to the
context for memory safety.
* Now a context should be always managed by a shared pointer for memory
safety.
* Minor docs & interface improvements.
* Minor fix in `mscclpp-test` correctness test.
2025-06-24 09:50:28 -07:00
Changho Hwang
2796cfa5ba New FIFO test (#558)
Comprehensive FIFO testing
2025-06-23 15:42:44 -07:00
Wenxuan Tan
2151790463 Fix some typos in docs (#555) 2025-06-19 19:39:37 +00:00
Binyang Li
81699a5bdd DeviceSemaphore fix (#553)
Fix the bug, make sure a thread will be wake up if semaphore be
released.

This pull request includes a modification to the `DeviceSemaphore`
struct in the `concurrency_device.hpp` file, specifically in the
`acquire` method. The change refines the logic for acquiring a semaphore
by adjusting the condition used to handle contention scenarios.
2025-06-19 12:30:01 -07:00
Changho Hwang
a36dcd56bf Do not use tail replica by default (#544)
Added `MSCCLPP_FIFO_USE_TAIL_REPLICA` environment variable to control
whether to use a tail replica for the FIFO buffer. Default is false.
2025-06-12 14:07:10 -07:00
Changho Hwang
17d8e7c9e9 Fix build processes (#545)
* Let CMake read version numbers from the `VERSION` file.
* Upgrade dlpack and drop `CMAKE_POLICY_VERSION_MINIMUM`.
* Do not install dlpack.
* Add license files in the wheel and exclude `*.cpp` files.
2025-06-06 13:37:40 -07:00
Changho Hwang
f694f2e46b Fix #509 (#546)
Fix a destruction order issue
2025-06-05 19:36:02 -07:00
Changho Hwang
125d6f5809 Multi-stream CUDA IPC (#326)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>
2025-06-04 10:31:04 -07:00
Changho Hwang
253a1ba1a9 Use a stream pool for gpuCalloc*() (#509)
Previous `gpuCalloc*()` creates a new stream for each allocation, which
messes the timeline up in profiler traces. Now `GpuStreamPool` allows
reusing the temporal streams.
2025-06-04 10:07:20 -07:00
Changho Hwang
83356957bd Improved documentation & minor interface revision (#541) 2025-06-03 14:26:27 -07:00
Changho Hwang
c184485808 DLPack fixes (#537)
* Fix typos in type name
* Make it work without current context set
2025-05-27 21:40:50 +00:00
Changho Hwang
7278b51e61 Rename ChannelTrigger fields and check field values in debug builds (#529) 2025-05-27 14:36:22 -07:00
Changho Hwang
2b9b18d562 Address NVCC warning #20012-D (#528) 2025-05-21 10:37:50 -07:00
Binyang Li
d1869011c2 Add device semaphore API (#523)
Add deviceSemaphore structure, implement a new NVLS based algo to show
how to use these APIs. Current perf for NVLS non-zero copy version is:
```
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           512      half     sum      -1     6.10    0.17    0.29      0     5.65    0.18    0.32      0
        2048          1024      half     sum      -1     5.94    0.35    0.60      0     5.85    0.35    0.61      0
        4096          2048      half     sum      -1     6.11    0.67    1.17      0     5.97    0.69    1.20      0
        8192          4096      half     sum      -1     6.22    1.32    2.31      0     6.17    1.33    2.33      0
       16384          8192      half     sum      -1     6.68    2.45    4.29      0     6.52    2.51    4.39      0
       32768         16384      half     sum      -1     8.02    4.09    7.15      0     7.66    4.28    7.49      0
       65536         32768      half     sum      -1     8.09    8.10   14.18      0     7.91    8.29   14.51      0
      131072         65536      half     sum      -1     9.58   13.68   23.93      0     9.61   13.64   23.86      0
      262144        131072      half     sum      -1    12.60   20.81   36.42      0    12.28   21.35   37.37      0
      524288        262144      half     sum      -1    14.51   36.12   63.22      0    14.09   37.21   65.12      0
     1048576        524288      half     sum      -1    19.45   53.92   94.36      0    19.29   54.35   95.12      0
     2097152       1048576      half     sum      -1    31.00   67.66  118.40      0    30.80   68.08  119.14      0
     4194304       2097152      half     sum      -1    44.71   93.80  164.16      0    44.66   93.91  164.34      0
     8388608       4194304      half     sum      -1    62.96  133.24  233.17      0    62.49  134.24  234.91      0
    16777216       8388608      half     sum      -1    105.1  159.68  279.45      0    104.4  160.74  281.29      0
    33554432      16777216      half     sum      -1    169.9  197.55  345.71      0    169.8  197.64  345.87      0
    67108864      33554432      half     sum      -1    298.1  225.12  393.96      0    298.1  225.09  393.91      0
   134217728      67108864      half     sum      -1    552.9  242.77  424.84      0    553.7  242.39  424.18      0
   268435456     134217728      half     sum      -1   1055.8  254.24  444.91      0   1056.9  253.98  444.47      0
   536870912     268435456      half     sum      -1   2040.1  263.15  460.52      0   2045.1  262.52  459.40      0
  1073741824     536870912      half     sum      -1   3996.9  268.65  470.13      0   4007.7  267.92  468.86      0
```
2025-05-20 09:32:38 -07:00
Caio Rocha
29c3af2ac6 Properly setting up the device in Ethernet Connection (#527)
When we create the thread to receive messages in the Ethernet
Connection, it resets the Device ID, causing faults in the Ethernet
Connection unit tests.


![image](https://github.com/user-attachments/assets/ba609c16-0f52-4624-807a-5ad776a0c18d)

This PR aims to properly set up the device when the thread is created.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-05-19 10:05:45 -07:00
Binyang Li
a18e91cee4 Set Up a CI Pipeline for H100 (#526)
Set Up a CI Pipeline for H100
2025-05-15 14:50:23 -07:00
Changho Hwang
908659318b Update citations (#524)
Co-authored-by: Aashaka Shah <aashaka96@gmail.com>
2025-05-13 17:52:04 -07:00
Changho Hwang
2c63059014 Add a CMake option MSCCLPP_GPU_ARCHS (#525)
`MSCCLPP_GPU_ARCHS` allows specifying GPU architectures with delimiters
(comma, space, or semicolon).
2025-05-13 20:51:23 +00:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Changho Hwang
5205618c4a Fix device assert (#522)
* Fixed a bug that external `assert()`s may not be compiled with mscclpp
headers
* Use a macro assert instead of a function
2025-05-12 13:38:11 -07:00
Binyang Li
a464b9f21e Adding maxSpinCount to port channel flush (#518)
Fix #482

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-05-08 21:24:48 -07:00
Changho Hwang
d636093336 Asynchronous setup (#514)
Cherry-picked a part of features from #167: now `Communicator::setup()`
is unneeded. `Communicator::sendMemory()` conducts the task inline, and
`Communicator::recvMemory()` and `Communicator::connect()` conducts the
task asynchronously without explicit setup.
2025-05-08 22:01:51 +00:00
Qinghua Zhou
8bc369ceb4 Fix the issue of echo message for nccl fallback in CI test (#520)
Update the echo message for nccl fallback in CI test
2025-05-08 10:37:54 -07:00
Qinghua Zhou
b4f0af8f9f Support ibv_reg_dmabuf_mr for buffer allocated by cuMemMalloc (#513)
Fix #496
For buffer allocated by cuMemMalloc, use ibv_reg_dmabuf_mr to register a
dma-buf based memory region.
2025-05-07 17:26:14 -07:00
Caio Rocha
51eca89d20 Enhance Collective Check at MSCCLang (#511) 2025-04-29 13:29:28 -07:00
Binyang Li
affca7d9bc Add NVLS based fallback algo (#507)
Add two nvls based fallback algo. allreduce9 is for nvls with zero copy.
allreduce10 is for nvls need to copy to scratch buffer, do reduce
operation then copy result back to result buffer.

Perf number for allreduce9
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.45    0.19    0.33      0     5.35    0.19    0.33      0
        2048           512     float     sum      -1     5.57    0.37    0.64      0     5.53    0.37    0.65      0
        4096          1024     float     sum      -1     5.80    0.71    1.24      0     5.78    0.71    1.24      0
        8192          2048     float     sum      -1     5.94    1.38    2.42      0     5.85    1.40    2.45      0
       16384          4096     float     sum      -1     6.40    2.56    4.48      0     6.27    2.61    4.57      0
       32768          8192     float     sum      -1     7.45    4.40    7.70      0     7.39    4.43    7.76      0
       65536         16384     float     sum      -1     8.03    8.17   14.29      0     8.32    7.88   13.79      0
      131072         32768     float     sum      -1     7.28   18.00   31.49      0     7.07   18.53   32.43      0
      262144         65536     float     sum      -1     7.72   33.95   59.41      0     7.59   34.56   60.48      0
      524288        131072     float     sum      -1     8.70   60.29  105.51      0     8.37   62.61  109.57      0
     1048576        262144     float     sum      -1    10.56   99.26  173.70      0    10.32  101.64  177.87      0
     2097152        524288     float     sum      -1    14.45  145.14  253.99      0    14.02  149.58  261.76      0
     4194304       1048576     float     sum      -1    22.83  183.73  321.52      0    23.03  182.14  318.75      0
     8388608       2097152     float     sum      -1    38.63  217.14  380.00      0    38.57  217.52  380.65      0
    16777216       4194304     float     sum      -1    70.03  239.58  419.27      0    69.96  239.80  419.66      0
    33554432       8388608     float     sum      -1    131.5  255.17  446.55      0    131.3  255.59  447.28      0
    67108864      16777216     float     sum      -1    255.8  262.37  459.15      0    255.4  262.75  459.82      0
   134217728      33554432     float     sum      -1    500.9  267.94  468.90      0    500.0  268.42  469.74      0
   268435456      67108864     float     sum      -1    989.0  271.41  474.97      0    988.9  271.45  475.05      0
   536870912     134217728     float     sum      -1   1967.4  272.88  477.54      0   1966.0  273.08  477.88      0
  1073741824     268435456     float     sum      -1   3908.5  274.72  480.77      0   3904.6  274.99  481.24      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 218.734 
```

Perf number for allreduce10
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.60    0.18    0.32      0     5.52    0.19    0.32      0
        2048           512     float     sum      -1     5.79    0.35    0.62      0     5.64    0.36    0.64      0
        4096          1024     float     sum      -1     5.92    0.69    1.21      0     5.82    0.70    1.23      0
        8192          2048     float     sum      -1     6.03    1.36    2.38      0     5.95    1.38    2.41      0
       16384          4096     float     sum      -1     6.58    2.49    4.35      0     6.39    2.56    4.49      0
       32768          8192     float     sum      -1     7.54    4.34    7.60      0     7.41    4.42    7.74      0
       65536         16384     float     sum      -1     7.95    8.24   14.42      0     8.10    8.09   14.16      0
      131072         32768     float     sum      -1     9.56   13.72   24.00      0     9.47   13.84   24.23      0
      262144         65536     float     sum      -1    11.49   22.81   39.92      0    11.41   22.97   40.20      0
      524288        131072     float     sum      -1    14.19   36.94   64.64      0    13.88   37.76   66.09      0
     1048576        262144     float     sum      -1    19.10   54.89   96.06      0    18.98   55.24   96.67      0
     2097152        524288     float     sum      -1    31.12   67.38  117.91      0    31.34   66.92  117.10      0
     4194304       1048576     float     sum      -1    44.88   93.46  163.56      0    44.76   93.70  163.97      0
     8388608       2097152     float     sum      -1    63.23  132.68  232.18      0    62.53  134.14  234.75      0
    16777216       4194304     float     sum      -1    106.8  157.03  274.80      0    105.9  158.46  277.30      0
    33554432       8388608     float     sum      -1    172.2  194.91  341.09      0    172.0  195.05  341.35      0
    67108864      16777216     float     sum      -1    299.8  223.83  391.70      0    300.8  223.12  390.46      0
   134217728      33554432     float     sum      -1    553.1  242.66  424.66      0    553.8  242.38  424.16      0
   268435456      67108864     float     sum      -1   1056.1  254.18  444.82      0   1057.4  253.86  444.26      0
   536870912     134217728     float     sum      -1   2064.0  260.11  455.20      0   2063.8  260.14  455.25      0
  1073741824     268435456     float     sum      -1   4074.4  263.53  461.18      0   4065.8  264.09  462.16      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 169.799 
```

---------

Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-04-27 14:09:31 -07:00
Changho Hwang
b310783603 Fix #508 (#515)
* Wrong offsets in `unpackPackets()`
* Added Python binding of `BaseMemoryChannel`
2025-04-25 09:52:05 -07:00
Changho Hwang
710f6686dc Revised MemoryChannel interfaces (#508)
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
2025-04-25 00:02:56 +00:00
Nusrat Islam
9df2bdb2bf apps/nccl: fix a bug in allreduce kernels for graph mode (#502)
`allreduce7` and `allreduceAllpairs` kernels were updating the LL
protocol flag on the host side. So, it was not properly captured in
graph mode. This PR fixes the issue by updating the flag in the kernels.
2025-04-24 16:43:47 -07:00
Changho Hwang
cbdcf9064c Use implicit ctors for default device ctors (#512)
By using implicit constructors, the compiler doesn't need to dynamically
initialize the instances.
2025-04-24 12:38:19 -07:00
Caio Rocha
7a25e51b07 Automatic creation of Scratch Buffer at MSCCLLang (#510) 2025-04-23 16:37:14 -07:00
Changho Hwang
474ef0b696 Optimized allreduce fallback for ~10KB sizes (#506)
* Pass the op type as a template parameter
* Use the all-pairs algorithm for ~10KB
* Don't write channel handles on the shared memory for small sizes
* A reduction bug fix & cleanup
2025-04-23 10:38:15 -07:00
Binyang Li
7da11b35d5 Add flag to disable nvls (#500)
Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported
by Azure vm. Add this flag to force to use cudaMalloc for memory
allocation and disable nvls feature
2025-04-22 17:09:19 -07:00
Binyang Li
06f31994dc Fix performance issue introduced in PR: 499 (#505)
1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is
more efficient on A100
2. Update the deviceSyncer. Previous one cannot handle threadBlock
number change correctly. Use three counters to solve this issue. Reset
previous counter before sync on current counter.
3. Introduce relaxedWait which can be used with relaxedSignal for case
doesn't need guarantee the memory visibility
2025-04-22 14:03:37 -07:00
Binyang Li
e412804eab Improve signal/wait performance and fix barrier issue (#499)
Remove __assert_fail for release build. This will reduce the number of
PTX instructions inside the loop. Also Trying to resolve this issue
reported in #497. Reduce the number of PTX instructions from 8 to 6.
8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5
Also NDEBUG flag is confused here, sometime it will not be set. Use
customized flag for debug build.

Here is current PTX:
```
      ld.u64  %rd12, [%rd2+-24];
      mov.u64         %rd13, %rd12;
      mov.u64         %rd11, %rd13;
      ld.acquire.sys.b64 %rd10,[%rd11];
      setp.lt.u64     %p1, %rd10, %rd3;
      @%p1 bra        $L__BB2_1;
```

If we change to `asm volatile("ld.global.acquire.sys.b64 %0, [%1];" :
"=l"(flag) : "l"(flag_addr));` will reduce to 4 instructions. We can get
2.1 us for 8 ranks signal/wait
```
        ld.u64  %rd9, [%rd1+-24];
        ld.global.acquire.sys.b64 %rd8, [%rd9];
        setp.lt.u64     %p1, %rd8, %rd2;
        @%p1 bra        $L__BB2_1;
```
2025-04-16 14:22:10 -07:00
Qinghua Zhou
f1115210bf Fix the virtual address mapping issue of cuMemMap in fallback code (#501)
Fix the virtual address mapping issue of cuMemMap by using the size of
device memory allocation instead of user buffer size
2025-04-16 14:16:52 -07:00
Binyang Li
adc9ee5684 Export mscclpp GpuBuffer to dlpack format (#492)
For mscclpp, to use nvls we require the buffer is allocated by
mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export
the raw buffer to dlpack format.
User can use this feature to create buffer with type supported by
pytorch
```python
buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16
dl_pack = buffer.to_dlpack(str(torch.bfloat16))
tensor = torch.utils.dlpack.from_dlpack(dl_pack)
```
2025-04-03 12:59:32 -07:00
Changho Hwang
5a7a59ff14 Fix CMake installation in Dockerfile for arm64 (#491) 2025-03-31 17:38:47 +00:00
Changho Hwang
3aeb1cb9c6 Add a devcontainer configuration (#490) 2025-03-31 10:34:11 -07:00
Changho Hwang
def68ced64 Add CUDA 12.8 images (#488) 2025-03-29 00:31:26 +00:00
Binyang Li
a3d8d6807b Remove the requirement for CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED for NVLS support (#489)
Remove the requirement for
`CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support.

Fix #487
2025-03-28 16:46:54 -07:00
Qinghua Zhou
0f21ed44b8 Add CI test for fallback allgather, allreduce, broadcastand reducescatter to NCCL operations (#485)
Add CI test for fallback allgather, allreduce, broadcast, and
reducescatter to NCCL operations
Test following parameters:
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE 
-x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl/nccl/build/lib/libnccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allgather, allreduce,
broadcast, reducescatter" or "all"
2025-03-27 21:13:07 +00:00
Caio Rocha
ac5cc647e0 Reduce Operation Support to the Executor (#484)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-25 13:58:12 -07:00
Binyang Li
b4062462fd Fix reduceMin failaure issue (#486)
Remove the reduceOp check, as this already done at `getReduceOp` method
2025-03-25 10:15:24 -07:00