Commit Graph

43 Commits

Author SHA1 Message Date
Changho Hwang
199468bc47 Revise NVLS interface (#458)
* Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel`
* Minor interface improvements
2025-07-12 00:33:03 +00:00
Changho Hwang
ae56698d67 New semaphore constructors (#559)
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Binyang Li
65c10fa8ec Support any GPUs per node for NCCL_API (#566)
Support any GPUs per node for NCCL API
2025-07-11 13:42:39 -07:00
Binyang Li
d1869011c2 Add device semaphore API (#523)
Add deviceSemaphore structure, implement a new NVLS based algo to show
how to use these APIs. Current perf for NVLS non-zero copy version is:
```
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           512      half     sum      -1     6.10    0.17    0.29      0     5.65    0.18    0.32      0
        2048          1024      half     sum      -1     5.94    0.35    0.60      0     5.85    0.35    0.61      0
        4096          2048      half     sum      -1     6.11    0.67    1.17      0     5.97    0.69    1.20      0
        8192          4096      half     sum      -1     6.22    1.32    2.31      0     6.17    1.33    2.33      0
       16384          8192      half     sum      -1     6.68    2.45    4.29      0     6.52    2.51    4.39      0
       32768         16384      half     sum      -1     8.02    4.09    7.15      0     7.66    4.28    7.49      0
       65536         32768      half     sum      -1     8.09    8.10   14.18      0     7.91    8.29   14.51      0
      131072         65536      half     sum      -1     9.58   13.68   23.93      0     9.61   13.64   23.86      0
      262144        131072      half     sum      -1    12.60   20.81   36.42      0    12.28   21.35   37.37      0
      524288        262144      half     sum      -1    14.51   36.12   63.22      0    14.09   37.21   65.12      0
     1048576        524288      half     sum      -1    19.45   53.92   94.36      0    19.29   54.35   95.12      0
     2097152       1048576      half     sum      -1    31.00   67.66  118.40      0    30.80   68.08  119.14      0
     4194304       2097152      half     sum      -1    44.71   93.80  164.16      0    44.66   93.91  164.34      0
     8388608       4194304      half     sum      -1    62.96  133.24  233.17      0    62.49  134.24  234.91      0
    16777216       8388608      half     sum      -1    105.1  159.68  279.45      0    104.4  160.74  281.29      0
    33554432      16777216      half     sum      -1    169.9  197.55  345.71      0    169.8  197.64  345.87      0
    67108864      33554432      half     sum      -1    298.1  225.12  393.96      0    298.1  225.09  393.91      0
   134217728      67108864      half     sum      -1    552.9  242.77  424.84      0    553.7  242.39  424.18      0
   268435456     134217728      half     sum      -1   1055.8  254.24  444.91      0   1056.9  253.98  444.47      0
   536870912     268435456      half     sum      -1   2040.1  263.15  460.52      0   2045.1  262.52  459.40      0
  1073741824     536870912      half     sum      -1   3996.9  268.65  470.13      0   4007.7  267.92  468.86      0
```
2025-05-20 09:32:38 -07:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Binyang Li
affca7d9bc Add NVLS based fallback algo (#507)
Add two nvls based fallback algo. allreduce9 is for nvls with zero copy.
allreduce10 is for nvls need to copy to scratch buffer, do reduce
operation then copy result back to result buffer.

Perf number for allreduce9
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.45    0.19    0.33      0     5.35    0.19    0.33      0
        2048           512     float     sum      -1     5.57    0.37    0.64      0     5.53    0.37    0.65      0
        4096          1024     float     sum      -1     5.80    0.71    1.24      0     5.78    0.71    1.24      0
        8192          2048     float     sum      -1     5.94    1.38    2.42      0     5.85    1.40    2.45      0
       16384          4096     float     sum      -1     6.40    2.56    4.48      0     6.27    2.61    4.57      0
       32768          8192     float     sum      -1     7.45    4.40    7.70      0     7.39    4.43    7.76      0
       65536         16384     float     sum      -1     8.03    8.17   14.29      0     8.32    7.88   13.79      0
      131072         32768     float     sum      -1     7.28   18.00   31.49      0     7.07   18.53   32.43      0
      262144         65536     float     sum      -1     7.72   33.95   59.41      0     7.59   34.56   60.48      0
      524288        131072     float     sum      -1     8.70   60.29  105.51      0     8.37   62.61  109.57      0
     1048576        262144     float     sum      -1    10.56   99.26  173.70      0    10.32  101.64  177.87      0
     2097152        524288     float     sum      -1    14.45  145.14  253.99      0    14.02  149.58  261.76      0
     4194304       1048576     float     sum      -1    22.83  183.73  321.52      0    23.03  182.14  318.75      0
     8388608       2097152     float     sum      -1    38.63  217.14  380.00      0    38.57  217.52  380.65      0
    16777216       4194304     float     sum      -1    70.03  239.58  419.27      0    69.96  239.80  419.66      0
    33554432       8388608     float     sum      -1    131.5  255.17  446.55      0    131.3  255.59  447.28      0
    67108864      16777216     float     sum      -1    255.8  262.37  459.15      0    255.4  262.75  459.82      0
   134217728      33554432     float     sum      -1    500.9  267.94  468.90      0    500.0  268.42  469.74      0
   268435456      67108864     float     sum      -1    989.0  271.41  474.97      0    988.9  271.45  475.05      0
   536870912     134217728     float     sum      -1   1967.4  272.88  477.54      0   1966.0  273.08  477.88      0
  1073741824     268435456     float     sum      -1   3908.5  274.72  480.77      0   3904.6  274.99  481.24      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 218.734 
```

Perf number for allreduce10
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.60    0.18    0.32      0     5.52    0.19    0.32      0
        2048           512     float     sum      -1     5.79    0.35    0.62      0     5.64    0.36    0.64      0
        4096          1024     float     sum      -1     5.92    0.69    1.21      0     5.82    0.70    1.23      0
        8192          2048     float     sum      -1     6.03    1.36    2.38      0     5.95    1.38    2.41      0
       16384          4096     float     sum      -1     6.58    2.49    4.35      0     6.39    2.56    4.49      0
       32768          8192     float     sum      -1     7.54    4.34    7.60      0     7.41    4.42    7.74      0
       65536         16384     float     sum      -1     7.95    8.24   14.42      0     8.10    8.09   14.16      0
      131072         32768     float     sum      -1     9.56   13.72   24.00      0     9.47   13.84   24.23      0
      262144         65536     float     sum      -1    11.49   22.81   39.92      0    11.41   22.97   40.20      0
      524288        131072     float     sum      -1    14.19   36.94   64.64      0    13.88   37.76   66.09      0
     1048576        262144     float     sum      -1    19.10   54.89   96.06      0    18.98   55.24   96.67      0
     2097152        524288     float     sum      -1    31.12   67.38  117.91      0    31.34   66.92  117.10      0
     4194304       1048576     float     sum      -1    44.88   93.46  163.56      0    44.76   93.70  163.97      0
     8388608       2097152     float     sum      -1    63.23  132.68  232.18      0    62.53  134.14  234.75      0
    16777216       4194304     float     sum      -1    106.8  157.03  274.80      0    105.9  158.46  277.30      0
    33554432       8388608     float     sum      -1    172.2  194.91  341.09      0    172.0  195.05  341.35      0
    67108864      16777216     float     sum      -1    299.8  223.83  391.70      0    300.8  223.12  390.46      0
   134217728      33554432     float     sum      -1    553.1  242.66  424.66      0    553.8  242.38  424.16      0
   268435456      67108864     float     sum      -1   1056.1  254.18  444.82      0   1057.4  253.86  444.26      0
   536870912     134217728     float     sum      -1   2064.0  260.11  455.20      0   2063.8  260.14  455.25      0
  1073741824     268435456     float     sum      -1   4074.4  263.53  461.18      0   4065.8  264.09  462.16      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 169.799 
```

---------

Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-04-27 14:09:31 -07:00
Changho Hwang
710f6686dc Revised MemoryChannel interfaces (#508)
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
2025-04-25 00:02:56 +00:00
Nusrat Islam
9df2bdb2bf apps/nccl: fix a bug in allreduce kernels for graph mode (#502)
`allreduce7` and `allreduceAllpairs` kernels were updating the LL
protocol flag on the host side. So, it was not properly captured in
graph mode. This PR fixes the issue by updating the flag in the kernels.
2025-04-24 16:43:47 -07:00
Changho Hwang
474ef0b696 Optimized allreduce fallback for ~10KB sizes (#506)
* Pass the op type as a template parameter
* Use the all-pairs algorithm for ~10KB
* Don't write channel handles on the shared memory for small sizes
* A reduction bug fix & cleanup
2025-04-23 10:38:15 -07:00
Binyang Li
b4062462fd Fix reduceMin failaure issue (#486)
Remove the reduceOp check, as this already done at `getReduceOp` method
2025-03-25 10:15:24 -07:00
Qinghua Zhou
a7c364beb8 nccl/rccl integration (#469)
Use dlopen to load nccl/rccl Apis from shared library to
enable Allgather, Allreduce, Broadcast, ReduceScatter fallback to nccl/rccl operations.

Add three related environment variables
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=/path/libnccl.so/librccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all"
By default, if MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, all these operations will be fallback to nccl/rccl apis.
---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-20 11:31:37 -07:00
Binyang Li
89f7573adf Fix correctness issue when mscclppDisableChannelCache set to true (#483)
If `mscclppDisableChannelCache` set to true, we need to keep every
channel information avoid the channel info in GPU side be released.
2025-03-19 14:55:37 -07:00
Binyang Li
f124dc1df9 Add min operation for allreduce (#481)
Add min operation for allreduce
2025-03-16 20:47:36 -07:00
Binyang Li
0b840baa05 Update allgather fallback algo (#476)
Enhancements to all-gather operation, a temporary solution to fix the
memory overhead when integrating msccl++ with pytorch.
This solution will not register input/output buffer to msccl++, so the
temp output buffer for allgather could be reused by torch automatically.

* Introduced a new `allgather8` kernel function in
`apps/nccl/src/allgather.hpp` to handle larger data sizes more
efficiently. This includes double buffering to hide synchronization
overhead and support for both in-place and out-of-place operations.
* Modified the `allgather` function to decide between `allgather6` and
`allgather8` based on data size and platform, improving performance for
large data sizes.

Configuration and environment improvements:

* Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to
control whether the channel cache is disabled, enhancing
configurability. This variable is now part of the `Env` class and is
logged during environment initialization.
* Removed the redundant global variable `mscclppDisableChannelCache`
from `src/debug.cc` and updated its usage to refer to the new
environment variable.
2025-03-14 11:18:03 -07:00
Qinghua Zhou
591276f9d0 Disable channel cache (#463)
Add workaround of disabling channel cache.
Related runtime parameter: -x MSCCLPP_DISABLE_CHANNEL_CACHE=TRUE
(Default value: False)
In this PR, some other features (e.g., ncclCommSplit) come from branch
binyangli/nccl-api

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-02-19 19:26:12 +00:00
Caio Rocha
55789bc551 Support ReduceScatter in the NCCL interface (#460)
Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net>
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-02-11 13:28:19 -08:00
Changho Hwang
3565bfdf6d Renaming channels (#436)
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Binyang Li
6d26b92665 Fix azure pipeline (#437) 2025-01-04 19:41:10 -08:00
Pedram Alizadeh
97eaca2bd2 [NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl (#399) 2025-01-03 20:38:57 +00:00
Qinghua Zhou
ba0d0d68b8 Enhance the nccl error message handling (#434)
Add WARN or INFO before returning the nccl error message.
Change NCCL_DEBUG to MSCCLPP_DEBUG in debug message.
2025-01-03 00:50:36 +00:00
Changho Hwang
e2230aab26 Tackle build warnings (#422)
* Comply with
[CMP0165](https://cmake.org/cmake/help/latest/policy/CMP0165.html)
* Tackle other warnings during build
2024-12-19 16:51:50 -08:00
SreevatsaAnantharamu
0c7ed2c674 Add ncclBcast / ncclBroadcast support (#419)
A simple broadcast using scratch buffer and option to use an executor.
2024-12-19 01:16:30 +00:00
David Sidler
d8d0dfbffa Fix synchronization in allreduce8 kernel (#407)
Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a
synchronization bug. The PR addresses it by ensuring that signals are
only issued after all threads in the block have issued their writes to
guarantee correct ordering between data writes and signal writes.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-18 17:10:22 -08:00
Caio Rocha
774602d49c Supporting Executor multi node in NCCL API (#412)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-12-18 15:50:58 -08:00
Binyang Li
fcb2e46cb1 NVLS support for NCCL API (#410)
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-18 09:55:35 +00:00
Binyang Li
88d28e07a7 Select algo according to json config (#396)
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Caio Rocha
d9c297ba14 AllGather Executor Support in NCCL Interface (#393)
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-27 17:05:51 -08:00
Caio Rocha
93628d2066 Fixing Message Boundary AllReduce Fallback Code (#391) 2024-11-23 12:15:56 -08:00
Changho Hwang
2127a3ba29 Improve CMake options (#376)
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-22 01:54:11 +00:00
Binyang Li
28a57b0610 NVLS support for msccl++ executor (#375)
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda

Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Binyang Li
4136153a76 [Doc] mscclpp docs (#348)
Generate docs for mescclpp.
Setup github action to auto-deploy github-page
doc link here: https://microsoft.github.io/mscclpp

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-10-18 06:08:31 +00:00
Changho Hwang
f8c0bcca2b Perf optimization & support clipping (#364)
Co-authored-by: Nusrat Islam <Nusrat.Islam@amd.com>
2024-10-16 14:35:08 -07:00
Changho Hwang
e9294357c5 Fix NCCL API bugs (#363) 2024-10-16 14:16:34 -07:00
Binyang Li
b30bb260e3 Tune threads per block for mscclpp executor (#345) 2024-09-18 17:21:47 -07:00
Changho Hwang
1e82dd444f Make ibverbs optional at compile time (#340)
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-08-21 12:47:05 -07:00
Changho Hwang
8c6fb429e9 bfloat16 support (#336)
* Add bfloat16 support for executor and NCCL interface
* Changed `gpu_data_types.hpp` into an internal header file
2024-08-12 15:41:58 -07:00
caiomcbr
67eb9b04cc NCCL API Executor Integration (#331)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-07-25 15:05:02 -07:00
caiomcbr
7493e2f075 Double buffering for NCCL APIs (#324)
Using two scratch buffers in each peer to exchange data.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-07-15 22:18:53 +00:00
Changho Hwang
c4ca2fbc8c Resolve clang++ warnings (#325) 2024-07-11 07:48:35 +00:00
caiomcbr
f4c3c8f916 AllReduce Kernel for Small Messages (#322)
Adding allreduce kernel code for message sizes smaller than 32 bytes,
when the number of elements are smaller than the number of ranks.

---------

Co-authored-by: Caio Rocha <caio.rocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-07-05 21:08:43 +00:00
caiomcbr
b1b9d0626c Support NCCL APIs (#319)
Start supporting NCCL APIs with a few limitations.

---------

Co-authored-by: Caio Rocha <caio.rocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-06-27 23:54:06 +00:00