Commit Graph

745 Commits

Author SHA1 Message Date
Binyang Li
affca7d9bc Add NVLS based fallback algo (#507)
Add two nvls based fallback algo. allreduce9 is for nvls with zero copy.
allreduce10 is for nvls need to copy to scratch buffer, do reduce
operation then copy result back to result buffer.

Perf number for allreduce9
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.45    0.19    0.33      0     5.35    0.19    0.33      0
        2048           512     float     sum      -1     5.57    0.37    0.64      0     5.53    0.37    0.65      0
        4096          1024     float     sum      -1     5.80    0.71    1.24      0     5.78    0.71    1.24      0
        8192          2048     float     sum      -1     5.94    1.38    2.42      0     5.85    1.40    2.45      0
       16384          4096     float     sum      -1     6.40    2.56    4.48      0     6.27    2.61    4.57      0
       32768          8192     float     sum      -1     7.45    4.40    7.70      0     7.39    4.43    7.76      0
       65536         16384     float     sum      -1     8.03    8.17   14.29      0     8.32    7.88   13.79      0
      131072         32768     float     sum      -1     7.28   18.00   31.49      0     7.07   18.53   32.43      0
      262144         65536     float     sum      -1     7.72   33.95   59.41      0     7.59   34.56   60.48      0
      524288        131072     float     sum      -1     8.70   60.29  105.51      0     8.37   62.61  109.57      0
     1048576        262144     float     sum      -1    10.56   99.26  173.70      0    10.32  101.64  177.87      0
     2097152        524288     float     sum      -1    14.45  145.14  253.99      0    14.02  149.58  261.76      0
     4194304       1048576     float     sum      -1    22.83  183.73  321.52      0    23.03  182.14  318.75      0
     8388608       2097152     float     sum      -1    38.63  217.14  380.00      0    38.57  217.52  380.65      0
    16777216       4194304     float     sum      -1    70.03  239.58  419.27      0    69.96  239.80  419.66      0
    33554432       8388608     float     sum      -1    131.5  255.17  446.55      0    131.3  255.59  447.28      0
    67108864      16777216     float     sum      -1    255.8  262.37  459.15      0    255.4  262.75  459.82      0
   134217728      33554432     float     sum      -1    500.9  267.94  468.90      0    500.0  268.42  469.74      0
   268435456      67108864     float     sum      -1    989.0  271.41  474.97      0    988.9  271.45  475.05      0
   536870912     134217728     float     sum      -1   1967.4  272.88  477.54      0   1966.0  273.08  477.88      0
  1073741824     268435456     float     sum      -1   3908.5  274.72  480.77      0   3904.6  274.99  481.24      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 218.734 
```

Perf number for allreduce10
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.60    0.18    0.32      0     5.52    0.19    0.32      0
        2048           512     float     sum      -1     5.79    0.35    0.62      0     5.64    0.36    0.64      0
        4096          1024     float     sum      -1     5.92    0.69    1.21      0     5.82    0.70    1.23      0
        8192          2048     float     sum      -1     6.03    1.36    2.38      0     5.95    1.38    2.41      0
       16384          4096     float     sum      -1     6.58    2.49    4.35      0     6.39    2.56    4.49      0
       32768          8192     float     sum      -1     7.54    4.34    7.60      0     7.41    4.42    7.74      0
       65536         16384     float     sum      -1     7.95    8.24   14.42      0     8.10    8.09   14.16      0
      131072         32768     float     sum      -1     9.56   13.72   24.00      0     9.47   13.84   24.23      0
      262144         65536     float     sum      -1    11.49   22.81   39.92      0    11.41   22.97   40.20      0
      524288        131072     float     sum      -1    14.19   36.94   64.64      0    13.88   37.76   66.09      0
     1048576        262144     float     sum      -1    19.10   54.89   96.06      0    18.98   55.24   96.67      0
     2097152        524288     float     sum      -1    31.12   67.38  117.91      0    31.34   66.92  117.10      0
     4194304       1048576     float     sum      -1    44.88   93.46  163.56      0    44.76   93.70  163.97      0
     8388608       2097152     float     sum      -1    63.23  132.68  232.18      0    62.53  134.14  234.75      0
    16777216       4194304     float     sum      -1    106.8  157.03  274.80      0    105.9  158.46  277.30      0
    33554432       8388608     float     sum      -1    172.2  194.91  341.09      0    172.0  195.05  341.35      0
    67108864      16777216     float     sum      -1    299.8  223.83  391.70      0    300.8  223.12  390.46      0
   134217728      33554432     float     sum      -1    553.1  242.66  424.66      0    553.8  242.38  424.16      0
   268435456      67108864     float     sum      -1   1056.1  254.18  444.82      0   1057.4  253.86  444.26      0
   536870912     134217728     float     sum      -1   2064.0  260.11  455.20      0   2063.8  260.14  455.25      0
  1073741824     268435456     float     sum      -1   4074.4  263.53  461.18      0   4065.8  264.09  462.16      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 169.799 
```

---------

Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-04-27 14:09:31 -07:00
Changho Hwang
b310783603 Fix #508 (#515)
* Wrong offsets in `unpackPackets()`
* Added Python binding of `BaseMemoryChannel`
2025-04-25 09:52:05 -07:00
Changho Hwang
710f6686dc Revised MemoryChannel interfaces (#508)
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
2025-04-25 00:02:56 +00:00
Nusrat Islam
9df2bdb2bf apps/nccl: fix a bug in allreduce kernels for graph mode (#502)
`allreduce7` and `allreduceAllpairs` kernels were updating the LL
protocol flag on the host side. So, it was not properly captured in
graph mode. This PR fixes the issue by updating the flag in the kernels.
2025-04-24 16:43:47 -07:00
Changho Hwang
cbdcf9064c Use implicit ctors for default device ctors (#512)
By using implicit constructors, the compiler doesn't need to dynamically
initialize the instances.
2025-04-24 12:38:19 -07:00
Caio Rocha
7a25e51b07 Automatic creation of Scratch Buffer at MSCCLLang (#510) 2025-04-23 16:37:14 -07:00
Changho Hwang
474ef0b696 Optimized allreduce fallback for ~10KB sizes (#506)
* Pass the op type as a template parameter
* Use the all-pairs algorithm for ~10KB
* Don't write channel handles on the shared memory for small sizes
* A reduction bug fix & cleanup
2025-04-23 10:38:15 -07:00
Binyang Li
7da11b35d5 Add flag to disable nvls (#500)
Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported
by Azure vm. Add this flag to force to use cudaMalloc for memory
allocation and disable nvls feature
2025-04-22 17:09:19 -07:00
Binyang Li
06f31994dc Fix performance issue introduced in PR: 499 (#505)
1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is
more efficient on A100
2. Update the deviceSyncer. Previous one cannot handle threadBlock
number change correctly. Use three counters to solve this issue. Reset
previous counter before sync on current counter.
3. Introduce relaxedWait which can be used with relaxedSignal for case
doesn't need guarantee the memory visibility
2025-04-22 14:03:37 -07:00
Binyang Li
e412804eab Improve signal/wait performance and fix barrier issue (#499)
Remove __assert_fail for release build. This will reduce the number of
PTX instructions inside the loop. Also Trying to resolve this issue
reported in #497. Reduce the number of PTX instructions from 8 to 6.
8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5
Also NDEBUG flag is confused here, sometime it will not be set. Use
customized flag for debug build.

Here is current PTX:
```
      ld.u64  %rd12, [%rd2+-24];
      mov.u64         %rd13, %rd12;
      mov.u64         %rd11, %rd13;
      ld.acquire.sys.b64 %rd10,[%rd11];
      setp.lt.u64     %p1, %rd10, %rd3;
      @%p1 bra        $L__BB2_1;
```

If we change to `asm volatile("ld.global.acquire.sys.b64 %0, [%1];" :
"=l"(flag) : "l"(flag_addr));` will reduce to 4 instructions. We can get
2.1 us for 8 ranks signal/wait
```
        ld.u64  %rd9, [%rd1+-24];
        ld.global.acquire.sys.b64 %rd8, [%rd9];
        setp.lt.u64     %p1, %rd8, %rd2;
        @%p1 bra        $L__BB2_1;
```
2025-04-16 14:22:10 -07:00
Qinghua Zhou
f1115210bf Fix the virtual address mapping issue of cuMemMap in fallback code (#501)
Fix the virtual address mapping issue of cuMemMap by using the size of
device memory allocation instead of user buffer size
2025-04-16 14:16:52 -07:00
Binyang Li
adc9ee5684 Export mscclpp GpuBuffer to dlpack format (#492)
For mscclpp, to use nvls we require the buffer is allocated by
mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export
the raw buffer to dlpack format.
User can use this feature to create buffer with type supported by
pytorch
```python
buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16
dl_pack = buffer.to_dlpack(str(torch.bfloat16))
tensor = torch.utils.dlpack.from_dlpack(dl_pack)
```
2025-04-03 12:59:32 -07:00
Changho Hwang
5a7a59ff14 Fix CMake installation in Dockerfile for arm64 (#491) 2025-03-31 17:38:47 +00:00
Changho Hwang
3aeb1cb9c6 Add a devcontainer configuration (#490) 2025-03-31 10:34:11 -07:00
Changho Hwang
def68ced64 Add CUDA 12.8 images (#488) 2025-03-29 00:31:26 +00:00
Binyang Li
a3d8d6807b Remove the requirement for CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED for NVLS support (#489)
Remove the requirement for
`CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support.

Fix #487
2025-03-28 16:46:54 -07:00
Qinghua Zhou
0f21ed44b8 Add CI test for fallback allgather, allreduce, broadcastand reducescatter to NCCL operations (#485)
Add CI test for fallback allgather, allreduce, broadcast, and
reducescatter to NCCL operations
Test following parameters:
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE 
-x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl/nccl/build/lib/libnccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allgather, allreduce,
broadcast, reducescatter" or "all"
2025-03-27 21:13:07 +00:00
Caio Rocha
ac5cc647e0 Reduce Operation Support to the Executor (#484)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-25 13:58:12 -07:00
Binyang Li
b4062462fd Fix reduceMin failaure issue (#486)
Remove the reduceOp check, as this already done at `getReduceOp` method
2025-03-25 10:15:24 -07:00
Qinghua Zhou
a7c364beb8 nccl/rccl integration (#469)
Use dlopen to load nccl/rccl Apis from shared library to
enable Allgather, Allreduce, Broadcast, ReduceScatter fallback to nccl/rccl operations.

Add three related environment variables
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=/path/libnccl.so/librccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all"
By default, if MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, all these operations will be fallback to nccl/rccl apis.
---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-20 11:31:37 -07:00
Binyang Li
89f7573adf Fix correctness issue when mscclppDisableChannelCache set to true (#483)
If `mscclppDisableChannelCache` set to true, we need to keep every
channel information avoid the channel info in GPU side be released.
2025-03-19 14:55:37 -07:00
Caio Rocha
b6a179faff NCCL API CI Test for ReduceScatter (#465)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-17 13:58:32 -07:00
Binyang Li
f124dc1df9 Add min operation for allreduce (#481)
Add min operation for allreduce
2025-03-16 20:47:36 -07:00
Binyang Li
0b840baa05 Update allgather fallback algo (#476)
Enhancements to all-gather operation, a temporary solution to fix the
memory overhead when integrating msccl++ with pytorch.
This solution will not register input/output buffer to msccl++, so the
temp output buffer for allgather could be reused by torch automatically.

* Introduced a new `allgather8` kernel function in
`apps/nccl/src/allgather.hpp` to handle larger data sizes more
efficiently. This includes double buffering to hide synchronization
overhead and support for both in-place and out-of-place operations.
* Modified the `allgather` function to decide between `allgather6` and
`allgather8` based on data size and platform, improving performance for
large data sizes.

Configuration and environment improvements:

* Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to
control whether the channel cache is disabled, enhancing
configurability. This variable is now part of the `Env` class and is
logged during environment initialization.
* Removed the redundant global variable `mscclppDisableChannelCache`
from `src/debug.cc` and updated its usage to refer to the new
environment variable.
2025-03-14 11:18:03 -07:00
Changho Hwang
e4012ded48 Mark mscclpp-test as deprecated in the doc (#478) 2025-03-11 22:44:38 +00:00
Binyang Li
79b5eefa6c Fix memory OOM issue (#479)
This pull request includes changes to improve memory management in
GPU-related functions by ensuring proper release of memory handles.
Fix #470
2025-03-10 11:15:17 -07:00
Caio Rocha
bac3c90f6a Improving Get Operation at MSCCLLang (#475)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-05 09:41:38 -08:00
Yang Wang
1ff217d5f3 Fix minor typos and errors in documentation (#474) 2025-02-28 17:46:24 -08:00
Caio Rocha
6074eeeac9 Adjust NPKit IB Event (#472) 2025-02-28 10:16:47 -08:00
Caio Rocha
986c45b71a NPKit Support to Read Put Packet Operation (#471) 2025-02-27 12:02:16 -08:00
Caio Rocha
b3992a8b29 Adding Read Put Packet operation at Executor (#441) 2025-02-26 09:29:43 -08:00
Caio Rocha
0222bb324d Adjusting AllGather Collective in MSCCLLang (#466)
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
2025-02-25 08:35:26 -08:00
Qinghua Zhou
591276f9d0 Disable channel cache (#463)
Add workaround of disabling channel cache.
Related runtime parameter: -x MSCCLPP_DISABLE_CHANNEL_CACHE=TRUE
(Default value: False)
In this PR, some other features (e.g., ncclCommSplit) come from branch
binyangli/nccl-api

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-02-19 19:26:12 +00:00
Caio Rocha
8a564977e5 Updating MSCCLLang Examples (#462)
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
2025-02-19 09:48:31 -08:00
Caio Rocha
55789bc551 Support ReduceScatter in the NCCL interface (#460)
Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net>
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-02-11 13:28:19 -08:00
Binyang Li
a6e00cc449 remove unnecessary sync (#461)
`nop` instruction is only for synchronization within the same
threadblock. Cross threadblock synchronization is handled by `barrier`
instruction. So insert `nop` only if the dependency is within the same
threadblock.
2025-02-10 15:31:49 +08:00
Caio Rocha
e7cff899ce Adjusting BFS to seek circular dependencies in the msccl-tools DAG (#459)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-02-07 11:24:27 -08:00
Binyang Li
7f3b088744 Add multi-nodes example & update doc (#455)
Documentation update:

*
[`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114):
Updated the link to the examples folder to reflect the correct path.

New example script:

*
[`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76):
Added a new example script demonstrating the allgather all-pairs
algorithm across multiple nodes using packet communication.

IR module improvements:

*
[`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472):
Refined the sorting criteria for GPU instance channels and thread block
channels to include the channel type, ensuring a more accurate order.
Debugging enhancements:

*
[`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441):
Added a debug log to indicate the start of communication collective
execution with details about the execution plan and collective.
*
[`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89):
Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for
logging executor-related information.
2025-01-31 17:52:15 -08:00
Changho Hwang
3565bfdf6d Renaming channels (#436)
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Binyang Li
af0bb86e07 Merge mscclpp-lang to mscclpp project (#442)
First step to merge msccl-tools into mscclpp repo. In this step will
move all msccl related code, pass the current tests and do some
necessary refactor.

Add `mscclpp.language` module
Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag
Add `DagLower` to lower dag to intermediate representation 
Add documents for mscclpp.language
Remove msccl related code
2025-01-22 09:47:37 -08:00
Changho Hwang
4ee15b7ad0 Fix PR #449 (#453) 2025-01-15 11:59:12 -08:00
Changho Hwang
d12247b54a Lazily create streams for CudaIpcConnection (#449) 2025-01-15 11:50:02 -08:00
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Binyang Li
8ac50dc85d Resolve cuMemMap error (#451)
* Updated `RegisteredMemory::Impl::Impl(const std::vector<char>&
serialization)` to use both minimum and recommended granularities for
memory address reservation and mapping. This will resolve the cuMemMap
error
2025-01-10 14:22:14 -08:00
Changho Hwang
2b54af7e27 Auto-update version numbers in CMakeLists.txt (#450) 2025-01-09 17:54:10 -08:00
Changho Hwang
f2b52c6318 Fix Python binding of exceptions (#444)
* Fixed errors to be catchable from Python code
* Skip IB tests in Python unit tests when IB ports are down
2025-01-09 11:58:23 -08:00
Caio Rocha
80abce59ef Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit (#415)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-01-08 09:02:36 -08:00
Changho Hwang
1989d4be9c Fix CMake build messages (#443) 2025-01-08 02:44:01 +00:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Binyang Li
6d26b92665 Fix azure pipeline (#437) 2025-01-04 19:41:10 -08:00