Commit Graph

426 Commits

Author SHA1 Message Date
Changho Hwang
83356957bd Improved documentation & minor interface revision (#541) 2025-06-03 14:26:27 -07:00
Changho Hwang
c184485808 DLPack fixes (#537)
* Fix typos in type name
* Make it work without current context set
2025-05-27 21:40:50 +00:00
Changho Hwang
7278b51e61 Rename ChannelTrigger fields and check field values in debug builds (#529) 2025-05-27 14:36:22 -07:00
Caio Rocha
29c3af2ac6 Properly setting up the device in Ethernet Connection (#527)
When we create the thread to receive messages in the Ethernet
Connection, it resets the Device ID, causing faults in the Ethernet
Connection unit tests.


![image](https://github.com/user-attachments/assets/ba609c16-0f52-4624-807a-5ad776a0c18d)

This PR aims to properly set up the device when the thread is created.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-05-19 10:05:45 -07:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Changho Hwang
d636093336 Asynchronous setup (#514)
Cherry-picked a part of features from #167: now `Communicator::setup()`
is unneeded. `Communicator::sendMemory()` conducts the task inline, and
`Communicator::recvMemory()` and `Communicator::connect()` conducts the
task asynchronously without explicit setup.
2025-05-08 22:01:51 +00:00
Qinghua Zhou
b4f0af8f9f Support ibv_reg_dmabuf_mr for buffer allocated by cuMemMalloc (#513)
Fix #496
For buffer allocated by cuMemMalloc, use ibv_reg_dmabuf_mr to register a
dma-buf based memory region.
2025-05-07 17:26:14 -07:00
Binyang Li
affca7d9bc Add NVLS based fallback algo (#507)
Add two nvls based fallback algo. allreduce9 is for nvls with zero copy.
allreduce10 is for nvls need to copy to scratch buffer, do reduce
operation then copy result back to result buffer.

Perf number for allreduce9
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.45    0.19    0.33      0     5.35    0.19    0.33      0
        2048           512     float     sum      -1     5.57    0.37    0.64      0     5.53    0.37    0.65      0
        4096          1024     float     sum      -1     5.80    0.71    1.24      0     5.78    0.71    1.24      0
        8192          2048     float     sum      -1     5.94    1.38    2.42      0     5.85    1.40    2.45      0
       16384          4096     float     sum      -1     6.40    2.56    4.48      0     6.27    2.61    4.57      0
       32768          8192     float     sum      -1     7.45    4.40    7.70      0     7.39    4.43    7.76      0
       65536         16384     float     sum      -1     8.03    8.17   14.29      0     8.32    7.88   13.79      0
      131072         32768     float     sum      -1     7.28   18.00   31.49      0     7.07   18.53   32.43      0
      262144         65536     float     sum      -1     7.72   33.95   59.41      0     7.59   34.56   60.48      0
      524288        131072     float     sum      -1     8.70   60.29  105.51      0     8.37   62.61  109.57      0
     1048576        262144     float     sum      -1    10.56   99.26  173.70      0    10.32  101.64  177.87      0
     2097152        524288     float     sum      -1    14.45  145.14  253.99      0    14.02  149.58  261.76      0
     4194304       1048576     float     sum      -1    22.83  183.73  321.52      0    23.03  182.14  318.75      0
     8388608       2097152     float     sum      -1    38.63  217.14  380.00      0    38.57  217.52  380.65      0
    16777216       4194304     float     sum      -1    70.03  239.58  419.27      0    69.96  239.80  419.66      0
    33554432       8388608     float     sum      -1    131.5  255.17  446.55      0    131.3  255.59  447.28      0
    67108864      16777216     float     sum      -1    255.8  262.37  459.15      0    255.4  262.75  459.82      0
   134217728      33554432     float     sum      -1    500.9  267.94  468.90      0    500.0  268.42  469.74      0
   268435456      67108864     float     sum      -1    989.0  271.41  474.97      0    988.9  271.45  475.05      0
   536870912     134217728     float     sum      -1   1967.4  272.88  477.54      0   1966.0  273.08  477.88      0
  1073741824     268435456     float     sum      -1   3908.5  274.72  480.77      0   3904.6  274.99  481.24      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 218.734 
```

Perf number for allreduce10
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.60    0.18    0.32      0     5.52    0.19    0.32      0
        2048           512     float     sum      -1     5.79    0.35    0.62      0     5.64    0.36    0.64      0
        4096          1024     float     sum      -1     5.92    0.69    1.21      0     5.82    0.70    1.23      0
        8192          2048     float     sum      -1     6.03    1.36    2.38      0     5.95    1.38    2.41      0
       16384          4096     float     sum      -1     6.58    2.49    4.35      0     6.39    2.56    4.49      0
       32768          8192     float     sum      -1     7.54    4.34    7.60      0     7.41    4.42    7.74      0
       65536         16384     float     sum      -1     7.95    8.24   14.42      0     8.10    8.09   14.16      0
      131072         32768     float     sum      -1     9.56   13.72   24.00      0     9.47   13.84   24.23      0
      262144         65536     float     sum      -1    11.49   22.81   39.92      0    11.41   22.97   40.20      0
      524288        131072     float     sum      -1    14.19   36.94   64.64      0    13.88   37.76   66.09      0
     1048576        262144     float     sum      -1    19.10   54.89   96.06      0    18.98   55.24   96.67      0
     2097152        524288     float     sum      -1    31.12   67.38  117.91      0    31.34   66.92  117.10      0
     4194304       1048576     float     sum      -1    44.88   93.46  163.56      0    44.76   93.70  163.97      0
     8388608       2097152     float     sum      -1    63.23  132.68  232.18      0    62.53  134.14  234.75      0
    16777216       4194304     float     sum      -1    106.8  157.03  274.80      0    105.9  158.46  277.30      0
    33554432       8388608     float     sum      -1    172.2  194.91  341.09      0    172.0  195.05  341.35      0
    67108864      16777216     float     sum      -1    299.8  223.83  391.70      0    300.8  223.12  390.46      0
   134217728      33554432     float     sum      -1    553.1  242.66  424.66      0    553.8  242.38  424.16      0
   268435456      67108864     float     sum      -1   1056.1  254.18  444.82      0   1057.4  253.86  444.26      0
   536870912     134217728     float     sum      -1   2064.0  260.11  455.20      0   2063.8  260.14  455.25      0
  1073741824     268435456     float     sum      -1   4074.4  263.53  461.18      0   4065.8  264.09  462.16      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 169.799 
```

---------

Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-04-27 14:09:31 -07:00
Changho Hwang
710f6686dc Revised MemoryChannel interfaces (#508)
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
2025-04-25 00:02:56 +00:00
Binyang Li
7da11b35d5 Add flag to disable nvls (#500)
Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported
by Azure vm. Add this flag to force to use cudaMalloc for memory
allocation and disable nvls feature
2025-04-22 17:09:19 -07:00
Qinghua Zhou
f1115210bf Fix the virtual address mapping issue of cuMemMap in fallback code (#501)
Fix the virtual address mapping issue of cuMemMap by using the size of
device memory allocation instead of user buffer size
2025-04-16 14:16:52 -07:00
Binyang Li
adc9ee5684 Export mscclpp GpuBuffer to dlpack format (#492)
For mscclpp, to use nvls we require the buffer is allocated by
mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export
the raw buffer to dlpack format.
User can use this feature to create buffer with type supported by
pytorch
```python
buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16
dl_pack = buffer.to_dlpack(str(torch.bfloat16))
tensor = torch.utils.dlpack.from_dlpack(dl_pack)
```
2025-04-03 12:59:32 -07:00
Binyang Li
a3d8d6807b Remove the requirement for CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED for NVLS support (#489)
Remove the requirement for
`CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support.

Fix #487
2025-03-28 16:46:54 -07:00
Caio Rocha
ac5cc647e0 Reduce Operation Support to the Executor (#484)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-25 13:58:12 -07:00
Qinghua Zhou
a7c364beb8 nccl/rccl integration (#469)
Use dlopen to load nccl/rccl Apis from shared library to
enable Allgather, Allreduce, Broadcast, ReduceScatter fallback to nccl/rccl operations.

Add three related environment variables
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=/path/libnccl.so/librccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all"
By default, if MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, all these operations will be fallback to nccl/rccl apis.
---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-20 11:31:37 -07:00
Binyang Li
0b840baa05 Update allgather fallback algo (#476)
Enhancements to all-gather operation, a temporary solution to fix the
memory overhead when integrating msccl++ with pytorch.
This solution will not register input/output buffer to msccl++, so the
temp output buffer for allgather could be reused by torch automatically.

* Introduced a new `allgather8` kernel function in
`apps/nccl/src/allgather.hpp` to handle larger data sizes more
efficiently. This includes double buffering to hide synchronization
overhead and support for both in-place and out-of-place operations.
* Modified the `allgather` function to decide between `allgather6` and
`allgather8` based on data size and platform, improving performance for
large data sizes.

Configuration and environment improvements:

* Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to
control whether the channel cache is disabled, enhancing
configurability. This variable is now part of the `Env` class and is
logged during environment initialization.
* Removed the redundant global variable `mscclppDisableChannelCache`
from `src/debug.cc` and updated its usage to refer to the new
environment variable.
2025-03-14 11:18:03 -07:00
Binyang Li
79b5eefa6c Fix memory OOM issue (#479)
This pull request includes changes to improve memory management in
GPU-related functions by ensuring proper release of memory handles.
Fix #470
2025-03-10 11:15:17 -07:00
Caio Rocha
bac3c90f6a Improving Get Operation at MSCCLLang (#475)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-05 09:41:38 -08:00
Caio Rocha
6074eeeac9 Adjust NPKit IB Event (#472) 2025-02-28 10:16:47 -08:00
Caio Rocha
b3992a8b29 Adding Read Put Packet operation at Executor (#441) 2025-02-26 09:29:43 -08:00
Caio Rocha
0222bb324d Adjusting AllGather Collective in MSCCLLang (#466)
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
2025-02-25 08:35:26 -08:00
Qinghua Zhou
591276f9d0 Disable channel cache (#463)
Add workaround of disabling channel cache.
Related runtime parameter: -x MSCCLPP_DISABLE_CHANNEL_CACHE=TRUE
(Default value: False)
In this PR, some other features (e.g., ncclCommSplit) come from branch
binyangli/nccl-api

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-02-19 19:26:12 +00:00
Binyang Li
7f3b088744 Add multi-nodes example & update doc (#455)
Documentation update:

*
[`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114):
Updated the link to the examples folder to reflect the correct path.

New example script:

*
[`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76):
Added a new example script demonstrating the allgather all-pairs
algorithm across multiple nodes using packet communication.

IR module improvements:

*
[`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472):
Refined the sorting criteria for GPU instance channels and thread block
channels to include the channel type, ensuring a more accurate order.
Debugging enhancements:

*
[`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441):
Added a debug log to indicate the start of communication collective
execution with details about the execution plan and collective.
*
[`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89):
Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for
logging executor-related information.
2025-01-31 17:52:15 -08:00
Changho Hwang
3565bfdf6d Renaming channels (#436)
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Changho Hwang
4ee15b7ad0 Fix PR #449 (#453) 2025-01-15 11:59:12 -08:00
Changho Hwang
d12247b54a Lazily create streams for CudaIpcConnection (#449) 2025-01-15 11:50:02 -08:00
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Binyang Li
8ac50dc85d Resolve cuMemMap error (#451)
* Updated `RegisteredMemory::Impl::Impl(const std::vector<char>&
serialization)` to use both minimum and recommended granularities for
memory address reservation and mapping. This will resolve the cuMemMap
error
2025-01-10 14:22:14 -08:00
Changho Hwang
f2b52c6318 Fix Python binding of exceptions (#444)
* Fixed errors to be catchable from Python code
* Skip IB tests in Python unit tests when IB ports are down
2025-01-09 11:58:23 -08:00
Caio Rocha
80abce59ef Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit (#415)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-01-08 09:02:36 -08:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Binyang Li
6fedb7c0e8 Fix nccl-test failure issue (#421) 2024-12-19 12:07:00 -08:00
Binyang Li
fcb2e46cb1 NVLS support for NCCL API (#410)
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-18 09:55:35 +00:00
Binyang Li
863a599360 Disable CuMemMap check for ROCm (#411)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-17 08:36:25 +00:00
Binyang Li
ee75caf365 Reduce memory usage for scratch buffer (#403)
In the executor, we allocate the scratch buffer based on `sendMemRange`.
However, for certain execution plans, this allocation may be unsuitable,
as the plan does not support messages of this size.

To avoid allocating to much data and cause OOM error, set scratch buffer
size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan),
scratchBufferSize(sendMemRange))`
2024-12-13 13:00:04 -08:00
Caio Rocha
01fd813f1b Exception Max Number Operation per Tb (#405) 2024-12-11 16:06:15 -08:00
Changho Hwang
756f24c697 Revised ProxyChannel interfaces (#400)
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Ziyue Yang
f6305a3c1d Add connection events for NPKit (#386) 2024-12-05 00:06:37 +08:00
Binyang Li
88d28e07a7 Select algo according to json config (#396)
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Binyang Li
593478e1b7 Add cross threadblock barrier (#383) 2024-11-26 05:13:30 +00:00
Changho Hwang
2127a3ba29 Improve CMake options (#376)
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-22 01:54:11 +00:00
Binyang Li
28a57b0610 NVLS support for msccl++ executor (#375)
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda

Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Ziyue Yang
3e51e9b359 Fix missing packet parameter for executor (#385) 2024-11-19 08:36:37 +08:00
Binyang Li
1baea89fa0 Fix light load bug (#379)
Fix lightLoadExecutionPlan issue.
An execution context many have multi device execution plans. These plans
share the channel connections which are constructed before.
A deviceExecutionPlanKey is introduced to identify these plans. We can
get the current device execution plan key via:
`contexts.currentDevicePlan`
2024-11-13 07:58:43 +00:00
Caio Rocha
d5d608abdc Fixing Bug Const Offset in Execution Plan (#380)
The offset was not differentiating between the buffer types, causing the
offset to be incorrect when the buffer type was not `SCRATCH`.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-11-11 20:02:02 -08:00
Changho Hwang
85fdde7a73 Lazily create the context stream (#381)
Create the context stream only when needed.
2024-11-11 10:39:32 +08:00
Caio Rocha
c6e06cfad7 Executor AllGather In-Place Support (#365) 2024-10-21 05:45:56 -07:00
Changho Hwang
0c150e5166 Fix copyright messages (#367) 2024-10-17 21:25:46 -07:00
Caio Rocha
08a0cec2eb Fixing RegisterMemory Allocation for ProxyChannels (#353)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-09-24 23:01:41 -07:00
Ziyue Yang
5c4e105814 Fix NPKit exit event offset (#356) 2024-09-19 13:35:44 +08:00