Add an IB multi-node tutorial (#702)

This commit is contained in:
Changho Hwang
2025-12-11 15:15:58 -08:00
committed by GitHub
parent 51a86630ff
commit da60eb7f46
2 changed files with 123 additions and 29 deletions

View File

@@ -35,7 +35,7 @@ Note that this example is **NOT** a performance benchmark. The performance numbe
## Code Overview
The example code implements a bidirectional data transfer using a `PortChannel` between two GPUs on the same machine. The code is similar to the [Memory Channel](./03-memory-channel.md) tutorial, with the main difference being that the construction of a `PortChannel` is done by a `ProxyService` instance. We need to "add" the pre-built `Semaphore` and `RegisteredMemory` objects to the `ProxyService`, which return `SemaphoreId` and `MemoryId`s, respectively:
The example code implements a bidirectional data transfer using a `PortChannel` between two GPUs. The code is similar to the [Memory Channel](./03-memory-channel.md) tutorial, with the main difference being that the construction of a `PortChannel` is done by a `ProxyService` instance. We need to "add" the pre-built `Semaphore` and `RegisteredMemory` objects to the `ProxyService`, which return `SemaphoreId` and `MemoryId`s, respectively:
```cpp
mscclpp::ProxyService proxyService;
@@ -102,6 +102,74 @@ The device handle methods of `PortChannel` are thread-safe except when the numbe
Advanced users may want to customize the behavior of `ProxyService` to support custom request types or transport mechanisms, which can be done by subclassing `BaseProxyService`. See an example in [class AllGatherProxyService](https://github.com/microsoft/mscclpp/blob/main/test/mscclpp-test/allgather_test.cu#L503).
```
## Cross-node Execution
This section explains running the example code with two GPUs on different nodes using InfiniBand (or RoCE) transport.
### Running the Example across Nodes
```{note}
Before running the code across nodes, make sure that your environment meets the [prerequisites of GPUDirect RDMA](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html#common-prerequisites) and the RDMA networking is properly configured.
```
Run the program on two nodes with command line arguments:
```
./bidir_port_channel [<ip_port> <rank> <gpu_id> <transport>]
```
For example, assume we use `192.168.0.1:50000` as the bootstrap IP address and port, and both nodes use GPU 0 with the InfiniBand device index 0 (`IB0`).
On Node 0 (Rank 0):
```bash
$ ./bidir_port_channel 192.168.0.1:50000 0 0 IB0
```
On Node 1 (Rank 1):
```bash
$ ./bidir_port_channel 192.168.0.1:50000 1 0 IB0
```
You should see output indicating successful data transfer.
```{tip}
The example code also supports running two instances on the same node. For example:
Terminal 1: `./bidir_port_channel 127.0.0.1:50000 0 0 IB0`
Terminal 2: `./bidir_port_channel 127.0.0.1:50000 1 1 IB1`
```
```{tip}
If your bootstrap IP address is not on the default network interface of your node, you can specify the network interface by passing `interface_name:ip:port` as the first argument (such as `eth1:192.168.0.1:50000`).
```
### What's Happening in Terms of InfiniBand?
When we use InfiniBand transport, each `Endpoint` holds a unique InfiniBand context, and each `Connection` holds a unique InfiniBand queue pair (QP). Therefore, multiple `Semaphore`s and `PortChannel`s will share the same QP if they are created out of the same `Connection`. If you want multiple QPs between two endpoints, you need to create multiple parallel `Connection`s, and then create `Semaphore`s and `PortChannel`s from different `Connection`s.
The `PortChannel` methods would have the following behavior in terms of InfiniBand operations:
- `put()`: Posts an RDMA Write operation to the QP to transfer data.
- `signal()`: Asynchronously triggers a PCIe flush on the remote side (e.g., by an RDMA atomic operation) to ensure all previous RDMA Writes are visible to the remote GPU.
- `wait()`: Polls the completion queue (CQ) of the QP until the corresponding signal is received.
- `poll()`: Non-blocking version of `wait()`, checks the CQ for the signal.
- `flush()`: Ensures the CQ is drained and all previous operations are completed.
The example code does not pass InfiniBand-specific parameters in the endpoint configuration for simplicity, which can be done like the following example:
```cpp
mscclpp::EndpointConfig epConfig;
epConfig.transport = mscclpp::Transport::IB0;
epConfig.device = {mscclpp::DeviceType::GPU, 0}; // GPU 0
// InfiniBand-specific parameters
epConfig.ib.maxCqSize = 8192;
epConfig.ib.maxCqPollNum = 4;
// Create an endpoint and establish a connection
auto conn = comm.connect(epConfig, remoteRank).get();
```
See all available InfiniBand-specific parameters in {cpp:struct}`mscclpp::EndpointConfig::Ib`.
## Summary and Next Steps
In this tutorial, we learned how to use `PortChannel` for bidirectional data transfer between two GPUs using a `ProxyService`.