Add an IB multi-node tutorial (#702)

2026-05-12 01:10:22 +00:00 · 2025-12-11 15:15:58 -08:00
parent 51a86630ff
commit da60eb7f46
2 changed files with 123 additions and 29 deletions
--- a/docs/tutorials/04-port-channel.md
+++ b/docs/tutorials/04-port-channel.md
@@ -35,7 +35,7 @@ Note that this example is **NOT** a performance benchmark. The performance numbe

 ## Code Overview

-The example code implements a bidirectional data transfer using a `PortChannel` between two GPUs on the same machine. The code is similar to the [Memory Channel](./03-memory-channel.md) tutorial, with the main difference being that the construction of a `PortChannel` is done by a `ProxyService` instance. We need to "add" the pre-built `Semaphore` and `RegisteredMemory` objects to the `ProxyService`, which return `SemaphoreId` and `MemoryId`s, respectively:
+The example code implements a bidirectional data transfer using a `PortChannel` between two GPUs. The code is similar to the [Memory Channel](./03-memory-channel.md) tutorial, with the main difference being that the construction of a `PortChannel` is done by a `ProxyService` instance. We need to "add" the pre-built `Semaphore` and `RegisteredMemory` objects to the `ProxyService`, which return `SemaphoreId` and `MemoryId`s, respectively:

 ```cpp
 mscclpp::ProxyService proxyService;
@@ -102,6 +102,74 @@ The device handle methods of `PortChannel` are thread-safe except when the numbe
 Advanced users may want to customize the behavior of `ProxyService` to support custom request types or transport mechanisms, which can be done by subclassing `BaseProxyService`. See an example in [class AllGatherProxyService](https://github.com/microsoft/mscclpp/blob/main/test/mscclpp-test/allgather_test.cu#L503).
 ```

+## Cross-node Execution
+
+This section explains running the example code with two GPUs on different nodes using InfiniBand (or RoCE) transport.
+
+### Running the Example across Nodes
+
+```{note}
+Before running the code across nodes, make sure that your environment meets the [prerequisites of GPUDirect RDMA](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html#common-prerequisites) and the RDMA networking is properly configured.
+```
+
+Run the program on two nodes with command line arguments:
+
+```
+./bidir_port_channel [<ip_port> <rank> <gpu_id> <transport>]
+```
+
+For example, assume we use `192.168.0.1:50000` as the bootstrap IP address and port, and both nodes use GPU 0 with the InfiniBand device index 0 (`IB0`).
+
+On Node 0 (Rank 0):
+```bash
+$ ./bidir_port_channel 192.168.0.1:50000 0 0 IB0
+```
+
+On Node 1 (Rank 1):
+```bash
+$ ./bidir_port_channel 192.168.0.1:50000 1 0 IB0
+```
+
+You should see output indicating successful data transfer.
+
+```{tip}
+The example code also supports running two instances on the same node. For example:
+
+Terminal 1: `./bidir_port_channel 127.0.0.1:50000 0 0 IB0`
+
+Terminal 2: `./bidir_port_channel 127.0.0.1:50000 1 1 IB1`
+```
+
+```{tip}
+If your bootstrap IP address is not on the default network interface of your node, you can specify the network interface by passing `interface_name:ip:port` as the first argument (such as `eth1:192.168.0.1:50000`).
+```
+
+### What's Happening in Terms of InfiniBand?
+
+When we use InfiniBand transport, each `Endpoint` holds a unique InfiniBand context, and each `Connection` holds a unique InfiniBand queue pair (QP). Therefore, multiple `Semaphore`s and `PortChannel`s will share the same QP if they are created out of the same `Connection`. If you want multiple QPs between two endpoints, you need to create multiple parallel `Connection`s, and then create `Semaphore`s and `PortChannel`s from different `Connection`s.
+
+The `PortChannel` methods would have the following behavior in terms of InfiniBand operations:
+- `put()`: Posts an RDMA Write operation to the QP to transfer data.
+- `signal()`: Asynchronously triggers a PCIe flush on the remote side (e.g., by an RDMA atomic operation) to ensure all previous RDMA Writes are visible to the remote GPU.
+- `wait()`: Polls the completion queue (CQ) of the QP until the corresponding signal is received.
+- `poll()`: Non-blocking version of `wait()`, checks the CQ for the signal.
+- `flush()`: Ensures the CQ is drained and all previous operations are completed.
+
+The example code does not pass InfiniBand-specific parameters in the endpoint configuration for simplicity, which can be done like the following example:
+
+```cpp
+mscclpp::EndpointConfig epConfig;
+epConfig.transport = mscclpp::Transport::IB0;
+epConfig.device = {mscclpp::DeviceType::GPU, 0}; // GPU 0
+// InfiniBand-specific parameters
+epConfig.ib.maxCqSize = 8192;
+epConfig.ib.maxCqPollNum = 4;
+// Create an endpoint and establish a connection
+auto conn = comm.connect(epConfig, remoteRank).get();
+```
+
+See all available InfiniBand-specific parameters in {cpp:struct}`mscclpp::EndpointConfig::Ib`.
+
 ## Summary and Next Steps

 In this tutorial, we learned how to use `PortChannel` for bidirectional data transfer between two GPUs using a `ProxyService`.