Files
mscclpp/python
Qinghua Zhou a6af3a4454 ext/ep: fix multi-rank intranode dispatch+combine
Three issues blocked end-to-end intranode validation across multiple
ranks. This commit fixes them and adds a 2/4/8-rank functional test.

1. Combine receiver: OOB __shared__ read

   In the combine receiver warp, the wait loop evaluated
   `channel_tail_idx[recv_lane_id] <= expected_head` before the
   `expected_head >= 0` guard. `channel_tail_idx` is a shared array
   of size `kNumRanks`, but the loop runs on all 32 lanes of a warp,
   so lanes with `recv_lane_id >= kNumRanks` indexed out of bounds.
   compute-sanitizer reported "Invalid __shared__ read of size 4
   bytes" at combine<bf16,2,768>+0xdd0, surfaced asynchronously as
   cudaErrorIllegalAddress at the kernel launch site. Swap the
   operands so the rank-bounds check short-circuits the shared read.

2. Python bindings: UniqueId ABI

   `mscclpp::UniqueId` is a `std::array<uint8_t, N>` which pybind11
   auto-converts to a Python `list`, silently overriding any
   `py::class_<UniqueId>` wrapper. Expose `create_unique_id` /
   `connect` as lambdas that produce/consume `py::bytes` and memcpy
   into a local `UniqueId`. Also coerce `bytes`->`bytearray` at the
   Python call site for `sync()` whose signature expects
   `pybind11::bytearray`.

3. Python frontend: communicator required for NVL-only sync

   `Buffer::sync()` uses `communicator->connect(ipc_config, ...)` on
   the pure-NVLink path, so the communicator must be initialized
   even when `num_rdma_ranks == 1` and `low_latency_mode == False`.
   Always broadcast the unique id and call `runtime.connect()`
   before `sync()`.

Validation on a single H100x8 node via torchrun:
- 2 ranks: dispatch 195 tokens, combine diff=0
- 4 ranks: dispatch 371 tokens, combine diff=0
- 8 ranks: dispatch 456 tokens, combine diff=0

Test harness added at test/python/ext/ep/test_intranode_multirank.py.
2026-04-21 02:03:55 +00:00
..
2026-04-16 21:24:45 -07:00
2026-04-16 21:24:45 -07:00