internode_ncclep.cuh: add EP_NCCLEP_DRAIN_NOOP compile gate (default 0, inert) - when 1 the NVL receiver keeps all control flow but skips the data copies, to measure the dispatch-time upper bound of eliminating the cross-GPU receiver drain. Probe result (4-node {38,41,59,75}): dispatch inc2 1124us -> DRAIN_NOOP 1048us (-6.8%), agg_bw 836->896 GB/s => confirms real headroom for the cross-GPU peer-map direct-write rework (ceiling ~-16.6% cumulative vs baseline).
test_internode_multirank.py: gate the dispatch range-assert and combine assert behind MSCCLPP_EP_SKIP_VERIFY env so dispatch timing can be measured when recv_x is intentionally incomplete (perf probing).
- src/core/atomicadd_kernel.cu: restore the legacy 3-arg
cuCtxCreate(&proxyAtomicCtx_, 0, cuDevice) in the
'#else' branch of the CUDA_VERSION >= 12050 guard. A prior
edit had corrupted it to 'cuCtxCreate(&proxyAtomicCtx_vice)',
which broke the CUDA 11.8 build (CodeQL CUDA cuda11.8 and
MSCCLPPLang cuda11.8 jobs).
- Apply clang-format to src/ext/ep/* (no logic changes,
fixes the cpplint CI job).
- Apply black to test/python/ext/ep/test_internode_multirank.py
and test_intranode_multirank.py (no logic changes, fixes
the pylint CI job).
The NVLS HT B2 path introduced in 3ab2e43b activated whenever
isNvlsSupported() && num_rdma_ranks > 1. On H100 NDv5 / Azure CX-7 RoCE
that is true (H100 has intra-node NVLink multicast), but there is no
cross-host NVSwitch fabric. mscclpp's GpuIpcMem::create then falls back
to CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR whose handle exchange routes
through /tmp/mscclpp_bootstrap_<pid>.sock -- a master-rank-0 unix-domain
socket worker ranks cannot reach. Symptom on every commit since 3ab2e43b:
RuntimeError: connect() failed for unix socket to
/tmp/mscclpp_bootstrap_<pid>.sock
MSCCLPP_EP_FABRIC_IPC=0 was being silently ignored.
src/ext/ep/buffer.cc: add resolve_fabric_ipc_supported() helper.
Resolution:
1. MSCCLPP_EP_FABRIC_IPC env var (0/off/false/no => off,
1/on/true/yes/force => on, otherwise auto).
2. Auto-detect: requires both
- CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED == 1
- device compute capability >= sm_100 (Blackwell+).
Gate both use_fabric_ipc_alloc (RDMA buffer allocator) and nvls_ht_enabled
(HT B2 multicast region) on fabric_ipc_supported. On H100 both fall back
to cudaMalloc + legacy PortChannel; on GB200 NVL72 both remain enabled.
Diagnostic prints now show fabric_ipc=.
test/python/ext/ep/test_internode_multirank.py: replace hardcoded
NUM_MAX_NVL_PEERS=4 with a runtime _detect_local_world_size() helper
that reads MSCCLPP_EP_LOCAL_WORLD_SIZE / LOCAL_WORLD_SIZE /
OMPI_COMM_WORLD_LOCAL_SIZE, falling back to torch.cuda.device_count().
Makes the test correct on both H100 (8 GPUs/node) and GB200 (4 GPUs/node)
without code changes.
src/core/atomicadd_kernel.cu: use cuCtxCreate_v4 for CUDA >= 12.5 (the
underlying symbol was renamed); preserve legacy 3-arg cuCtxCreate for
older toolkits.
Verified on 2x H100 NDv5 at HEAD:
LL intranode (8 GPUs) PASS
LL internode (16 GPUs, 2 nodes) PASS
HT intranode (8 GPUs) PASS
HT internode (16 GPUs, 2 nodes) PASS
Diagnostic on H100:
[mscclpp_ep] rdma_buffer allocator: cudaMalloc (low_latency=0, nvls=1, fabric_ipc=0)
[mscclpp_ep] NVLS HT multicast: disabled (low_latency=0, num_rdma_ranks=2, nvls_supported=1, fabric_ipc=0)
At 16 nodes (64 ranks) with topk=8, expected combine values reach
rank*8 = 504, while intermediate partial sums (rank*7 etc.) cross the
bf16 ulp=2 boundary at 256. With the test pattern x = rank*ones and
weights = 1, this produces deterministic +/-1 round-off on certain
ranks (odd local_rank on nodes >= 9), tripping the previous 1e-2
absolute tolerance even though the kernel is correct.
Use tol = max(1e-2, max_exp / 64) which matches the bf16 mantissa
precision and scales with the magnitude of the expected combined
output. The previous tight bound is preserved for small-scale runs
where max_exp < 0.64.
- Internode HT test: accept MSCCLPP_EP_HT_{TOKENS,HIDDEN,TOPK,EXPERTS}
env vars to override the functional-check problem size (was hardcoded
to num_tokens=128, hidden=1024, num_topk=min(4,num_ranks),
num_experts=num_ranks*4).
- Both intranode + internode HT tests: replace dist.all_to_all_single
bookkeeping (per-(src,dst) recv-count matrix used for the
six-metric NVL/RDMA BW breakdown) with dist.all_gather_into_tensor
+ transpose. Functionally identical (gathered[:, rank] gives the
same recv-from-src column) but works on socket-NCCL with
NCCL_IB_DISABLE=1, which is required on rigs where NCCL IB cannot
coexist with mscclpp RDMA. Sends num_ranks^2 int64 instead of
num_ranks per rank — negligible (64 ints at 8 ranks).
Allow tuning the internode HT test cfg from the environment without
editing the source. Supported variables (all optional):
MSCCLPP_EP_NSM (default 152) num channels / SMs
MSCCLPP_EP_NVL_SEND (default 8)
MSCCLPP_EP_NVL_RECV (default 256)
MSCCLPP_EP_RDMA_SEND (default 16)
MSCCLPP_EP_RDMA_RECV (default 128)
The defaults match what we use for 16-node GB200 bench runs (e.g.
NVL_RECV=512 to satisfy the HT combine assert at 16 nodes).
Run `tools/lint.sh cpp` (clang-format 14) and `tools/lint.sh py`
(black) over the EP extension files added by this PR. No functional
changes; pure reformatting to satisfy the cpplint and pylint CI jobs.
Aligns with NCCL-EP's ep_bench convention (BW computed from average time
across ranks). Previously we reported only the max time and computed BW
per-rank, which made our numbers more pessimistic than NCCL-EP's.
Add dist.barrier() + dist.destroy_process_group() in a finally block so
non-zero ranks don't poll the TCPStore after rank 0 (the store server)
exits, which produced noisy 'recvValue failed / Connection was likely
closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor.
Also pass device_id to init_process_group in the internode test to
silence 'Guessing device ID based on global rank' warnings.
Same change as the intra-node bench (commit 4ed6f229), applied to the
cross-node test:
- Add MSCCLPP_EP_BENCH_EXPERTS / _TOPK env knobs so the bench phase can
match NCCL-EP's `ep_bench -a ht` defaults (256 experts, top-8).
- Switch BW accounting from recv_tokens*hidden to bench_tokens*hidden,
matching NCCL-EP's `RDMA_send` per-rank byte count.
Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).
Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
Gated behind MSCCLPP_EP_BENCH=1 to keep correctness runs fast. Reports
per-iter latency (max across ranks, CUDA-event timed) and aggregate
effective bandwidth (sum across ranks, dispatch+combine payload bytes).
Tunable via MSCCLPP_EP_BENCH_WARMUP / _ITERS / _TOKENS / _HIDDEN.
Bench reuses the Buffer allocated for the correctness phase and
self-skips if the requested hidden exceeds the per-peer NVL/RDMA budget.
Two issues prevented internode HT combine from completing on 2x8 H100:
1. Wrong prefix matrices passed to internode_combine. Combine runs in the
reverse direction of dispatch, so it must consume the receiver-side
matrices returned by dispatch (recv_rdma_channel_prefix_matrix,
recv_rdma_rank_prefix_sum, recv_gbl_channel_prefix_matrix), not the
sender-side rdma_channel_prefix_matrix / gbl_channel_prefix_matrix.
This matches DeepEP's deep_ep/buffer.py::internode_combine handle
unpacking. Without the fix the NVL forwarder's 'NVL check' timed out
because token_start_idx/token_end_idx were computed against the wrong
per-channel layout.
2. Cross-rank race between dispatch and combine. Even with the correct
matrices, launching combine immediately after dispatch deadlocked the
forwarder NVL check (tail stuck one short of expected_head) because
peers still had in-flight dispatch proxy traffic while fast ranks had
already started combine. A torch.cuda.synchronize() + dist.barrier()
between the two calls makes the test pass deterministically on 16
ranks (combine diff == 0, max|expected| up to 60.0).
The barrier in the test is a workaround; the real fix belongs in
Buffer::internode_dispatch / Buffer::internode_combine so the
dispatch->combine handoff fully fences outstanding proxy work across
ranks. Marked with an XXX comment in the test.
The `internode` kernels index device-side port channel handles as
`port_channel_handles[channel_id * num_ranks + peer_rank]`, where
`peer_rank` is a global rank in [0, num_ranks). `Buffer::sync` was
building that table by iterating `std::unordered_map<int, MemoryId>`
(and similarly for connections/semaphores), which yields hash order
rather than ascending rank order. Once the cross-node fan-out grew
beyond a single peer, a local rank's trigger for peer `r` landed on
the semaphore/memory pair of a different peer, so RDMA puts and
atomic tail updates went to the wrong destination and the forwarder
spun on a tail counter that never advanced.
Changes:
- Build `sema_ids` and `port_channel_handles` by iterating
`for (int r = 0; r < num_ranks; ++r)` and looking up the
connection / memory id for rank `r`, skipping ranks excluded by
low-latency mode (inserting a placeholder handle so the stride
stays `num_ranks`).
- Tag the RDMA-phase `sendMemory`/`recvMemory`/`connect` calls with
`kRdmaTag = 1` so they do not collide with NVL-phase tag-0
traffic between the same pair of ranks.
- Drop an unused `r` local in the NVL setup loop.
With this fix and a matched `libmscclpp.so` on both nodes, the
2-node x 8-GPU internode HT dispatch path completes successfully
(`[dispatch] OK`). Combine is still under investigation.
Also adds `test/python/ext/ep/test_internode_multirank.py`, a
torchrun-based 2-node functional test that exercises
`get_dispatch_layout` -> `internode_dispatch` -> `internode_combine`
and validates per-source-rank token values end-to-end.