Add optional out_packed_recv_x / out_src_info / out_layout_range /
out_count parameters to Buffer::low_latency_dispatch so callers can
hoist the four recv-side allocations out of a hot loop, mirroring the
existing out= path on low_latency_combine.
The bench in test_low_latency_multirank.py preallocates these tensors
once and passes them on every iter so the timed loop reflects kernel
cost, not torch.empty + caching-allocator overhead.
Set TORCH_NCCL_ENABLE_MONITORING=0 before importing torch.distributed.
The barrier+destroy_process_group finally block (afbdcd6a) suffices
under torchrun, but under mpirun rank 0 (the TCPStore server) can exit
before non-zero ranks finish teardown, and the background heartbeat
thread polls the store and logs 'recvValue failed / Connection was
likely closed'. Disabling the monitor outright is safe for short-lived
bench runs.
Aligns with NCCL-EP's ep_bench convention (BW computed from average time
across ranks). Previously we reported only the max time and computed BW
per-rank, which made our numbers more pessimistic than NCCL-EP's.
Add dist.barrier() + dist.destroy_process_group() in a finally block so
non-zero ranks don't poll the TCPStore after rank 0 (the store server)
exits, which produced noisy 'recvValue failed / Connection was likely
closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor.
Also pass device_id to init_process_group in the internode test to
silence 'Guessing device ID based on global rank' warnings.
The LL combine benchmark was cloning the ~58 MB dispatch recv buffer
('recv_x.clone()') on every timed iteration, adding ~20 us of D2D
memcpy per sample and masking kernel-level changes. It also called
torch.empty() for the output inside the loop. Both now live outside
the timed region; the kernel is invoked against a persistent bench_out
and the recv_x produced by the most recent dispatch.
- Report both per-rank and aggregate BW to align with NCCL-EP's ep_bench
(which reports per-rank GB/s).
- Accept MSCCLPP_EP_LL_TOKENS/HIDDEN/TOPK/EXPERTS_PER_RANK env overrides
so we can match external benchmark problem sizes (NCCL-EP LL defaults
are num_tokens=128, hidden=7168, top_k=8).
Each local expert sends one copy per dispatched token back to its owner,
so the bytes actually on the wire during combine match dispatch. The
previous num_tokens×hidden under-counted by ~num_topk×, making combine
BW look artificially low next to dispatch.
Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).
Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode.
DeepEP's original filter was safe because its LL path used NVSHMEM; this
port drives LL via PortChannel so the kernel indexes
port_channel_handles[local_expert*num_ranks + dst_rank] for every
dst_rank. All peers now get a real memory/connection/semaphore/port
channel entry.
- Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine
functional round-trip, BF16 only). Works cross-node in DeepEP's
1-GPU-per-node topology.
- Known limitation documented in src/ext/ep/README.md and the test docstring:
intra-node 8-GPU LL currently hangs because every peer transfer routes
through the CPU proxy over IB loopback between distinct HCAs on the same
host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which
mis-aligns the 32-bit rdma_recv_count slots when used for same-node
peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for
same-node, PortChannel for cross-node) or 64-bit counters.