mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Qinghua Zhou	fdf7d579dc	ext/ep: optional preallocated outputs for low_latency_dispatch Add optional out_packed_recv_x / out_src_info / out_layout_range / out_count parameters to Buffer::low_latency_dispatch so callers can hoist the four recv-side allocations out of a hot loop, mirroring the existing out= path on low_latency_combine. The bench in test_low_latency_multirank.py preallocates these tensors once and passes them on every iter so the timed loop reflects kernel cost, not torch.empty + caching-allocator overhead.	2026-04-30 18:45:44 +00:00
Qinghua Zhou	6ad82e8bbe	tests/ep: disable NCCL HeartbeatMonitor to silence mpirun shutdown noise Set TORCH_NCCL_ENABLE_MONITORING=0 before importing torch.distributed. The barrier+destroy_process_group finally block (`afbdcd6a`) suffices under torchrun, but under mpirun rank 0 (the TCPStore server) can exit before non-zero ranks finish teardown, and the background heartbeat thread polls the store and logs 'recvValue failed / Connection was likely closed'. Disabling the monitor outright is safe for short-lived bench runs.	2026-04-29 20:44:37 +00:00
Qinghua Zhou	9213587ffe	ep tests: report dispatch/combine min, avg, max time and use avg for BW Aligns with NCCL-EP's ep_bench convention (BW computed from average time across ranks). Previously we reported only the max time and computed BW per-rank, which made our numbers more pessimistic than NCCL-EP's.	2026-04-29 16:50:33 +00:00
Qinghua Zhou	afbdcd6a3d	ep tests: clean shutdown to silence TCPStore/HeartbeatMonitor noise Add dist.barrier() + dist.destroy_process_group() in a finally block so non-zero ranks don't poll the TCPStore after rank 0 (the store server) exits, which produced noisy 'recvValue failed / Connection was likely closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor. Also pass device_id to init_process_group in the internode test to silence 'Guessing device ID based on global rank' warnings.	2026-04-29 05:16:22 +00:00
Qinghua Zhou	1600074f09	tests/ep: hoist combine output tensor out of the timed loop The LL combine benchmark was cloning the ~58 MB dispatch recv buffer ('recv_x.clone()') on every timed iteration, adding ~20 us of D2D memcpy per sample and masking kernel-level changes. It also called torch.empty() for the output inside the loop. Both now live outside the timed region; the kernel is invoked against a persistent bench_out and the recv_x produced by the most recent dispatch.	2026-04-24 21:06:49 +00:00
Qinghua Zhou	10cd0012f1	tests/ep: LL bench prints per_rank_bw and accepts size env vars - Report both per-rank and aggregate BW to align with NCCL-EP's ep_bench (which reports per-rank GB/s). - Accept MSCCLPP_EP_LL_TOKENS/HIDDEN/TOPK/EXPERTS_PER_RANK env overrides so we can match external benchmark problem sizes (NCCL-EP LL defaults are num_tokens=128, hidden=7168, top_k=8).	2026-04-23 22:20:40 +00:00
Qinghua Zhou	63afb25ab3	tests/ep: LL bench combine uses recv_tokens×hidden for payload bytes Each local expert sends one copy per dispatched token back to its owner, so the bytes actually on the wire during combine match dispatch. The previous num_tokens×hidden under-counted by ~num_topk×, making combine BW look artificially low next to dispatch.	2026-04-23 21:53:34 +00:00
Qinghua Zhou	c51a8a5305	ext/ep tests: time dispatch and combine separately in MSCCLPP_EP_BENCH Previously the optional benchmark measured full round-trip latency. Split it to time dispatch alone (N iters) and combine alone (N iters reusing one dispatch output), reporting per-phase latency (max across ranks) and aggregate effective bandwidth (sum across ranks). Applies to intranode HT, internode HT, and the (currently unreachable on intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard between dispatch and combine but excludes it from either phase's timing.	2026-04-22 23:11:04 +00:00
Qinghua Zhou	f0a72263c8	ext/ep: unfilter LL sync + add LL multirank test (intra-node WIP) - Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode. DeepEP's original filter was safe because its LL path used NVSHMEM; this port drives LL via PortChannel so the kernel indexes port_channel_handles[local_expert*num_ranks + dst_rank] for every dst_rank. All peers now get a real memory/connection/semaphore/port channel entry. - Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine functional round-trip, BF16 only). Works cross-node in DeepEP's 1-GPU-per-node topology. - Known limitation documented in src/ext/ep/README.md and the test docstring: intra-node 8-GPU LL currently hangs because every peer transfer routes through the CPU proxy over IB loopback between distinct HCAs on the same host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which mis-aligns the 32-bit rdma_recv_count slots when used for same-node peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for same-node, PortChannel for cross-node) or 64-bit counters.	2026-04-22 06:11:30 +00:00

9 Commits