mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Qinghua Zhou	10cd0012f1	tests/ep: LL bench prints per_rank_bw and accepts size env vars - Report both per-rank and aggregate BW to align with NCCL-EP's ep_bench (which reports per-rank GB/s). - Accept MSCCLPP_EP_LL_TOKENS/HIDDEN/TOPK/EXPERTS_PER_RANK env overrides so we can match external benchmark problem sizes (NCCL-EP LL defaults are num_tokens=128, hidden=7168, top_k=8).	2026-04-23 22:20:40 +00:00
Qinghua Zhou	63afb25ab3	tests/ep: LL bench combine uses recv_tokens×hidden for payload bytes Each local expert sends one copy per dispatched token back to its owner, so the bytes actually on the wire during combine match dispatch. The previous num_tokens×hidden under-counted by ~num_topk×, making combine BW look artificially low next to dispatch.	2026-04-23 21:53:34 +00:00
Qinghua Zhou	c51a8a5305	ext/ep tests: time dispatch and combine separately in MSCCLPP_EP_BENCH Previously the optional benchmark measured full round-trip latency. Split it to time dispatch alone (N iters) and combine alone (N iters reusing one dispatch output), reporting per-phase latency (max across ranks) and aggregate effective bandwidth (sum across ranks). Applies to intranode HT, internode HT, and the (currently unreachable on intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard between dispatch and combine but excludes it from either phase's timing.	2026-04-22 23:11:04 +00:00
Qinghua Zhou	f0a72263c8	ext/ep: unfilter LL sync + add LL multirank test (intra-node WIP) - Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode. DeepEP's original filter was safe because its LL path used NVSHMEM; this port drives LL via PortChannel so the kernel indexes port_channel_handles[local_expert*num_ranks + dst_rank] for every dst_rank. All peers now get a real memory/connection/semaphore/port channel entry. - Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine functional round-trip, BF16 only). Works cross-node in DeepEP's 1-GPU-per-node topology. - Known limitation documented in src/ext/ep/README.md and the test docstring: intra-node 8-GPU LL currently hangs because every peer transfer routes through the CPU proxy over IB loopback between distinct HCAs on the same host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which mis-aligns the 32-bit rdma_recv_count slots when used for same-node peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for same-node, PortChannel for cross-node) or 64-bit counters.	2026-04-22 06:11:30 +00:00

4 Commits