mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-07-01 11:57:44 +00:00

Author	SHA1	Message	Date
Binyang Li	bd0f15b4ef	WIP	2026-06-22 21:50:39 +00:00
Binyang Li	5e0c1de254	WIP	2026-06-22 17:54:46 +00:00
Binyang Li	4c7e95a582	WIP	2026-06-21 06:28:22 +00:00
Binyang Li	f4fbd093db	lint	2026-06-16 15:58:10 +00:00
qinghuazhou	2ebf81aa35	ep(ncclep): increment-3 de-risk - DRAIN_NOOP probe + SKIP_VERIFY test gate internode_ncclep.cuh: add EP_NCCLEP_DRAIN_NOOP compile gate (default 0, inert) - when 1 the NVL receiver keeps all control flow but skips the data copies, to measure the dispatch-time upper bound of eliminating the cross-GPU receiver drain. Probe result (4-node {38,41,59,75}): dispatch inc2 1124us -> DRAIN_NOOP 1048us (-6.8%), agg_bw 836->896 GB/s => confirms real headroom for the cross-GPU peer-map direct-write rework (ceiling ~-16.6% cumulative vs baseline). test_internode_multirank.py: gate the dispatch range-assert and combine assert behind MSCCLPP_EP_SKIP_VERIFY env so dispatch timing can be measured when recv_x is intentionally incomplete (perf probing).	2026-06-08 16:25:53 +00:00
copilot-swe-agent[bot]	04ea24da8d	Fix python lint formatting in internode multirank test Agent-Logs-Url: https://github.com/microsoft/mscclpp/sessions/f5220581-e26c-49d8-98fa-e1b8ab011898 Co-authored-by: seagater <7475084+seagater@users.noreply.github.com>	2026-05-20 18:04:00 +00:00
Qinghua Zhou	757c5ec831	Merge qinghuazhou/expert_parallel_gb200	2026-05-20 01:56:34 +00:00
Qinghua Zhou	cb93dd585b	tests/ep: Unify the name of EP benchmark variables	2026-05-20 01:48:29 +00:00
Qinghua Zhou	20bd1ec55b	ext/ep: fix CUDA 11.8 build + apply clang-format/black - src/core/atomicadd_kernel.cu: restore the legacy 3-arg cuCtxCreate(&proxyAtomicCtx_, 0, cuDevice) in the '#else' branch of the CUDA_VERSION >= 12050 guard. A prior edit had corrupted it to 'cuCtxCreate(&proxyAtomicCtx_vice)', which broke the CUDA 11.8 build (CodeQL CUDA cuda11.8 and MSCCLPPLang cuda11.8 jobs). - Apply clang-format to src/ext/ep/* (no logic changes, fixes the cpplint CI job). - Apply black to test/python/ext/ep/test_internode_multirank.py and test_intranode_multirank.py (no logic changes, fixes the pylint CI job).	2026-05-18 21:44:20 +00:00
Qinghua Zhou	5911998181	ext/ep: gate NVLS HT B2 on cross-host fabric IPC support (H100 fix) The NVLS HT B2 path introduced in `3ab2e43b` activated whenever isNvlsSupported() && num_rdma_ranks > 1. On H100 NDv5 / Azure CX-7 RoCE that is true (H100 has intra-node NVLink multicast), but there is no cross-host NVSwitch fabric. mscclpp's GpuIpcMem::create then falls back to CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR whose handle exchange routes through /tmp/mscclpp_bootstrap_<pid>.sock -- a master-rank-0 unix-domain socket worker ranks cannot reach. Symptom on every commit since `3ab2e43b`: RuntimeError: connect() failed for unix socket to /tmp/mscclpp_bootstrap_<pid>.sock MSCCLPP_EP_FABRIC_IPC=0 was being silently ignored. src/ext/ep/buffer.cc: add resolve_fabric_ipc_supported() helper. Resolution: 1. MSCCLPP_EP_FABRIC_IPC env var (0/off/false/no => off, 1/on/true/yes/force => on, otherwise auto). 2. Auto-detect: requires both - CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED == 1 - device compute capability >= sm_100 (Blackwell+). Gate both use_fabric_ipc_alloc (RDMA buffer allocator) and nvls_ht_enabled (HT B2 multicast region) on fabric_ipc_supported. On H100 both fall back to cudaMalloc + legacy PortChannel; on GB200 NVL72 both remain enabled. Diagnostic prints now show fabric_ipc=. test/python/ext/ep/test_internode_multirank.py: replace hardcoded NUM_MAX_NVL_PEERS=4 with a runtime _detect_local_world_size() helper that reads MSCCLPP_EP_LOCAL_WORLD_SIZE / LOCAL_WORLD_SIZE / OMPI_COMM_WORLD_LOCAL_SIZE, falling back to torch.cuda.device_count(). Makes the test correct on both H100 (8 GPUs/node) and GB200 (4 GPUs/node) without code changes. src/core/atomicadd_kernel.cu: use cuCtxCreate_v4 for CUDA >= 12.5 (the underlying symbol was renamed); preserve legacy 3-arg cuCtxCreate for older toolkits. Verified on 2x H100 NDv5 at HEAD: LL intranode (8 GPUs) PASS LL internode (16 GPUs, 2 nodes) PASS HT intranode (8 GPUs) PASS HT internode (16 GPUs, 2 nodes) PASS Diagnostic on H100: [mscclpp_ep] rdma_buffer allocator: cudaMalloc (low_latency=0, nvls=1, fabric_ipc=0) [mscclpp_ep] NVLS HT multicast: disabled (low_latency=0, num_rdma_ranks=2, nvls_supported=1, fabric_ipc=0)	2026-05-14 21:29:10 +00:00
qinghuazhou	13babbfff2	test/ext/ep: HT — scale combine tolerance with bf16 ulp At 16 nodes (64 ranks) with topk=8, expected combine values reach rank8 = 504, while intermediate partial sums (rank7 etc.) cross the bf16 ulp=2 boundary at 256. With the test pattern x = rank*ones and weights = 1, this produces deterministic +/-1 round-off on certain ranks (odd local_rank on nodes >= 9), tripping the previous 1e-2 absolute tolerance even though the kernel is correct. Use tol = max(1e-2, max_exp / 64) which matches the bf16 mantissa precision and scales with the magnitude of the expected combined output. The previous tight bound is preserved for small-scale runs where max_exp < 0.64.	2026-05-12 05:37:42 +00:00
qinghuazhou	3f459a995d	test/ext/ep: HT tests — env-driven cfg + allgather bookkeeping - Internode HT test: accept MSCCLPP_EP_HT_{TOKENS,HIDDEN,TOPK,EXPERTS} env vars to override the functional-check problem size (was hardcoded to num_tokens=128, hidden=1024, num_topk=min(4,num_ranks), num_experts=num_ranks*4). - Both intranode + internode HT tests: replace dist.all_to_all_single bookkeeping (per-(src,dst) recv-count matrix used for the six-metric NVL/RDMA BW breakdown) with dist.all_gather_into_tensor + transpose. Functionally identical (gathered[:, rank] gives the same recv-from-src column) but works on socket-NCCL with NCCL_IB_DISABLE=1, which is required on rigs where NCCL IB cannot coexist with mscclpp RDMA. Sends num_ranks^2 int64 instead of num_ranks per rank — negligible (64 ints at 8 ranks).	2026-05-12 02:53:40 +00:00
Qinghua Zhou	7b06a60786	test/ext/ep: make HT test Config env-driven Allow tuning the internode HT test cfg from the environment without editing the source. Supported variables (all optional): MSCCLPP_EP_NSM (default 152) num channels / SMs MSCCLPP_EP_NVL_SEND (default 8) MSCCLPP_EP_NVL_RECV (default 256) MSCCLPP_EP_RDMA_SEND (default 16) MSCCLPP_EP_RDMA_RECV (default 128) The defaults match what we use for 16-node GB200 bench runs (e.g. NVL_RECV=512 to satisfy the HT combine assert at 16 nodes).	2026-05-11 20:59:12 +00:00
Qinghua Zhou	5d16ac958e	EP GB200 (4 GPUs/node) support - configs.cuh: NUM_MAX_NVL_PEERS 8 -> 4 - internode.cu: introduce NvlPackT (uint64_t for 8 peers, uint32_t for 4) to handle packed-bool loads of is_token_in_rank; relax SourceMeta static_assert; replace 4 uint64_t-coupled sites - buffer.hpp/buffer.cc: relax NUM_MAX_NVL_PEERS assert (4 \|\| 8); read MSCCLPP_EP_LOCAL_WORLD_SIZE env to override rdma_rank/nvl_rank partitioning when local world size != NUM_MAX_NVL_PEERS - CMakeLists.txt (ext/ep): rpath / install fix - pyproject.toml: MSCCLPP_BUILD_EXT_EP=ON - src/core/atomicadd_kernel.cu, kernels/buffer.cuh, kernels/utils.cuh: related EP fixes - test_internode_multirank.py: NUM_MAX_NVL_PEERS=4, rank %% 4	2026-05-08 01:42:21 +00:00
Qinghua Zhou	e87c66a85d	ext/ep: apply clang-format and black to fix CI lint failures Run `tools/lint.sh cpp` (clang-format 14) and `tools/lint.sh py` (black) over the EP extension files added by this PR. No functional changes; pure reformatting to satisfy the cpplint and pylint CI jobs.	2026-05-06 04:12:20 +00:00
Qinghua Zhou	5178155be8	ext/ep: add MIT license headers to EP sources and tests	2026-05-06 02:42:49 +00:00
Qinghua Zhou	e752dbaf97	tests/ep: add NCCL-EP six-metric BW breakdown (send/recv x total/nvl/rdma) For HT intra/internode benches, compute per-rank avg total_send/rdma_send and total_recv/rdma_recv token counts (matching NCCL-EP ep_bench accounting) and print send-side and recv-side BW split into total / nvl / rdma columns. Combine reverses send<->recv. Byte-count line mirrors NCCL-EP's '(per rank avg)' summary so numbers are directly comparable.	2026-04-29 20:44:10 +00:00
Qinghua Zhou	9213587ffe	ep tests: report dispatch/combine min, avg, max time and use avg for BW Aligns with NCCL-EP's ep_bench convention (BW computed from average time across ranks). Previously we reported only the max time and computed BW per-rank, which made our numbers more pessimistic than NCCL-EP's.	2026-04-29 16:50:33 +00:00
Qinghua Zhou	afbdcd6a3d	ep tests: clean shutdown to silence TCPStore/HeartbeatMonitor noise Add dist.barrier() + dist.destroy_process_group() in a finally block so non-zero ranks don't poll the TCPStore after rank 0 (the store server) exits, which produced noisy 'recvValue failed / Connection was likely closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor. Also pass device_id to init_process_group in the internode test to silence 'Guessing device ID based on global rank' warnings.	2026-04-29 05:16:22 +00:00
Qinghua Zhou	48540bc11e	tests/ep: align internode HT bench with NCCL-EP accounting Same change as the intra-node bench (commit `4ed6f229`), applied to the cross-node test: - Add MSCCLPP_EP_BENCH_EXPERTS / _TOPK env knobs so the bench phase can match NCCL-EP's `ep_bench -a ht` defaults (256 experts, top-8). - Switch BW accounting from recv_tokenshidden to bench_tokenshidden, matching NCCL-EP's `RDMA_send` per-rank byte count.	2026-04-27 17:39:17 +00:00
Qinghua Zhou	9840853c69	tests/ep: HT benches also print per_rank_bw Same alignment with NCCL-EP ep_bench as the LL test: report both per-rank (agg/num_ranks) and aggregate throughput.	2026-04-23 22:58:23 +00:00
Qinghua Zhou	906fa3c48f	tests/ep: size HT buffers for bench hidden so bench phase fits	2026-04-23 17:13:09 +00:00
Qinghua Zhou	c51a8a5305	ext/ep tests: time dispatch and combine separately in MSCCLPP_EP_BENCH Previously the optional benchmark measured full round-trip latency. Split it to time dispatch alone (N iters) and combine alone (N iters reusing one dispatch output), reporting per-phase latency (max across ranks) and aggregate effective bandwidth (sum across ranks). Applies to intranode HT, internode HT, and the (currently unreachable on intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard between dispatch and combine but excludes it from either phase's timing.	2026-04-22 23:11:04 +00:00
Qinghua Zhou	2391ce1de7	ext/ep tests: add optional HT benchmark pass Gated behind MSCCLPP_EP_BENCH=1 to keep correctness runs fast. Reports per-iter latency (max across ranks, CUDA-event timed) and aggregate effective bandwidth (sum across ranks, dispatch+combine payload bytes). Tunable via MSCCLPP_EP_BENCH_WARMUP / _ITERS / _TOKENS / _HIDDEN. Bench reuses the Buffer allocated for the correctness phase and self-skips if the requested hidden exceeds the per-peer NVL/RDMA budget.	2026-04-22 19:03:09 +00:00
Qinghua Zhou	c351b871a1	ep: fix internode combine in multirank test Two issues prevented internode HT combine from completing on 2x8 H100: 1. Wrong prefix matrices passed to internode_combine. Combine runs in the reverse direction of dispatch, so it must consume the receiver-side matrices returned by dispatch (recv_rdma_channel_prefix_matrix, recv_rdma_rank_prefix_sum, recv_gbl_channel_prefix_matrix), not the sender-side rdma_channel_prefix_matrix / gbl_channel_prefix_matrix. This matches DeepEP's deep_ep/buffer.py::internode_combine handle unpacking. Without the fix the NVL forwarder's 'NVL check' timed out because token_start_idx/token_end_idx were computed against the wrong per-channel layout. 2. Cross-rank race between dispatch and combine. Even with the correct matrices, launching combine immediately after dispatch deadlocked the forwarder NVL check (tail stuck one short of expected_head) because peers still had in-flight dispatch proxy traffic while fast ranks had already started combine. A torch.cuda.synchronize() + dist.barrier() between the two calls makes the test pass deterministically on 16 ranks (combine diff == 0, max\|expected\| up to 60.0). The barrier in the test is a workaround; the real fix belongs in Buffer::internode_dispatch / Buffer::internode_combine so the dispatch->combine handoff fully fences outstanding proxy work across ranks. Marked with an XXX comment in the test.	2026-04-22 02:21:29 +00:00
Qinghua Zhou	393d6e2673	ep: fix port-channel rank ordering for internode HT dispatch The `internode` kernels index device-side port channel handles as `port_channel_handles[channel_id * num_ranks + peer_rank]`, where `peer_rank` is a global rank in [0, num_ranks). `Buffer::sync` was building that table by iterating `std::unordered_map<int, MemoryId>` (and similarly for connections/semaphores), which yields hash order rather than ascending rank order. Once the cross-node fan-out grew beyond a single peer, a local rank's trigger for peer `r` landed on the semaphore/memory pair of a different peer, so RDMA puts and atomic tail updates went to the wrong destination and the forwarder spun on a tail counter that never advanced. Changes: - Build `sema_ids` and `port_channel_handles` by iterating `for (int r = 0; r < num_ranks; ++r)` and looking up the connection / memory id for rank `r`, skipping ranks excluded by low-latency mode (inserting a placeholder handle so the stride stays `num_ranks`). - Tag the RDMA-phase `sendMemory`/`recvMemory`/`connect` calls with `kRdmaTag = 1` so they do not collide with NVL-phase tag-0 traffic between the same pair of ranks. - Drop an unused `r` local in the NVL setup loop. With this fix and a matched `libmscclpp.so` on both nodes, the 2-node x 8-GPU internode HT dispatch path completes successfully (`[dispatch] OK`). Combine is still under investigation. Also adds `test/python/ext/ep/test_internode_multirank.py`, a torchrun-based 2-node functional test that exercises `get_dispatch_layout` -> `internode_dispatch` -> `internode_combine` and validates per-source-rank token values end-to-end.	2026-04-21 19:12:25 +00:00

26 Commits