mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-24 14:54:51 +00:00

Author	SHA1	Message	Date
qinghuazhou	33e59c2908	test/ext/ep: intranode HT — parameterize num_sms and NVL chunk sizes Adds MSCCLPP_EP_NUM_SMS / MSCCLPP_EP_NVL_SEND / MSCCLPP_EP_NVL_RECV env overrides for ep.Config(num_sms, num_max_nvl_chunked_send_tokens, num_max_nvl_chunked_recv_tokens). Defaults unchanged (20, 8, 256). Sweep on 4-rank intranode HT (tokens=4096, hidden=7168, experts=256): sms=20, NVL_SEND=8, NVL_RECV=256 -> d_recv=50.76, c_recv=65.66 GB/s sms=64, NVL_SEND=16, NVL_RECV=512 -> d_recv=57.75, c_recv=150.46 GB/s d_recv (actual NVL throughput per rank) plateaus at ~57 GB/s for topk>=2; combine recv scales near-linearly with num_sms.	2026-05-13 16:25:01 +00:00
qinghuazhou	f9f0d0fcb7	test/ext/ep: intranode HT bench — cached-mode iter loop Run one uncached dispatch to capture (rank_prefix_matrix, channel_prefix_matrix, num_recv_tokens), then time iters in cached mode. This replaces notify_dispatch + host busy-wait on mapped pinned counters with the cheap cached_notify_dispatch (one barrier + memcpy + memset), matching NCCL-EP ep_bench convention. Cached mode forces num_experts=0 (buffer.cc:807), so topk_idx must be None in iters; combine still works because recv_topk_weights is optional. Per-iter dispatch latency drops ~21% (4247→3373µs). Confirms host-side notify_dispatch overhead is only ~20% of total dispatch time; the remaining 14.4× send-total asymmetry vs combine is intrinsic (3× recv/ send byte fan-out × 3.8× dispatch-kernel-vs-combine-kernel work).	2026-05-12 19:02:31 +00:00
qinghuazhou	3f459a995d	test/ext/ep: HT tests — env-driven cfg + allgather bookkeeping - Internode HT test: accept MSCCLPP_EP_HT_{TOKENS,HIDDEN,TOPK,EXPERTS} env vars to override the functional-check problem size (was hardcoded to num_tokens=128, hidden=1024, num_topk=min(4,num_ranks), num_experts=num_ranks*4). - Both intranode + internode HT tests: replace dist.all_to_all_single bookkeeping (per-(src,dst) recv-count matrix used for the six-metric NVL/RDMA BW breakdown) with dist.all_gather_into_tensor + transpose. Functionally identical (gathered[:, rank] gives the same recv-from-src column) but works on socket-NCCL with NCCL_IB_DISABLE=1, which is required on rigs where NCCL IB cannot coexist with mscclpp RDMA. Sends num_ranks^2 int64 instead of num_ranks per rank — negligible (64 ints at 8 ranks).	2026-05-12 02:53:40 +00:00
Qinghua Zhou	e87c66a85d	ext/ep: apply clang-format and black to fix CI lint failures Run `tools/lint.sh cpp` (clang-format 14) and `tools/lint.sh py` (black) over the EP extension files added by this PR. No functional changes; pure reformatting to satisfy the cpplint and pylint CI jobs.	2026-05-06 04:12:20 +00:00
Qinghua Zhou	5178155be8	ext/ep: add MIT license headers to EP sources and tests	2026-05-06 02:42:49 +00:00
Qinghua Zhou	2529774868	tests/ep: intranode send-side counts unique (token, dst_node) to match NCCL-EP Previously total_send_tokens was Sigma over dst_rank of num_tokens_per_rank which over-counts intra-node fan-out. NCCL-EP's ep_bench collapses multiple destinations on the same node into one count; on a single-node run that means total_send_tokens = number of tokens with at least one valid expert. Switching to is_token_in_rank.any(dim=1).sum() makes the send-side BW comparable to NCCL-EP's send: total_bw / nvl_bw line.	2026-04-29 23:31:47 +00:00
Qinghua Zhou	e752dbaf97	tests/ep: add NCCL-EP six-metric BW breakdown (send/recv x total/nvl/rdma) For HT intra/internode benches, compute per-rank avg total_send/rdma_send and total_recv/rdma_recv token counts (matching NCCL-EP ep_bench accounting) and print send-side and recv-side BW split into total / nvl / rdma columns. Combine reverses send<->recv. Byte-count line mirrors NCCL-EP's '(per rank avg)' summary so numbers are directly comparable.	2026-04-29 20:44:10 +00:00
Qinghua Zhou	9213587ffe	ep tests: report dispatch/combine min, avg, max time and use avg for BW Aligns with NCCL-EP's ep_bench convention (BW computed from average time across ranks). Previously we reported only the max time and computed BW per-rank, which made our numbers more pessimistic than NCCL-EP's.	2026-04-29 16:50:33 +00:00
Qinghua Zhou	afbdcd6a3d	ep tests: clean shutdown to silence TCPStore/HeartbeatMonitor noise Add dist.barrier() + dist.destroy_process_group() in a finally block so non-zero ranks don't poll the TCPStore after rank 0 (the store server) exits, which produced noisy 'recvValue failed / Connection was likely closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor. Also pass device_id to init_process_group in the internode test to silence 'Guessing device ID based on global rank' warnings.	2026-04-29 05:16:22 +00:00
Qinghua Zhou	4ed6f229f2	tests/ep: align intranode HT bench with NCCL-EP accounting - Add MSCCLPP_EP_BENCH_EXPERTS / _TOPK env knobs so the bench phase can match NCCL-EP's `ep_bench -a ht` defaults (256 experts, top-8). The functional check above continues to use the smaller (num_ranks4 experts, topk=4) configuration. - Switch BW accounting from recv_tokenshidden to bench_tokens*hidden, matching NCCL-EP's `RDMA_send` per-rank byte count. The previous formula counted DeepEP's expanded recv layout (one row per (token,src_rank) pair), inflating reported GB/s ~5x and making cross-stack comparisons misleading.	2026-04-27 17:14:42 +00:00
Qinghua Zhou	9840853c69	tests/ep: HT benches also print per_rank_bw Same alignment with NCCL-EP ep_bench as the LL test: report both per-rank (agg/num_ranks) and aggregate throughput.	2026-04-23 22:58:23 +00:00
Qinghua Zhou	906fa3c48f	tests/ep: size HT buffers for bench hidden so bench phase fits	2026-04-23 17:13:09 +00:00
Qinghua Zhou	c51a8a5305	ext/ep tests: time dispatch and combine separately in MSCCLPP_EP_BENCH Previously the optional benchmark measured full round-trip latency. Split it to time dispatch alone (N iters) and combine alone (N iters reusing one dispatch output), reporting per-phase latency (max across ranks) and aggregate effective bandwidth (sum across ranks). Applies to intranode HT, internode HT, and the (currently unreachable on intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard between dispatch and combine but excludes it from either phase's timing.	2026-04-22 23:11:04 +00:00
Qinghua Zhou	2391ce1de7	ext/ep tests: add optional HT benchmark pass Gated behind MSCCLPP_EP_BENCH=1 to keep correctness runs fast. Reports per-iter latency (max across ranks, CUDA-event timed) and aggregate effective bandwidth (sum across ranks, dispatch+combine payload bytes). Tunable via MSCCLPP_EP_BENCH_WARMUP / _ITERS / _TOKENS / _HIDDEN. Bench reuses the Buffer allocated for the correctness phase and self-skips if the requested hidden exceeds the per-peer NVL/RDMA budget.	2026-04-22 19:03:09 +00:00
Qinghua Zhou	a6af3a4454	ext/ep: fix multi-rank intranode dispatch+combine Three issues blocked end-to-end intranode validation across multiple ranks. This commit fixes them and adds a 2/4/8-rank functional test. 1. Combine receiver: OOB __shared__ read In the combine receiver warp, the wait loop evaluated `channel_tail_idx[recv_lane_id] <= expected_head` before the `expected_head >= 0` guard. `channel_tail_idx` is a shared array of size `kNumRanks`, but the loop runs on all 32 lanes of a warp, so lanes with `recv_lane_id >= kNumRanks` indexed out of bounds. compute-sanitizer reported "Invalid __shared__ read of size 4 bytes" at combine<bf16,2,768>+0xdd0, surfaced asynchronously as cudaErrorIllegalAddress at the kernel launch site. Swap the operands so the rank-bounds check short-circuits the shared read. 2. Python bindings: UniqueId ABI `mscclpp::UniqueId` is a `std::array<uint8_t, N>` which pybind11 auto-converts to a Python `list`, silently overriding any `py::class_<UniqueId>` wrapper. Expose `create_unique_id` / `connect` as lambdas that produce/consume `py::bytes` and memcpy into a local `UniqueId`. Also coerce `bytes`->`bytearray` at the Python call site for `sync()` whose signature expects `pybind11::bytearray`. 3. Python frontend: communicator required for NVL-only sync `Buffer::sync()` uses `communicator->connect(ipc_config, ...)` on the pure-NVLink path, so the communicator must be initialized even when `num_rdma_ranks == 1` and `low_latency_mode == False`. Always broadcast the unique id and call `runtime.connect()` before `sync()`. Validation on a single H100x8 node via torchrun: - 2 ranks: dispatch 195 tokens, combine diff=0 - 4 ranks: dispatch 371 tokens, combine diff=0 - 8 ranks: dispatch 456 tokens, combine diff=0 Test harness added at test/python/ext/ep/test_intranode_multirank.py.	2026-04-21 02:03:55 +00:00

15 Commits