mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-21 05:19:24 +00:00

Files

Qinghua Zhou 5911998181 ext/ep: gate NVLS HT B2 on cross-host fabric IPC support (H100 fix)

The NVLS HT B2 path introduced in 3ab2e43b activated whenever
isNvlsSupported() && num_rdma_ranks > 1. On H100 NDv5 / Azure CX-7 RoCE
that is true (H100 has intra-node NVLink multicast), but there is no
cross-host NVSwitch fabric. mscclpp's GpuIpcMem::create then falls back
to CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR whose handle exchange routes
through /tmp/mscclpp_bootstrap_<pid>.sock -- a master-rank-0 unix-domain
socket worker ranks cannot reach. Symptom on every commit since 3ab2e43b:

    RuntimeError: connect() failed for unix socket to
    /tmp/mscclpp_bootstrap_<pid>.sock

MSCCLPP_EP_FABRIC_IPC=0 was being silently ignored.

src/ext/ep/buffer.cc: add resolve_fabric_ipc_supported() helper.
Resolution:
  1. MSCCLPP_EP_FABRIC_IPC env var (0/off/false/no => off,
     1/on/true/yes/force => on, otherwise auto).
  2. Auto-detect: requires both
       - CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED == 1
       - device compute capability >= sm_100 (Blackwell+).

Gate both use_fabric_ipc_alloc (RDMA buffer allocator) and nvls_ht_enabled
(HT B2 multicast region) on fabric_ipc_supported. On H100 both fall back
to cudaMalloc + legacy PortChannel; on GB200 NVL72 both remain enabled.
Diagnostic prints now show fabric_ipc=.

test/python/ext/ep/test_internode_multirank.py: replace hardcoded
NUM_MAX_NVL_PEERS=4 with a runtime _detect_local_world_size() helper
that reads MSCCLPP_EP_LOCAL_WORLD_SIZE / LOCAL_WORLD_SIZE /
OMPI_COMM_WORLD_LOCAL_SIZE, falling back to torch.cuda.device_count().
Makes the test correct on both H100 (8 GPUs/node) and GB200 (4 GPUs/node)
without code changes.

src/core/atomicadd_kernel.cu: use cuCtxCreate_v4 for CUDA >= 12.5 (the
underlying symbol was renamed); preserve legacy 3-arg cuCtxCreate for
older toolkits.

Verified on 2x H100 NDv5 at HEAD:
  LL intranode  (8 GPUs)            PASS
  LL internode  (16 GPUs, 2 nodes)  PASS
  HT intranode  (8 GPUs)            PASS
  HT internode  (16 GPUs, 2 nodes)  PASS

Diagnostic on H100:
  [mscclpp_ep] rdma_buffer allocator: cudaMalloc (low_latency=0, nvls=1, fabric_ipc=0)
  [mscclpp_ep] NVLS HT multicast: disabled (low_latency=0, num_rdma_ranks=2, nvls_supported=1, fabric_ipc=0)

2026-05-14 21:29:10 +00:00

deploy

Support python wheel build (#787 )

2026-04-16 21:24:45 -07:00

execution-files

New DSL implementation (#579 )

2025-08-09 00:36:20 -07:00

executor-tests

Adding CI Test to DSL Executor (#782 )

2026-04-13 13:55:45 -07:00

mp_unit

atomic add

2026-04-20 18:32:36 +00:00

mscclpp-test

Address installation issue in some env (#750 )

2026-02-20 16:11:16 -08:00

python/ext/ep

ext/ep: gate NVLS HT B2 on cross-host fabric IPC support (H100 fix)

2026-05-14 21:29:10 +00:00

torch

Refactor algo selection logic and introduce symmetric_memory env (#741 )

2026-02-12 19:06:18 -08:00

unit

IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753 )

2026-04-09 09:24:30 +00:00

allgather_test_cpp.cu

connect() APIs changed to return an instance instead of a shared_ptr (#680 )

2025-11-15 11:40:40 -08:00

allgather_test_host_offloading.cu

connect() APIs changed to return an instance instead of a shared_ptr (#680 )

2025-11-15 11:40:40 -08:00

CMakeLists.txt

Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 )

2026-03-24 23:34:38 -04:00

executor_test.cc

Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 )

2026-03-24 23:34:38 -04:00

framework.cc

IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753 )

2026-04-09 09:24:30 +00:00

framework.hpp

IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753 )

2026-04-09 09:24:30 +00:00

nvls_test.cu

Fix lint.sh (#652 )

2025-10-20 17:23:01 -07:00

README.md

Add unit testing framework readme (#766 )

2026-04-01 05:30:35 +00:00

run_mpi_test.sh.in

Rudimentary CTest support for test executables

2023-05-16 16:16:00 -07:00

README.md

MSCCL++ C++ Test Framework

A lightweight, GTest-like test framework with MPI support for testing MSCCL++ C++ APIs. Defined in framework.hpp / framework.cc.

Adding a New Test (Step-by-Step)

Single-process test (unit/)

Create the test file test/unit/my_feature_tests.cc (or .cu for CUDA):

#include "../framework.hpp"
#include <mscclpp/my_feature.hpp>

TEST(MyFeatureTest, BasicUsage) {
  EXPECT_EQ(myFunction(), 42);
}

Register it in CMake — add the filename to test/unit/CMakeLists.txt:

target_sources(unit_tests PRIVATE
    ...
    my_feature_tests.cc   # <-- add here
)

Build and run:

cmake --build build -j
./build/test/unit_tests --filter=MyFeatureTest

Multi-process test (mp_unit/)

Create the test file test/mp_unit/my_feature_tests.cc (or .cu):
```
#include "mp_unit_tests.hpp"

TEST(MyFeatureTest, MultiRank) {
  int rank = gEnv->rank;
  EXPECT_GE(rank, 0);
}
```
Use fixtures from mp_unit_tests.hpp (e.g., CommunicatorTest) if you need pre-established connections.

Register it in CMake — add the filename to test/mp_unit/CMakeLists.txt:

target_sources(mp_unit_tests PRIVATE
    ...
    my_feature_tests.cc   # <-- add here
)

Build and run:

cmake --build build -j
mpirun -np 2 ./build/test/mp_unit_tests --filter=MyFeatureTest

Notes

No separate test registration step is needed — TEST() auto-registers via static initialization.
The test_framework static library is built from framework.cc in the top-level test/CMakeLists.txt and linked into both unit_tests and mp_unit_tests. You do not need to modify it.
Use .cu extension for files that contain CUDA kernel code; use .cc for host-only tests.
Each test binary needs a main() that calls RUN_ALL_TESTS(). See unit/unit_tests_main.cc (single-process) and mp_unit/mp_unit_tests.cc (multi-process with Environment setup).
Additional run options: --filter=-Pattern (exclude), --exclude-perf-tests (skip PERF_TESTs).

Macros

Macro	Behavior
`TEST(Suite, Name)`	Register a test. If `Suite` is a defined class, it's used as a fixture.
`PERF_TEST(Suite, Name)`	Same as `TEST` but marked as perf (skippable via `--exclude-perf-tests`).
`EXPECT_*`	Non-fatal assertions: `EXPECT_TRUE`, `EXPECT_FALSE`, `EXPECT_EQ`, `EXPECT_NE`, `EXPECT_LT`, `EXPECT_LE`, `EXPECT_GT`, `EXPECT_GE`
`ASSERT_*`	Fatal assertions (abort test on failure): same variants as `EXPECT_*`, plus `ASSERT_NO_THROW`
`FAIL()`	Fail immediately. Supports streaming: `FAIL() << "reason";`
`SKIP_TEST()`	Skip the current test. Supports streaming: `SKIP_TEST() << "reason";`
`CUDA_CHECK(call)`	Check a CUDA API return code, throw on error.

Fixtures

Define a class inheriting from mscclpp::test::TestCase with SetUp() / TearDown(), then use the class name as the suite name:

class MyFixture : public mscclpp::test::TestCase {
 public:
  void SetUp() override { /* per-test setup */ }
  void TearDown() override { /* per-test cleanup */ }
 protected:
  int sharedState_ = 0;
};

TEST(MyFixture, SomeTest) {
  sharedState_ = 42;
  EXPECT_EQ(sharedState_, 42);
}

See mp_unit/mp_unit_tests.hpp (BootstrapTest, CommunicatorTest, etc.) for real fixture examples.

Global Environments

class MyEnv : public mscclpp::test::Environment {
 public:
  void SetUp() override { /* global init */ }
  void TearDown() override { /* global cleanup */ }
};

// In main(), before RUN_ALL_TESTS():
mscclpp::test::TestRegistry::instance().addEnvironment(new MyEnv());

See mp_unit/mp_unit_tests.cc for the MultiProcessTestEnv example.

Utilities

mscclpp::test::utils::isMainRank() — true on MPI rank 0
mscclpp::test::utils::getMPIRank() / getMPISize()
mscclpp::test::utils::Timer — high-resolution timer with start(), stop(), elapsedMilliseconds()
mscclpp::test::currentTestName() — returns "Suite.Name" for the running test