The NVLS HT B2 path introduced in3ab2e43bactivated whenever isNvlsSupported() && num_rdma_ranks > 1. On H100 NDv5 / Azure CX-7 RoCE that is true (H100 has intra-node NVLink multicast), but there is no cross-host NVSwitch fabric. mscclpp's GpuIpcMem::create then falls back to CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR whose handle exchange routes through /tmp/mscclpp_bootstrap_<pid>.sock -- a master-rank-0 unix-domain socket worker ranks cannot reach. Symptom on every commit since3ab2e43b: RuntimeError: connect() failed for unix socket to /tmp/mscclpp_bootstrap_<pid>.sock MSCCLPP_EP_FABRIC_IPC=0 was being silently ignored. src/ext/ep/buffer.cc: add resolve_fabric_ipc_supported() helper. Resolution: 1. MSCCLPP_EP_FABRIC_IPC env var (0/off/false/no => off, 1/on/true/yes/force => on, otherwise auto). 2. Auto-detect: requires both - CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED == 1 - device compute capability >= sm_100 (Blackwell+). Gate both use_fabric_ipc_alloc (RDMA buffer allocator) and nvls_ht_enabled (HT B2 multicast region) on fabric_ipc_supported. On H100 both fall back to cudaMalloc + legacy PortChannel; on GB200 NVL72 both remain enabled. Diagnostic prints now show fabric_ipc=. test/python/ext/ep/test_internode_multirank.py: replace hardcoded NUM_MAX_NVL_PEERS=4 with a runtime _detect_local_world_size() helper that reads MSCCLPP_EP_LOCAL_WORLD_SIZE / LOCAL_WORLD_SIZE / OMPI_COMM_WORLD_LOCAL_SIZE, falling back to torch.cuda.device_count(). Makes the test correct on both H100 (8 GPUs/node) and GB200 (4 GPUs/node) without code changes. src/core/atomicadd_kernel.cu: use cuCtxCreate_v4 for CUDA >= 12.5 (the underlying symbol was renamed); preserve legacy 3-arg cuCtxCreate for older toolkits. Verified on 2x H100 NDv5 at HEAD: LL intranode (8 GPUs) PASS LL internode (16 GPUs, 2 nodes) PASS HT intranode (8 GPUs) PASS HT internode (16 GPUs, 2 nodes) PASS Diagnostic on H100: [mscclpp_ep] rdma_buffer allocator: cudaMalloc (low_latency=0, nvls=1, fabric_ipc=0) [mscclpp_ep] NVLS HT multicast: disabled (low_latency=0, num_rdma_ranks=2, nvls_supported=1, fabric_ipc=0)
MSCCL++ C++ Test Framework
A lightweight, GTest-like test framework with MPI support for testing MSCCL++ C++ APIs. Defined in framework.hpp / framework.cc.
Adding a New Test (Step-by-Step)
Single-process test (unit/)
-
Create the test file
test/unit/my_feature_tests.cc(or.cufor CUDA):#include "../framework.hpp" #include <mscclpp/my_feature.hpp> TEST(MyFeatureTest, BasicUsage) { EXPECT_EQ(myFunction(), 42); } -
Register it in CMake — add the filename to
test/unit/CMakeLists.txt:target_sources(unit_tests PRIVATE ... my_feature_tests.cc # <-- add here ) -
Build and run:
cmake --build build -j ./build/test/unit_tests --filter=MyFeatureTest
Multi-process test (mp_unit/)
-
Create the test file
test/mp_unit/my_feature_tests.cc(or.cu):#include "mp_unit_tests.hpp" TEST(MyFeatureTest, MultiRank) { int rank = gEnv->rank; EXPECT_GE(rank, 0); }Use fixtures from
mp_unit_tests.hpp(e.g.,CommunicatorTest) if you need pre-established connections. -
Register it in CMake — add the filename to
test/mp_unit/CMakeLists.txt:target_sources(mp_unit_tests PRIVATE ... my_feature_tests.cc # <-- add here ) -
Build and run:
cmake --build build -j mpirun -np 2 ./build/test/mp_unit_tests --filter=MyFeatureTest
Notes
- No separate test registration step is needed —
TEST()auto-registers via static initialization. - The
test_frameworkstatic library is built fromframework.ccin the top-leveltest/CMakeLists.txtand linked into bothunit_testsandmp_unit_tests. You do not need to modify it. - Use
.cuextension for files that contain CUDA kernel code; use.ccfor host-only tests. - Each test binary needs a
main()that callsRUN_ALL_TESTS(). Seeunit/unit_tests_main.cc(single-process) andmp_unit/mp_unit_tests.cc(multi-process withEnvironmentsetup). - Additional run options:
--filter=-Pattern(exclude),--exclude-perf-tests(skipPERF_TESTs).
Macros
| Macro | Behavior |
|---|---|
TEST(Suite, Name) |
Register a test. If Suite is a defined class, it's used as a fixture. |
PERF_TEST(Suite, Name) |
Same as TEST but marked as perf (skippable via --exclude-perf-tests). |
EXPECT_* |
Non-fatal assertions: EXPECT_TRUE, EXPECT_FALSE, EXPECT_EQ, EXPECT_NE, EXPECT_LT, EXPECT_LE, EXPECT_GT, EXPECT_GE |
ASSERT_* |
Fatal assertions (abort test on failure): same variants as EXPECT_*, plus ASSERT_NO_THROW |
FAIL() |
Fail immediately. Supports streaming: FAIL() << "reason"; |
SKIP_TEST() |
Skip the current test. Supports streaming: SKIP_TEST() << "reason"; |
CUDA_CHECK(call) |
Check a CUDA API return code, throw on error. |
Fixtures
Define a class inheriting from mscclpp::test::TestCase with SetUp() / TearDown(), then use the class name as the suite name:
class MyFixture : public mscclpp::test::TestCase {
public:
void SetUp() override { /* per-test setup */ }
void TearDown() override { /* per-test cleanup */ }
protected:
int sharedState_ = 0;
};
TEST(MyFixture, SomeTest) {
sharedState_ = 42;
EXPECT_EQ(sharedState_, 42);
}
See mp_unit/mp_unit_tests.hpp (BootstrapTest, CommunicatorTest, etc.) for real fixture examples.
Global Environments
Register an Environment subclass for one-time global setup/teardown (e.g., MPI bootstrap):
class MyEnv : public mscclpp::test::Environment {
public:
void SetUp() override { /* global init */ }
void TearDown() override { /* global cleanup */ }
};
// In main(), before RUN_ALL_TESTS():
mscclpp::test::TestRegistry::instance().addEnvironment(new MyEnv());
See mp_unit/mp_unit_tests.cc for the MultiProcessTestEnv example.
Utilities
mscclpp::test::utils::isMainRank()— true on MPI rank 0mscclpp::test::utils::getMPIRank()/getMPISize()mscclpp::test::utils::Timer— high-resolution timer withstart(),stop(),elapsedMilliseconds()mscclpp::test::currentTestName()— returns"Suite.Name"for the running test