Address review comments:
1. Ensure TearDown() is always called if SetUp() succeeds, even when
TestBody() throws. This prevents resource leaks and maintains MPI
synchronization between tests.
2. Replace assert() in fifo_perf_tests.cu with proper return false
on validation failure, ensuring consistent test failure reporting.
Fixes:
- test/framework.cc: Track SetUp success and call TearDown in finally-style
- test/unit/fifo_perf_tests.cu: Replace assert with explicit check
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Remove test/perf/ directory (fifo_test.cu, framework.{cc,hpp}, CMakeLists.txt)
- Remove add_subdirectory(perf) from test/CMakeLists.txt
- Performance tests now integrated into unit_tests as fifo_perf_tests.cu
- Fix mp_unit_tests.cc to use framework functions without ::testing:: namespace
- Fix bootstrap_tests.cc ErrorCode comparison to use ASSERT_TRUE
- Fix switch_channel_tests.cu to not use streaming with ASSERT_EQ
- Add missing #include <unistd.h> to executor_tests.cc
All perf test functionality is now in unit_tests and can be filtered
with --exclude-perf-tests flag. The standalone test/perf/ directory
is no longer needed.
Verified builds:
- unit_tests: ✅
- mp_unit_tests: ✅
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Add unit_tests_main.cc with main() function for unit_tests executable
- Create fifo_perf_tests.cu as PERF_TEST for unit_tests
- Add fifo_perf_tests.cu to unit_tests sources
- Fix errors_tests.cc to use ASSERT_TRUE for ErrorCode comparisons
- Fix core_tests.cc to use ASSERT_TRUE for TransportFlags comparisons
- Add Azure pipeline step for Debug build with coverage
- Add step to run mp_unit_tests --exclude-perf-tests with coverage
The perf tests are now part of unit_tests and can be filtered out
for coverage reporting. CI now includes Debug build with coverage
collection for non-performance tests.
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Add nlohmann::ordered_json metrics field to TestResult struct
- Add nlohmann/json.hpp include to test/framework.hpp
- Link test_framework with nlohmann_json::nlohmann_json
- Replace PerfTestResult with TestResult in test/perf/framework.cc
- Move perf utility functions to utils namespace for consistency
- Remove duplicate PerfTestResult struct definition
This consolidates the two similar structs into one, reducing code
duplication while maintaining all necessary fields for both unit
tests (passed/failure_message) and performance tests (metrics).
Verified build succeeds with Docker:
docker run --rm -v $(pwd):/workspace -w /workspace \
ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
"cd /workspace/build && make -j4 fifo_test"
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Make MPI REQUIRED for test builds (clearer error messages)
- Add project include directories to test_framework library
- Fix core_tests.cc to use custom framework correctly
- Fix mp_unit_tests.hpp to use mscclpp::test namespace
- Add FAIL() macro with streaming support for test messages
- Building tests now works in Docker environment with GPU bypass
Tests can now be built using:
docker run --rm -v $(pwd):/workspace -w /workspace \
ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
"mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
-DMSCCLPP_USE_CUDA=ON .. && make -j"
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Modified test/mp_unit/mp_unit_tests.hpp to use ../framework.hpp instead of gtest/gtest.h
- Enhanced test/framework.hpp with GTest-compatible APIs:
- Added Environment base class for global test setup/teardown
- Added TestInfo and UnitTest classes for test metadata access
- Added GTEST_SKIP macro support via SkipHelper class
- Added namespace alias 'testing' for compatibility
- Added InitGoogleTest and AddGlobalTestEnvironment helper functions
- Updated test/framework.cc with implementations for new classes
- All mp_unit test files now use framework.hpp through mp_unit_tests.hpp
- Formatting applied via lint.sh
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
* Always use `ibv_reg_dmabuf_mr` when DMABUF is supported
* Do not check `nvidia-peermem` when unnecessary
* More rigorous check on IB port availability
* Fixed ibverbs wrappers
* Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
* Add a FIFO test code that reproduced a correctness issue
* Fix the correctness issue by using pinned memory instead of cudaMemcpy
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Previous `gpuCalloc*()` creates a new stream for each allocation, which
messes the timeline up in profiler traces. Now `GpuStreamPool` allows
reusing the temporal streams.
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
Cherry-picked a part of features from #167: now `Communicator::setup()`
is unneeded. `Communicator::sendMemory()` conducts the task inline, and
`Communicator::recvMemory()` and `Communicator::connect()` conducts the
task asynchronously without explicit setup.
Remove __assert_fail for release build. This will reduce the number of
PTX instructions inside the loop. Also Trying to resolve this issue
reported in #497. Reduce the number of PTX instructions from 8 to 6.
8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5
Also NDEBUG flag is confused here, sometime it will not be set. Use
customized flag for debug build.
Here is current PTX:
```
ld.u64 %rd12, [%rd2+-24];
mov.u64 %rd13, %rd12;
mov.u64 %rd11, %rd13;
ld.acquire.sys.b64 %rd10,[%rd11];
setp.lt.u64 %p1, %rd10, %rd3;
@%p1 bra $L__BB2_1;
```
If we change to `asm volatile("ld.global.acquire.sys.b64 %0, [%1];" :
"=l"(flag) : "l"(flag_addr));` will reduce to 4 instructions. We can get
2.1 us for 8 ranks signal/wait
```
ld.u64 %rd9, [%rd1+-24];
ld.global.acquire.sys.b64 %rd8, [%rd9];
setp.lt.u64 %p1, %rd8, %rd2;
@%p1 bra $L__BB2_1;
```
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
This PR implements and closes#137. The new `Endpoint` and `Context`
classes expose the connection establishing functionality from
`Communicator`, which now is only responsible for tying together the
bootstrapper with a context.
The largest breaking change here is that
`Communicator.connectOnSetup(...)` now returns the `Connection` wrapped
inside a `NonblockingFuture`. This is because with the way `Context` is
implemented a `Connection` is now fully initialized on construction.
Some smaller breaking API changes from this change are that
`RegisteredMemory` no longer has a `rank()` function (as there maybe no
concept of rank), and similarly `Connection` has no `remoteRank()` and
`tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf`
functions in `Communicator`.
A new `EndpointConfig` class is introduced to avoid duplication of the
IB configuration parameters in the APIs of `Context` and `Communicator`.
The usual usage pattern of just passing in a `Transport` still works due
to an implicit conversion into `EndpointConfig`.
Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the `Impl`
struct and the `pimpl_` pointers private for all relevant classes in the
core API.
-Enables ctest to be run from the build root directory.