This PR adds an example code for switch channel testing. It validates
switch channel on single node and multi node environments. We need to
add the description of the algorithms and the explanation of the code
under doc.
example outputs:
rank0:
./bidir_switch_channel 10.0.5.233:45571 0 0
Rank 0 (GPU 0): Preparing for tests ...
Rank 0 (GPU 0): bytes 4096, elapsed 0.0062328 ms/iter, BW 0.657169 GB/s
Rank 0 (GPU 0): bytes 4.1943e+06, elapsed 0.0164577 ms/iter, BW 254.854
GB/s
Rank 0 (GPU 0): bytes 1.34218e+08, elapsed 0.33628 ms/iter, BW 399.125
GB/s
Rank 0: Succeed!
rank1:
./bidir_switch_channel 10.0.5.233:45571 1 0
Rank 1 (GPU 0): Preparing for tests ...
Rank 1: Succeed!
This pull request updates the way the `nlohmann/json` library is fetched
and upgrades it to a newer version in both the main build and test
configuration files.
Addressed installation issue in some env
This pull request updates the handling of the default flag buffer in the
C++ and Python bindings to ensure proper memory management when
interfacing with Python.
Make sure the buffer will not be deallocated when transfer ownership
from cpp to python
This PR refactors the algorithm selection logic in MSCCL++ and
introduces support for symmetric memory configuration through
environment variables.
1. Algorithm Selection Refactoring
Use separate class for algo selection. Could introduce more complex
logic for algo selection based on message size, arch, if cuda graph is
enabled and memory allocation method
2. Symmetric Memory Support
Introduced symmetricMemory parameter in algorithm context key
generation. Remove disableChannelCache env as is ambiguous
3. Add new args for build_default_algorithms
Add flag_buffer, and flag_buffer_size args to build default algorithm.
Then we could use unified flag buffer for different algorithms, avoid
application hanging when switch algo for different message size.
---------
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Support is being added for fusing the ReadPutPacket operation on DSL,
which reduces the overhead caused by reading packet data multiple times
in the scratch buffer. Fusion will occur when two rppkt operations are
executed consecutively with the same src_buffer:
rppkt(src, dst0) + rppkt(src, dst1) -> rppkt(src, [dst0, dst1]
Co-authored-by: Binyang Li <binyli@microsoft.com>
- Remove complex wildcard pattern matching (*, ?, negative patterns)
- Use simple substring matching with find()
- Simpler implementation, easier to understand and maintain
- Still supports --gtest_filter for basic test name filtering
Note: For advanced filtering like wildcards, users can use multiple
test runs with different substring filters.
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Remove test/perf/ directory (fifo_test.cu, framework.{cc,hpp}, CMakeLists.txt)
- Remove add_subdirectory(perf) from test/CMakeLists.txt
- Performance tests now integrated into unit_tests as fifo_perf_tests.cu
- Fix mp_unit_tests.cc to use framework functions without ::testing:: namespace
- Fix bootstrap_tests.cc ErrorCode comparison to use ASSERT_TRUE
- Fix switch_channel_tests.cu to not use streaming with ASSERT_EQ
- Add missing #include <unistd.h> to executor_tests.cc
All perf test functionality is now in unit_tests and can be filtered
with --exclude-perf-tests flag. The standalone test/perf/ directory
is no longer needed.
Verified builds:
- unit_tests: ✅
- mp_unit_tests: ✅
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Another set of accidentally committed build artifacts in build2/ directory.
The .gitignore pattern build_*/ should prevent these in the future.
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Add unit_tests_main.cc with main() function for unit_tests executable
- Create fifo_perf_tests.cu as PERF_TEST for unit_tests
- Add fifo_perf_tests.cu to unit_tests sources
- Fix errors_tests.cc to use ASSERT_TRUE for ErrorCode comparisons
- Fix core_tests.cc to use ASSERT_TRUE for TransportFlags comparisons
- Add Azure pipeline step for Debug build with coverage
- Add step to run mp_unit_tests --exclude-perf-tests with coverage
The perf tests are now part of unit_tests and can be filtered out
for coverage reporting. CI now includes Debug build with coverage
collection for non-performance tests.
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Add isPerfTest field to TestInfoInternal struct
- Add --exclude-perf-tests command line argument
- Add PERF_TEST and PERF_TEST_F macros for marking performance tests
- Update runAllTests to filter performance tests when requested
- Remove genhtml dependency and HTML report generation
- Keep only coverage.info file generation with lcov
Performance tests can now be excluded with:
./build/bin/unit_tests --exclude-perf-tests
./build/bin/mp_unit_tests --exclude-perf-tests
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Add nlohmann::ordered_json metrics field to TestResult struct
- Add nlohmann/json.hpp include to test/framework.hpp
- Link test_framework with nlohmann_json::nlohmann_json
- Replace PerfTestResult with TestResult in test/perf/framework.cc
- Move perf utility functions to utils namespace for consistency
- Remove duplicate PerfTestResult struct definition
This consolidates the two similar structs into one, reducing code
duplication while maintaining all necessary fields for both unit
tests (passed/failure_message) and performance tests (metrics).
Verified build succeeds with Docker:
docker run --rm -v $(pwd):/workspace -w /workspace \
ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
"cd /workspace/build && make -j4 fifo_test"
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Remove build_test/ directory containing CMake cache and build files
- Update .gitignore to exclude build_*/ pattern to prevent future accidents
These CMake artifacts (CMakeCache.txt, CMakeFiles/, generated headers)
were accidentally committed and should never be in version control.
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Move helper classes inside namespace for proper access
- Remove duplicate class definitions outside namespace
- Tests can now build in Docker with CUDA toolkit installed
- Remaining issues: ErrorCode and TransportFlags need operator<< for EXPECT_EQ
Successfully building with Docker:
docker run --rm -v $(pwd):/workspace -w /workspace \
ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
"mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
-DMSCCLPP_USE_CUDA=ON .. && make -j4"
Note: Some unit tests (errors_tests.cc, core_tests.cc) need operator<<
defined for ErrorCode and TransportFlags to compile with custom framework.
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Make MPI REQUIRED for test builds (clearer error messages)
- Add project include directories to test_framework library
- Fix core_tests.cc to use custom framework correctly
- Fix mp_unit_tests.hpp to use mscclpp::test namespace
- Add FAIL() macro with streaming support for test messages
- Building tests now works in Docker environment with GPU bypass
Tests can now be built using:
docker run --rm -v $(pwd):/workspace -w /workspace \
ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
"mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
-DMSCCLPP_USE_CUDA=ON .. && make -j"
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
The recent removal of GTest and introduction of custom test framework
requires MPI dependency which is not needed for CodeQL analysis.
Disable test building in CodeQL workflows to fix the build failures.
CodeQL only needs to analyze the core library code, not the tests.
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Support --gtest_filter command line argument for test filtering,
compatible with Azure pipeline configurations.
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Move test framework from test/perf/ to test/ for shared use
- Add GTest-compatible macros (TEST, TEST_F, EXPECT_*, ASSERT_*, etc.)
- Remove GTest dependency from CMakeLists.txt
- Add test_framework library for unit and mp_unit tests
- Add code coverage support with lcov (MSCCLPP_ENABLE_COVERAGE option)
- Update perf tests to use shared framework
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Modified test/mp_unit/mp_unit_tests.hpp to use ../framework.hpp instead of gtest/gtest.h
- Enhanced test/framework.hpp with GTest-compatible APIs:
- Added Environment base class for global test setup/teardown
- Added TestInfo and UnitTest classes for test metadata access
- Added GTEST_SKIP macro support via SkipHelper class
- Added namespace alias 'testing' for compatibility
- Added InitGoogleTest and AddGlobalTestEnvironment helper functions
- Updated test/framework.cc with implementations for new classes
- All mp_unit test files now use framework.hpp through mp_unit_tests.hpp
- Formatting applied via lint.sh
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
* Added configurable InfiniBand (IB) signaling mode.
`EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`,
`HostNoAtomic`). `Default` is equivalent to `Host` unless specified
different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the
previous implementation using RDMA atomics for signaling, while
`HostNoAtomic` uses write-with-immediate instead.
* Regarding updates in Python bindings and API.