Commit Graph

164 Commits

Author SHA1 Message Date
Changho Hwang
c4afbe12d9 Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-23 10:25:22 -08:00
Binyang Li
3962574bcb Address installation issue in some env (#750)
This pull request updates the way the `nlohmann/json` library is fetched
and upgrades it to a newer version in both the main build and test
configuration files.
Addressed installation issue in some env
2026-02-20 16:11:16 -08:00
Changho Hwang
b64536f28e Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-18 20:35:34 -08:00
Changho Hwang
2b4adcc4ad fix lint 2026-02-18 20:33:57 -08:00
Changho Hwang
b693d1b3fc lint issue 2026-02-18 20:31:25 -08:00
Changho Hwang
e40c72bd2b license text update 2026-02-18 20:12:32 -08:00
Changho Hwang
b6ce0f2ede simplify 2026-02-18 19:16:21 -08:00
Changho Hwang
30b9891180 simplifying 2026-02-18 18:35:33 -08:00
Binyang Li
bd68319e3e Refactor algo selection logic and introduce symmetric_memory env (#741)
This PR refactors the algorithm selection logic in MSCCL++ and
introduces support for symmetric memory configuration through
environment variables.


1. Algorithm Selection Refactoring
Use separate class for algo selection. Could introduce more complex
logic for algo selection based on message size, arch, if cuda graph is
enabled and memory allocation method

2. Symmetric Memory Support
Introduced symmetricMemory parameter in algorithm context key
generation. Remove disableChannelCache env as is ambiguous

3. Add new args for build_default_algorithms 
Add flag_buffer, and flag_buffer_size args to build default algorithm.
Then we could use unified flag buffer for different algorithms, avoid
application hanging when switch algo for different message size.

---------

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2026-02-12 19:06:18 -08:00
copilot-swe-agent[bot]
7003fec763 Simplify filter matching to use substring matching
- Remove complex wildcard pattern matching (*, ?, negative patterns)
- Use simple substring matching with find()
- Simpler implementation, easier to understand and maintain
- Still supports --gtest_filter for basic test name filtering

Note: For advanced filtering like wildcards, users can use multiple
test runs with different substring filters.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 08:36:24 +00:00
copilot-swe-agent[bot]
e26f8abbd4 Address PR review comments
1. Add missing includes to fifo_perf_tests.cu
   - Add #include <cassert>
   - Add #include <unordered_map>

2. Fix license header capitalization (4 files)
   - "license." → "License." in test/framework.{hpp,cc}
   - "license." → "License." in test/unit/{unit_tests_main.cc,fifo_perf_tests.cu}

3. Fix double MPI_Init issue
   - Check MPI_Initialized() before calling MPI_Init
   - Prevents double initialization when mp_unit_tests already inits MPI

4. Fix coverage flags for CUDA compilation
   - Use generator expressions to apply --coverage only to C++ language
   - Prevents breaking CUDA compilation with host-only flags

5. Fix environment memory leak
   - Delete environment objects after TearDown()
   - Clear environments_ vector

6. Implement proper GTEST_SKIP handling
   - Create SkipException class
   - Handle skipped tests separately from failures
   - Report skipped test count

7. Implement GTest-style filter pattern matching
   - Support wildcards (* and ?)
   - Support negative patterns (-Pattern)
   - Support colon-separated patterns (Foo:Bar)
   - Compatible with existing CI usage like --gtest_filter=-*Ib*

Verified builds successfully with Docker.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 08:32:28 +00:00
copilot-swe-agent[bot]
50f6a24b69 Remove test/perf/ directory completely
- Remove test/perf/ directory (fifo_test.cu, framework.{cc,hpp}, CMakeLists.txt)
- Remove add_subdirectory(perf) from test/CMakeLists.txt
- Performance tests now integrated into unit_tests as fifo_perf_tests.cu
- Fix mp_unit_tests.cc to use framework functions without ::testing:: namespace
- Fix bootstrap_tests.cc ErrorCode comparison to use ASSERT_TRUE
- Fix switch_channel_tests.cu to not use streaming with ASSERT_EQ
- Add missing #include <unistd.h> to executor_tests.cc

All perf test functionality is now in unit_tests and can be filtered
with --exclude-perf-tests flag. The standalone test/perf/ directory
is no longer needed.

Verified builds:
- unit_tests: 
- mp_unit_tests: 

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 03:16:00 +00:00
copilot-swe-agent[bot]
b59196b8a5 Integrate perf tests into unit_tests and add CI coverage step
- Add unit_tests_main.cc with main() function for unit_tests executable
- Create fifo_perf_tests.cu as PERF_TEST for unit_tests
- Add fifo_perf_tests.cu to unit_tests sources
- Fix errors_tests.cc to use ASSERT_TRUE for ErrorCode comparisons
- Fix core_tests.cc to use ASSERT_TRUE for TransportFlags comparisons
- Add Azure pipeline step for Debug build with coverage
- Add step to run mp_unit_tests --exclude-perf-tests with coverage

The perf tests are now part of unit_tests and can be filtered out
for coverage reporting. CI now includes Debug build with coverage
collection for non-performance tests.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 02:35:46 +00:00
copilot-swe-agent[bot]
7e4365f014 Add performance test filtering and remove HTML coverage
- Add isPerfTest field to TestInfoInternal struct
- Add --exclude-perf-tests command line argument
- Add PERF_TEST and PERF_TEST_F macros for marking performance tests
- Update runAllTests to filter performance tests when requested
- Remove genhtml dependency and HTML report generation
- Keep only coverage.info file generation with lcov

Performance tests can now be excluded with:
  ./build/bin/unit_tests --exclude-perf-tests
  ./build/bin/mp_unit_tests --exclude-perf-tests

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 02:29:50 +00:00
copilot-swe-agent[bot]
b1f458eca3 Convert test framework identifiers from snake_case to camelCase
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 02:25:28 +00:00
copilot-swe-agent[bot]
305d15717e Remove PerfTestResult and reuse TestResult directly
- Add nlohmann::ordered_json metrics field to TestResult struct
- Add nlohmann/json.hpp include to test/framework.hpp
- Link test_framework with nlohmann_json::nlohmann_json
- Replace PerfTestResult with TestResult in test/perf/framework.cc
- Move perf utility functions to utils namespace for consistency
- Remove duplicate PerfTestResult struct definition

This consolidates the two similar structs into one, reducing code
duplication while maintaining all necessary fields for both unit
tests (passed/failure_message) and performance tests (metrics).

Verified build succeeds with Docker:
docker run --rm -v $(pwd):/workspace -w /workspace \
  ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
  "cd /workspace/build && make -j4 fifo_test"

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 02:21:10 +00:00
copilot-swe-agent[bot]
4823583587 Move FailHelper and SkipHelper into mscclpp::test namespace
- Move helper classes inside namespace for proper access
- Remove duplicate class definitions outside namespace
- Tests can now build in Docker with CUDA toolkit installed
- Remaining issues: ErrorCode and TransportFlags need operator<< for EXPECT_EQ

Successfully building with Docker:
docker run --rm -v $(pwd):/workspace -w /workspace \
  ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
  "mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
   -DMSCCLPP_USE_CUDA=ON .. && make -j4"

Note: Some unit tests (errors_tests.cc, core_tests.cc) need operator<<
defined for ErrorCode and TransportFlags to compile with custom framework.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 01:17:08 +00:00
copilot-swe-agent[bot]
0eae34c53d Fix test framework for building with Docker
- Make MPI REQUIRED for test builds (clearer error messages)
- Add project include directories to test_framework library
- Fix core_tests.cc to use custom framework correctly
- Fix mp_unit_tests.hpp to use mscclpp::test namespace
- Add FAIL() macro with streaming support for test messages
- Building tests now works in Docker environment with GPU bypass

Tests can now be built using:
docker run --rm -v $(pwd):/workspace -w /workspace \
  ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
  "mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
   -DMSCCLPP_USE_CUDA=ON .. && make -j"

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 01:13:29 +00:00
copilot-swe-agent[bot]
a10aff559f Address code review feedback
- Move PerfTestResult struct definition outside vector declaration
- Move getCurrentTimestamp to anonymous namespace
- Add documentation for GTEST_SKIP macro explaining RAII pattern

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:28:00 +00:00
copilot-swe-agent[bot]
3d8a2e7349 Add --gtest_filter support to framework
Support --gtest_filter command line argument for test filtering,
compatible with Azure pipeline configurations.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:25:43 +00:00
copilot-swe-agent[bot]
eafa6fbfaf Add custom test framework and code coverage support
- Move test framework from test/perf/ to test/ for shared use
- Add GTest-compatible macros (TEST, TEST_F, EXPECT_*, ASSERT_*, etc.)
- Remove GTest dependency from CMakeLists.txt
- Add test_framework library for unit and mp_unit tests
- Add code coverage support with lcov (MSCCLPP_ENABLE_COVERAGE option)
- Update perf tests to use shared framework

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:24:03 +00:00
copilot-swe-agent[bot]
1e32e17c1e Address code review comments
- Remove duplicate static getMPIRank() and getMPISize() functions
- Add full namespace qualification to GTEST_SKIP macro

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:22:04 +00:00
copilot-swe-agent[bot]
e227fdc1ef Convert mp_unit tests from gtest to framework.hpp
- Modified test/mp_unit/mp_unit_tests.hpp to use ../framework.hpp instead of gtest/gtest.h
- Enhanced test/framework.hpp with GTest-compatible APIs:
  - Added Environment base class for global test setup/teardown
  - Added TestInfo and UnitTest classes for test metadata access
  - Added GTEST_SKIP macro support via SkipHelper class
  - Added namespace alias 'testing' for compatibility
  - Added InitGoogleTest and AddGlobalTestEnvironment helper functions
- Updated test/framework.cc with implementations for new classes
- All mp_unit test files now use framework.hpp through mp_unit_tests.hpp
- Formatting applied via lint.sh

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:21:04 +00:00
copilot-swe-agent[bot]
c881bc5e16 Replace gtest/gtest.h with framework.hpp in all unit tests
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:17:18 +00:00
Changho Hwang
42be3660e0 Add a new IB stack impl that doesn't use RDMA atomics (#728)
* Added configurable InfiniBand (IB) signaling mode.
`EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`,
`HostNoAtomic`). `Default` is equivalent to `Host` unless specified
different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the
previous implementation using RDMA atomics for signaling, while
`HostNoAtomic` uses write-with-immediate instead.
* Regarding updates in Python bindings and API.
2026-02-10 01:07:53 +00:00
Binyang Li
c12822a7af create CI pipeline for rocm (#718)
Create CI pipeline for AMD GPU.
2026-02-09 16:55:16 -08:00
Qinghua Zhou
620378b4fb Fix cpplint error in main branch (#740)
Fix the legacy cpplint error in main branch.

---------

Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-02-05 09:25:12 -08:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00
Changho Hwang
2cf14ff723 Minor fixes (#715) 2026-01-05 11:09:48 +08:00
Binyang Li
eda74a7f29 Add handle cache for AMD platform (#698)
Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles

For nvidia, we don't need this feature since nvidia will count the
handle reference internally and reuse the same handle if already be
opened

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-12-21 18:39:12 -08:00
Changho Hwang
9e076da3d4 Make IB more configurable (#703)
* Added `port` and `gidIndex` field in the IB endpoint config (and
`deviceIndex` field for future usages)
* Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so
* Added `--ib_gid_index` CLI option to `mp_unit_tests`
* Other minor fixes
2025-12-18 13:21:07 -08:00
Caio Rocha
060c35fec6 No IB Env CI Test (#687) 2025-11-19 11:13:03 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Caio Rocha
eb202780f5 Support Synchronous Initialization for Proxy Service (#679) 2025-11-12 18:35:57 -08:00
Changho Hwang
ffafcaf6d6 IB stack enhancements & bug fixes (#673)
* Always use `ibv_reg_dmabuf_mr` when DMABUF is supported
* Do not check `nvidia-peermem` when unnecessary
* More rigorous check on IB port availability
* Fixed ibverbs wrappers
* Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test
2025-11-07 14:26:17 -08:00
Changho Hwang
9994f53cea Fixes for no-IB systems (#667)
* Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB
on/off
* Fix `nvidia-peermem` check; no need for DMABUF supported systems
* Fix `mp_unit_tests` to skip all IB tests when built with
`-DMSCCLPP_USE_IB=OFF`
2025-10-29 10:03:02 -07:00
Changho Hwang
68b1f151f0 Rename nvls* files (#660)
Rename nvls* files to switch_channel*
2025-10-24 11:34:26 -07:00
Changho Hwang
a2f1279c60 Test peer accessibility after deployment (#661)
Test GPUs' peer accessibility before integration testing to distinguish
VM issues.
2025-10-24 11:09:36 -07:00
Binyang Li
610db6f023 Fix test script (#655)
Fix: #654. Address correctness_test.py crash issue

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-21 19:57:17 +00:00
Changho Hwang
2f7d74b281 Fix lint.sh (#652)
Exit 1 upon any errors from clang-format or black
2025-10-20 17:23:01 -07:00
Binyang Li
b1a88d755e Pipeline fix (#645)
Co-authored-by: github-actions <github-actions@github.com>
2025-10-10 11:26:33 -07:00
Binyang Li
5ac427610d Address teardown issue (#638)
Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point
2025-09-25 12:12:40 -07:00
Binyang Li
bf8d424ae3 use unix socket to share fd (#634)
Use unix socket to share fd to other processes. Used for nvls handle sharing
Update nccl interface to support worldSize=1

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-25 11:40:54 -07:00
Abhinav Jangda
5d062b7038 Fix Illegal Memory Access in nvls_test for CUDA12.9 (#631)
Running NVLS test, `test/nvls_test.cu` in CUDA 12.9 leads to illegal
memory access at
571fee16fb/test/nvls_test.cu (L151)
.
This PR addresses this error by moving cudaMemset after memory mapping.
2025-09-10 09:46:18 -07:00
Binyang Li
ba4c4aaeb8 Integrate MSCCL++ with torch workload (#626)
Integrate MSCCL++ with torch
Introduce `NCCL audit shim library`, use can use following commands to
launch torch library. Also avoid break build pipeline in the CPU machine
```bash
export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so
export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH
torchrun --nnodes=1 --nproc_per_node=8 your_script.py
```
2025-09-09 13:28:32 -07:00
Changho Hwang
547a9ae65c Fixed cpp linter (#619) 2025-08-25 12:15:45 -07:00
Binyang Li
2b40fe37b3 add torch test (#612)
Simple torch test
2025-08-15 10:27:21 -07:00
Binyang Li
03c0ff2a91 Fix for multi-nodes test (#614)
Fix multi-node test

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-14 20:44:43 -07:00
Binyang Li
bb76d27553 all2all implementation (#609)
Implement single node all2all via MSCCL++ C++API
perf kernel 3:
```
       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.41   44.78   39.19      0
     2097152         65536                                     23.95   87.56   76.61      0
     4194304        131072                                     27.50  152.51  133.45      0
     8388608        262144                                     35.14  238.73  208.89      0
    16777216        524288                                     57.54  291.55  255.11      0
    33554432       1048576                                     109.7  305.81  267.59      0
    67108864       2097152                                     212.3  316.07  276.56      0
   134217728       4194304                                     410.9  326.64  285.81      0
   268435456       8388608                                     784.9  341.99  299.24      0
```

kernel 2
```

#                                        in-place                       out-of-place
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.42   44.77   39.17      0
     2097152         65536                                     24.96   84.02   73.52      0
     4194304        131072                                     28.53  147.03  128.65      0
     8388608        262144                                     36.75  228.28  199.75      0
    16777216        524288                                     58.01  289.20  253.05      0
    33554432       1048576                                     110.4  303.83  265.85      0
    67108864       2097152                                     212.4  315.99  276.49      0
   134217728       4194304                                     407.8  329.12  287.98      0
   268435456       8388608                                     797.4  336.64  294.56      0
```

NCCL:
```
NCCL version 2.21.5+cuda12.4
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        524288      half    none      -1    38.70  216.75  189.66      0    39.25  213.72  187.00    N/A
    16777216       1048576      half    none      -1    71.39  234.99  205.62      0    68.41  245.25  214.60    N/A
    33554432       2097152      half    none      -1    119.7  280.22  245.20      0    119.8  280.17  245.15    N/A
    67108864       4194304      half    none      -1    211.9  316.66  277.08      0    212.7  315.53  276.09    N/A
   134217728       8388608      half    none      -1    408.4  328.61  287.53      0    393.8  340.87  298.26    N/A
   268435456      16777216      half    none      -1    761.6  352.47  308.41      0    763.3  351.70  307.73    N/A
   536870912      33554432      half    none      -1   1502.5  357.31  312.64      0   1467.3  365.89  320.16    N/A
```
2025-08-14 11:30:40 -07:00
Binyang Li
be6a941fba New DSL implementation (#579)
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline 
- Clean code and enhance document
C++ side: 
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-09 00:36:20 -07:00