Commit Graph

191 Commits

Author SHA1 Message Date
Changho Hwang
b04fa2daa7 lint 2026-04-04 06:22:04 +00:00
Changho Hwang
f62633ad41 mlx5dv bug fixes & enhanced unit tests perf reporting 2026-04-04 06:18:44 +00:00
Changho Hwang
53099a7cf9 Merge branch 'main' into chhwang/fix-ib-no-atomic 2026-04-01 22:45:58 -07:00
Binyang Li
be9126ca1b Fix run-remote.sh to support multi-command scripts (#770)
## Summary
- Fix `run-remote.sh` to correctly execute multi-command scripts (e.g.,
multiple `mpirun` calls)
- The old approach piped decoded script through `base64 -d | bash`,
which feeds the script via bash's **stdin**. When `mpirun` (or its child
processes) runs, it can consume the remaining stdin, causing bash to
never see subsequent commands — only the first command would execute.
- The fix decodes the script to a **temp file** and runs `bash -euxo
pipefail "$TMP"` instead, so bash reads commands from the file and
`mpirun` consuming stdin has no effect.
- Applied to both the docker path (pssh + docker exec) and the
non-docker path (pssh only).


🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-01 16:25:19 -07:00
Changho Hwang
ff4d825652 Merge branch 'main' into chhwang/fix-ib-no-atomic 2026-04-01 14:01:55 -07:00
Changho Hwang
67f9933ba1 fix data direct 2026-04-01 10:20:43 +00:00
Changho Hwang
d2f7056cf4 Add unit testing framework readme (#766) 2026-04-01 05:30:35 +00:00
Binyang Li
4f3638b60d Use PTX red for D2D semaphore signal (#768)
## Summary
- Replace the two-step `signal()` implementation (`incOutbound()` +
`atomicStore()`) with a single fire-and-forget PTX
`red.release.sys.global.add.u64` instruction
- This eliminates one local atomic fetch-add and replaces a remote store
with a remote atomic add that has no return value — more efficient on
both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to
fire-and-forget `flat_atomic_add_x2`)
- Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait
ping-pong latency

### Performance (H100, 2 ranks, signal+wait round-trip)

```
SemaphorePerfTest.SignalPingPong:
  Store-based (old): 2.595 us/iter
  Red-based   (new): 2.345 us/iter
  Speedup:           1.11x
```

## Test plan
- [x] Builds successfully (`make mp_unit_tests`)
- [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter
"SemaphorePerfTest"` — 1.11x speedup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 15:34:43 -07:00
Copilot
93f6eeaa6b Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744)
- Removes the GTest dependency, replacing it with a minimal custom
framework (`test/framework.*`) that covers only what the tests actually
use — a unified `TEST()` macro with SFINAE-based fixture auto-detection,
`EXPECT_*`/`ASSERT_*` assertions, environments, and setup/teardown.
- `--exclude-perf-tests` flag and substring-based negative filtering
- `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to
Codecov
- Merges standalone `test/perf/` into main test targets
- Refactors Azure pipelines to reduce redundancies & make more readable

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2026-03-24 23:34:38 -04:00
Changho Hwang
02005322a7 Merge branch 'copilot/remove-gtest-use-custom-framework' into chhwang/fix-ib-no-atomic 2026-03-18 14:04:20 -07:00
copilot-swe-agent[bot]
bff76d5b85 Fix TearDown() handling and replace assert() in perf tests
Address review comments:
1. Ensure TearDown() is always called if SetUp() succeeds, even when
   TestBody() throws. This prevents resource leaks and maintains MPI
   synchronization between tests.
2. Replace assert() in fifo_perf_tests.cu with proper return false
   on validation failure, ensuring consistent test failure reporting.

Fixes:
- test/framework.cc: Track SetUp success and call TearDown in finally-style
- test/unit/fifo_perf_tests.cu: Replace assert with explicit check

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-03-18 19:44:11 +00:00
Changho Hwang
275622159c update 2026-03-18 02:32:21 +00:00
Changho Hwang
2c4bab8359 fix 2026-03-16 18:37:57 +00:00
Changho Hwang
e2a9692674 fix merge 2026-03-11 21:04:45 +00:00
Changho Hwang
a38bd9dee2 Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-03-11 14:02:56 -07:00
Changho Hwang
2a705f52e1 fix merge 2026-03-11 20:38:54 +00:00
Changho Hwang
e2a5be467d debugging 2026-03-11 02:40:50 +00:00
Changho Hwang
757c0ecc6a debugging 2026-03-11 01:00:12 +00:00
Changho Hwang
cf505d777a debugging 2026-03-10 22:18:41 +00:00
Changho Hwang
7a87c2c856 debugging 2026-03-10 20:51:22 +00:00
Changho Hwang
6647338fb4 debugging 2026-03-10 17:50:04 +00:00
Changho Hwang
a9cf93863f fix 2026-03-09 23:49:54 +00:00
Binyang Li
bf946ea51e Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms (#759)
## Summary

This PR addresses a multicast resource leak, fixes `cuMemMap` offset
handling for multicast handles, renames NVLS allreduce algorithm classes
for clarity, and adds a new unit test for `SwitchChannel`.

### Bug Fixes

#### 1. Fix multicast allocation handle leak in `createMulticast()`
(`gpu_ipc_mem.cc`)

`GpuIpcMemHandle::createMulticast()` called
`cuMulticastCreate(&allocHandle, ...)` but never released the local
`allocHandle` after exporting it to shareable handles (POSIX FD /
Fabric). This caused a reference count leak — the multicast object was
never freed even after all mappings and imported handles were released.

Per the [CUDA Driver API docs for
`cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html):
> *"The memory allocation will be freed when all outstanding mappings to
the memory are unmapped and when all outstanding references to the
handle (including its shareable counterparts) are also released."*

The fix adds `cuMemRelease(allocHandle)` after export, matching the
existing pattern used for regular allocations in
`GpuIpcMemHandle::create()`.

**Impact:** Without this fix, repeated creation/destruction of NVLS
connections causes OOM after ~120 iterations when allocating 1GB
multicast buffers on H100.

#### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`)

`cuMemMap` requires `offset=0` for multicast handles. Previously, the
code attempted to map at a non-zero offset within the multicast object,
leading to errors when binding multiple buffers to the same
`NvlsConnection`. The fix maps the entire range `[0, mcOffset +
bufferSize)` and returns the pointer offset by `mcOffset`. This only
consumes extra virtual address space; no additional physical memory is
used.

### Refactoring

#### 3. Rename NVLS allreduce algorithm classes

Renamed for clarity:
- `AllreduceNvls` → `AllreduceNvlsZeroCopy`
- `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline`
- `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline`

Updated all references in builder, selector, docs, and examples.

#### 4. Move `nvlsConnections` setup to `initialize()`

Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this
member) to individual algorithm class members, initialized in their
`initialize()` methods.

### Tests

#### 5. Add `TwoChannelsSameConnection` test

New unit test that creates two `SwitchChannel` instances from the same
`NvlsConnection`, performs reduce operations on both, and verifies
correctness. This exercises the multi-bind path that triggered the
`cuMemMap` offset fix.

### Files Changed

- `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap
offset fix
- `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` —
nvlsConnections fix
- `src/ext/collectives/include/allreduce/*.hpp` — renamed headers
- `src/ext/collectives/algorithm_collection_builder.cc` — updated
references
- `src/ext/nccl/algorithm_selector.cc` — updated algorithm names
- `test/mp_unit/switch_channel_tests.cu` — new test
- `docs/guide/mscclpp-torch-integration.md` — updated names
- `examples/torch-integration/customized_comm_with_default_algo.py` —
updated names
2026-03-09 10:22:45 -07:00
Changho Hwang
d6a6fa2ffa simplified 2026-03-08 05:31:48 +00:00
Changho Hwang
ea1dd65126 fix 2026-03-08 04:05:58 +00:00
Changho Hwang
c699b8a784 az pipeline refactoring 2026-03-07 02:23:30 +00:00
Changho Hwang
060982d253 updates 2026-02-26 12:40:58 -08:00
Changho Hwang
c4afbe12d9 Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-23 10:25:22 -08:00
Binyang Li
3962574bcb Address installation issue in some env (#750)
This pull request updates the way the `nlohmann/json` library is fetched
and upgrades it to a newer version in both the main build and test
configuration files.
Addressed installation issue in some env
2026-02-20 16:11:16 -08:00
Changho Hwang
b64536f28e Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-18 20:35:34 -08:00
Changho Hwang
2b4adcc4ad fix lint 2026-02-18 20:33:57 -08:00
Changho Hwang
b693d1b3fc lint issue 2026-02-18 20:31:25 -08:00
Changho Hwang
e40c72bd2b license text update 2026-02-18 20:12:32 -08:00
Changho Hwang
b6ce0f2ede simplify 2026-02-18 19:16:21 -08:00
Changho Hwang
30b9891180 simplifying 2026-02-18 18:35:33 -08:00
Binyang Li
bd68319e3e Refactor algo selection logic and introduce symmetric_memory env (#741)
This PR refactors the algorithm selection logic in MSCCL++ and
introduces support for symmetric memory configuration through
environment variables.


1. Algorithm Selection Refactoring
Use separate class for algo selection. Could introduce more complex
logic for algo selection based on message size, arch, if cuda graph is
enabled and memory allocation method

2. Symmetric Memory Support
Introduced symmetricMemory parameter in algorithm context key
generation. Remove disableChannelCache env as is ambiguous

3. Add new args for build_default_algorithms 
Add flag_buffer, and flag_buffer_size args to build default algorithm.
Then we could use unified flag buffer for different algorithms, avoid
application hanging when switch algo for different message size.

---------

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2026-02-12 19:06:18 -08:00
copilot-swe-agent[bot]
7003fec763 Simplify filter matching to use substring matching
- Remove complex wildcard pattern matching (*, ?, negative patterns)
- Use simple substring matching with find()
- Simpler implementation, easier to understand and maintain
- Still supports --gtest_filter for basic test name filtering

Note: For advanced filtering like wildcards, users can use multiple
test runs with different substring filters.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 08:36:24 +00:00
copilot-swe-agent[bot]
e26f8abbd4 Address PR review comments
1. Add missing includes to fifo_perf_tests.cu
   - Add #include <cassert>
   - Add #include <unordered_map>

2. Fix license header capitalization (4 files)
   - "license." → "License." in test/framework.{hpp,cc}
   - "license." → "License." in test/unit/{unit_tests_main.cc,fifo_perf_tests.cu}

3. Fix double MPI_Init issue
   - Check MPI_Initialized() before calling MPI_Init
   - Prevents double initialization when mp_unit_tests already inits MPI

4. Fix coverage flags for CUDA compilation
   - Use generator expressions to apply --coverage only to C++ language
   - Prevents breaking CUDA compilation with host-only flags

5. Fix environment memory leak
   - Delete environment objects after TearDown()
   - Clear environments_ vector

6. Implement proper GTEST_SKIP handling
   - Create SkipException class
   - Handle skipped tests separately from failures
   - Report skipped test count

7. Implement GTest-style filter pattern matching
   - Support wildcards (* and ?)
   - Support negative patterns (-Pattern)
   - Support colon-separated patterns (Foo:Bar)
   - Compatible with existing CI usage like --gtest_filter=-*Ib*

Verified builds successfully with Docker.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 08:32:28 +00:00
copilot-swe-agent[bot]
50f6a24b69 Remove test/perf/ directory completely
- Remove test/perf/ directory (fifo_test.cu, framework.{cc,hpp}, CMakeLists.txt)
- Remove add_subdirectory(perf) from test/CMakeLists.txt
- Performance tests now integrated into unit_tests as fifo_perf_tests.cu
- Fix mp_unit_tests.cc to use framework functions without ::testing:: namespace
- Fix bootstrap_tests.cc ErrorCode comparison to use ASSERT_TRUE
- Fix switch_channel_tests.cu to not use streaming with ASSERT_EQ
- Add missing #include <unistd.h> to executor_tests.cc

All perf test functionality is now in unit_tests and can be filtered
with --exclude-perf-tests flag. The standalone test/perf/ directory
is no longer needed.

Verified builds:
- unit_tests: 
- mp_unit_tests: 

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 03:16:00 +00:00
copilot-swe-agent[bot]
b59196b8a5 Integrate perf tests into unit_tests and add CI coverage step
- Add unit_tests_main.cc with main() function for unit_tests executable
- Create fifo_perf_tests.cu as PERF_TEST for unit_tests
- Add fifo_perf_tests.cu to unit_tests sources
- Fix errors_tests.cc to use ASSERT_TRUE for ErrorCode comparisons
- Fix core_tests.cc to use ASSERT_TRUE for TransportFlags comparisons
- Add Azure pipeline step for Debug build with coverage
- Add step to run mp_unit_tests --exclude-perf-tests with coverage

The perf tests are now part of unit_tests and can be filtered out
for coverage reporting. CI now includes Debug build with coverage
collection for non-performance tests.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 02:35:46 +00:00
copilot-swe-agent[bot]
7e4365f014 Add performance test filtering and remove HTML coverage
- Add isPerfTest field to TestInfoInternal struct
- Add --exclude-perf-tests command line argument
- Add PERF_TEST and PERF_TEST_F macros for marking performance tests
- Update runAllTests to filter performance tests when requested
- Remove genhtml dependency and HTML report generation
- Keep only coverage.info file generation with lcov

Performance tests can now be excluded with:
  ./build/bin/unit_tests --exclude-perf-tests
  ./build/bin/mp_unit_tests --exclude-perf-tests

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 02:29:50 +00:00
copilot-swe-agent[bot]
b1f458eca3 Convert test framework identifiers from snake_case to camelCase
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 02:25:28 +00:00
copilot-swe-agent[bot]
305d15717e Remove PerfTestResult and reuse TestResult directly
- Add nlohmann::ordered_json metrics field to TestResult struct
- Add nlohmann/json.hpp include to test/framework.hpp
- Link test_framework with nlohmann_json::nlohmann_json
- Replace PerfTestResult with TestResult in test/perf/framework.cc
- Move perf utility functions to utils namespace for consistency
- Remove duplicate PerfTestResult struct definition

This consolidates the two similar structs into one, reducing code
duplication while maintaining all necessary fields for both unit
tests (passed/failure_message) and performance tests (metrics).

Verified build succeeds with Docker:
docker run --rm -v $(pwd):/workspace -w /workspace \
  ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
  "cd /workspace/build && make -j4 fifo_test"

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 02:21:10 +00:00
copilot-swe-agent[bot]
4823583587 Move FailHelper and SkipHelper into mscclpp::test namespace
- Move helper classes inside namespace for proper access
- Remove duplicate class definitions outside namespace
- Tests can now build in Docker with CUDA toolkit installed
- Remaining issues: ErrorCode and TransportFlags need operator<< for EXPECT_EQ

Successfully building with Docker:
docker run --rm -v $(pwd):/workspace -w /workspace \
  ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
  "mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
   -DMSCCLPP_USE_CUDA=ON .. && make -j4"

Note: Some unit tests (errors_tests.cc, core_tests.cc) need operator<<
defined for ErrorCode and TransportFlags to compile with custom framework.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 01:17:08 +00:00
copilot-swe-agent[bot]
0eae34c53d Fix test framework for building with Docker
- Make MPI REQUIRED for test builds (clearer error messages)
- Add project include directories to test_framework library
- Fix core_tests.cc to use custom framework correctly
- Fix mp_unit_tests.hpp to use mscclpp::test namespace
- Add FAIL() macro with streaming support for test messages
- Building tests now works in Docker environment with GPU bypass

Tests can now be built using:
docker run --rm -v $(pwd):/workspace -w /workspace \
  ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
  "mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
   -DMSCCLPP_USE_CUDA=ON .. && make -j"

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 01:13:29 +00:00
copilot-swe-agent[bot]
a10aff559f Address code review feedback
- Move PerfTestResult struct definition outside vector declaration
- Move getCurrentTimestamp to anonymous namespace
- Add documentation for GTEST_SKIP macro explaining RAII pattern

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:28:00 +00:00
copilot-swe-agent[bot]
3d8a2e7349 Add --gtest_filter support to framework
Support --gtest_filter command line argument for test filtering,
compatible with Azure pipeline configurations.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:25:43 +00:00
copilot-swe-agent[bot]
eafa6fbfaf Add custom test framework and code coverage support
- Move test framework from test/perf/ to test/ for shared use
- Add GTest-compatible macros (TEST, TEST_F, EXPECT_*, ASSERT_*, etc.)
- Remove GTest dependency from CMakeLists.txt
- Add test_framework library for unit and mp_unit tests
- Add code coverage support with lcov (MSCCLPP_ENABLE_COVERAGE option)
- Update perf tests to use shared framework

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:24:03 +00:00
copilot-swe-agent[bot]
1e32e17c1e Address code review comments
- Remove duplicate static getMPIRank() and getMPISize() functions
- Add full namespace qualification to GTEST_SKIP macro

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:22:04 +00:00
copilot-swe-agent[bot]
e227fdc1ef Convert mp_unit tests from gtest to framework.hpp
- Modified test/mp_unit/mp_unit_tests.hpp to use ../framework.hpp instead of gtest/gtest.h
- Enhanced test/framework.hpp with GTest-compatible APIs:
  - Added Environment base class for global test setup/teardown
  - Added TestInfo and UnitTest classes for test metadata access
  - Added GTEST_SKIP macro support via SkipHelper class
  - Added namespace alias 'testing' for compatibility
  - Added InitGoogleTest and AddGlobalTestEnvironment helper functions
- Updated test/framework.cc with implementations for new classes
- All mp_unit test files now use framework.hpp through mp_unit_tests.hpp
- Formatting applied via lint.sh

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-02-11 00:21:04 +00:00