mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-13 09:46:00 +00:00

Author	SHA1	Message	Date
Changho Hwang	b04fa2daa7	lint	2026-04-04 06:22:04 +00:00
Changho Hwang	f62633ad41	mlx5dv bug fixes & enhanced unit tests perf reporting	2026-04-04 06:18:44 +00:00
Changho Hwang	53099a7cf9	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-04-01 22:45:58 -07:00
Binyang Li	be9126ca1b	Fix run-remote.sh to support multi-command scripts (#770 ) ## Summary - Fix `run-remote.sh` to correctly execute multi-command scripts (e.g., multiple `mpirun` calls) - The old approach piped decoded script through `base64 -d \| bash`, which feeds the script via bash's stdin. When `mpirun` (or its child processes) runs, it can consume the remaining stdin, causing bash to never see subsequent commands — only the first command would execute. - The fix decodes the script to a temp file and runs `bash -euxo pipefail "$TMP"` instead, so bash reads commands from the file and `mpirun` consuming stdin has no effect. - Applied to both the docker path (pssh + docker exec) and the non-docker path (pssh only). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-01 16:25:19 -07:00
Changho Hwang	ff4d825652	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-04-01 14:01:55 -07:00
Changho Hwang	67f9933ba1	fix data direct	2026-04-01 10:20:43 +00:00
Changho Hwang	d2f7056cf4	Add unit testing framework readme (#766 )	2026-04-01 05:30:35 +00:00
Binyang Li	4f3638b60d	Use PTX red for D2D semaphore signal (#768 ) ## Summary - Replace the two-step `signal()` implementation (`incOutbound()` + `atomicStore()`) with a single fire-and-forget PTX `red.release.sys.global.add.u64` instruction - This eliminates one local atomic fetch-add and replaces a remote store with a remote atomic add that has no return value — more efficient on both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to fire-and-forget `flat_atomic_add_x2`) - Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait ping-pong latency ### Performance (H100, 2 ranks, signal+wait round-trip) ``` SemaphorePerfTest.SignalPingPong: Store-based (old): 2.595 us/iter Red-based (new): 2.345 us/iter Speedup: 1.11x ``` ## Test plan - [x] Builds successfully (`make mp_unit_tests`) - [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter "SemaphorePerfTest"` — 1.11x speedup 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 15:34:43 -07:00
Copilot	93f6eeaa6b	Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 ) - Removes the GTest dependency, replacing it with a minimal custom framework (`test/framework.`) that covers only what the tests actually use — a unified `TEST()` macro with SFINAE-based fixture auto-detection, `EXPECT_`/`ASSERT_*` assertions, environments, and setup/teardown. - `--exclude-perf-tests` flag and substring-based negative filtering - `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to Codecov - Merges standalone `test/perf/` into main test targets - Refactors Azure pipelines to reduce redundancies & make more readable --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-03-24 23:34:38 -04:00
Changho Hwang	02005322a7	Merge branch 'copilot/remove-gtest-use-custom-framework' into chhwang/fix-ib-no-atomic	2026-03-18 14:04:20 -07:00
copilot-swe-agent[bot]	bff76d5b85	Fix TearDown() handling and replace assert() in perf tests Address review comments: 1. Ensure TearDown() is always called if SetUp() succeeds, even when TestBody() throws. This prevents resource leaks and maintains MPI synchronization between tests. 2. Replace assert() in fifo_perf_tests.cu with proper return false on validation failure, ensuring consistent test failure reporting. Fixes: - test/framework.cc: Track SetUp success and call TearDown in finally-style - test/unit/fifo_perf_tests.cu: Replace assert with explicit check Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-03-18 19:44:11 +00:00
Changho Hwang	275622159c	update	2026-03-18 02:32:21 +00:00
Changho Hwang	2c4bab8359	fix	2026-03-16 18:37:57 +00:00
Changho Hwang	e2a9692674	fix merge	2026-03-11 21:04:45 +00:00
Changho Hwang	a38bd9dee2	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-03-11 14:02:56 -07:00
Changho Hwang	2a705f52e1	fix merge	2026-03-11 20:38:54 +00:00
Changho Hwang	e2a5be467d	debugging	2026-03-11 02:40:50 +00:00
Changho Hwang	757c0ecc6a	debugging	2026-03-11 01:00:12 +00:00
Changho Hwang	cf505d777a	debugging	2026-03-10 22:18:41 +00:00
Changho Hwang	7a87c2c856	debugging	2026-03-10 20:51:22 +00:00
Changho Hwang	6647338fb4	debugging	2026-03-10 17:50:04 +00:00
Changho Hwang	a9cf93863f	fix	2026-03-09 23:49:54 +00:00
Binyang Li	bf946ea51e	Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms (#759 ) ## Summary This PR addresses a multicast resource leak, fixes `cuMemMap` offset handling for multicast handles, renames NVLS allreduce algorithm classes for clarity, and adds a new unit test for `SwitchChannel`. ### Bug Fixes #### 1. Fix multicast allocation handle leak in `createMulticast()` (`gpu_ipc_mem.cc`) `GpuIpcMemHandle::createMulticast()` called `cuMulticastCreate(&allocHandle, ...)` but never released the local `allocHandle` after exporting it to shareable handles (POSIX FD / Fabric). This caused a reference count leak — the multicast object was never freed even after all mappings and imported handles were released. Per the [CUDA Driver API docs for `cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html): > "The memory allocation will be freed when all outstanding mappings to the memory are unmapped and when all outstanding references to the handle (including its shareable counterparts) are also released." The fix adds `cuMemRelease(allocHandle)` after export, matching the existing pattern used for regular allocations in `GpuIpcMemHandle::create()`. Impact: Without this fix, repeated creation/destruction of NVLS connections causes OOM after ~120 iterations when allocating 1GB multicast buffers on H100. #### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`) `cuMemMap` requires `offset=0` for multicast handles. Previously, the code attempted to map at a non-zero offset within the multicast object, leading to errors when binding multiple buffers to the same `NvlsConnection`. The fix maps the entire range `[0, mcOffset + bufferSize)` and returns the pointer offset by `mcOffset`. This only consumes extra virtual address space; no additional physical memory is used. ### Refactoring #### 3. Rename NVLS allreduce algorithm classes Renamed for clarity: - `AllreduceNvls` → `AllreduceNvlsZeroCopy` - `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline` - `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline` Updated all references in builder, selector, docs, and examples. #### 4. Move `nvlsConnections` setup to `initialize()` Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this member) to individual algorithm class members, initialized in their `initialize()` methods. ### Tests #### 5. Add `TwoChannelsSameConnection` test New unit test that creates two `SwitchChannel` instances from the same `NvlsConnection`, performs reduce operations on both, and verifies correctness. This exercises the multi-bind path that triggered the `cuMemMap` offset fix. ### Files Changed - `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap offset fix - `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` — nvlsConnections fix - `src/ext/collectives/include/allreduce/*.hpp` — renamed headers - `src/ext/collectives/algorithm_collection_builder.cc` — updated references - `src/ext/nccl/algorithm_selector.cc` — updated algorithm names - `test/mp_unit/switch_channel_tests.cu` — new test - `docs/guide/mscclpp-torch-integration.md` — updated names - `examples/torch-integration/customized_comm_with_default_algo.py` — updated names	2026-03-09 10:22:45 -07:00
Changho Hwang	d6a6fa2ffa	simplified	2026-03-08 05:31:48 +00:00
Changho Hwang	ea1dd65126	fix	2026-03-08 04:05:58 +00:00
Changho Hwang	c699b8a784	az pipeline refactoring	2026-03-07 02:23:30 +00:00
Changho Hwang	060982d253	updates	2026-02-26 12:40:58 -08:00
Changho Hwang	c4afbe12d9	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-02-23 10:25:22 -08:00
Binyang Li	3962574bcb	Address installation issue in some env (#750 ) This pull request updates the way the `nlohmann/json` library is fetched and upgrades it to a newer version in both the main build and test configuration files. Addressed installation issue in some env	2026-02-20 16:11:16 -08:00
Changho Hwang	b64536f28e	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-02-18 20:35:34 -08:00
Changho Hwang	2b4adcc4ad	fix lint	2026-02-18 20:33:57 -08:00
Changho Hwang	b693d1b3fc	lint issue	2026-02-18 20:31:25 -08:00
Changho Hwang	e40c72bd2b	license text update	2026-02-18 20:12:32 -08:00
Changho Hwang	b6ce0f2ede	simplify	2026-02-18 19:16:21 -08:00
Changho Hwang	30b9891180	simplifying	2026-02-18 18:35:33 -08:00
Binyang Li	bd68319e3e	Refactor algo selection logic and introduce symmetric_memory env (#741 ) This PR refactors the algorithm selection logic in MSCCL++ and introduces support for symmetric memory configuration through environment variables. 1. Algorithm Selection Refactoring Use separate class for algo selection. Could introduce more complex logic for algo selection based on message size, arch, if cuda graph is enabled and memory allocation method 2. Symmetric Memory Support Introduced symmetricMemory parameter in algorithm context key generation. Remove disableChannelCache env as is ambiguous 3. Add new args for build_default_algorithms Add flag_buffer, and flag_buffer_size args to build default algorithm. Then we could use unified flag buffer for different algorithms, avoid application hanging when switch algo for different message size. --------- Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-12 19:06:18 -08:00
copilot-swe-agent[bot]	7003fec763	Simplify filter matching to use substring matching - Remove complex wildcard pattern matching (*, ?, negative patterns) - Use simple substring matching with find() - Simpler implementation, easier to understand and maintain - Still supports --gtest_filter for basic test name filtering Note: For advanced filtering like wildcards, users can use multiple test runs with different substring filters. Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 08:36:24 +00:00
copilot-swe-agent[bot]	e26f8abbd4	Address PR review comments 1. Add missing includes to fifo_perf_tests.cu - Add #include <cassert> - Add #include <unordered_map> 2. Fix license header capitalization (4 files) - "license." → "License." in test/framework.{hpp,cc} - "license." → "License." in test/unit/{unit_tests_main.cc,fifo_perf_tests.cu} 3. Fix double MPI_Init issue - Check MPI_Initialized() before calling MPI_Init - Prevents double initialization when mp_unit_tests already inits MPI 4. Fix coverage flags for CUDA compilation - Use generator expressions to apply --coverage only to C++ language - Prevents breaking CUDA compilation with host-only flags 5. Fix environment memory leak - Delete environment objects after TearDown() - Clear environments_ vector 6. Implement proper GTEST_SKIP handling - Create SkipException class - Handle skipped tests separately from failures - Report skipped test count 7. Implement GTest-style filter pattern matching - Support wildcards (* and ?) - Support negative patterns (-Pattern) - Support colon-separated patterns (Foo:Bar) - Compatible with existing CI usage like --gtest_filter=-Ib Verified builds successfully with Docker. Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 08:32:28 +00:00
copilot-swe-agent[bot]	50f6a24b69	Remove test/perf/ directory completely - Remove test/perf/ directory (fifo_test.cu, framework.{cc,hpp}, CMakeLists.txt) - Remove add_subdirectory(perf) from test/CMakeLists.txt - Performance tests now integrated into unit_tests as fifo_perf_tests.cu - Fix mp_unit_tests.cc to use framework functions without ::testing:: namespace - Fix bootstrap_tests.cc ErrorCode comparison to use ASSERT_TRUE - Fix switch_channel_tests.cu to not use streaming with ASSERT_EQ - Add missing #include <unistd.h> to executor_tests.cc All perf test functionality is now in unit_tests and can be filtered with --exclude-perf-tests flag. The standalone test/perf/ directory is no longer needed. Verified builds: - unit_tests: ✅ - mp_unit_tests: ✅ Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 03:16:00 +00:00
copilot-swe-agent[bot]	b59196b8a5	Integrate perf tests into unit_tests and add CI coverage step - Add unit_tests_main.cc with main() function for unit_tests executable - Create fifo_perf_tests.cu as PERF_TEST for unit_tests - Add fifo_perf_tests.cu to unit_tests sources - Fix errors_tests.cc to use ASSERT_TRUE for ErrorCode comparisons - Fix core_tests.cc to use ASSERT_TRUE for TransportFlags comparisons - Add Azure pipeline step for Debug build with coverage - Add step to run mp_unit_tests --exclude-perf-tests with coverage The perf tests are now part of unit_tests and can be filtered out for coverage reporting. CI now includes Debug build with coverage collection for non-performance tests. Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 02:35:46 +00:00
copilot-swe-agent[bot]	7e4365f014	Add performance test filtering and remove HTML coverage - Add isPerfTest field to TestInfoInternal struct - Add --exclude-perf-tests command line argument - Add PERF_TEST and PERF_TEST_F macros for marking performance tests - Update runAllTests to filter performance tests when requested - Remove genhtml dependency and HTML report generation - Keep only coverage.info file generation with lcov Performance tests can now be excluded with: ./build/bin/unit_tests --exclude-perf-tests ./build/bin/mp_unit_tests --exclude-perf-tests Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 02:29:50 +00:00
copilot-swe-agent[bot]	b1f458eca3	Convert test framework identifiers from snake_case to camelCase Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 02:25:28 +00:00
copilot-swe-agent[bot]	305d15717e	Remove PerfTestResult and reuse TestResult directly - Add nlohmann::ordered_json metrics field to TestResult struct - Add nlohmann/json.hpp include to test/framework.hpp - Link test_framework with nlohmann_json::nlohmann_json - Replace PerfTestResult with TestResult in test/perf/framework.cc - Move perf utility functions to utils namespace for consistency - Remove duplicate PerfTestResult struct definition This consolidates the two similar structs into one, reducing code duplication while maintaining all necessary fields for both unit tests (passed/failure_message) and performance tests (metrics). Verified build succeeds with Docker: docker run --rm -v $(pwd):/workspace -w /workspace \ ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \ "cd /workspace/build && make -j4 fifo_test" Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 02:21:10 +00:00
copilot-swe-agent[bot]	4823583587	Move FailHelper and SkipHelper into mscclpp::test namespace - Move helper classes inside namespace for proper access - Remove duplicate class definitions outside namespace - Tests can now build in Docker with CUDA toolkit installed - Remaining issues: ErrorCode and TransportFlags need operator<< for EXPECT_EQ Successfully building with Docker: docker run --rm -v $(pwd):/workspace -w /workspace \ ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \ "mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \ -DMSCCLPP_USE_CUDA=ON .. && make -j4" Note: Some unit tests (errors_tests.cc, core_tests.cc) need operator<< defined for ErrorCode and TransportFlags to compile with custom framework. Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 01:17:08 +00:00
copilot-swe-agent[bot]	0eae34c53d	Fix test framework for building with Docker - Make MPI REQUIRED for test builds (clearer error messages) - Add project include directories to test_framework library - Fix core_tests.cc to use custom framework correctly - Fix mp_unit_tests.hpp to use mscclpp::test namespace - Add FAIL() macro with streaming support for test messages - Building tests now works in Docker environment with GPU bypass Tests can now be built using: docker run --rm -v $(pwd):/workspace -w /workspace \ ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \ "mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \ -DMSCCLPP_USE_CUDA=ON .. && make -j" Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 01:13:29 +00:00
copilot-swe-agent[bot]	a10aff559f	Address code review feedback - Move PerfTestResult struct definition outside vector declaration - Move getCurrentTimestamp to anonymous namespace - Add documentation for GTEST_SKIP macro explaining RAII pattern Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 00:28:00 +00:00
copilot-swe-agent[bot]	3d8a2e7349	Add --gtest_filter support to framework Support --gtest_filter command line argument for test filtering, compatible with Azure pipeline configurations. Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 00:25:43 +00:00
copilot-swe-agent[bot]	eafa6fbfaf	Add custom test framework and code coverage support - Move test framework from test/perf/ to test/ for shared use - Add GTest-compatible macros (TEST, TEST_F, EXPECT_, ASSERT_, etc.) - Remove GTest dependency from CMakeLists.txt - Add test_framework library for unit and mp_unit tests - Add code coverage support with lcov (MSCCLPP_ENABLE_COVERAGE option) - Update perf tests to use shared framework Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 00:24:03 +00:00
copilot-swe-agent[bot]	1e32e17c1e	Address code review comments - Remove duplicate static getMPIRank() and getMPISize() functions - Add full namespace qualification to GTEST_SKIP macro Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 00:22:04 +00:00
copilot-swe-agent[bot]	e227fdc1ef	Convert mp_unit tests from gtest to framework.hpp - Modified test/mp_unit/mp_unit_tests.hpp to use ../framework.hpp instead of gtest/gtest.h - Enhanced test/framework.hpp with GTest-compatible APIs: - Added Environment base class for global test setup/teardown - Added TestInfo and UnitTest classes for test metadata access - Added GTEST_SKIP macro support via SkipHelper class - Added namespace alias 'testing' for compatibility - Added InitGoogleTest and AddGlobalTestEnvironment helper functions - Updated test/framework.cc with implementations for new classes - All mp_unit test files now use framework.hpp through mp_unit_tests.hpp - Formatting applied via lint.sh Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-02-11 00:21:04 +00:00

1 2 3 4

191 Commits