mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 17:26:04 +00:00

Author	SHA1	Message	Date
Ubuntu	6d8fb00a91	add extra signal/wait and avoid local flush	2026-04-09 15:58:07 +00:00
Ubuntu	3f2ade22cb	add barrier	2026-04-07 01:40:15 +00:00
Ubuntu	812f6cfded	fix hang on 4 ranks and make send/recv test more like nccl-test	2026-04-07 01:33:48 +00:00
Ubuntu	1a065dd6ad	add help scripts	2026-04-06 20:06:21 +00:00
Ubuntu	2c3f125d4c	add changes from ib and connection	2026-04-06 03:29:54 +00:00
Ubuntu	e487f831e6	debug	2026-04-06 03:01:30 +00:00
Ubuntu	ad56728c6d	fix	2026-04-06 02:32:58 +00:00
Ubuntu	8cecfee270	debug	2026-04-06 02:24:23 +00:00
Ubuntu	07d97f6f17	Unique QP per channel and env-controlled GID index - Change executor to create one connection (unique QP) per channel entry instead of sharing connections per peer. This is required for HostNoAtomic IB mode where each connection can only forward signals to one semaphore via setSignalForwardingDst. - Add MSCCLPP_IB_GID_INDEX environment variable to override the default GID index (3) used for IB transport. Set to the desired GID index value, or leave unset/-1 to use the default.	2026-04-06 02:18:56 +00:00
Ubuntu	251873ca8e	update	2026-04-06 02:14:52 +00:00
Ubuntu	1e6d4939a8	update	2026-04-06 02:11:36 +00:00
Ubuntu	289f89ddfe	update	2026-04-06 02:07:05 +00:00
Ubuntu	a4118eae73	update the number of instances	2026-04-06 02:06:37 +00:00
Ubuntu	b1cc649470	re-format output	2026-04-06 02:05:53 +00:00
Ubuntu	a191f16b76	add scripts	2026-04-06 02:04:49 +00:00
Ubuntu	d07a1ba28c	show scale in output	2026-04-06 02:02:10 +00:00
Ubuntu	27fbddb707	update the executor so we have message size range	2026-04-06 02:00:04 +00:00
Ubuntu	49979e58ab	tune #instances and remoce extra barriers	2026-04-06 01:55:43 +00:00
Ubuntu	194a79f772	add sendrecv correctness check	2026-04-06 01:46:55 +00:00
Ubuntu	a4bb8fb4bf	add debugging code	2026-04-06 01:37:18 +00:00
Changho Hwang	b04fa2daa7	lint	2026-04-04 06:22:04 +00:00
Changho Hwang	f62633ad41	mlx5dv bug fixes & enhanced unit tests perf reporting	2026-04-04 06:18:44 +00:00
Changho Hwang	53099a7cf9	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-04-01 22:45:58 -07:00
Binyang Li	be9126ca1b	Fix run-remote.sh to support multi-command scripts (#770 ) ## Summary - Fix `run-remote.sh` to correctly execute multi-command scripts (e.g., multiple `mpirun` calls) - The old approach piped decoded script through `base64 -d \| bash`, which feeds the script via bash's stdin. When `mpirun` (or its child processes) runs, it can consume the remaining stdin, causing bash to never see subsequent commands — only the first command would execute. - The fix decodes the script to a temp file and runs `bash -euxo pipefail "$TMP"` instead, so bash reads commands from the file and `mpirun` consuming stdin has no effect. - Applied to both the docker path (pssh + docker exec) and the non-docker path (pssh only). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-01 16:25:19 -07:00
Changho Hwang	553fd3b2d8	lint	2026-04-01 21:20:55 +00:00
Changho Hwang	94d0508ec2	prerequisites update	2026-04-01 21:18:47 +00:00
Changho Hwang	ff4d825652	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-04-01 14:01:55 -07:00
Changho Hwang	848b89b59c	64-bit token reconstruction	2026-04-01 21:00:54 +00:00
Changho Hwang	4cf53328ad	updates	2026-04-01 19:36:52 +00:00
Changho Hwang	f8e94d9971	disable mlx5dv_reg_dmabuf_mr	2026-04-01 19:00:03 +00:00
Changho Hwang	144046b818	revert	2026-04-01 18:22:16 +00:00
Changho Hwang	d1124fba29	revert	2026-04-01 18:20:29 +00:00
Changho Hwang	67f9933ba1	fix data direct	2026-04-01 10:20:43 +00:00
Changho Hwang	d2f7056cf4	Add unit testing framework readme (#766 )	2026-04-01 05:30:35 +00:00
Binyang Li	4f3638b60d	Use PTX red for D2D semaphore signal (#768 ) ## Summary - Replace the two-step `signal()` implementation (`incOutbound()` + `atomicStore()`) with a single fire-and-forget PTX `red.release.sys.global.add.u64` instruction - This eliminates one local atomic fetch-add and replaces a remote store with a remote atomic add that has no return value — more efficient on both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to fire-and-forget `flat_atomic_add_x2`) - Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait ping-pong latency ### Performance (H100, 2 ranks, signal+wait round-trip) ``` SemaphorePerfTest.SignalPingPong: Store-based (old): 2.595 us/iter Red-based (new): 2.345 us/iter Speedup: 1.11x ``` ## Test plan - [x] Builds successfully (`make mp_unit_tests`) - [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter "SemaphorePerfTest"` — 1.11x speedup 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 15:34:43 -07:00
Ekow Wellington	fd76507e9a	Install default plans under MSCCLPP_CACHE_DIR/default (#769 ) ### Summary Update the installer to place bundled default execution plans under `<MSCCLPP_CACHE_DIR>/default`, which is where the runtime already looks for bundled plans. ### Background The C++ runtime treats `MSCCLPP_CACHE_DIR` as the cache root and loads bundled default plans from `<cache root>/default`. When `MSCCLPP_CACHE_DIR` was set, the installer instead wrote bundled plans directly into the cache root, causing the runtime to miss them. This surfaced while running benchmarking tests with a non-default `MSCCLPP_CACHE_DIR`, where the bundled plans were not being discovered. ### Change This PR updates the installer to always install bundled default plans into `<MSCCLPP_CACHE_DIR>/default`, preserving the existing runtime contract. ### Scope - Installer-only change - No runtime behavior changes ### Validation Manual inspection of the updated install path. Successful build --------- Co-authored-by: Ekow Wellington <t-ekoww@microsoft.com>	2026-03-31 14:27:33 -05:00
Changho Hwang	80f554ebaf	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-03-26 18:02:43 -04:00
Copilot	93f6eeaa6b	Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 ) - Removes the GTest dependency, replacing it with a minimal custom framework (`test/framework.`) that covers only what the tests actually use — a unified `TEST()` macro with SFINAE-based fixture auto-detection, `EXPECT_`/`ASSERT_*` assertions, environments, and setup/teardown. - `--exclude-perf-tests` flag and substring-based negative filtering - `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to Codecov - Merges standalone `test/perf/` into main test targets - Refactors Azure pipelines to reduce redundancies & make more readable --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-03-24 23:34:38 -04:00
Binyang Li	5d18835417	Fix use-after-free for fabric allocation handle in GpuIpcMemHandle (#764 ) ## Summary Fix a use-after-free where the CUDA allocation handle (`CUmemGenericAllocationHandle`) was released prematurely while the exported fabric handle still referenced it. ## Problem Unlike POSIX FD handles (where the kernel keeps the allocation alive via the open file descriptor), fabric handles do not hold their own reference to the underlying allocation. The original code called `cuMemRelease(allocHandle)` immediately after exporting the fabric handle, freeing the allocation. When a remote process later tries to `cuMemImportFromShareableHandle` using that fabric handle, it references a freed allocation — a use-after-free. This affected both code paths: 1. `GpuIpcMemHandle::create()`: The local `allocHandle` obtained via `cuMemRetainAllocationHandle` was released right after fabric export, leaving the fabric handle dangling. 2. `GpuIpcMemHandle::createMulticast()`: The `allocHandle` from `cuMulticastCreate` was unconditionally released, even when it was the only thing keeping the multicast object alive for the fabric handle. ## Fix - Added `allocHandle` field to the `fabric` struct in `GpuIpcMemHandle` to store the allocation handle and keep it alive for the lifetime of the `GpuIpcMemHandle`. - `create()`: Retain an additional reference via `cuMemRetainAllocationHandle` and store it in `fabric.allocHandle` when a fabric handle is successfully exported. - `createMulticast()`: Store the `allocHandle` directly in `fabric.allocHandle` instead of unconditionally releasing it. Only release if fabric export was not used. - `deleter()`: Release `fabric.allocHandle` via `cuMemRelease` when the handle type includes `Fabric`, ensuring proper cleanup. - `GpuIpcMem` constructor (importer side): Clear `fabric.allocHandle` after importing, since the importer gets its own handle via `cuMemImportFromShareableHandle` and should not release the exporter's allocation handle. ## Files Changed - `src/core/include/gpu_ipc_mem.hpp` — Added `CUmemGenericAllocationHandle allocHandle` to fabric struct. - `src/core/gpu_ipc_mem.cc` — Retain/release allocation handle properly across create, createMulticast, deleter, and importer paths.	2026-03-19 11:52:09 -07:00
Changho Hwang	02005322a7	Merge branch 'copilot/remove-gtest-use-custom-framework' into chhwang/fix-ib-no-atomic	2026-03-18 14:04:20 -07:00
Changho Hwang	79a014976d	updates	2026-03-18 20:30:18 +00:00
Changho Hwang	6082648f80	fix for npkit	2026-03-18 20:06:37 +00:00
copilot-swe-agent[bot]	bff76d5b85	Fix TearDown() handling and replace assert() in perf tests Address review comments: 1. Ensure TearDown() is always called if SetUp() succeeds, even when TestBody() throws. This prevents resource leaks and maintains MPI synchronization between tests. 2. Replace assert() in fifo_perf_tests.cu with proper return false on validation failure, ensuring consistent test failure reporting. Fixes: - test/framework.cc: Track SetUp success and call TearDown in finally-style - test/unit/fifo_perf_tests.cu: Replace assert with explicit check Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-03-18 19:44:11 +00:00
Changho Hwang	275622159c	update	2026-03-18 02:32:21 +00:00
Changho Hwang	2297a3deda	updates	2026-03-18 00:58:08 +00:00
Changho Hwang	5a65cc7aba	debugging	2026-03-17 20:00:34 +00:00
Changho Hwang	d66d7e4743	debugging	2026-03-17 01:41:40 +00:00
Changho Hwang	a937ce4a8d	debugging	2026-03-16 20:35:46 +00:00
Changho Hwang	2c4bab8359	fix	2026-03-16 18:37:57 +00:00
Changho Hwang	e2a9692674	fix merge	2026-03-11 21:04:45 +00:00

1 2 3 4 5 ...

1046 Commits