Commit Graph

1046 Commits

Author SHA1 Message Date
Ubuntu
6d8fb00a91 add extra signal/wait and avoid local flush 2026-04-09 15:58:07 +00:00
Ubuntu
3f2ade22cb add barrier 2026-04-07 01:40:15 +00:00
Ubuntu
812f6cfded fix hang on 4 ranks and make send/recv test more like nccl-test 2026-04-07 01:33:48 +00:00
Ubuntu
1a065dd6ad add help scripts 2026-04-06 20:06:21 +00:00
Ubuntu
2c3f125d4c add changes from ib and connection 2026-04-06 03:29:54 +00:00
Ubuntu
e487f831e6 debug 2026-04-06 03:01:30 +00:00
Ubuntu
ad56728c6d fix 2026-04-06 02:32:58 +00:00
Ubuntu
8cecfee270 debug 2026-04-06 02:24:23 +00:00
Ubuntu
07d97f6f17 Unique QP per channel and env-controlled GID index
- Change executor to create one connection (unique QP) per channel entry
  instead of sharing connections per peer. This is required for HostNoAtomic
  IB mode where each connection can only forward signals to one semaphore
  via setSignalForwardingDst.

- Add MSCCLPP_IB_GID_INDEX environment variable to override the default
  GID index (3) used for IB transport. Set to the desired GID index value,
  or leave unset/-1 to use the default.
2026-04-06 02:18:56 +00:00
Ubuntu
251873ca8e update 2026-04-06 02:14:52 +00:00
Ubuntu
1e6d4939a8 update 2026-04-06 02:11:36 +00:00
Ubuntu
289f89ddfe update 2026-04-06 02:07:05 +00:00
Ubuntu
a4118eae73 update the number of instances 2026-04-06 02:06:37 +00:00
Ubuntu
b1cc649470 re-format output 2026-04-06 02:05:53 +00:00
Ubuntu
a191f16b76 add scripts 2026-04-06 02:04:49 +00:00
Ubuntu
d07a1ba28c show scale in output 2026-04-06 02:02:10 +00:00
Ubuntu
27fbddb707 update the executor so we have message size range 2026-04-06 02:00:04 +00:00
Ubuntu
49979e58ab tune #instances and remoce extra barriers 2026-04-06 01:55:43 +00:00
Ubuntu
194a79f772 add sendrecv correctness check 2026-04-06 01:46:55 +00:00
Ubuntu
a4bb8fb4bf add debugging code 2026-04-06 01:37:18 +00:00
Changho Hwang
b04fa2daa7 lint 2026-04-04 06:22:04 +00:00
Changho Hwang
f62633ad41 mlx5dv bug fixes & enhanced unit tests perf reporting 2026-04-04 06:18:44 +00:00
Changho Hwang
53099a7cf9 Merge branch 'main' into chhwang/fix-ib-no-atomic 2026-04-01 22:45:58 -07:00
Binyang Li
be9126ca1b Fix run-remote.sh to support multi-command scripts (#770)
## Summary
- Fix `run-remote.sh` to correctly execute multi-command scripts (e.g.,
multiple `mpirun` calls)
- The old approach piped decoded script through `base64 -d | bash`,
which feeds the script via bash's **stdin**. When `mpirun` (or its child
processes) runs, it can consume the remaining stdin, causing bash to
never see subsequent commands — only the first command would execute.
- The fix decodes the script to a **temp file** and runs `bash -euxo
pipefail "$TMP"` instead, so bash reads commands from the file and
`mpirun` consuming stdin has no effect.
- Applied to both the docker path (pssh + docker exec) and the
non-docker path (pssh only).


🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-01 16:25:19 -07:00
Changho Hwang
553fd3b2d8 lint 2026-04-01 21:20:55 +00:00
Changho Hwang
94d0508ec2 prerequisites update 2026-04-01 21:18:47 +00:00
Changho Hwang
ff4d825652 Merge branch 'main' into chhwang/fix-ib-no-atomic 2026-04-01 14:01:55 -07:00
Changho Hwang
848b89b59c 64-bit token reconstruction 2026-04-01 21:00:54 +00:00
Changho Hwang
4cf53328ad updates 2026-04-01 19:36:52 +00:00
Changho Hwang
f8e94d9971 disable mlx5dv_reg_dmabuf_mr 2026-04-01 19:00:03 +00:00
Changho Hwang
144046b818 revert 2026-04-01 18:22:16 +00:00
Changho Hwang
d1124fba29 revert 2026-04-01 18:20:29 +00:00
Changho Hwang
67f9933ba1 fix data direct 2026-04-01 10:20:43 +00:00
Changho Hwang
d2f7056cf4 Add unit testing framework readme (#766) 2026-04-01 05:30:35 +00:00
Binyang Li
4f3638b60d Use PTX red for D2D semaphore signal (#768)
## Summary
- Replace the two-step `signal()` implementation (`incOutbound()` +
`atomicStore()`) with a single fire-and-forget PTX
`red.release.sys.global.add.u64` instruction
- This eliminates one local atomic fetch-add and replaces a remote store
with a remote atomic add that has no return value — more efficient on
both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to
fire-and-forget `flat_atomic_add_x2`)
- Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait
ping-pong latency

### Performance (H100, 2 ranks, signal+wait round-trip)

```
SemaphorePerfTest.SignalPingPong:
  Store-based (old): 2.595 us/iter
  Red-based   (new): 2.345 us/iter
  Speedup:           1.11x
```

## Test plan
- [x] Builds successfully (`make mp_unit_tests`)
- [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter
"SemaphorePerfTest"` — 1.11x speedup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 15:34:43 -07:00
Ekow Wellington
fd76507e9a Install default plans under MSCCLPP_CACHE_DIR/default (#769)
### Summary
Update the installer to place bundled default execution plans under
`<MSCCLPP_CACHE_DIR>/default`, which is where the runtime already looks
for bundled plans.

### Background
The C++ runtime treats `MSCCLPP_CACHE_DIR` as the cache *root* and loads
bundled default plans from `<cache root>/default`.
When `MSCCLPP_CACHE_DIR` was set, the installer instead wrote bundled
plans
directly into the cache root, causing the runtime to miss them.

This surfaced while running benchmarking tests with a non-default
`MSCCLPP_CACHE_DIR`, where the bundled plans were not being discovered.

### Change
This PR updates the installer to always install bundled default plans
into
`<MSCCLPP_CACHE_DIR>/default`, preserving the existing runtime contract.

### Scope
- Installer-only change
- No runtime behavior changes

### Validation
Manual inspection of the updated install path.
Successful build

---------

Co-authored-by: Ekow Wellington <t-ekoww@microsoft.com>
2026-03-31 14:27:33 -05:00
Changho Hwang
80f554ebaf Merge branch 'main' into chhwang/fix-ib-no-atomic 2026-03-26 18:02:43 -04:00
Copilot
93f6eeaa6b Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744)
- Removes the GTest dependency, replacing it with a minimal custom
framework (`test/framework.*`) that covers only what the tests actually
use — a unified `TEST()` macro with SFINAE-based fixture auto-detection,
`EXPECT_*`/`ASSERT_*` assertions, environments, and setup/teardown.
- `--exclude-perf-tests` flag and substring-based negative filtering
- `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to
Codecov
- Merges standalone `test/perf/` into main test targets
- Refactors Azure pipelines to reduce redundancies & make more readable

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2026-03-24 23:34:38 -04:00
Binyang Li
5d18835417 Fix use-after-free for fabric allocation handle in GpuIpcMemHandle (#764)
## Summary

Fix a use-after-free where the CUDA allocation handle
(`CUmemGenericAllocationHandle`) was released prematurely while the
exported fabric handle still referenced it.

## Problem

Unlike POSIX FD handles (where the kernel keeps the allocation alive via
the open file descriptor), fabric handles do not hold their own
reference to the underlying allocation. The original code called
`cuMemRelease(allocHandle)` immediately after exporting the fabric
handle, freeing the allocation. When a remote process later tries to
`cuMemImportFromShareableHandle` using that fabric handle, it references
a freed allocation — a **use-after-free**.

This affected both code paths:

1. **`GpuIpcMemHandle::create()`**: The local `allocHandle` obtained via
`cuMemRetainAllocationHandle` was released right after fabric export,
leaving the fabric handle dangling.
2. **`GpuIpcMemHandle::createMulticast()`**: The `allocHandle` from
`cuMulticastCreate` was unconditionally released, even when it was the
only thing keeping the multicast object alive for the fabric handle.

## Fix

- **Added `allocHandle` field** to the `fabric` struct in
`GpuIpcMemHandle` to store the allocation handle and keep it alive for
the lifetime of the `GpuIpcMemHandle`.
- **`create()`**: Retain an additional reference via
`cuMemRetainAllocationHandle` and store it in `fabric.allocHandle` when
a fabric handle is successfully exported.
- **`createMulticast()`**: Store the `allocHandle` directly in
`fabric.allocHandle` instead of unconditionally releasing it. Only
release if fabric export was not used.
- **`deleter()`**: Release `fabric.allocHandle` via `cuMemRelease` when
the handle type includes `Fabric`, ensuring proper cleanup.
- **`GpuIpcMem` constructor (importer side)**: Clear
`fabric.allocHandle` after importing, since the importer gets its own
handle via `cuMemImportFromShareableHandle` and should not release the
exporter's allocation handle.

## Files Changed

- `src/core/include/gpu_ipc_mem.hpp` — Added
`CUmemGenericAllocationHandle allocHandle` to fabric struct.
- `src/core/gpu_ipc_mem.cc` — Retain/release allocation handle properly
across create, createMulticast, deleter, and importer paths.
2026-03-19 11:52:09 -07:00
Changho Hwang
02005322a7 Merge branch 'copilot/remove-gtest-use-custom-framework' into chhwang/fix-ib-no-atomic 2026-03-18 14:04:20 -07:00
Changho Hwang
79a014976d updates 2026-03-18 20:30:18 +00:00
Changho Hwang
6082648f80 fix for npkit 2026-03-18 20:06:37 +00:00
copilot-swe-agent[bot]
bff76d5b85 Fix TearDown() handling and replace assert() in perf tests
Address review comments:
1. Ensure TearDown() is always called if SetUp() succeeds, even when
   TestBody() throws. This prevents resource leaks and maintains MPI
   synchronization between tests.
2. Replace assert() in fifo_perf_tests.cu with proper return false
   on validation failure, ensuring consistent test failure reporting.

Fixes:
- test/framework.cc: Track SetUp success and call TearDown in finally-style
- test/unit/fifo_perf_tests.cu: Replace assert with explicit check

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-03-18 19:44:11 +00:00
Changho Hwang
275622159c update 2026-03-18 02:32:21 +00:00
Changho Hwang
2297a3deda updates 2026-03-18 00:58:08 +00:00
Changho Hwang
5a65cc7aba debugging 2026-03-17 20:00:34 +00:00
Changho Hwang
d66d7e4743 debugging 2026-03-17 01:41:40 +00:00
Changho Hwang
a937ce4a8d debugging 2026-03-16 20:35:46 +00:00
Changho Hwang
2c4bab8359 fix 2026-03-16 18:37:57 +00:00
Changho Hwang
e2a9692674 fix merge 2026-03-11 21:04:45 +00:00