Commit Graph

972 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
7724e49f31 Fix lint and ROCm error alias
Agent-Logs-Url: https://github.com/microsoft/mscclpp/sessions/0f0e525d-a69c-4ff7-8913-983243b5cbf7

Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-05-13 03:26:53 +00:00
Binyang Li
0c09239b06 Merge branch 'main' into binyli/mnnvl 2026-05-12 20:17:02 -07:00
Binyang Li
224b3deb84 Clean up completed communicator receives
Erase completed receive bookkeeping from the communicator once the deferred receive future finishes, while preserving ordered receive chaining for repeated rank/tag operations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 01:22:51 +00:00
Caio Rocha
0c9b9abfd5 Adding Support 4 Nodes AllReduce Small Message Size (#794)
Results on 4 Nodes H200:

| Size | NCCL  | MSCCL++ 57TB | MSCCL++ 29TB |
|------|-------|--------------|--------------|
| 8K   | 45.75 | 17.74        | 18.18        |
| 16K  | 47.08 | 18.9         | 18.42        |
| 32K  | 47.29 | 19.48        | 19.12        |
| 64K  | 50.34 | 20.51        | 19.29        |
| 128K | 59.65 | 21.37        | 20.25        |
| 256K | 87.46 | 23.87        | 23.51        |
| 512K | 106.55| 29.15        | 29.51        |
| 1M   | 115   | 40.64        | 41.83        |
| 2M   | 135.89| 63.73        | 70.45        |
| 4M   | 177.59| 121.76       | 128.79       |
| 8M   | 251.17| 228.5        | 251.36       |

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsof.com>
2026-05-12 13:45:55 -07:00
Binyang Li
825fc124a5 address hang issue 2026-05-09 03:16:33 +00:00
Binyang Li
e208cc326b WIP 2026-05-08 04:30:05 +00:00
Binyang Li
5516bdbb6b fix 2026-05-08 04:22:50 +00:00
Binyang Li
654bcfa6ba update 2026-05-08 03:54:32 +00:00
Binyang Li
9ff7e1c2c3 update 2026-05-08 03:43:34 +00:00
Binyang Li
113d859d13 fix 2026-05-08 03:29:05 +00:00
Binyang Li
d1b04a3b26 NVLS zero-copy allreduce: support FP16 accumulator for FP8 inputs
multimem.ld_reduce on FP8 inputs accumulates in FP32 by default. The
ISA also exposes an .acc::f16 variant that keeps the reduction in
FP16, which is faster but lower precision. Plumb AccumT through:

- include/mscclpp/switch_channel_device.hpp:
  Extend SwitchChannelDeviceHandle::multimemLoadReduce with an optional
  AccumT template parameter. When VectorType is one of the FP8 vector
  types (f8_e4m3x{4,8,16} / f8_e5m2x{4,8,16}) and AccumT is __half,
  emit the .acc::f16 form of the instruction; otherwise unchanged.

- src/ext/collectives/include/allreduce/common.hpp:
  Make handleMultiLoadReduceStore template on AccumT and forward it to
  multimemLoadReduce<vectorType, AccumT>(...).

- src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu:
  Template allreduceNvls and NvlsAdapter on AccumT and forward to
  handleMultiLoadReduceStore<T, AccumT>; the existing dispatch<>
  machinery already plumbs AccumT through from the algorithm context.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-07 00:38:31 +00:00
Binyang Li
7d80a33360 Default torch example SYMMETRIC_MEMORY env to 1
The non-symmetric rsag_zero_copy path uses an incrementing tag in its
context key, so cross-rank memory registration handshakes happen on
every call rather than being cached. At single-host x 8 GPUs and
sizes >= 512 KB this becomes the only candidate (since nvls_zero_copy
is filtered out without symmetric memory) and degrades into apparent
hang. Defaulting SYMMETRIC_MEMORY=1 lets a plain `mpirun ...`
invocation work out of the box; users can still override with
`SYMMETRIC_MEMORY=0` to exercise the non-symmetric path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 23:43:37 +00:00
Binyang Li
e8caab7c8e Strip preflight validation blocks from NVLS pipeline allreduce kernels
allreduce_nvls_block_pipeline.cu and allreduce_nvls_warp_pipeline.cu
were carrying ~45 lines of per-call invariant-checking added during the
MNNVL work. Restore main's simple defaulting pattern (just `if
(==0) set defaults`); incorrect inputs will manifest as CUDA errors via
the existing error-handling path. Also drop the unreachable
`6 * ipcDomainNranks > NUM_SEMAPHORES` throw in the block_pipeline
initialize (max ipcDomainNranks=72, NUM_SEMAPHORES=512), the now-unused
`<mscclpp/errors.hpp>` include, and trim the verbose comments around
`nBaseChannels_` sizing in both files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 23:04:41 +00:00
Binyang Li
639b80de7b Tie AllreduceAllpairPacket maxBlockNum_ to MAX_IPC_DOMAIN_NRANKS - 1
The hard-coded 72 was off by one from what the comment claims is the
minimum (MAX_IPC_DOMAIN_NRANKS - 1 = 71). Express the value via the
constant so the relationship is self-documenting and any future change
to MAX_IPC_DOMAIN_NRANKS propagates automatically.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 22:31:15 +00:00
Binyang Li
095cfff11d Revert RSAG nBlocks default to 64
The 128-block default fires only when the caller passes nBlocks=0
(i.e. no tuning). Tuning explicitly drives nBlocks via the adapter, so
the historical default of 64 is fine. Keep nChannelsPerConnection_=128
so the tuner can still request up to 128 blocks for MNNVL configs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 22:23:18 +00:00
Binyang Li
bde23ce38e Revert verbose RSAG zero-copy comment; rename NRanksPerNode template param
- Restore the original two-line note about the templated peer-loop
  unrolling instead of the multi-paragraph rationale block.
- Rename the kernel template parameter from NRanksPerNode to NRanks.
  The IPC domain can span multiple physical hosts under MNNVL, so the
  'PerNode' suffix is misleading; NRanks matches the runtime
  ipcDomainNranks parameter that drives template dispatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 22:16:08 +00:00
Binyang Li
f0c6ac081f Fold validateIpcDomainSpansWorld into getIpcDomainNranks
getIpcDomainNranks now performs the range / world-size / rank checks
itself and throws on violation, so the separate
validateIpcDomainSpansWorld helper is unnecessary. Update the 3 NVLS
callsites (block_pipeline, warp_pipeline, nvls_zero_copy) to call
getIpcDomainNranks directly. The non-NVLS callers also pick up the
strict validation, which is fine because they are only invoked in
single-host or multi-host MNNVL scenarios where worldSize ==
ipcDomainNranks (the NCCL adapter's multi-node path returns nullptr,
falling back to NCCL/RCCL).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 21:49:48 +00:00
Binyang Li
307a471888 Shorten verbose comments and use THROW in validateIpcDomainSpansWorld
- Collapse the duplicated 3-line warp-strided-load comment in 5 kernels
  (allgather_fullmesh, allreduce_fullmesh, allreduce_packet,
  allreduce_nvls_zero_copy, allreduce_nvls_warp_pipeline) into a single
  one-line 'Peer count may exceed WARP_SIZE on MNNVL.' note.
- Drop the algName parameter from validateIpcDomainSpansWorld; switch
  its 3 throws to use the THROW logger macro (LogSubsys::ALGO), which
  already captures file/line/function. Update the 3 callsites
  (nvls_block_pipeline, nvls_warp_pipeline, nvls_zero_copy) and trim the
  Doxygen comment accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 21:37:09 +00:00
Binyang Li
4a0d5b29d5 Simplify torch-integration tuning example
- Drop the multi_host_mnnvl-specific rsag fallback in _default_ar_config;
  fall through to default_allreduce_packet when NVLS is unavailable.
- Add SYMMETRIC_MEMORY env var so the tuning sweep can include the
  zero-copy NVLS / RSAG candidates without editing the source.
- Make _algo() raise on miss (direct dict lookup) and drop the
  defensive 'if a:' guards in _ar_candidates / _ag_candidates /
  _default_ar_config; merge existence checks into the platform
  conditions (self._nvls, self.symmetric_memory).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 21:14:36 +00:00
Binyang Li
905b23d9a8 Drop non-MNNVL multi_node regime from torch-integration example
The example is now MNNVL-only: a run is either single-host (everything
fits in one node) or multi-host MNNVL (one cross-host NVLink domain).
Plain multi-node-without-MNNVL had its own algorithm branch that this
example will never exercise, so remove the multi_node flag and the
intermediate mnnvl_domain variable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 19:00:22 +00:00
Binyang Li
9aeeaf0f12 Simplify torch-integration tuning example for MPI-only multi-node testing
Use mpi4py for bootstrap and local-rank discovery; drop the torchrun /
gloo / manual MSCCLPP_MASTER_ADDR paths and the netifaces dependency.
Add MNNVL/multi-node algorithm selection (rsag, rsag_zero_copy,
nvls_zero_copy) and route barrier / timing-sync allreduces through the
configured symmetric_memory flag so they work across hosts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 18:51:29 +00:00
Mahdieh Ghazi
822fbb2351 Adding necessary macros for enabling mrc support (#797)
This PR adds necessary macros and instructions for enabling mrc support with no atomic.
2026-05-05 17:17:41 -04:00
Binyang Li
6296803d87 Make NVLS non-zero-copy allreduce algorithms MNNVL-ready
Both default_allreduce_nvls_warp_pipeline and default_allreduce_nvls_block_pipeline
were only partially MNNVL-aware: their kernels had been updated to use
ipcDomainNranks (with shared-memory channel arrays sized for the global
NVLink-domain bound), but the host-side context init still hard-coded
ctx->ipcDomainNranks = bootstrap->getNranksPerNode(). On a fully populated MNNVL
fabric (e.g. NVL72 where world == ipcDomainNranks but the per-physical-host
nranksPerNode is much smaller), this mismatched the multicast group span and
produced wrong/missing data plus out-of-bounds scratch indexing.

Changes:
- Rename MAX_NRANKS_PER_NODE -> MAX_IPC_DOMAIN_NRANKS to match the rest of the
  IPC-domain naming (getIpcDomainNranks, ipcDomainNranks,
  MSCCLPP_IPC_DOMAIN_NRANKS env var). Pure rename, no semantic change.
- Add validateIpcDomainSpansWorld(comm, algName) helper in collective_utils
  that wraps getIpcDomainNranks() and asserts the IPC-domain == whole-comm
  invariant required by NVLS algorithms (worldSize == ipcDomainNranks,
  rank < ipcDomainNranks, ipcDomainNranks in [2, MAX_IPC_DOMAIN_NRANKS]),
  throwing Error(InvalidUsage) on violation and returning the validated value.
- nvls_zero_copy / nvls_block_pipeline / nvls_warp_pipeline initialize() each
  now call the helper instead of repeating the same ~20-line check inline.
- initAllreduceContext() in both pipelines now uses getIpcDomainNranks(comm)
  instead of bootstrap->getNranksPerNode().
- Per-peer base channel allocation (nBaseChannels_) is sized in initialize() as
  max(64, 4*ipc) for block pipeline and max(64, 8*ipc) for warp pipeline so
  the kernel's per-block channel addressing remains in-bounds at NVL72 scale.
- Block pipeline initialize() also asserts 6*ipcDomainNranks <= NUM_SEMAPHORES.
- allreduceKernelFunc() in both pipelines now validates launch shape and the
  user-supplied scratch buffer size before launching, returning
  CommInvalidArgument with a clear WARN on mismatch:
  - Block: nBlocks must equal 5*ipcDomainNranks (structurally required by the
    kernel's three-phase block partition), nThreads == 1024, inputSize aligned
    to (ipc * 16) bytes, scratchSizePerBlock >= unitSize.
  - Warp: nBlocks >= NUM_NVLS_CONNECTION and a multiple of it (kernel does
    nBlocks / NUM_NVLS_CONNECTION partitioning of the multicast handles),
    2*nBlocks <= nBaseChannels_, nThreads == 1024 (32 warps hard-coded in the
    bar.sync member counts), inputSize divisible by ipcDomainNranks,
    scratchSizePerBlock >= copyPerIter.
- Default nBlocks for warp pipeline is rounded up to a multiple of
  NUM_NVLS_CONNECTION so the structural constraint holds for any
  ipcDomainNranks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-05 04:41:14 +00:00
Binyang Li
987f80025a Merge remote-tracking branch 'origin/main' into binyli/mnnvl 2026-05-04 23:53:25 +00:00
Binyang Li
9ec26fa4d1 Reset GPU tokens before reuse (#795)
Fixes a token-reuse bug in `TokenPool` that's independent of MNNVL.

## Bug

`TokenPool` hands out 8-byte device-memory slots used as
device-semaphore counters. The deleter only cleared the bitmap — the
underlying GPU memory was left as-is. When a token was freed and later
re-allocated, the new semaphore inherited the previous counter value
instead of starting at 0, breaking subsequent `signal()/wait()` math.

## Fix

* Add a synchronous `gpuMemset` host helper (mirrors `gpuMemcpy` /
`gpuMemcpyAsync`).
* Zero the slot inside the `TokenPool` deleter so recycled tokens hand
out a clean counter. The very-first allocation is already zeroed by
`gpuCallocPhysical` (`src/core/gpu_utils.cc:227-228`), so first-time
tokens are also clean — the deleter only has to handle the recycle case.

## Notes

* Public wrapper is named `mscclpp::gpuMemset` (not `mscclpp::memset`)
for symmetry with `gpuMemcpy` and to avoid shadowing `std::memset` in
TUs that pull the namespace in.
* Zeroing happens on release rather than acquire so the cost is paid in
the typically less perf-sensitive teardown path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-04 15:11:47 -07:00
Binyang Li
9a36884369 Rename gpuMemset wrapper and zero TokenPool slots in deleter
Two follow-ups to commit 7bc5e040:
  * Rename mscclpp::memset to mscclpp::gpuMemset for symmetry with
    gpuMemcpy / gpuMemcpyAsync, and avoid shadowing std::memset for
    callers that pull the namespace in. Also add the missing doc
    comment.
  * Move the per-slot zeroing from getToken() into the deleter so the
    cost is paid on release rather than acquire. This is safe because
    gpuCallocPhysical already zeros the underlying buffer at TokenPool
    construction, so first-time tokens are clean and recycled tokens
    are scrubbed on release.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 04:28:36 +00:00
Binyang Li
7bc5e0406b Reset GPU tokens before reuse
Clear recycled TokenPool entries before handing them out so device-to-device semaphores start from a clean counter value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 03:19:31 +00:00
Binyang Li
1c29817566 Revert AllreduceRsAgZeroCopy non-symmetric ctx key tag back to ++tag
Commit 533f3299 dropped the static tag counter from
generateAllreduceContextKey, causing every non-symmetric call to
return the same key (zero) and reuse a stale context. Restore the
pre-MNNVL behavior of returning a unique key per non-symmetric call
so the context cache rebuilds when buffers change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 23:40:11 +00:00
Binyang Li
2efda4d819 Restore compile-time templated NRanksPerNode for rsag_zero_copy
Recovers the per-thread int4 register array + #pragma unroll for the
{4, 8} rank cases. All NPeers remote reads are issued in parallel so
their latency overlaps instead of being serialized by the runtime
fused load+reduce loop. The runtime-domain (NVL72) fallback is
removed; the algo now returns cudaErrorInvalidValue for unsupported
ipcDomainNranks, and rsag_zero_copy is dropped from the MNNVL
candidate list in the tuning example.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 23:09:22 +00:00
Binyang Li
2a2fca8a58 Rename collective ctx/kernel param nRanksPerNode to ipcDomainNranks
The AlgorithmCtx field and the kernel/host parameters that hold the
collective's IPC peer-group size were named nRanksPerNode, which is
misleading on Multi-Node NVLink (where the value spans multiple hosts)
and on AMD (where the relevant fabric is XGMI, not NVLink). Rename to
ipcDomainNranks throughout the collective algorithms to match the
neutral naming introduced for the env helper.

Scope intentionally limited to src/ext/collectives/. The following are
left untouched on purpose:
  - Bootstrap::getNranksPerNode() — physical-host detection, semantics
    unchanged.
  - Algorithm::Constraint::nRanksPerNode (public API in
    include/mscclpp/algorithm.hpp) and the DSL plan config in
    algorithm_collection_builder.cc — these describe a plan's required
    physical topology.
  - NCCL adapter (src/ext/nccl/) — preserves NCCL ABI compatibility.
  - MAX_NRANKS_PER_NODE — sizing constant for shared-memory arrays.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 19:06:07 +00:00
Binyang Li
45a651b2c8 Decouple IPC-domain hint from bootstrap nRanksPerNode
Replace MSCCLPP_MNNVL_NRANKS_PER_NODE (which overrode TcpBootstrap and
silently changed getNranksPerNode() for every consumer) with a single
algorithm-level helper getIpcDomainNranks(comm) backed by a new
MSCCLPP_IPC_DOMAIN_NRANKS env. The neutral IPC name covers both NVLink/
MNNVL on NV and XGMI on AMD. Bootstrap is unchanged and continues to
report physical-host detection.

Collapse the two getCollectiveDomainNranksPerNode overloads into one
canonical helper and route all six allreduce algos (packet,
allpair_packet, nvls_packet, nvls_zero_copy, rsag, rsag_zero_copy)
through it. Update the standalone tuning example to use the new env
name; drop the undeclared MSCCLPP_ENABLE_MNNVL gate; fix
multi_host_mnnvl detection now that nranks_per_node is no longer
overridden by the bootstrap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:27:17 +00:00
Binyang Li
2c52937b26 Fix FP8 ROCm build/test issues and dtype naming (#792)
## Summary
- Fix ROCm FP8 build failure by using the actual FP8 `DataType` enum
constants in allreduce packet tuning.
- Fix FP8 E4M3FNUZ test encoding so small negative values do not produce
the FNUZ NaN byte (`0x80`).
- Align FP8 `DataType` enum constants and Python bindings with
torch-style names (`FLOAT8_E4M3FN`, `FLOAT8_E4M3FNUZ`, `FLOAT8_E5M2FNUZ`
/ `float8_e4m3fn`, `float8_e4m3fnuz`, `float8_e5m2fnuz`).

## Validation
- `./tools/lint.sh`
- `make -j` from `build/`
- `mpirun --allow-run-as-root -np 8 python3 -m pytest
python/test/test_fp8_accum.py -q` (`36 passed, 9 skipped`)
- `DTYPE=float8_e4m3fnuz ACCUM_DTYPE=float32 torchrun --nnodes=1
--nproc_per_node=8
examples/torch-integration/customized_comm_with_tuning.py`

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 15:02:22 -07:00
Binyang Li
533f329971 Tune no-sym MNNVL with RSAG zero-copy
Disable NVLS zero-copy when symmetric memory is not enabled, and allow the RSAG zero-copy path to participate in MNNVL tuning for non-symmetric memory. Cache RSAG zero-copy contexts by the concrete buffer pointers so CUDA graph capture does not create a new registration for every execute call, and cap requested blocks at the channel count.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 16:23:23 +00:00
Binyang Li
3bc00cb7f0 Enable NVLS zero-copy without symmetric memory flag
Allow default_allreduce_nvls_zero_copy to run when the public symmetric_memory flag is false; the algorithm already binds the concrete input and output allocations in its context. Include that fast path in MNNVL tuning and bound allpair/NVLS packet candidates to small sizes so large-message no-symmetric tuning avoids slow or unsafe packet variants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 08:24:49 +00:00
Binyang Li
865c2bc795 Optimize MNNVL allreduce without symmetric memory
Run the tuning example with symmetric memory disabled, make allreduce tuning use the same symmetric-memory mode as execution, and narrow the MNNVL small-message candidate set to avoid slower packet/NVLS choices. Increase packet and RSAG channel parallelism so non-symmetric CUDA-IPC paths can use 112-block packet and 128-block RSAG configs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 07:55:52 +00:00
Binyang Li
dded5e0e39 Improve MNNVL allreduce tuning performance
Add the allpair packet algorithm to the MNNVL small-message candidate set and enable zero-copy NVLS/RSAG candidates for larger symmetric-memory allreduce benchmarks. Run the standalone tuning example with symmetric memory so RawGpuBuffer-backed tensors can use the zero-copy paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 06:41:17 +00:00
Binyang Li
893a08e69c Enable MNNVL allreduce tuning
Add an MNNVL rank-domain override so MSCCL++ collectives can treat multi-host NVLink fabrics as a single CUDA IPC/NVLS peer group. Update packet, RSAG, and NVLS allreduce paths to use the collective domain size and teach the torch integration tuning example to select MNNVL-capable allreduce algorithms.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 05:38:59 +00:00
Binyang Li
dd8b301a65 Scale native allreduce/allgather algos for MNNVL/MNNVLS
Bump MAX_NRANKS_PER_NODE from 8 to 72 to cover Multi-Node NVLink (MNNVL)
domains up to GB200 NVL72, and bump NUM_SEMAPHORES from 64 to 512 to
accommodate semaphore indexing that grows as O(nRanksPerNode).

Convert allreduce_rsag_zero_copy from a compile-time-templated kernel
({4,8} ranks) to a runtime nRanksPerNode kernel; fuse load+reduce inside
the peer loop to avoid O(NPeers) register pressure that would otherwise
spill at NVL72 scale.

Bump AllreduceAllpairPacket::maxBlockNum_ from 28 to 72 so the adapter
can launch >= nPeers blocks at MNNVL scale.

Fix a shared-memory channel-cache bug across five kernels:
nvls_zero_copy, nvls_warp_pipeline, packet, allreduce_fullmesh, and
allgather_fullmesh. The original 'if (lid < nPeers) channels[lid] = ...'
load only populated the first WARP_SIZE entries, but threads from
multiple warps later read channels[threadIdx.x] up to nPeers-1. Replace
with a per-warp strided loop so every warp loads all entries before
__syncwarp(); the same-value cross-warp writes are benign.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-27 20:36:58 +00:00
Changho Hwang
c97be492d5 GDRCopy status message to string (#793) 2026-04-27 10:32:20 -07:00
Copilot
e874bf1666 fix: isCuMemMapAllocated crashes on non-NVLS systems even with MSCCLPP_FORCE_DISABLE_NVLS=true (#790)
- [x] Fix `isCuMemMapAllocated()` to just return `true/false` without
throwing when NVLS is not supported
- [x] Fix `isNvlsSupported()` caching bug where `result`/`isChecked`
were never updated
- [x] Restore `[[maybe_unused]]` on `result` and `isChecked` statics —
needed in HIP/ROCm env where `CUDA_NVLS_API_AVAILABLE` is not defined
and the variables would otherwise be unused
- [x] Run linter (`./tools/lint.sh`)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-04-22 10:12:40 -07:00
Binyang Li
eeea00b298 Support python wheel build (#787)
## Support Python wheel build

This PR modernizes the Python packaging for MSCCL++ by defining
dependencies and optional extras in `pyproject.toml`, enabling proper
wheel builds with `pip install ".[cuda12]"`.

### Changes

**`pyproject.toml`**
- Add `dependencies` (numpy, blake3, pybind11, sortedcontainers)
- Add `optional-dependencies` for platform-specific CuPy (`cuda11`,
`cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras
- Bump minimum Python version from 3.8 to 3.10

**`test/deploy/setup.sh`**
- Use `pip install ".[<platform>,benchmark,test]"` instead of separate
`pip install -r requirements_*.txt` + `pip install .` steps
- Add missing CUDA 13 case

**`docs/quickstart.md`**
- Update install instructions to use extras (e.g., `pip install
".[cuda12]"`)
- Document all available extras and clarify that `rocm6` builds CuPy
from source
- Update Python version references to 3.10

**`python/csrc/CMakeLists.txt`**, **`python/test/CMakeLists.txt`**
- Update `find_package(Python)` from 3.8 to 3.10

### Notes
- The `requirements_*.txt` files are kept for Docker base image builds
where only dependencies (not the project itself) should be installed.
- CuPy is intentionally not in base dependencies — users must specify a
platform extra to get the correct pre-built wheel (or source build for
ROCm).

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 21:24:45 -07:00
Binyang Li
572028ea3d Fix nccl-test CI building for all GPU architectures (#786)
## Problem

`nccl-test.yml` was the only CI template calling `deploy.yml` without
passing `gpuArch`. Since the CI build machine has no GPU, CMake fell
back to building for **all** supported architectures (`80;90;100;120`),
unnecessarily slowing down CI builds.

## Fix

- Add `gpuArch` parameter to `nccl-test.yml` and forward it to
`deploy.yml`
- Pass `gpuArch: '80'` (A100) and `gpuArch: '90'` (H100) from
`nccl-api-test.yml`

All other templates were already passing `gpuArch` correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-15 12:55:40 -07:00
Binyang Li
ecd33722d4 Fix multi-node H100 CI: CUDA compat, deploy improvements (#781)
## Summary

- **Multi-node H100 CI setup**: Improve architecture detection and GPU
configuration
- **Remove hardcoded VMSS hostnames** from deploy files
- **Fix CUDA compat library issue**: Remove stale compat paths from
Docker image for CUDA 12+. Instead, `peer_access_test` now returns a
distinct exit code (2) for CUDA init failure, and `setup.sh`
conditionally adds compat libs only when needed. This fixes
`cudaErrorSystemNotReady` (error 803) when the host driver is newer than
the container's compat libs.
- **Speed up deploy**: Replace recursive `parallel-scp` with
tar+scp+untar to avoid per-file SSH overhead.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-13 21:51:29 -07:00
Caio Rocha
b6d0ca13ca Adding CI Test to DSL Executor (#782) 2026-04-13 13:55:45 -07:00
Caio Rocha
b59e6d7f00 Updating NpKit (#785) 2026-04-13 13:36:42 -07:00
Binyang Li
5380a4ac6e Add MSCCLPP_IB_GID_INDEX env (#780)
Use MSCCLPP_IB_GID_INDEX to control ib gid index

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-13 09:59:42 -07:00
Caio Rocha
feda338595 Adjusting Torch Integration Example (#779)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-04-10 13:57:14 -07:00
Changho Hwang
d63f9403c0 IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753)
Major enhancements to the IB signal forwarding mechanisms
(`host-no-atomic` mode), primarily adding support for GDRCopy and MLX5
Direct Verbs, and refactoring the signal forwarding path for IB
HostNoAtomic mode. The changes fix memory consistency issues and reduce
signaling latency.
- GDRCopy and MLX5 Direct Verbs MR integration
- Signal forwarding path redesign
- Semaphore and connection API updates
- Environment (`MSCCLPP_FORCE_DISABLE_GDR`) and documentation updates
2026-04-09 09:24:30 +00:00
Caio Rocha
a7273047e9 Fix TBG on DSL Get Operation (#778) 2026-04-08 17:02:07 -07:00
Caio Rocha
3e5c41c98a Adding Channel Type in ReduceSend Operation on DSL (#777)
The reduce send operation in DSL essentially combines the reduce and put
operations. The put operation carry the information about the channel
type, whereas previously, we were using the channel type from the reduce
operation.
2026-04-08 16:59:08 -07:00