Commit Graph

53 Commits

Author SHA1 Message Date
Caio Rocha
17247cd695 DSL Quick Start (#689)
Fix #675

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-11-21 14:45:49 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Binyang Li
5acac93dbc Integrate MSCCL++ DSL to torch workload (#620)
Provides two integration ways for MSCCL++ DSL.
1. Integrate with customized communication group
2. Integrate with NCCL API

Introduce new Python APIs to make it work:
```python
mscclpp.compile # compile dsl to json based execution plan
mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery
mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size....
```
Fix #556

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-29 15:39:00 -07:00
Changho Hwang
9994f53cea Fixes for no-IB systems (#667)
* Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB
on/off
* Fix `nvidia-peermem` check; no need for DMABUF supported systems
* Fix `mp_unit_tests` to skip all IB tests when built with
`-DMSCCLPP_USE_IB=OFF`
2025-10-29 10:03:02 -07:00
Caio Rocha
d7b99e9c9d Improving DSL documentation (#650) 2025-10-23 17:50:33 -07:00
Changho Hwang
a48421872e Fix docs (#656)
* Fix Python doc generation
* Remove `ChannelTrigger` and fix `ProxyTrigger`
* Fixed package versions for consistency
2025-10-23 00:34:53 +00:00
Changho Hwang
b8f61cb761 Update the port channel tutorial doc (#653) 2025-10-21 11:52:15 -07:00
Binyang Li
ddca185add Address corner case when generating version file (#641)
Address corner case for version file generation

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
2025-10-07 14:32:33 -07:00
Binyang Li
3d94383696 Add MSCCLPP_GIT_COMMIT micro (#640)
- Add MSCCLPP_GIT_COMMIT micro
- Update docs
2025-10-06 15:57:28 -07:00
Qinghua Zhou
16a96ea77b Support detailed version tracking that captures git repository information (#639)
#### Version Format

The package version includes the git commit hash directly in the version
string for development builds:
- **Release version**: `0.7.0`
- **Development version**: `0.7.0.dev36+g6e2360d69` (includes short
commit hash)
- **Development with uncommitted changes**:
`0.7.0.dev36+g6e2360d69.dirty`

#### Checking Version Information

After installation, you can check the version information in several
ways:

**From Python:**
```python
import mscclpp

# Access individual attributes
print(f"Version: {mscclpp.__version__}")           # Full version with commit
Version: 0.7.0.dev36+g6e2360d69

# Get as dictionary
mscclpp.version()
{'version': '0.7.0.dev46+gb0d27c58f', 'base_version': '0.7.0', 'git_commit': 'b0d27c58f'}
```

#### Version Information Details

The version tracking captures:
- **Package Version** (`mscclpp.__version__`): Full version string
including git commit (e.g., `0.7.0.dev36+g6e2360d69`)

This information is embedded during the package build process and
remains accessible even after distribution, making it easier to debug
issues and ensure reproducibility.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-09-30 09:00:33 -07:00
Binyang Li
70b8297c56 Revise NCCL API implementation (#617)
- Make nccl interface extensible. Customer can register their own algo
to NCCL API. User can provide customized algo selection function.
- Fallback to NCCL/RCCL if no algo is selected based on algo selection
function
- MSCCLPP interfaces now works for any scale
2025-09-26 10:08:12 -07:00
Binyang Li
ba4c4aaeb8 Integrate MSCCL++ with torch workload (#626)
Integrate MSCCL++ with torch
Introduce `NCCL audit shim library`, use can use following commands to
launch torch library. Also avoid break build pipeline in the CPU machine
```bash
export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so
export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH
torchrun --nnodes=1 --nproc_per_node=8 your_script.py
```
2025-09-09 13:28:32 -07:00
Caio Rocha
c3473b1794 Thread Block Group DSL (#621)
Supporting the creation of a group of thread block to perform some
operation.
2025-09-03 14:58:40 -07:00
Changho Hwang
2eadbaf86f python doc auto generation (#605)
Add Python API references
2025-08-11 10:34:29 -07:00
Binyang Li
be6a941fba New DSL implementation (#579)
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline 
- Clean code and enhance document
C++ side: 
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-09 00:36:20 -07:00
Changho Hwang
9650e5c37e Update documentation (#576)
Documentation overhaul
2025-08-07 15:37:37 -07:00
Binyang Li
5e991cf5c8 update readme & bump version (#550)
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-07-12 01:00:18 -07:00
Changho Hwang
ae56698d67 New semaphore constructors (#559)
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Wenxuan Tan
2151790463 Fix some typos in docs (#555) 2025-06-19 19:39:37 +00:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Changho Hwang
710f6686dc Revised MemoryChannel interfaces (#508)
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
2025-04-25 00:02:56 +00:00
Binyang Li
a3d8d6807b Remove the requirement for CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED for NVLS support (#489)
Remove the requirement for
`CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support.

Fix #487
2025-03-28 16:46:54 -07:00
Qinghua Zhou
0f21ed44b8 Add CI test for fallback allgather, allreduce, broadcastand reducescatter to NCCL operations (#485)
Add CI test for fallback allgather, allreduce, broadcast, and
reducescatter to NCCL operations
Test following parameters:
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE 
-x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl/nccl/build/lib/libnccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allgather, allreduce,
broadcast, reducescatter" or "all"
2025-03-27 21:13:07 +00:00
Changho Hwang
e4012ded48 Mark mscclpp-test as deprecated in the doc (#478) 2025-03-11 22:44:38 +00:00
Yang Wang
1ff217d5f3 Fix minor typos and errors in documentation (#474) 2025-02-28 17:46:24 -08:00
Binyang Li
7f3b088744 Add multi-nodes example & update doc (#455)
Documentation update:

*
[`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114):
Updated the link to the examples folder to reflect the correct path.

New example script:

*
[`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76):
Added a new example script demonstrating the allgather all-pairs
algorithm across multiple nodes using packet communication.

IR module improvements:

*
[`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472):
Refined the sorting criteria for GPU instance channels and thread block
channels to include the channel type, ensuring a more accurate order.
Debugging enhancements:

*
[`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441):
Added a debug log to indicate the start of communication collective
execution with details about the execution plan and collective.
*
[`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89):
Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for
logging executor-related information.
2025-01-31 17:52:15 -08:00
Changho Hwang
3565bfdf6d Renaming channels (#436)
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Binyang Li
af0bb86e07 Merge mscclpp-lang to mscclpp project (#442)
First step to merge msccl-tools into mscclpp repo. In this step will
move all msccl related code, pass the current tests and do some
necessary refactor.

Add `mscclpp.language` module
Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag
Add `DagLower` to lower dag to intermediate representation 
Add documents for mscclpp.language
Remove msccl related code
2025-01-22 09:47:37 -08:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Binyang Li
3d6bfed2cf Update version number (#433)
Co-authored-by: github-actions <github-actions@github.com>
2025-01-02 16:45:08 -08:00
Binyang Li
863a599360 Disable CuMemMap check for ROCm (#411)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-17 08:36:25 +00:00
Changho Hwang
756f24c697 Revised ProxyChannel interfaces (#400)
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Changho Hwang
2127a3ba29 Improve CMake options (#376)
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-22 01:54:11 +00:00
Binyang Li
28a57b0610 NVLS support for msccl++ executor (#375)
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda

Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Binyang Li
4136153a76 [Doc] mscclpp docs (#348)
Generate docs for mescclpp.
Setup github action to auto-deploy github-page
doc link here: https://microsoft.github.io/mscclpp

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-10-18 06:08:31 +00:00
Changho Hwang
40cb196553 v0.5.2 (#328) 2024-07-16 00:35:18 +00:00
caiomcbr
b1b9d0626c Support NCCL APIs (#319)
Start supporting NCCL APIs with a few limitations.

---------

Co-authored-by: Caio Rocha <caio.rocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-06-27 23:54:06 +00:00
Angelica Moreira
34f4d9d006 Update quickstart.md (#314)
Updating the docker image name tag and the python benchmark path.
2024-06-19 22:26:13 +00:00
Changho Hwang
cddffbc8b6 v0.5.1 (#308) 2024-05-26 14:31:29 -07:00
Changho Hwang
9c2a96060a v0.5.0 (#298) 2024-05-04 16:51:48 -07:00
Changho Hwang
1a7cb98e3a v0.4.3 (#279) 2024-03-27 11:53:09 -07:00
Changho Hwang
cdaf3aea3d New packet format & optimizations (#256)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-02-20 20:01:37 -08:00
Changho Hwang
f1605b73d6 v0.4.2 (#236) 2023-12-18 11:42:58 +08:00
Binyang Li
f1b2c9df12 Fix performance downgrade issue & update doc (#229)
For push function, we only need to make sure the instruction `st.global`
will be executed after the while loop. Since there is a Write-After-Read
hazard for `trigger.fst` (Check `this->triggers[curFifoHead % size].fst
!= 0` first then write value to `triggers[curFifoHead % size]`), we can
expect the compiler and hardware can handle this situation correctly.
Remove the `release.sys` there.

BTW, `st.global.release.sys.v2.u64` will cause perf regression issue.
Previous we use `st.global.release.cta.v2.u64`, but seems not necessary.
2023-12-04 10:20:10 -08:00
Changho Hwang
351b95b926 Update documents (#225)
Adding AMD supports on the docs
2023-11-24 17:00:18 +08:00
Changho Hwang
15f6dcca49 Update documentation (#217)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-11-22 12:58:04 -08:00
Changho Hwang
f68820436c Explicit build dependency on nvidia_peermem (#201) 2023-10-23 04:29:30 +00:00
Changho Hwang
3df18d20a3 Update install guidelines (#159) 2023-08-30 10:40:40 -07:00
Changho Hwang
4114d65c60 Documents & minor updates (#119)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-07 17:35:05 +08:00
Changho Hwang
6ec585f3d8 Packet copy for IB (#109)
* Extend channels to support LL with IB
* Rename classes and interfaces
2023-06-28 10:39:31 -07:00