Commit Graph

190 Commits

Author SHA1 Message Date
Qinghua Zhou
16a96ea77b Support detailed version tracking that captures git repository information (#639)
#### Version Format

The package version includes the git commit hash directly in the version
string for development builds:
- **Release version**: `0.7.0`
- **Development version**: `0.7.0.dev36+g6e2360d69` (includes short
commit hash)
- **Development with uncommitted changes**:
`0.7.0.dev36+g6e2360d69.dirty`

#### Checking Version Information

After installation, you can check the version information in several
ways:

**From Python:**
```python
import mscclpp

# Access individual attributes
print(f"Version: {mscclpp.__version__}")           # Full version with commit
Version: 0.7.0.dev36+g6e2360d69

# Get as dictionary
mscclpp.version()
{'version': '0.7.0.dev46+gb0d27c58f', 'base_version': '0.7.0', 'git_commit': 'b0d27c58f'}
```

#### Version Information Details

The version tracking captures:
- **Package Version** (`mscclpp.__version__`): Full version string
including git commit (e.g., `0.7.0.dev36+g6e2360d69`)

This information is embedded during the package build process and
remains accessible even after distribution, making it easier to debug
issues and ensure reproducibility.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-09-30 09:00:33 -07:00
Caio Rocha
c3473b1794 Thread Block Group DSL (#621)
Supporting the creation of a group of thread block to perform some
operation.
2025-09-03 14:58:40 -07:00
Caio Rocha
4d9bb9f015 Adding Channel Id Field DSL Port Channel Operations (#615) 2025-08-15 16:10:52 -07:00
Caio Rocha
9261b1d278 AlltoAll Test Support (#606)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-08-15 16:00:41 -07:00
Binyang Li
699cc45eed Fix ut (#613)
Fix pytest
2025-08-14 17:15:28 -07:00
Changho Hwang
2eadbaf86f python doc auto generation (#605)
Add Python API references
2025-08-11 10:34:29 -07:00
Binyang Li
be6a941fba New DSL implementation (#579)
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline 
- Clean code and enhance document
C++ side: 
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-09 00:36:20 -07:00
Changho Hwang
542a10f69e Merge ChannelTrigger with ProxyTrigger (#601) 2025-08-08 19:07:50 +00:00
Binyang Li
4f6f23dae3 Use smart pointer for IB structure (#585)
Change to use smart pointer for IB structure. Registered memory will own
ibMr, ibCtx will not held the reference
- Use smart pointer for IbQp and IbMr
- Update memoryChannel API, keep localRegisteredMemory
- Close fd when registedMemory released

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-06 10:01:58 -07:00
Binyang Li
658411ccc4 update pytest and python API to fix ut failure (#598)
update pytest and python API to fix ut failure
2025-08-05 15:17:33 -07:00
Changho Hwang
c580e4c503 Support CudaIpc connection within a single process (#593)
* Allow CudaIpc connection between GPUs in a single process
* Added an example of connection in a single process
* Minor interface updates

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-08-02 12:59:36 +08:00
Binyang Li
604c345921 Fix #458 (#568)
Fix mscclpp_benchmark allreduce test
2025-07-13 18:02:56 -07:00
Changho Hwang
199468bc47 Revise NVLS interface (#458)
* Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel`
* Minor interface improvements
2025-07-12 00:33:03 +00:00
Changho Hwang
ae56698d67 New semaphore constructors (#559)
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Changho Hwang
20eca28942 Fix a FIFO correctness bug (#549)
* Add a FIFO test code that reproduced a correctness issue
* Fix the correctness issue by using pinned memory instead of cudaMemcpy

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-07-11 23:53:59 +00:00
Changho Hwang
b4dde38db8 FIFO improvements (#557)
* Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default.
* Optimize `FifoDeviceHandle`.
* Do not use `cudaHostAllocWriteCombined` that increases latency.
* Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`.
* Fix proxy NUMA binding issues.
* Prevent graph capture inside proxy threads.
* Now `CudaIpcConnection` skips stream sync when unnecessary.
* Now any type of connection needs to hold a shared pointer to the
context for memory safety.
* Now a context should be always managed by a shared pointer for memory
safety.
* Minor docs & interface improvements.
* Minor fix in `mscclpp-test` correctness test.
2025-06-24 09:50:28 -07:00
Changho Hwang
17d8e7c9e9 Fix build processes (#545)
* Let CMake read version numbers from the `VERSION` file.
* Upgrade dlpack and drop `CMAKE_POLICY_VERSION_MINIMUM`.
* Do not install dlpack.
* Add license files in the wheel and exclude `*.cpp` files.
2025-06-06 13:37:40 -07:00
Changho Hwang
83356957bd Improved documentation & minor interface revision (#541) 2025-06-03 14:26:27 -07:00
Changho Hwang
c184485808 DLPack fixes (#537)
* Fix typos in type name
* Make it work without current context set
2025-05-27 21:40:50 +00:00
Changho Hwang
7278b51e61 Rename ChannelTrigger fields and check field values in debug builds (#529) 2025-05-27 14:36:22 -07:00
Caio Rocha
29c3af2ac6 Properly setting up the device in Ethernet Connection (#527)
When we create the thread to receive messages in the Ethernet
Connection, it resets the Device ID, causing faults in the Ethernet
Connection unit tests.


![image](https://github.com/user-attachments/assets/ba609c16-0f52-4624-807a-5ad776a0c18d)

This PR aims to properly set up the device when the thread is created.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-05-19 10:05:45 -07:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Changho Hwang
5205618c4a Fix device assert (#522)
* Fixed a bug that external `assert()`s may not be compiled with mscclpp
headers
* Use a macro assert instead of a function
2025-05-12 13:38:11 -07:00
Changho Hwang
d636093336 Asynchronous setup (#514)
Cherry-picked a part of features from #167: now `Communicator::setup()`
is unneeded. `Communicator::sendMemory()` conducts the task inline, and
`Communicator::recvMemory()` and `Communicator::connect()` conducts the
task asynchronously without explicit setup.
2025-05-08 22:01:51 +00:00
Caio Rocha
51eca89d20 Enhance Collective Check at MSCCLang (#511) 2025-04-29 13:29:28 -07:00
Changho Hwang
b310783603 Fix #508 (#515)
* Wrong offsets in `unpackPackets()`
* Added Python binding of `BaseMemoryChannel`
2025-04-25 09:52:05 -07:00
Changho Hwang
710f6686dc Revised MemoryChannel interfaces (#508)
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
2025-04-25 00:02:56 +00:00
Caio Rocha
7a25e51b07 Automatic creation of Scratch Buffer at MSCCLLang (#510) 2025-04-23 16:37:14 -07:00
Binyang Li
06f31994dc Fix performance issue introduced in PR: 499 (#505)
1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is
more efficient on A100
2. Update the deviceSyncer. Previous one cannot handle threadBlock
number change correctly. Use three counters to solve this issue. Reset
previous counter before sync on current counter.
3. Introduce relaxedWait which can be used with relaxedSignal for case
doesn't need guarantee the memory visibility
2025-04-22 14:03:37 -07:00
Binyang Li
adc9ee5684 Export mscclpp GpuBuffer to dlpack format (#492)
For mscclpp, to use nvls we require the buffer is allocated by
mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export
the raw buffer to dlpack format.
User can use this feature to create buffer with type supported by
pytorch
```python
buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16
dl_pack = buffer.to_dlpack(str(torch.bfloat16))
tensor = torch.utils.dlpack.from_dlpack(dl_pack)
```
2025-04-03 12:59:32 -07:00
Caio Rocha
ac5cc647e0 Reduce Operation Support to the Executor (#484)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-25 13:58:12 -07:00
Caio Rocha
bac3c90f6a Improving Get Operation at MSCCLLang (#475)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-05 09:41:38 -08:00
Caio Rocha
b3992a8b29 Adding Read Put Packet operation at Executor (#441) 2025-02-26 09:29:43 -08:00
Caio Rocha
0222bb324d Adjusting AllGather Collective in MSCCLLang (#466)
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
2025-02-25 08:35:26 -08:00
Caio Rocha
8a564977e5 Updating MSCCLLang Examples (#462)
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
2025-02-19 09:48:31 -08:00
Caio Rocha
55789bc551 Support ReduceScatter in the NCCL interface (#460)
Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net>
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-02-11 13:28:19 -08:00
Binyang Li
a6e00cc449 remove unnecessary sync (#461)
`nop` instruction is only for synchronization within the same
threadblock. Cross threadblock synchronization is handled by `barrier`
instruction. So insert `nop` only if the dependency is within the same
threadblock.
2025-02-10 15:31:49 +08:00
Caio Rocha
e7cff899ce Adjusting BFS to seek circular dependencies in the msccl-tools DAG (#459)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-02-07 11:24:27 -08:00
Binyang Li
7f3b088744 Add multi-nodes example & update doc (#455)
Documentation update:

*
[`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114):
Updated the link to the examples folder to reflect the correct path.

New example script:

*
[`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76):
Added a new example script demonstrating the allgather all-pairs
algorithm across multiple nodes using packet communication.

IR module improvements:

*
[`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472):
Refined the sorting criteria for GPU instance channels and thread block
channels to include the channel type, ensuring a more accurate order.
Debugging enhancements:

*
[`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441):
Added a debug log to indicate the start of communication collective
execution with details about the execution plan and collective.
*
[`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89):
Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for
logging executor-related information.
2025-01-31 17:52:15 -08:00
Changho Hwang
3565bfdf6d Renaming channels (#436)
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Binyang Li
af0bb86e07 Merge mscclpp-lang to mscclpp project (#442)
First step to merge msccl-tools into mscclpp repo. In this step will
move all msccl related code, pass the current tests and do some
necessary refactor.

Add `mscclpp.language` module
Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag
Add `DagLower` to lower dag to intermediate representation 
Add documents for mscclpp.language
Remove msccl related code
2025-01-22 09:47:37 -08:00
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Changho Hwang
f2b52c6318 Fix Python binding of exceptions (#444)
* Fixed errors to be catchable from Python code
* Skip IB tests in Python unit tests when IB ports are down
2025-01-09 11:58:23 -08:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Changho Hwang
756f24c697 Revised ProxyChannel interfaces (#400)
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Binyang Li
88d28e07a7 Select algo according to json config (#396)
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Caio Rocha
ff18bb8d0b Providing reduce-scatter test support (#390) 2024-11-28 09:19:30 -08:00
Binyang Li
1b8d020650 Fix mscclpp_benchmark (#392)
Enable 1GB message size for NVLS transport in mscclpp_benchmark
2024-11-25 19:59:51 +00:00
Binyang Li
28a57b0610 NVLS support for msccl++ executor (#375)
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda

Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Caio Rocha
b3dc74c020 Small Adjust in Test Data AllGather at Executor Test (#384) 2024-11-16 15:21:00 +08:00