Commit Graph

731 Commits

Author SHA1 Message Date
Changho Hwang
def68ced64 Add CUDA 12.8 images (#488) 2025-03-29 00:31:26 +00:00
Binyang Li
a3d8d6807b Remove the requirement for CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED for NVLS support (#489)
Remove the requirement for
`CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support.

Fix #487
2025-03-28 16:46:54 -07:00
Qinghua Zhou
0f21ed44b8 Add CI test for fallback allgather, allreduce, broadcastand reducescatter to NCCL operations (#485)
Add CI test for fallback allgather, allreduce, broadcast, and
reducescatter to NCCL operations
Test following parameters:
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE 
-x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl/nccl/build/lib/libnccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allgather, allreduce,
broadcast, reducescatter" or "all"
2025-03-27 21:13:07 +00:00
Caio Rocha
ac5cc647e0 Reduce Operation Support to the Executor (#484)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-25 13:58:12 -07:00
Binyang Li
b4062462fd Fix reduceMin failaure issue (#486)
Remove the reduceOp check, as this already done at `getReduceOp` method
2025-03-25 10:15:24 -07:00
Qinghua Zhou
a7c364beb8 nccl/rccl integration (#469)
Use dlopen to load nccl/rccl Apis from shared library to
enable Allgather, Allreduce, Broadcast, ReduceScatter fallback to nccl/rccl operations.

Add three related environment variables
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=/path/libnccl.so/librccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all"
By default, if MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, all these operations will be fallback to nccl/rccl apis.
---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-20 11:31:37 -07:00
Binyang Li
89f7573adf Fix correctness issue when mscclppDisableChannelCache set to true (#483)
If `mscclppDisableChannelCache` set to true, we need to keep every
channel information avoid the channel info in GPU side be released.
2025-03-19 14:55:37 -07:00
Caio Rocha
b6a179faff NCCL API CI Test for ReduceScatter (#465)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-17 13:58:32 -07:00
Binyang Li
f124dc1df9 Add min operation for allreduce (#481)
Add min operation for allreduce
2025-03-16 20:47:36 -07:00
Binyang Li
0b840baa05 Update allgather fallback algo (#476)
Enhancements to all-gather operation, a temporary solution to fix the
memory overhead when integrating msccl++ with pytorch.
This solution will not register input/output buffer to msccl++, so the
temp output buffer for allgather could be reused by torch automatically.

* Introduced a new `allgather8` kernel function in
`apps/nccl/src/allgather.hpp` to handle larger data sizes more
efficiently. This includes double buffering to hide synchronization
overhead and support for both in-place and out-of-place operations.
* Modified the `allgather` function to decide between `allgather6` and
`allgather8` based on data size and platform, improving performance for
large data sizes.

Configuration and environment improvements:

* Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to
control whether the channel cache is disabled, enhancing
configurability. This variable is now part of the `Env` class and is
logged during environment initialization.
* Removed the redundant global variable `mscclppDisableChannelCache`
from `src/debug.cc` and updated its usage to refer to the new
environment variable.
2025-03-14 11:18:03 -07:00
Changho Hwang
e4012ded48 Mark mscclpp-test as deprecated in the doc (#478) 2025-03-11 22:44:38 +00:00
Binyang Li
79b5eefa6c Fix memory OOM issue (#479)
This pull request includes changes to improve memory management in
GPU-related functions by ensuring proper release of memory handles.
Fix #470
2025-03-10 11:15:17 -07:00
Caio Rocha
bac3c90f6a Improving Get Operation at MSCCLLang (#475)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-05 09:41:38 -08:00
Yang Wang
1ff217d5f3 Fix minor typos and errors in documentation (#474) 2025-02-28 17:46:24 -08:00
Caio Rocha
6074eeeac9 Adjust NPKit IB Event (#472) 2025-02-28 10:16:47 -08:00
Caio Rocha
986c45b71a NPKit Support to Read Put Packet Operation (#471) 2025-02-27 12:02:16 -08:00
Caio Rocha
b3992a8b29 Adding Read Put Packet operation at Executor (#441) 2025-02-26 09:29:43 -08:00
Caio Rocha
0222bb324d Adjusting AllGather Collective in MSCCLLang (#466)
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
2025-02-25 08:35:26 -08:00
Qinghua Zhou
591276f9d0 Disable channel cache (#463)
Add workaround of disabling channel cache.
Related runtime parameter: -x MSCCLPP_DISABLE_CHANNEL_CACHE=TRUE
(Default value: False)
In this PR, some other features (e.g., ncclCommSplit) come from branch
binyangli/nccl-api

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-02-19 19:26:12 +00:00
Caio Rocha
8a564977e5 Updating MSCCLLang Examples (#462)
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
2025-02-19 09:48:31 -08:00
Caio Rocha
55789bc551 Support ReduceScatter in the NCCL interface (#460)
Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net>
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-02-11 13:28:19 -08:00
Binyang Li
a6e00cc449 remove unnecessary sync (#461)
`nop` instruction is only for synchronization within the same
threadblock. Cross threadblock synchronization is handled by `barrier`
instruction. So insert `nop` only if the dependency is within the same
threadblock.
2025-02-10 15:31:49 +08:00
Caio Rocha
e7cff899ce Adjusting BFS to seek circular dependencies in the msccl-tools DAG (#459)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-02-07 11:24:27 -08:00
Binyang Li
7f3b088744 Add multi-nodes example & update doc (#455)
Documentation update:

*
[`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114):
Updated the link to the examples folder to reflect the correct path.

New example script:

*
[`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76):
Added a new example script demonstrating the allgather all-pairs
algorithm across multiple nodes using packet communication.

IR module improvements:

*
[`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472):
Refined the sorting criteria for GPU instance channels and thread block
channels to include the channel type, ensuring a more accurate order.
Debugging enhancements:

*
[`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441):
Added a debug log to indicate the start of communication collective
execution with details about the execution plan and collective.
*
[`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89):
Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for
logging executor-related information.
2025-01-31 17:52:15 -08:00
Changho Hwang
3565bfdf6d Renaming channels (#436)
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Binyang Li
af0bb86e07 Merge mscclpp-lang to mscclpp project (#442)
First step to merge msccl-tools into mscclpp repo. In this step will
move all msccl related code, pass the current tests and do some
necessary refactor.

Add `mscclpp.language` module
Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag
Add `DagLower` to lower dag to intermediate representation 
Add documents for mscclpp.language
Remove msccl related code
2025-01-22 09:47:37 -08:00
Changho Hwang
4ee15b7ad0 Fix PR #449 (#453) 2025-01-15 11:59:12 -08:00
Changho Hwang
d12247b54a Lazily create streams for CudaIpcConnection (#449) 2025-01-15 11:50:02 -08:00
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Binyang Li
8ac50dc85d Resolve cuMemMap error (#451)
* Updated `RegisteredMemory::Impl::Impl(const std::vector<char>&
serialization)` to use both minimum and recommended granularities for
memory address reservation and mapping. This will resolve the cuMemMap
error
2025-01-10 14:22:14 -08:00
Changho Hwang
2b54af7e27 Auto-update version numbers in CMakeLists.txt (#450) 2025-01-09 17:54:10 -08:00
Changho Hwang
f2b52c6318 Fix Python binding of exceptions (#444)
* Fixed errors to be catchable from Python code
* Skip IB tests in Python unit tests when IB ports are down
2025-01-09 11:58:23 -08:00
Caio Rocha
80abce59ef Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit (#415)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-01-08 09:02:36 -08:00
Changho Hwang
1989d4be9c Fix CMake build messages (#443) 2025-01-08 02:44:01 +00:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Binyang Li
6d26b92665 Fix azure pipeline (#437) 2025-01-04 19:41:10 -08:00
Pedram Alizadeh
97eaca2bd2 [NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl (#399) 2025-01-03 20:38:57 +00:00
Qinghua Zhou
ba0d0d68b8 Enhance the nccl error message handling (#434)
Add WARN or INFO before returning the nccl error message.
Change NCCL_DEBUG to MSCCLPP_DEBUG in debug message.
2025-01-03 00:50:36 +00:00
Binyang Li
3d6bfed2cf Update version number (#433)
Co-authored-by: github-actions <github-actions@github.com>
2025-01-02 16:45:08 -08:00
Changho Hwang
dff21905e3 Fix typos in the pipeline (#420)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-01-02 09:49:50 -08:00
Binyang Li
3e7801b1a8 Fix CI trigger issue (#428) 2024-12-20 16:27:52 -08:00
Binyang Li
f18a440feb trigger ci for release branches (#426) 2024-12-21 00:05:13 +00:00
Changho Hwang
e2230aab26 Tackle build warnings (#422)
* Comply with
[CMP0165](https://cmake.org/cmake/help/latest/policy/CMP0165.html)
* Tackle other warnings during build
2024-12-19 16:51:50 -08:00
Binyang Li
6fedb7c0e8 Fix nccl-test failure issue (#421) 2024-12-19 12:07:00 -08:00
Binyang Li
776f24e787 update READMED (#414) 2024-12-19 05:54:27 +00:00
SreevatsaAnantharamu
0c7ed2c674 Add ncclBcast / ncclBroadcast support (#419)
A simple broadcast using scratch buffer and option to use an executor.
2024-12-19 01:16:30 +00:00
David Sidler
d8d0dfbffa Fix synchronization in allreduce8 kernel (#407)
Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a
synchronization bug. The PR addresses it by ensuring that signals are
only issued after all threads in the block have issued their writes to
guarantee correct ordering between data writes and signal writes.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-18 17:10:22 -08:00
Caio Rocha
774602d49c Supporting Executor multi node in NCCL API (#412)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-12-18 15:50:58 -08:00
Binyang Li
fcb2e46cb1 NVLS support for NCCL API (#410)
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-18 09:55:35 +00:00
Binyang Li
863a599360 Disable CuMemMap check for ROCm (#411)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-17 08:36:25 +00:00