Commit Graph

20 Commits

Author SHA1 Message Date
Binyang Li
0b840baa05 Update allgather fallback algo (#476)
Enhancements to all-gather operation, a temporary solution to fix the
memory overhead when integrating msccl++ with pytorch.
This solution will not register input/output buffer to msccl++, so the
temp output buffer for allgather could be reused by torch automatically.

* Introduced a new `allgather8` kernel function in
`apps/nccl/src/allgather.hpp` to handle larger data sizes more
efficiently. This includes double buffering to hide synchronization
overhead and support for both in-place and out-of-place operations.
* Modified the `allgather` function to decide between `allgather6` and
`allgather8` based on data size and platform, improving performance for
large data sizes.

Configuration and environment improvements:

* Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to
control whether the channel cache is disabled, enhancing
configurability. This variable is now part of the `Env` class and is
logged during environment initialization.
* Removed the redundant global variable `mscclppDisableChannelCache`
from `src/debug.cc` and updated its usage to refer to the new
environment variable.
2025-03-14 11:18:03 -07:00
Qinghua Zhou
591276f9d0 Disable channel cache (#463)
Add workaround of disabling channel cache.
Related runtime parameter: -x MSCCLPP_DISABLE_CHANNEL_CACHE=TRUE
(Default value: False)
In this PR, some other features (e.g., ncclCommSplit) come from branch
binyangli/nccl-api

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-02-19 19:26:12 +00:00
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Changho Hwang
0c150e5166 Fix copyright messages (#367) 2024-10-17 21:25:46 -07:00
Changho Hwang
544ff0c21d ROCm support (#213)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-11-24 16:41:56 +08:00
Changho Hwang
60b3dd5a61 Bug fixes & resolve warnings (#107)
* Fix a bug in host hashing
* Fix a bug in `HostEpoch::wait()`
* Remove misc warnings
2023-06-16 09:31:23 +00:00
Changho Hwang
9cee6c4a74 Cleanup old files and functions (#86) 2023-06-01 17:34:57 +08:00
Olli Saarikivi
9f6c48cbf9 Format all files 2023-05-11 00:23:14 +00:00
Changho Hwang
d2c2ae72a7 Some cleanup 2023-04-11 08:45:22 +00:00
Changho Hwang
fe1d7fee9e Bug Fix: null-termination in logging 2023-03-31 05:25:07 +00:00
Saeed Maleki
32c4498fb8 typo fixes 2023-03-28 00:55:41 +00:00
Saeed Maleki
75036c0f12 typo fixes 2023-03-28 00:50:59 +00:00
Saeed Maleki
43c52367fb merged with main and simplified the callback requirements 2023-03-27 23:41:27 +00:00
Saeed Maleki
19bf369dc1 link format correction 2023-03-27 20:40:15 +00:00
Changho Hwang
8fc8f5b4fe Lint 2023-03-27 14:09:26 +00:00
Changho Hwang
8e4146aba9 Add mscclppSetLogHandler 2023-03-27 13:33:07 +00:00
Changho Hwang
ae01fa4958 Remove mscclpp_net.h and net.h 2023-03-14 08:32:19 +00:00
Saeed Maleki
0902ce89c6 compiles 2023-02-06 05:32:24 +00:00
v-xiaoxshi
200f5637bb more bootstrap files 2023-02-04 05:07:48 +00:00
Changho Hwang
82fe0b667d Add a makefile and logging functions 2023-02-03 12:29:27 +00:00