mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
Enhancements to all-gather operation, a temporary solution to fix the memory overhead when integrating msccl++ with pytorch. This solution will not register input/output buffer to msccl++, so the temp output buffer for allgather could be reused by torch automatically. * Introduced a new `allgather8` kernel function in `apps/nccl/src/allgather.hpp` to handle larger data sizes more efficiently. This includes double buffering to hide synchronization overhead and support for both in-place and out-of-place operations. * Modified the `allgather` function to decide between `allgather6` and `allgather8` based on data size and platform, improving performance for large data sizes. Configuration and environment improvements: * Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to control whether the channel cache is disabled, enhancing configurability. This variable is now part of the `Env` class and is logged during environment initialization. * Removed the redundant global variable `mscclppDisableChannelCache` from `src/debug.cc` and updated its usage to refer to the new environment variable.