mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 17:26:04 +00:00

Author	SHA1	Message	Date
Binyang Li	bd68319e3e	Refactor algo selection logic and introduce symmetric_memory env (#741 ) This PR refactors the algorithm selection logic in MSCCL++ and introduces support for symmetric memory configuration through environment variables. 1. Algorithm Selection Refactoring Use separate class for algo selection. Could introduce more complex logic for algo selection based on message size, arch, if cuda graph is enabled and memory allocation method 2. Symmetric Memory Support Introduced symmetricMemory parameter in algorithm context key generation. Remove disableChannelCache env as is ambiguous 3. Add new args for build_default_algorithms Add flag_buffer, and flag_buffer_size args to build default algorithm. Then we could use unified flag buffer for different algorithms, avoid application hanging when switch algo for different message size. --------- Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-12 19:06:18 -08:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	610db6f023	Fix test script (#655 ) Fix: #654. Address correctness_test.py crash issue Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-21 19:57:17 +00:00
Binyang Li	5ac427610d	Address teardown issue (#638 ) Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point	2025-09-25 12:12:40 -07:00
Binyang Li	ba4c4aaeb8	Integrate MSCCL++ with torch workload (#626 ) Integrate MSCCL++ with torch Introduce `NCCL audit shim library`, use can use following commands to launch torch library. Also avoid break build pipeline in the CPU machine ```bash export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH torchrun --nnodes=1 --nproc_per_node=8 your_script.py ```	2025-09-09 13:28:32 -07:00
Binyang Li	2b40fe37b3	add torch test (#612 ) Simple torch test	2025-08-15 10:27:21 -07:00

6 Commits