mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-12 01:10:22 +00:00
Add dist.barrier() + dist.destroy_process_group() in a finally block so non-zero ranks don't poll the TCPStore after rank 0 (the store server) exits, which produced noisy 'recvValue failed / Connection was likely closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor. Also pass device_id to init_process_group in the internode test to silence 'Guessing device ID based on global rank' warnings.
15 KiB
15 KiB