mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-21 13:29:45 +00:00
ep tests: clean shutdown to silence TCPStore/HeartbeatMonitor noise
Add dist.barrier() + dist.destroy_process_group() in a finally block so non-zero ranks don't poll the TCPStore after rank 0 (the store server) exits, which produced noisy 'recvValue failed / Connection was likely closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor. Also pass device_id to init_process_group in the internode test to silence 'Guessing device ID based on global rank' warnings.
This commit is contained in:
@@ -327,3 +327,14 @@ if __name__ == "__main__":
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
finally:
|
||||
# Ordered shutdown: barrier so every rank reaches teardown before the
|
||||
# TCPStore server (rank 0) exits, then destroy the PG. Avoids noisy
|
||||
# "recvValue failed / Connection was likely closed" stack traces from
|
||||
# ProcessGroupNCCL's HeartbeatMonitor.
|
||||
if dist.is_initialized():
|
||||
try:
|
||||
dist.barrier()
|
||||
except Exception:
|
||||
pass
|
||||
dist.destroy_process_group()
|
||||
|
||||
Reference in New Issue
Block a user