ep tests: clean shutdown to silence TCPStore/HeartbeatMonitor noise

Add dist.barrier() + dist.destroy_process_group() in a finally block so
non-zero ranks don't poll the TCPStore after rank 0 (the store server)
exits, which produced noisy 'recvValue failed / Connection was likely
closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor.

Also pass device_id to init_process_group in the internode test to
silence 'Guessing device ID based on global rank' warnings.
This commit is contained in:
Qinghua Zhou
2026-04-29 05:16:22 +00:00
parent 0227626335
commit afbdcd6a3d
3 changed files with 37 additions and 5 deletions

View File

@@ -327,3 +327,14 @@ if __name__ == "__main__":
import traceback
traceback.print_exc()
sys.exit(1)
finally:
# Ordered shutdown: barrier so every rank reaches teardown before the
# TCPStore server (rank 0) exits, then destroy the PG. Avoids noisy
# "recvValue failed / Connection was likely closed" stack traces from
# ProcessGroupNCCL's HeartbeatMonitor.
if dist.is_initialized():
try:
dist.barrier()
except Exception:
pass
dist.destroy_process_group()