mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
tests/ep: disable NCCL HeartbeatMonitor to silence mpirun shutdown noise
Set TORCH_NCCL_ENABLE_MONITORING=0 before importing torch.distributed.
The barrier+destroy_process_group finally block (afbdcd6a) suffices
under torchrun, but under mpirun rank 0 (the TCPStore server) can exit
before non-zero ranks finish teardown, and the background heartbeat
thread polls the store and logs 'recvValue failed / Connection was
likely closed'. Disabling the monitor outright is safe for short-lived
bench runs.
This commit is contained in:
@@ -35,6 +35,12 @@ import os
|
||||
import random
|
||||
import sys
|
||||
|
||||
# Disable ProcessGroupNCCL's HeartbeatMonitor before importing torch.distributed.
|
||||
# It runs in a background thread polling the TCPStore; under mpirun, rank 0
|
||||
# (the store server) can exit before non-zero ranks finish teardown, producing
|
||||
# noisy 'recvValue failed / Connection was likely closed' stack traces.
|
||||
os.environ.setdefault("TORCH_NCCL_ENABLE_MONITORING", "0")
|
||||
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
|
||||
|
||||
Reference in New Issue
Block a user