tests/ep: disable NCCL HeartbeatMonitor to silence mpirun shutdown noise

Set TORCH_NCCL_ENABLE_MONITORING=0 before importing torch.distributed.
The barrier+destroy_process_group finally block (afbdcd6a) suffices
under torchrun, but under mpirun rank 0 (the TCPStore server) can exit
before non-zero ranks finish teardown, and the background heartbeat
thread polls the store and logs 'recvValue failed / Connection was
likely closed'. Disabling the monitor outright is safe for short-lived
bench runs.
This commit is contained in:
Qinghua Zhou
2026-04-29 20:44:37 +00:00
parent e752dbaf97
commit 6ad82e8bbe

View File

@@ -35,6 +35,12 @@ import os
import random
import sys
# Disable ProcessGroupNCCL's HeartbeatMonitor before importing torch.distributed.
# It runs in a background thread polling the TCPStore; under mpirun, rank 0
# (the store server) can exit before non-zero ranks finish teardown, producing
# noisy 'recvValue failed / Connection was likely closed' stack traces.
os.environ.setdefault("TORCH_NCCL_ENABLE_MONITORING", "0")
import torch
import torch.distributed as dist