Files
mscclpp/test/python
Qinghua Zhou 6ad82e8bbe tests/ep: disable NCCL HeartbeatMonitor to silence mpirun shutdown noise
Set TORCH_NCCL_ENABLE_MONITORING=0 before importing torch.distributed.
The barrier+destroy_process_group finally block (afbdcd6a) suffices
under torchrun, but under mpirun rank 0 (the TCPStore server) can exit
before non-zero ranks finish teardown, and the background heartbeat
thread polls the store and logs 'recvValue failed / Connection was
likely closed'. Disabling the monitor outright is safe for short-lived
bench runs.
2026-04-29 20:44:37 +00:00
..