Fix multi-node H100 CI: CUDA compat, deploy improvements (#781)

## Summary

- **Multi-node H100 CI setup**: Improve architecture detection and GPU
configuration
- **Remove hardcoded VMSS hostnames** from deploy files
- **Fix CUDA compat library issue**: Remove stale compat paths from
Docker image for CUDA 12+. Instead, `peer_access_test` now returns a
distinct exit code (2) for CUDA init failure, and `setup.sh`
conditionally adds compat libs only when needed. This fixes
`cudaErrorSystemNotReady` (error 803) when the host driver is newer than
the container's compat libs.
- **Speed up deploy**: Replace recursive `parallel-scp` with
tar+scp+untar to avoid per-file SSH overhead.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Binyang Li
2026-04-13 21:51:29 -07:00
committed by GitHub
parent b6d0ca13ca
commit ecd33722d4
12 changed files with 200 additions and 88 deletions

View File

@@ -14,11 +14,6 @@ baseImageTable=(
declare -A extraLdPathTable
extraLdPathTable=(
["cuda11.8"]="/usr/local/cuda-11.8/compat"
["cuda12.4"]="/usr/local/cuda-12.4/compat"
["cuda12.8"]="/usr/local/cuda-12.8/compat"
["cuda12.9"]="/usr/local/cuda-12.9/compat"
["cuda13.0"]="/usr/local/cuda-13.0/compat"
["rocm6.2"]="/opt/rocm/lib"
)