Fix multi-node H100 CI: CUDA compat, deploy improvements (#781)

## Summary - **Multi-node H100 CI setup**: Improve architecture detection and GPU configuration - **Remove hardcoded VMSS hostnames** from deploy files - **Fix CUDA compat library issue**: Remove stale compat paths from Docker image for CUDA 12+. Instead, `peer_access_test` now returns a distinct exit code (2) for CUDA init failure, and `setup.sh` conditionally adds compat libs only when needed. This fixes `cudaErrorSystemNotReady` (error 803) when the host driver is newer than the container's compat libs. - **Speed up deploy**: Replace recursive `parallel-scp` with tar+scp+untar to avoid per-file SSH overhead. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-22 22:08:28 +00:00 · 2026-04-13 21:51:29 -07:00
parent b6d0ca13ca
commit ecd33722d4
12 changed files with 200 additions and 88 deletions
--- a/docker/build.sh
+++ b/docker/build.sh
@@ -14,11 +14,6 @@ baseImageTable=(

 declare -A extraLdPathTable
 extraLdPathTable=(
-    ["cuda11.8"]="/usr/local/cuda-11.8/compat"
-    ["cuda12.4"]="/usr/local/cuda-12.4/compat"
-    ["cuda12.8"]="/usr/local/cuda-12.8/compat"
-    ["cuda12.9"]="/usr/local/cuda-12.9/compat"
-    ["cuda13.0"]="/usr/local/cuda-13.0/compat"
    ["rocm6.2"]="/opt/rocm/lib"
 )