mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-22 22:08:28 +00:00
Fix multi-node H100 CI: CUDA compat, deploy improvements (#781)
## Summary - **Multi-node H100 CI setup**: Improve architecture detection and GPU configuration - **Remove hardcoded VMSS hostnames** from deploy files - **Fix CUDA compat library issue**: Remove stale compat paths from Docker image for CUDA 12+. Instead, `peer_access_test` now returns a distinct exit code (2) for CUDA init failure, and `setup.sh` conditionally adds compat libs only when needed. This fixes `cudaErrorSystemNotReady` (error 803) when the host driver is newer than the container's compat libs. - **Speed up deploy**: Replace recursive `parallel-scp` with tar+scp+untar to avoid per-file SSH overhead. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -14,11 +14,6 @@ baseImageTable=(
|
||||
|
||||
declare -A extraLdPathTable
|
||||
extraLdPathTable=(
|
||||
["cuda11.8"]="/usr/local/cuda-11.8/compat"
|
||||
["cuda12.4"]="/usr/local/cuda-12.4/compat"
|
||||
["cuda12.8"]="/usr/local/cuda-12.8/compat"
|
||||
["cuda12.9"]="/usr/local/cuda-12.9/compat"
|
||||
["cuda13.0"]="/usr/local/cuda-13.0/compat"
|
||||
["rocm6.2"]="/opt/rocm/lib"
|
||||
)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user