## Summary
- **Multi-node H100 CI setup**: Improve architecture detection and GPU
configuration
- **Remove hardcoded VMSS hostnames** from deploy files
- **Fix CUDA compat library issue**: Remove stale compat paths from
Docker image for CUDA 12+. Instead, `peer_access_test` now returns a
distinct exit code (2) for CUDA init failure, and `setup.sh`
conditionally adds compat libs only when needed. This fixes
`cudaErrorSystemNotReady` (error 803) when the host driver is newer than
the container's compat libs.
- **Speed up deploy**: Replace recursive `parallel-scp` with
tar+scp+untar to avoid per-file SSH overhead.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This pull request updates the deployment pipeline to allow custom CMake
arguments to be passed to the pip install process on remote VMs.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>