[CK][CI] Retry builds on node failure with automatic
rerouting (#8332)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
When a Jenkins node enters a bad state (missing GPU driver, dead Docker
daemon, full disk), every PR scheduled onto it fails the same way until
a human manually takes it offline. Some failures are also transient and
would pass on a simple retry. Today the pipeline does neither — every
failure goes straight to red on the same node.
## Technical Details
Two new retry behaviors based on failure type:
- **Different node** for persistent node faults (driver missing, daemon
down, disk full, container won't start)
- **Retry in place** for transient glitches (registry pull, DNS), then a
different node if retries are exhausted
Real build/compile failures and aborted builds are never retried.
**New:** `src/org/ck/NodeFault.groovy`, `TransientFault.groovy` — typed
exceptions in the shared library `src/` for stable classloader identity
under dynamic library loading.
**`vars/ck.groovy`:** adds `preflight()` (host health checks before
build), `pullImage()` (classifying pull failures at the call site,
replacing `getDockerImage()`), `runOnHealthyNode()` (outer reroute loop,
up to 3 nodes), `runInPlace()` (same-node transient retries). GitHub
failure status is only set once all retries are exhausted.
**`Jenkinsfile`:** all active `Build CK and run Tests` stages converted
to `agent none` + `ck.runOnHealthyNode(…)`.
## Test Plan
Tested on `users/brockhargreaves-amd/ck/node-failure-retry-logic` with
`USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`. Verified preflight logging,
reroute on node fault, attempt counter in logs, no retry on aborts, and
single failure status report after budget exhausted.
## Test Result
Retry logic working as expected. Three bugs found and fixed during
testing: false `NodeFault` from host-level sccache probe (sccache is
in-container), `null` node name in catch logging, and `sh` calls outside
`node()` context in status reporting.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.