mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-28 18:56:59 +00:00
[CK][CI] Retry builds on node failure with automatic rerouting (#8332) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation When a Jenkins node enters a bad state (missing GPU driver, dead Docker daemon, full disk), every PR scheduled onto it fails the same way until a human manually takes it offline. Some failures are also transient and would pass on a simple retry. Today the pipeline does neither — every failure goes straight to red on the same node. ## Technical Details Two new retry behaviors based on failure type: - **Different node** for persistent node faults (driver missing, daemon down, disk full, container won't start) - **Retry in place** for transient glitches (registry pull, DNS), then a different node if retries are exhausted Real build/compile failures and aborted builds are never retried. **New:** `src/org/ck/NodeFault.groovy`, `TransientFault.groovy` — typed exceptions in the shared library `src/` for stable classloader identity under dynamic library loading. **`vars/ck.groovy`:** adds `preflight()` (host health checks before build), `pullImage()` (classifying pull failures at the call site, replacing `getDockerImage()`), `runOnHealthyNode()` (outer reroute loop, up to 3 nodes), `runInPlace()` (same-node transient retries). GitHub failure status is only set once all retries are exhausted. **`Jenkinsfile`:** all active `Build CK and run Tests` stages converted to `agent none` + `ck.runOnHealthyNode(…)`. ## Test Plan Tested on `users/brockhargreaves-amd/ck/node-failure-retry-logic` with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`. Verified preflight logging, reroute on node fault, attempt counter in logs, no retry on aborts, and single failure status report after budget exhausted. ## Test Result Retry logic working as expected. Three bugs found and fixed during testing: false `NodeFault` from host-level sccache probe (sccache is in-container), `null` node name in catch logging, and `sh` calls outside `node()` context in status reporting. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.