Files
composable_kernel/groovy
Brock Hargreaves 3719bf05c2 [rocm-libraries] ROCm/rocm-libraries#8644 (commit 8b2545e)
[CK][CI] Expand other stages to use healthy-node retry logic.
 (#8644)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

CI stages bound to a single node via the declarative `agent{ label }`
cannot recover when that node is unhealthy — a `NodeFault` (e.g. an
exhausted `gitNetRetry` after a persistent `Could not resolve host:
github.com`, or a missing GPU) just fails the whole build. The existing
`runOnHealthyNode` wrapper already reroutes such faults to a different
node, but only the "Build CK and run Tests" stages used it. This PR
brings the remaining node-bound stages under that wrapper so a bad node
reroutes instead of failing the build.

  ## Technical Details

- `runOnHealthyNode`/`preflight` (`ck.groovy`): `preflight` now takes a
`requireGpu` flag that gates the GPU-only checks
(`driverUp`/`devicesUp`); `daemonUp`/`diskOk` always
run. `runOnHealthyNode` derives it from the node label
(`!label.contains('nogpu')`), so no call-site argument is needed and
nogpu stages skip the GPU checks automatically.
- `Jenkinsfile`: migrated 19 `agent{ label }` stages to `agent none` +
`ck.runOnHealthyNode(...)`:
- 12 GPU test stages: Pytorch, AITER ×2, FA ×2, 3 grouped-conv (gfx90a),
TILE_ENGINE_BASIC, TILE_ENGINE_GEMM ×3.
- 4 FMHA stages: the per-stage `environment{}` block was dissolved and
`build_and_run_fmha(arch)` is now evaluated on-node inside the closure
(required since `agent none`
  evaluates `environment{}` off-node).
- 3 nogpu stages: Determine CI Execution, Build Docker, Static checks
(reroute via the label-derived `requireGpu=false`).
- The commented-out `gfx908`/`gfx1010` stages were intentionally left
untouched.

  ## Test Plan

- Validated the Jenkinsfile against the pipeline linter
(`pipeline-model-converter/validate`).
  - Manual pipeline run to confirm no regressions
- Key checks: nogpu stages pass `preflight` (no false
`gpu-devices-missing`); migrated stages still allocate, build, and test
as before; on a `NodeFault` the `Node attempt N/3
  … on <node>` reroute lines appear.

  ## Test Result

  - Linter: **"Jenkinsfile successfully validated."**
- Groovy shared-library pipeline — not locally executable; the manual
build above is required to fully validate. Check CI.
## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-29 16:03:23 +00:00
..