mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-07-03 05:37:34 +00:00
[CK][CI] Expand other stages to use healthy-node retry logic. (#8644) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation CI stages bound to a single node via the declarative `agent{ label }` cannot recover when that node is unhealthy — a `NodeFault` (e.g. an exhausted `gitNetRetry` after a persistent `Could not resolve host: github.com`, or a missing GPU) just fails the whole build. The existing `runOnHealthyNode` wrapper already reroutes such faults to a different node, but only the "Build CK and run Tests" stages used it. This PR brings the remaining node-bound stages under that wrapper so a bad node reroutes instead of failing the build. ## Technical Details - `runOnHealthyNode`/`preflight` (`ck.groovy`): `preflight` now takes a `requireGpu` flag that gates the GPU-only checks (`driverUp`/`devicesUp`); `daemonUp`/`diskOk` always run. `runOnHealthyNode` derives it from the node label (`!label.contains('nogpu')`), so no call-site argument is needed and nogpu stages skip the GPU checks automatically. - `Jenkinsfile`: migrated 19 `agent{ label }` stages to `agent none` + `ck.runOnHealthyNode(...)`: - 12 GPU test stages: Pytorch, AITER ×2, FA ×2, 3 grouped-conv (gfx90a), TILE_ENGINE_BASIC, TILE_ENGINE_GEMM ×3. - 4 FMHA stages: the per-stage `environment{}` block was dissolved and `build_and_run_fmha(arch)` is now evaluated on-node inside the closure (required since `agent none` evaluates `environment{}` off-node). - 3 nogpu stages: Determine CI Execution, Build Docker, Static checks (reroute via the label-derived `requireGpu=false`). - The commented-out `gfx908`/`gfx1010` stages were intentionally left untouched. ## Test Plan - Validated the Jenkinsfile against the pipeline linter (`pipeline-model-converter/validate`). - Manual pipeline run to confirm no regressions - Key checks: nogpu stages pass `preflight` (no false `gpu-devices-missing`); migrated stages still allocate, build, and test as before; on a `NodeFault` the `Node attempt N/3 … on <node>` reroute lines appear. ## Test Result - Linter: **"Jenkinsfile successfully validated."** - Groovy shared-library pipeline — not locally executable; the manual build above is required to fully validate. Check CI. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.