7 Commits

Author SHA1 Message Date
Bartłomiej Kocot
7c2b979de2 [rocm-libraries] ROCm/rocm-libraries#8573 (commit 04c9f1d)
[CK][CK Tile] Drop profiler for experimental builder codegen
 (#8573)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Switch to dispatcher profiler for ck tile conv.

## Technical Details

- Switch to dispatcher profiler for ck tile conv.
- Drop profiler for experimental codegen
- Minor fixes for bwd data printing
- Minor fixes for 3d conv in dispatcher codegen

## Test Plan

test_grouped_conv*tile

## Test Result

Passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 09:38:44 +00:00
Brock Hargreaves
081fe18c1c [rocm-libraries] ROCm/rocm-libraries#8558 (commit ccfa08b)
[CK][CI] Retry git network ops to survive transient DNS blips
 (#8558)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

CI builds intermittently fail on transient git DNS blips (e.g. `Could
notresolve host: github.com`). These surface as an untyped `exit code
1`, which the existing node/transient-fault retry doesn't catch — so a
momentary glitch fails the whole build.

## Technical Details

Added `gitNetRetry(label, body)` (3 attempts, 15s backoff) and wrapped
every github.com-touching git step: ref-repo clone/update, `checkout
scm`, and the hipTensor clone. All are idempotent on retry. Docker pulls
are left to the existing `pullImage()` path.

## Test Plan

- Mapped the failing build's `git remote update` DNS error to a
now-wrapped call.
- Confirmed no existing code retries git host-resolution failures.

## Test Result

Groovy shared-library — not locally executable; needs a pipeline run to
fully validate. Check CI.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 21:18:27 +00:00
Brock Hargreaves
8864dcc3a4 [rocm-libraries] ROCm/rocm-libraries#8560 (commit f8362a1)
[CK][CI] Post failure GitHub status on stage build errors
 (#8560)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Failed CI stages (e.g. Static checks) were left stuck on a `pending`
GitHub status instead of reporting `failure`, so PRs showed an overall
failure with no indication of which check actually failed.

## Technical Details

`buildAndTest` posted `pending`/`success` statuses but its catch only
rethrew, deferring failure reporting to `runOnHealthyNode` — which
deferred right back. Neither posted `failure`. This adds a `failure`
status post for real build errors in `buildAndTest`, while letting
node-reroute signals (`NodeFault`/`TransientFault`) and aborts
(`FlowInterruptedException`) propagate untouched so retries still work.
Since every stage routes through `buildAndTest`, this fixes both the
directly-called `Static checks` stage and the `runOnHealthyNode`-wrapped
per-arch build stages.

## Test Plan

Trigger a stage failure (e.g. introduce a clang-format violation) and
confirm the corresponding GitHub status context transitions `pending` →
`failure` rather than remaining `pending`.

## Test Result

Pending CI run on a branch with a deliberate failure to confirm the
status transition.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-18 21:16:24 +00:00
Brock Hargreaves
bad7870830 [rocm-libraries] ROCm/rocm-libraries#8508 (commit 5cc3bef)
[CK][CI] Make gfx1250 build compile-only

## Motivation

gfx1250 has no CI hardware, so its build piggybacks on gfx90a nodes
where gfx1250 binaries can be compiled but not run. The build currently
fails because post-build runtime tests fire on the gfx90a node. This PR
makes the gfx1250 build compile + install only.

## Technical Details

The post-build test block in `buildAndTest` (`ck.groovy`) keys off the
physical node arch (`gfx90a`), so runtime tests run for gfx1250. Gated
that block off for gfx1250. Body-only change with no signature changes,
so it's backward compatible with the develop-pinned shared library and
doesn't affect other archs.

  ## Test Plan

Trigger the gfx1250 build with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`
and confirm it compiles/installs with no runtime test steps; confirm
gfx90a builds are unchanged.

## Test Result

Check CI.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-18 18:33:59 +00:00
Illia Silin
e2deaaba64 [rocm-libraries] ROCm/rocm-libraries#8591 (commit 5210ae6)
[CK] fix daily hipTensor tests.

## Motivation

Had to change the way hipTensor is cloned to make sure it doesn't erase
CK installation and uses the correct path for the installation. Also
added the "install" target every time we build and test everything, so
we could use CK for testing third-party libs that depend on it.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 14:58:10 +00:00
Brock Hargreaves
1b649a8d4b [rocm-libraries] ROCm/rocm-libraries#8332 (commit 48c389c)
[CK][CI] Retry builds on node failure with automatic
 rerouting (#8332)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

When a Jenkins node enters a bad state (missing GPU driver, dead Docker
daemon, full disk), every PR scheduled onto it fails the same way until
a human manually takes it offline. Some failures are also transient and
would pass on a simple retry. Today the pipeline does neither — every
failure goes straight to red on the same node.

## Technical Details

Two new retry behaviors based on failure type:
- **Different node** for persistent node faults (driver missing, daemon
down, disk full, container won't start)
- **Retry in place** for transient glitches (registry pull, DNS), then a
different node if retries are exhausted

Real build/compile failures and aborted builds are never retried.

**New:** `src/org/ck/NodeFault.groovy`, `TransientFault.groovy` — typed
exceptions in the shared library `src/` for stable classloader identity
under dynamic library loading.

**`vars/ck.groovy`:** adds `preflight()` (host health checks before
build), `pullImage()` (classifying pull failures at the call site,
replacing `getDockerImage()`), `runOnHealthyNode()` (outer reroute loop,
up to 3 nodes), `runInPlace()` (same-node transient retries). GitHub
failure status is only set once all retries are exhausted.

**`Jenkinsfile`:** all active `Build CK and run Tests` stages converted
to `agent none` + `ck.runOnHealthyNode(…)`.

## Test Plan

Tested on `users/brockhargreaves-amd/ck/node-failure-retry-logic` with
`USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`. Verified preflight logging,
reroute on node fault, attempt counter in logs, no retry on aborts, and
single failure status report after budget exhausted.

## Test Result

Retry logic working as expected. Three bugs found and fixed during
testing: false `NodeFault` from host-level sccache probe (sccache is
in-container), `null` node name in catch logging, and `sh` calls outside
`node()` context in status reporting.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-15 17:40:10 +00:00
Brock Hargreaves
96a7e44832 [rocm-libraries] ROCm/rocm-libraries#8378 (commit d68585d)
[CK] Pre-emptively add groovy/ folder and skip TheRock CI
 filter (#8378)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Groovy library is growing and will be reorganized into a
self-describing `groovy/` folder rather than living under `src/` and
`vars/`. This PR creates that folder pre-emptively and adds it to the
TheRock CI skip-list so that future Groovy additions do not
unnecessarily trigger TheRock builds.

## Technical Details

- Added `projects/composablekernel/groovy/` with a `.gitkeep` to
establish the directory in the repo.
- Added `"projects/composablekernel/groovy/*"` to
`SKIPPABLE_PATH_PATTERNS` in `.github/scripts/therock_configure_ci.py`
alongside the existing `vars/*` entry, ensuring changes confined to
Groovy pipeline code are recognized as non-therock-relevant and skip the
TheRock CI pipeline.

## Test Plan

No code logic was changed. Verified that `therock_configure_ci.py`
pattern list is consistent with the existing `vars/*` skip entry and
that the new pattern follows the same glob convention.

## Test Result

N/A — directory scaffolding and CI filter only; no functional code
affected.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-12 20:11:53 +00:00