Commit Graph

3 Commits

Author SHA1 Message Date
Yi DING
b2ea5fd315 [rocm-libraries] ROCm/rocm-libraries#6701 (commit f9a8d1c)
[CK] Fix CI Failures for PR From Forks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Fork PRs fail CI when `RUN_AITER_TESTS` or `RUN_FA_TESTS` is enabled.
The docker scripts run `git clone -b "$CK_*_BRANCH"
https://github.com/ROCm/rocm-libraries.git`, but a fork's branch doesn't
exist upstream:

```
fatal: Remote branch <fork-branch> not found in upstream origin
```

Example: [PR #6529 build
#4](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6529/4/pipeline).

## Technical Details

**`Jenkinsfile`** — for PRs, use the upstream-visible PR ref instead of
the head branch name:

```groovy
CURRENT_BRANCH_NAME = env.CHANGE_ID
    ? "refs/pull/${env.CHANGE_ID}/head"
    : (env.CHANGE_BRANCH ? env.CHANGE_BRANCH : env.BRANCH_NAME)
```

**`Dockerfile.aiter` / `Dockerfile.fa`** — `git clone -b <ref>` only
accepts branches (`refs/heads/*`) and tags (`refs/tags/*`), so it can't
resolve `refs/pull/N/head`. Switch to `git fetch`, which accepts any
refspec (and still works for plain branch names):

```sh
mkdir rocm-libraries && cd rocm-libraries
git init -q
git remote add origin https://github.com/ROCm/rocm-libraries.git
git fetch --depth 1 --filter=blob:none origin "$CK_*_BRANCH"
git sparse-checkout init --cone
git sparse-checkout set projects/composablekernel
git checkout FETCH_HEAD
```

`git checkout FETCH_HEAD` lands in detached HEAD, which breaks the
existing `git branch -m "$CK_*_BRANCH"` (and that name isn't a valid
local branch anyway). Decouple the local branch name from the upstream
ref:

- Replace `git init` + `git branch -m` with `git init -b
"$LOCAL_BRANCH"` (requires git ≥ 2.28, satisfied by base images)
- `LOCAL_BRANCH="ck-import-${ROCM_LIBRARIES_SHA}"` in the rocm-libraries
path; `LOCAL_BRANCH="$CK_*_BRANCH"` in the fallback
- Downstream `git clone -b ... ../ck` uses `$LOCAL_BRANCH`

## Test Plan

Manually trigger a build on this PR with `RUN_AITER_TESTS=true` and
`RUN_FA_TESTS=true`; both docker images should build end-to-end.

## Test Result
[jenkins / rocm-libraries-folder/Composable Kernel / PR-6701 /
#3](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6701/3/pipeline/)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-24 08:23:24 +00:00
Yi DING
eaaed3e35e [rocm-libraries] ROCm/rocm-libraries#6563 (commit 6559ac9)
[CK] Add render group to AITER and FA dockers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The AITER and FA test dockers (`Dockerfile.aiter`, `Dockerfile.fa`)
inherit from the `rocm/pytorch` base image. Recent updates to that base
image dropped the `render` group from `/etc/group`, so every parallel
test stage now fails on the test agents with:

```
docker: Error response from daemon: Unable to find group render:
no matching entries in group file.
```

Jenkins resolves `--group-add render` against the **container's**
`/etc/group`, not the host's, so even though the test agents have render
in their `/etc/group` (GID 109), the container lookup fails.

This pattern affects every recent develop build
([#673](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/673),
[#674](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/674),
[#686](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/686),
[#688](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/688),
[#699](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/699),
[#708](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/708)
— 6 days in a row), where AITER tests fail in seconds and the cascading
failure aborts all downstream Build/FMHA/TILE_ENGINE stages.

## Technical Details

Add `groupadd -f render` to both `Dockerfile.aiter` and `Dockerfile.fa`,
mirroring what the main `Dockerfile` already does (`Dockerfile:96`) and
what `Dockerfile.pytorch` does (`Dockerfile.pytorch:4`). The `-f` flag
makes it idempotent — silently succeeds if the group already exists.

This guarantees the `render` group is always present in the container,
regardless of whether the base image happens to ship it.

## Test Plan
Triggering AITER CI job:

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at

https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-21 05:36:37 +00:00
Yi DING
4d0bbe5d17 [rocm-libraries] ROCm/rocm-libraries#5329 (commit 9c43062)
[CK] Add flash_attn tests

## Motivation

Add CI support for running
[flash-attention](https://github.com/ROCm/flash-attention) tests against
CK, similar to existing AITER and PyTorch downstream test pipelines.

## Technical Details

### New: `Dockerfile.fa`
A new Dockerfile that builds a flash-attention test image on top of a
ROCm PyTorch base image. It:
- Sparse-checkouts CK from `rocm-libraries` (or clones directly from
`ROCm/composable_kernel`)
- Clones and builds `flash-attention` with CK as the backend
- Supports configurable `FA_BRANCH`, `CK_FA_BRANCH`, and `GPU_ARCHS`
build args

### Updated: `Jenkinsfile`

**buildDocker refactor:**
- Extracted `buildAndPushDockerImage()` helper that handles both "check
if exists, skip" and "force build, push" logic, eliminating the
duplicated try/catch blocks
- Split monolithic `buildDocker()` into `buildDockerBase()`,
`buildDockerPytorch()`, `buildDockerAiter()`, and new `buildDockerFa()`
- Each downstream docker build now runs unconditionally within its
respective guard (`RUN_PYTORCH_TESTS`, `RUN_AITER_TESTS`,
`RUN_FA_TESTS`)
- Image digests are stored in env vars (`CK_BASE_IMAGE`,
`CK_PYTORCH_IMAGE`, `CK_AITER_IMAGE`, `CK_FA_IMAGE`) for use in
downstream stages

**run_downstream_tests refactor:**
- Merged `run_aiter_tests()` and `run_pytorch_tests()` into a single
generic `run_downstream_tests(conf)` that accepts `image`,
`timeoutHours`, and `execute_cmds`
- Test commands for each downstream target are declared as top-level
lists (`RUN_PYTORCH_TESTS_CMDS`, `RUN_AITER_TESTS_CMDS`,
`RUN_FA_TESTS_CMDS`)

**Pipeline stages:**
- Merged "Run Pytorch Tests" and "Run AITER Tests" into a single "Run
Downstream Tests" parallel stage
- Added two new FA test stages: "Run FA Tests on gfx942" and "Run FA
Tests on gfx950"
- Added new pipeline parameters: `RUN_FA_TESTS`, `fa_base_docker`,
`fa_branch`, `ck_fa_branch`
- `ck_pytorch_branch` and `ck_aiter_branch` now default to the current
branch instead of hardcoded `develop`
- CRON schedule at 13:00 now also triggers `RUN_FA_TESTS=true`

## Test Plan

- [x] Trigger pipeline manually with `RUN_FA_TESTS=true` on gfx942 and
gfx950 nodes
- [x] Verify existing AITER and PyTorch test stages are unaffected
- [x] Verify `buildAndPushDockerImage` correctly skips rebuild when
image already exists (with `BUILD_DOCKER=false`)

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-10 01:23:54 +00:00