Adding New Notification Detection
## Motivation
Restricting one of the notification failure patterns to match a specific
missing drivers log pattern. This will help reduce the noise of
erroneous logs. Also adding a new failure pattern to notify us of Github
access issues.
## Technical Details
- Set the failure pattern to match the exact failure observed in the
logs.
- Switching to a plain substring search so special characters are
handled literally.
- Added a new failure pattern for Github access errors.
## Test Plan
- Force a failure using the known failure patterns.
## Test Result
The forced failures were triggered and caught by the notification
system.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Improved CI infrastructure failure detection
## Motivation
This PR re-enables CI infrastructure failure detection and notification,
which had been disabled due to performance issues caused by loading
large build logs (~80k lines) into memory for pattern scanning. The goal
is to reliably detect known infrastructure failures (GPU errors, Docker
authentication issues, disk space errors, etc.) and send actionable
Teams notifications without hanging on large logs.
## Technical Details
- Replaced full build log loading and Groovy-based pattern scanning with
a streaming wget | grep -E pipe. grep scans natively so the full log is
never loaded into Groovy, resolving the hang on large logs.
- Combined all failure patterns into a single grep -E call to avoid
multiple log fetches.
- The node name is now tracked with the observed failure.
- Added a new failure pattern for device's running out of space.
## Test Plan
- Forced failures in the "Determine CI Execution" stage with all 9
failure patterns echoed to the build log.
- Simulated large log sizes (~80k lines of dummy output) to validate
pattern detection and node name extraction at realistic log scales,
including patterns placed both before and after large blocks of dummy
output.
## Test Result
All 9 failure patterns detected correctly. Teams notifications sent with
accurate log context, node name, and job links. No hangs observed on 80k
line simulated logs.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* Removing hard-coded trace filename
* Including stage name in notification
* Simplifying capture setup and tagging file names with arch
* Removed test property from notification message
* Fixing regex to get arch name
* Fixing error in notification and modified regex