Use node.key to derive leaf labels instead of node.label, which loses
the file extension after PrimeVue Tree processing. Remove unused
getFilenameDetails import.
Fixes#10409
- Add reproducedBy field to ResearchResult and done() tool
- Agent reports how bug was proven: e2e_test, video, both, or none
- Badge shows '1 via E2E test' instead of generic '1 reproduced'
- Deploy script reads reproducedBy from research-log.json
- Test code (reproduce.spec.ts) now deployed to report page
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
- Add rule: assertions must be specific to the bug (not just > 0)
- If no bug-specific assertion possible, verdict must be NOT_REPRODUCIBLE
- Copy reproduce.spec.ts to deployed report for transparency
- Addresses #10307 false REPRODUCED (test only asserted node count > 0)
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
Root cause: Cloudflare Pages serves stale deployments when multiple
deploys race to the same branch. The ANALYZING placeholder deployed
seconds before the final report would sometimes 'win' the race.
Fix: Only deploy once — the final report with the real badge. No more
intermediate PREPARING or ANALYZING placeholders.
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
- Add explicit instruction to call done() immediately after test passes
- Inject warning message in runTest response when test passes
- Ban page.waitForTimeout() in system prompt (use retrying assertions)
- Instruct agent to write ONE focused test, not multiple
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
- Add readFixture and readTest tools to qa-agent for fixture API discovery
- Enrich system prompt with comprehensive ComfyPage fixture API reference
- Switch CI to ANTHROPIC_API_KEY_QA secret
- Fix all TS errors in qa-agent.ts, qa-record.ts, qa-reproduce.ts
- Better error handling for API credit exhaustion
- Rewrite SKILL.md to reflect three-phase pipeline
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
Claude sometimes keeps iterating after a test passes, exhausting
the time budget without calling done(). Now: when runTest() returns
TEST PASSED, the test code is saved. If the agent loop ends without
done(), auto-sets verdict=REPRODUCED with the passing test.
Fixes#8532 (17 calls, test passed twice, but INCONCLUSIVE verdict).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Added || true to all grep/sed pipe chains that could exit non-zero:
- grep on pr-context.txt (line 149)
- sed/grep on pr-context description (line 158-159)
- grep -oiP on RISK_FIRST (line 331)
- wrangler deploy | grep URL (line 355)
All tested under set -euo pipefail with empty inputs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sed on video-reviews/*.md returns exit code 2 when no files match,
killing the script under set -euo pipefail. Added || true to all
potentially empty glob pipelines.
Affects 13/20 QA runs that had successful research but failed deploy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test uses comfyPageFixture, not bare page. Also match
firstNode await calls for node interaction pauses.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Regex inserts await page.waitForTimeout(800) before every
comfyPage/topbar/page/canvas/expect await call in the Phase 2
test code. Adds ~5-8s to a 10-step test (negligible vs 10min research).
Default playback changed to 0.5x (was 0.25x) since pauses provide
natural breathing room. A 15s video at 0.5x = 30s viewing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Report player defaults to 0.25x speed (was 0.5x) — 5s test videos
play in 20s, much more watchable
- Phase 2 injects cursor overlay via addInitScript into the test code
before running — white SVG arrow follows mousemove events
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Playwright config only records video when PLAYWRIGHT_LOCAL is set.
In CI, this env var was missing so Phase 2 produced no video.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After context.close(), renameLatestWebm would overwrite the Phase 2
test execution video with the idle research browser recording.
Now skips the rename if qa-session.webm already exists from Phase 2.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old video showed an idle screen (research browser doing nothing).
Now Phase 2 runs the test with --video=on from browser_tests/tests/,
finds the recorded .webm, and copies it to qa-session.webm where
the deploy script expects it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Copy test to browser_tests/tests/ where Playwright config expects it
- System prompt teaches Claude the project's test fixtures:
comfyPageFixture, comfyPage.menu.topbar, comfyPage.workflow, etc.
- Increased time budget to 10 min for write→run→fix iterations
- Increased max turns to 50
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Also search qa-artifacts/before/*/research/ for the research log
since artifacts are downloaded with that nested structure.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1: Claude reads issue + a11y tree, writes a Playwright .spec.ts
test that asserts the bug exists. Runs the test, reads errors, iterates
until the test passes (proving the bug) or determines NOT_REPRODUCIBLE.
Phase 2: Run the passing test with --video=on for clean recording.
This replaces interactive browser driving with deterministic test code.
Claude Sonnet 4.6 excels at writing Playwright tests — much more
reliable than real-time browser interaction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
page.accessibility.snapshot() was removed in Playwright 1.49+.
Use page.locator('body').ariaSnapshot() which returns a text
representation of the accessibility tree.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Copy research-log.json to deploy dir (accessible at /research-log.json)
- Read verdict from research log first (a11y-verified ground truth)
- Fall back to video review verdict only if no research log exists
- Research log is uploaded as part of QA artifacts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1 (qa-agent.ts): Claude investigates via a11y API only.
- No video, no Gemini vision — only page.accessibility.snapshot()
- Every action logged with a11y before/after state
- done() requires evidence citing inspect() results
- Outputs reproduction plan for Phase 2
Phase 2 (qa-reproduce.ts): Deterministic replay of research plan.
- Executes each step with a11y assertions
- Gemini describes visual changes (narration for humans)
- Clean focused video with subtitles
Phase 3: Report job reads research-log.json for verdict (ground truth),
narration-log.json for descriptions, video for visuals.
Gemini formats logs into report — never determines verdict.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agent: MUST use inspect() after every action, verdict based on DOM
state not opinions. "NEVER claim REPRODUCED unless inspect() confirms."
Reviewer: Two-phase prompt — Phase 1 describes what it SEES (blind,
no context). Phase 2 compares observations against issue/PR context.
Anti-hallucination rules: "describe ONLY what you observe, NEVER infer."
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ANTHROPIC_API_KEY is optional: Agent SDK uses Claude Code OAuth
session when running locally (detects CLAUDE_CODE_SSE_PORT).
In CI, ANTHROPIC_API_KEY from secrets is used.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Gemini-only agentic loop had ~47% success rate — too low to be
useful as a fallback. Now ANTHROPIC_API_KEY is required for issue
reproduction. Fails clearly if missing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
addInitScript runs before page load — Vue's app mount destroys the
cursor div when it takes over the DOM. Using addScriptTag after login
ensures the cursor persists in the stable DOM.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Locator.click/hover bypasses our page.mouse monkey-patch. Now
clickByText, hoverMenuItem, clickSubmenuItem get the element
bounding box and update cursor overlay manually.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The grep \{"verdict":\s*"[^"]+ captures up to but not including the
closing quote. The second grep for "[A-Z_]+"$ then fails because
there's no closing quote. Fixed: match "verdict":\s*"[A-Z_]+ then
extract [A-Z_]+$ (no quotes needed).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Video review prompt now requests a ## Verdict JSON block:
{"verdict": "REPRODUCED|NOT_REPRODUCIBLE|INCONCLUSIVE", "risk": "low|medium|high"}
- Deploy script reads JSON verdict first, falls back to grep
- Eliminates all regex-matching false positives permanently
- Theme: light mode is default, dark via prefers-color-scheme:dark
- Cards use solid backgrounds, grain overlay only in dark mode
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add prefers-color-scheme:light media query with light palette.
Replace hardcoded dark oklch values with CSS variables.
Light mode: white surfaces, dark text, subtle borders, no grain overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- loadDefaultWorkflow now calls app.resetToDefaultWorkflow() via JS API
instead of navigating File → Load Default menu (menu item name varies)
- pressKey reverted to instant press() — the 400ms hold via down/up
prevented Escape from propagating to parent dialog (#10397 BEFORE video
showed wrong behavior because hold intercepted the event)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of manually calling moveCursorOverlay in each action,
patch page.mouse.move/click/dblclick/down/up globally. Now EVERY
mouse operation shows the cursor — text clicks, menu hovers, etc.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"could not be confirmed" contains "confirmed" which matched the
positive reproduc|confirm check. Now caught by the negative check first.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Agent SDK returned "model not found" for claude-sonnet-4-6-20250514.
Correct ID is claude-sonnet-4-6.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Headless Chrome's Playwright CDP doesn't trigger DOM mousemove events
reliably. Now executeAction calls __moveCursor(x,y) directly after
every mouse.move/click/drag. Cursor is an SVG arrow (white + outline).
Click state shown via scale animation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pressKey now uses keyboard.down/up with 400ms hold instead of
instant press(). Shows subtitle "⌨ Escape" and the keyboard HUD
catches the held state for video frame capture.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Agent system prompt now instructs Claude to demonstrate BOTH working
(control) and broken (test) states when bug is triggered by a setting
- Added docs/qa/backlog.md with future improvements: Type B/C comparisons,
TTS, pre-seeding, cost optimization, environment-dependent issues
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Shows "QA @ abc1234" linking to the pipeline code commit
- Shows start time → deploy time in header
- Helps trace which version of QA scripts generated each report
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Architecture:
- Claude Sonnet 4.6 plans and reasons (via Claude Agent SDK)
- Gemini 2.5 Flash watches video buffer and describes what it sees
- 4 tools: observe(), inspect(), perform(), done()
observe(seconds, focus): builds video clip from screenshot buffer,
sends to Gemini with Claude's focused question.
inspect(selector): searches a11y tree for specific element state.
perform(action, params): executes Playwright action.
done(verdict, summary): signals completion.
Falls back to Gemini-only loop if ANTHROPIC_API_KEY not set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Agent reasoning shown as subtitle bar at bottom of video during recording
- After recording, generates TTS audio via OpenAI API (tts-1, nova voice)
- Merges audio clips at correct timestamps into the video with ffmpeg
- Requires OPENAI_API_KEY env var; gracefully skips if not set
- No-sandbox + disable-dev-shm-usage for headless Chrome compatibility
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Download QA guide artifact in report job
- Extract prerequisites, test focus, and steps from guide JSON
- Display below the purpose description: focus → prerequisites → steps
- Separated by a subtle divider with smaller font
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Report site shows "PR #N aims to..." or "Issue #N reports..." block
above the video cards, extracted from pr-context.txt
- Multi-pass video links fall back to pass1 when qa-{os}.mp4 is 404
- More negative verdict patterns: "does not demonstrate", "never tested"
- Risk uses first word of Overall Risk (avoids "high confidence" match)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>