Commit Graph

7603 Commits

Author SHA1 Message Date
snomiao
04bf4cbe95 feat: badge shows reproduction method (E2E test / video / both)
- Add reproducedBy field to ResearchResult and done() tool
- Agent reports how bug was proven: e2e_test, video, both, or none
- Badge shows '1 via E2E test' instead of generic '1 reproduced'
- Deploy script reads reproducedBy from research-log.json
- Test code (reproduce.spec.ts) now deployed to report page

Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
2026-04-01 13:45:21 +00:00
snomiao
705d10bc8d fix: prevent trivial assertions from false REPRODUCED + deploy test code
- Add rule: assertions must be specific to the bug (not just > 0)
- If no bug-specific assertion possible, verdict must be NOT_REPRODUCIBLE
- Copy reproduce.spec.ts to deployed report for transparency
- Addresses #10307 false REPRODUCED (test only asserted node count > 0)

Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
2026-04-01 13:35:38 +00:00
snomiao
765b5dbf1d fix: remove oversized files before CF Pages deploy (25MB limit)
Large mp4 videos (37-76MB) caused wrangler to silently fail the deploy,
leaving stale ANALYZING badges. Now strips files >25MB before deploying.

Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
2026-04-01 11:12:57 +00:00
snomiao
8ac9a9843b fix: remove PREPARING/ANALYZING intermediate badge deploys
Root cause: Cloudflare Pages serves stale deployments when multiple
deploys race to the same branch. The ANALYZING placeholder deployed
seconds before the final report would sometimes 'win' the race.

Fix: Only deploy once — the final report with the real badge. No more
intermediate PREPARING or ANALYZING placeholders.

Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
2026-04-01 10:37:19 +00:00
snomiao
3351f79c27 fix: log wrangler output and always use fallback deploy URL
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
2026-04-01 10:01:02 +00:00
snomiao
64c4bedc0d fix: stop iterating after test passes, ban waitForTimeout in QA tests
- Add explicit instruction to call done() immediately after test passes
- Inject warning message in runTest response when test passes
- Ban page.waitForTimeout() in system prompt (use retrying assertions)
- Instruct agent to write ONE focused test, not multiple

Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
2026-04-01 08:19:53 +00:00
snomiao
854f1c7da0 feat: readFixture/readTest tools, ANTHROPIC_API_KEY_QA, fix TS errors
- Add readFixture and readTest tools to qa-agent for fixture API discovery
- Enrich system prompt with comprehensive ComfyPage fixture API reference
- Switch CI to ANTHROPIC_API_KEY_QA secret
- Fix all TS errors in qa-agent.ts, qa-record.ts, qa-reproduce.ts
- Better error handling for API credit exhaustion
- Rewrite SKILL.md to reflect three-phase pipeline

Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
2026-04-01 06:44:34 +00:00
snomiao
ed96aaafc6 fix: auto-complete when test passes but agent doesn't call done()
Claude sometimes keeps iterating after a test passes, exhausting
the time budget without calling done(). Now: when runTest() returns
TEST PASSED, the test code is saved. If the agent loop ends without
done(), auto-sets verdict=REPRODUCED with the passing test.

Fixes #8532 (17 calls, test passed twice, but INCONCLUSIVE verdict).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 13:25:07 +00:00
snomiao
4ba526b859 fix: remaining pipefail crashes in deploy script
Added || true to all grep/sed pipe chains that could exit non-zero:
- grep on pr-context.txt (line 149)
- sed/grep on pr-context description (line 158-159)
- grep -oiP on RISK_FIRST (line 331)
- wrangler deploy | grep URL (line 355)

All tested under set -euo pipefail with empty inputs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 12:23:32 +00:00
snomiao
a3dd897823 fix: pipefail crash in deploy script with empty video-reviews
sed on video-reviews/*.md returns exit code 2 when no files match,
killing the script under set -euo pipefail. Added || true to all
potentially empty glob pipelines.

Affects 13/20 QA runs that had successful research but failed deploy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:08 +00:00
snomiao
979124109a fix: use comfyPage.page.waitForTimeout for delay injection
The test uses comfyPageFixture, not bare page. Also match
firstNode await calls for node interaction pauses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:08 +00:00
snomiao
b0f8e69dc2 feat: inject 800ms pauses between test actions for readable videos
Regex inserts await page.waitForTimeout(800) before every
comfyPage/topbar/page/canvas/expect await call in the Phase 2
test code. Adds ~5-8s to a 10-step test (negligible vs 10min research).

Default playback changed to 0.5x (was 0.25x) since pauses provide
natural breathing room. A 15s video at 0.5x = 30s viewing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:08 +00:00
snomiao
377cbf775f feat: default playback 0.25x + cursor overlay in E2E test videos
- Report player defaults to 0.25x speed (was 0.5x) — 5s test videos
  play in 20s, much more watchable
- Phase 2 injects cursor overlay via addInitScript into the test code
  before running — white SVG arrow follows mousemove events

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:08 +00:00
snomiao
4e261fe5c6 fix: set PLAYWRIGHT_LOCAL=1 for Phase 2 to enable video recording
Playwright config only records video when PLAYWRIGHT_LOCAL is set.
In CI, this env var was missing so Phase 2 produced no video.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:08 +00:00
snomiao
629dfc9d37 fix: don't overwrite Phase 2 test video with idle research video
After context.close(), renameLatestWebm would overwrite the Phase 2
test execution video with the idle research browser recording.
Now skips the rename if qa-session.webm already exists from Phase 2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
8752276945 fix: Phase 2 records test execution video, copies to qa-session.webm
The old video showed an idle screen (research browser doing nothing).
Now Phase 2 runs the test with --video=on from browser_tests/tests/,
finds the recorded .webm, and copies it to qa-session.webm where
the deploy script expects it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
4d6fe47e01 fix: runTest uses project Playwright config + fixtures
- Copy test to browser_tests/tests/ where Playwright config expects it
- System prompt teaches Claude the project's test fixtures:
  comfyPageFixture, comfyPage.menu.topbar, comfyPage.workflow, etc.
- Increased time budget to 10 min for write→run→fix iterations
- Increased max turns to 50

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
4de56e2696 fix: broaden research-log.json search paths in deploy script
Also search qa-artifacts/before/*/research/ for the research log
since artifacts are downloaded with that nested structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
e037244d49 feat: Claude writes E2E tests to reproduce bugs instead of driving browser
Phase 1: Claude reads issue + a11y tree, writes a Playwright .spec.ts
test that asserts the bug exists. Runs the test, reads errors, iterates
until the test passes (proving the bug) or determines NOT_REPRODUCIBLE.

Phase 2: Run the passing test with --video=on for clean recording.

This replaces interactive browser driving with deterministic test code.
Claude Sonnet 4.6 excels at writing Playwright tests — much more
reliable than real-time browser interaction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
e8ce53c92b fix: use ariaSnapshot() instead of removed page.accessibility API
page.accessibility.snapshot() was removed in Playwright 1.49+.
Use page.locator('body').ariaSnapshot() which returns a text
representation of the accessibility tree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
30f4976621 feat: deploy research-log.json + use it as primary verdict source
- Copy research-log.json to deploy dir (accessible at /research-log.json)
- Read verdict from research log first (a11y-verified ground truth)
- Fall back to video review verdict only if no research log exists
- Research log is uploaded as part of QA artifacts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
b156ecd493 feat: three-phase QA pipeline — Research → Reproduce → Report
Phase 1 (qa-agent.ts): Claude investigates via a11y API only.
  - No video, no Gemini vision — only page.accessibility.snapshot()
  - Every action logged with a11y before/after state
  - done() requires evidence citing inspect() results
  - Outputs reproduction plan for Phase 2

Phase 2 (qa-reproduce.ts): Deterministic replay of research plan.
  - Executes each step with a11y assertions
  - Gemini describes visual changes (narration for humans)
  - Clean focused video with subtitles

Phase 3: Report job reads research-log.json for verdict (ground truth),
  narration-log.json for descriptions, video for visuals.
  Gemini formats logs into report — never determines verdict.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
5541c9aaea fix: prevent AI lies — assertion-based verdicts + blind reviewer
Agent: MUST use inspect() after every action, verdict based on DOM
state not opinions. "NEVER claim REPRODUCED unless inspect() confirms."

Reviewer: Two-phase prompt — Phase 1 describes what it SEES (blind,
no context). Phase 2 compares observations against issue/PR context.
Anti-hallucination rules: "describe ONLY what you observe, NEVER infer."

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
ec29d5a828 feat: Agent SDK auto-detects Claude Code session — no API key needed locally
ANTHROPIC_API_KEY is optional: Agent SDK uses Claude Code OAuth
session when running locally (detects CLAUDE_CODE_SSE_PORT).
In CI, ANTHROPIC_API_KEY from secrets is used.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
a96f836a00 refactor: require ANTHROPIC_API_KEY, remove Gemini-only fallback
The Gemini-only agentic loop had ~47% success rate — too low to be
useful as a fallback. Now ANTHROPIC_API_KEY is required for issue
reproduction. Fails clearly if missing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
80db4a1ac6 fix: inject cursor overlay via addScriptTag after login, not addInitScript
addInitScript runs before page load — Vue's app mount destroys the
cursor div when it takes over the DOM. Using addScriptTag after login
ensures the cursor persists in the stable DOM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
9db6463b96 fix: cursor overlay on locator clicks (clickByText, menu items)
Locator.click/hover bypasses our page.mouse monkey-patch. Now
clickByText, hoverMenuItem, clickSubmenuItem get the element
bounding box and update cursor overlay manually.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
6761c391b7 fix: verdict JSON grep pattern — capture value without closing quote
The grep \{"verdict":\s*"[^"]+ captures up to but not including the
closing quote. The second grep for "[A-Z_]+"$ then fails because
there's no closing quote. Fixed: match "verdict":\s*"[A-Z_]+ then
extract [A-Z_]+$ (no quotes needed).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
bfc0931260 fix: grep pipefail crash + add QA troubleshooting doc
- Add || true to all grep pipelines in deploy script (grep returns 1
  on no match, pipefail kills script)
- Add docs/qa/TROUBLESHOOTING.md covering all failures encountered:
  __name errors, zod/v4 imports, model IDs, badge mismatches, cursor,
  loadDefaultWorkflow, pressKey timing, agent behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
be92dee046 feat: structured JSON verdict from AI reviewer, light-first theme
- Video review prompt now requests a ## Verdict JSON block:
  {"verdict": "REPRODUCED|NOT_REPRODUCIBLE|INCONCLUSIVE", "risk": "low|medium|high"}
- Deploy script reads JSON verdict first, falls back to grep
- Eliminates all regex-matching false positives permanently
- Theme: light mode is default, dark via prefers-color-scheme:dark
- Cards use solid backgrounds, grain overlay only in dark mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
6e60706656 feat: report site follows system light/dark theme
Add prefers-color-scheme:light media query with light palette.
Replace hardcoded dark oklch values with CSS variables.
Light mode: white surfaces, dark text, subtle borders, no grain overlay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
40b5b0ca4a fix: loadDefaultWorkflow uses API instead of menu, pressKey uses instant press
- loadDefaultWorkflow now calls app.resetToDefaultWorkflow() via JS API
  instead of navigating File → Load Default menu (menu item name varies)
- pressKey reverted to instant press() — the 400ms hold via down/up
  prevented Escape from propagating to parent dialog (#10397 BEFORE video
  showed wrong behavior because hold intercepted the event)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
0a9ca0c0ba fix: monkey-patch page.mouse for universal cursor overlay
Instead of manually calling moveCursorOverlay in each action,
patch page.mouse.move/click/dblclick/down/up globally. Now EVERY
mouse operation shows the cursor — text clicks, menu hovers, etc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
159fdee471 fix: use gemini-3-flash-preview in hybrid agent (not 2.5 preview)
Gemini 2.5 preview models return 404. Always use gemini-3+ models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
eac491c6b1 fix: add 'could not be confirmed' to negative verdict patterns
"could not be confirmed" contains "confirmed" which matched the
positive reproduc|confirm check. Now caught by the negative check first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
4bfb059696 fix: correct Claude model ID — claude-sonnet-4-6 (not dated suffix)
The Agent SDK returned "model not found" for claude-sonnet-4-6-20250514.
Correct ID is claude-sonnet-4-6.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
99343cdef8 fix: cursor overlay now controlled via __moveCursor, not DOM events
Headless Chrome's Playwright CDP doesn't trigger DOM mousemove events
reliably. Now executeAction calls __moveCursor(x,y) directly after
every mouse.move/click/drag. Cursor is an SVG arrow (white + outline).
Click state shown via scale animation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
17776a850f feat: badge label includes QA date — #10397 QA0327
Shows when the QA was run so stale results are obvious at a glance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
00dc89dad7 fix: make key presses visible in video — hold + subtitle
pressKey now uses keyboard.down/up with 400ms hold instead of
instant press(). Shows subtitle "⌨ Escape" and the keyboard HUD
catches the held state for video frame capture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
0984ec2706 fix: use zod instead of zod/v4 — project zod doesn't export /v4 subpath
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
742216ad18 fix: add claude-agent-sdk to workspace catalog
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:07 +00:00
snomiao
6db13752a0 feat: control/test comparison strategy + QA backlog doc
- Agent system prompt now instructs Claude to demonstrate BOTH working
  (control) and broken (test) states when bug is triggered by a setting
- Added docs/qa/backlog.md with future improvements: Type B/C comparisons,
  TTS, pre-seeding, cost optimization, environment-dependent issues

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:06 +00:00
snomiao
9809d52ac6 feat: all badges use vertical box style
Drop horizontal badges. Universal box badge shows:
  ┌──────────────────┐
  │    #7414 QA       │
  │ ✓ 1 reproduced   │
  │ ⚙ Fix: APPROVED  │  ← only for PRs
  └──────────────────┘

Issues show repro/not-repro/inconclusive rows.
PRs add a fix quality row (APPROVED/MINOR/MAJOR).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:06 +00:00
snomiao
1a25756290 feat: show QA pipeline commit hash + timing on report site
- Shows "QA @ abc1234" linking to the pipeline code commit
- Shows start time → deploy time in header
- Helps trace which version of QA scripts generated each report

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:06 +00:00
snomiao
83204b9a67 feat: hybrid QA agent — Claude Sonnet 4.6 brain + Gemini vision
Architecture:
- Claude Sonnet 4.6 plans and reasons (via Claude Agent SDK)
- Gemini 2.5 Flash watches video buffer and describes what it sees
- 4 tools: observe(), inspect(), perform(), done()

observe(seconds, focus): builds video clip from screenshot buffer,
  sends to Gemini with Claude's focused question.
inspect(selector): searches a11y tree for specific element state.
perform(action, params): executes Playwright action.
done(verdict, summary): signals completion.

Falls back to Gemini-only loop if ANTHROPIC_API_KEY not set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:06 +00:00
snomiao
d78388c893 feat: pass OPENAI_API_KEY to recording step for TTS narration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:06 +00:00
snomiao
eeaedf9854 feat: subtitle overlay + OpenAI TTS narration on reproduce videos
- Agent reasoning shown as subtitle bar at bottom of video during recording
- After recording, generates TTS audio via OpenAI API (tts-1, nova voice)
- Merges audio clips at correct timestamps into the video with ffmpeg
- Requires OPENAI_API_KEY env var; gracefully skips if not set
- No-sandbox + disable-dev-shm-usage for headless Chrome compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:06 +00:00
snomiao
3f1a580d35 feat: show test requirements from QA guide on report site
- Download QA guide artifact in report job
- Extract prerequisites, test focus, and steps from guide JSON
- Display below the purpose description: focus → prerequisites → steps
- Separated by a subtle divider with smaller font

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:06 +00:00
snomiao
6bd6e08195 feat: purpose description on report + multi-pass video link fix
- Report site shows "PR #N aims to..." or "Issue #N reports..." block
  above the video cards, extracted from pr-context.txt
- Multi-pass video links fall back to pass1 when qa-{os}.mp4 is 404
- More negative verdict patterns: "does not demonstrate", "never tested"
- Risk uses first word of Overall Risk (avoids "high confidence" match)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:06 +00:00
snomiao
8994f74b2f fix: verdict detection — more negative patterns, risk uses first word
- Add "does not demonstrate", "steps were not performed", "never tested"
  to NOT_REPRO patterns (fixes #9101 false positive)
- Risk detection uses first word of Overall Risk section instead of
  grepping entire text (fixes "high confidence" matching HIGH)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 07:54:06 +00:00