The QA backend occasionally retains a stale playwright-test-* user that is
not returned by /api/users, causing POST /api/users to collide with
Duplicate username. Recover by re-querying, then falling back to a unique
suffixed username if the stale user still cannot be resolved.
- Prompt agent to reason about hidden preconditions before writing tests
(e.g. z-index bugs need crowded canvas, not empty default workflow)
- Fetch issue comments with reproduction hints (repro/step/workaround)
- Better error analysis: different strategy on retry, not same code
- Both CI workflow and pnpm qa CLI fetch comments
When Opus API call fails (credit balance, rate limit), keep Sonnet's
result instead of overwriting with INCONCLUSIVE API error. Only use
Opus result if it's actually better than Sonnet's attempt.
Sonnet tries first. If INCONCLUSIVE, automatically retries with
claude-opus-4-6 (30 turns). Disable with QA_OPUS_ESCALATION=0.
Also: model param added to ResearchOptions for flexibility.
- B1: Fix concurrency group to use ref_name (parallel sno-qa-* branches)
- D1: Auto-trigger QA on 'Potential Bug' and 'verified bug' labels
- A4: Prompt agent to read existing tests first before writing
- Turn budget enforcement from previous commit
- Bump @anthropic-ai/claude-agent-sdk to `latest` (0.2.109) — stay on
newest SDK by default
- advisorModel=claude-opus-4-6 — Sonnet handles the loop, Opus is
consulted server-side on hard judgments without paying Opus rates
every turn
- permissions.deny: Bash/Agent/Write/Edit/NotebookEdit — prior run went
off-rails into sqlite3 against the user's ComfyUI DB; read-only
exploration (Read/Grep/Glob) stays allowed
- loadWorkflow accepts URL or JSON (agents kept passing {url:...} into
the old {json:...} schema)
- Strip npm/pnpm/node warnings from Playwright output so real assertion
errors surface instead of being truncated
- Force done(NOT_REPRODUCIBLE) after 5 consecutive test failures +
auto-complete when the loop ends with failures but no done() call
- Ignore ephemeral browser_tests/tests/qa-reproduce.spec.ts
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add reproducedBy field to ResearchResult and done() tool
- Agent reports how bug was proven: e2e_test, video, both, or none
- Badge shows '1 via E2E test' instead of generic '1 reproduced'
- Deploy script reads reproducedBy from research-log.json
- Test code (reproduce.spec.ts) now deployed to report page
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
- Add rule: assertions must be specific to the bug (not just > 0)
- If no bug-specific assertion possible, verdict must be NOT_REPRODUCIBLE
- Copy reproduce.spec.ts to deployed report for transparency
- Addresses #10307 false REPRODUCED (test only asserted node count > 0)
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
Root cause: Cloudflare Pages serves stale deployments when multiple
deploys race to the same branch. The ANALYZING placeholder deployed
seconds before the final report would sometimes 'win' the race.
Fix: Only deploy once — the final report with the real badge. No more
intermediate PREPARING or ANALYZING placeholders.
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
- Add explicit instruction to call done() immediately after test passes
- Inject warning message in runTest response when test passes
- Ban page.waitForTimeout() in system prompt (use retrying assertions)
- Instruct agent to write ONE focused test, not multiple
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
- Add readFixture and readTest tools to qa-agent for fixture API discovery
- Enrich system prompt with comprehensive ComfyPage fixture API reference
- Switch CI to ANTHROPIC_API_KEY_QA secret
- Fix all TS errors in qa-agent.ts, qa-record.ts, qa-reproduce.ts
- Better error handling for API credit exhaustion
- Rewrite SKILL.md to reflect three-phase pipeline
Amp-Thread-ID: https://ampcode.com/threads/T-019d4786-eb5f-7115-a10e-5b086c921800
Co-authored-by: Amp <amp@ampcode.com>
Claude sometimes keeps iterating after a test passes, exhausting
the time budget without calling done(). Now: when runTest() returns
TEST PASSED, the test code is saved. If the agent loop ends without
done(), auto-sets verdict=REPRODUCED with the passing test.
Fixes#8532 (17 calls, test passed twice, but INCONCLUSIVE verdict).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Added || true to all grep/sed pipe chains that could exit non-zero:
- grep on pr-context.txt (line 149)
- sed/grep on pr-context description (line 158-159)
- grep -oiP on RISK_FIRST (line 331)
- wrangler deploy | grep URL (line 355)
All tested under set -euo pipefail with empty inputs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sed on video-reviews/*.md returns exit code 2 when no files match,
killing the script under set -euo pipefail. Added || true to all
potentially empty glob pipelines.
Affects 13/20 QA runs that had successful research but failed deploy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test uses comfyPageFixture, not bare page. Also match
firstNode await calls for node interaction pauses.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Regex inserts await page.waitForTimeout(800) before every
comfyPage/topbar/page/canvas/expect await call in the Phase 2
test code. Adds ~5-8s to a 10-step test (negligible vs 10min research).
Default playback changed to 0.5x (was 0.25x) since pauses provide
natural breathing room. A 15s video at 0.5x = 30s viewing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Report player defaults to 0.25x speed (was 0.5x) — 5s test videos
play in 20s, much more watchable
- Phase 2 injects cursor overlay via addInitScript into the test code
before running — white SVG arrow follows mousemove events
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Playwright config only records video when PLAYWRIGHT_LOCAL is set.
In CI, this env var was missing so Phase 2 produced no video.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After context.close(), renameLatestWebm would overwrite the Phase 2
test execution video with the idle research browser recording.
Now skips the rename if qa-session.webm already exists from Phase 2.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old video showed an idle screen (research browser doing nothing).
Now Phase 2 runs the test with --video=on from browser_tests/tests/,
finds the recorded .webm, and copies it to qa-session.webm where
the deploy script expects it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Copy test to browser_tests/tests/ where Playwright config expects it
- System prompt teaches Claude the project's test fixtures:
comfyPageFixture, comfyPage.menu.topbar, comfyPage.workflow, etc.
- Increased time budget to 10 min for write→run→fix iterations
- Increased max turns to 50
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Also search qa-artifacts/before/*/research/ for the research log
since artifacts are downloaded with that nested structure.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1: Claude reads issue + a11y tree, writes a Playwright .spec.ts
test that asserts the bug exists. Runs the test, reads errors, iterates
until the test passes (proving the bug) or determines NOT_REPRODUCIBLE.
Phase 2: Run the passing test with --video=on for clean recording.
This replaces interactive browser driving with deterministic test code.
Claude Sonnet 4.6 excels at writing Playwright tests — much more
reliable than real-time browser interaction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
page.accessibility.snapshot() was removed in Playwright 1.49+.
Use page.locator('body').ariaSnapshot() which returns a text
representation of the accessibility tree.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Copy research-log.json to deploy dir (accessible at /research-log.json)
- Read verdict from research log first (a11y-verified ground truth)
- Fall back to video review verdict only if no research log exists
- Research log is uploaded as part of QA artifacts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1 (qa-agent.ts): Claude investigates via a11y API only.
- No video, no Gemini vision — only page.accessibility.snapshot()
- Every action logged with a11y before/after state
- done() requires evidence citing inspect() results
- Outputs reproduction plan for Phase 2
Phase 2 (qa-reproduce.ts): Deterministic replay of research plan.
- Executes each step with a11y assertions
- Gemini describes visual changes (narration for humans)
- Clean focused video with subtitles
Phase 3: Report job reads research-log.json for verdict (ground truth),
narration-log.json for descriptions, video for visuals.
Gemini formats logs into report — never determines verdict.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agent: MUST use inspect() after every action, verdict based on DOM
state not opinions. "NEVER claim REPRODUCED unless inspect() confirms."
Reviewer: Two-phase prompt — Phase 1 describes what it SEES (blind,
no context). Phase 2 compares observations against issue/PR context.
Anti-hallucination rules: "describe ONLY what you observe, NEVER infer."
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ANTHROPIC_API_KEY is optional: Agent SDK uses Claude Code OAuth
session when running locally (detects CLAUDE_CODE_SSE_PORT).
In CI, ANTHROPIC_API_KEY from secrets is used.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Gemini-only agentic loop had ~47% success rate — too low to be
useful as a fallback. Now ANTHROPIC_API_KEY is required for issue
reproduction. Fails clearly if missing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
addInitScript runs before page load — Vue's app mount destroys the
cursor div when it takes over the DOM. Using addScriptTag after login
ensures the cursor persists in the stable DOM.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Locator.click/hover bypasses our page.mouse monkey-patch. Now
clickByText, hoverMenuItem, clickSubmenuItem get the element
bounding box and update cursor overlay manually.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The grep \{"verdict":\s*"[^"]+ captures up to but not including the
closing quote. The second grep for "[A-Z_]+"$ then fails because
there's no closing quote. Fixed: match "verdict":\s*"[A-Z_]+ then
extract [A-Z_]+$ (no quotes needed).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Video review prompt now requests a ## Verdict JSON block:
{"verdict": "REPRODUCED|NOT_REPRODUCIBLE|INCONCLUSIVE", "risk": "low|medium|high"}
- Deploy script reads JSON verdict first, falls back to grep
- Eliminates all regex-matching false positives permanently
- Theme: light mode is default, dark via prefers-color-scheme:dark
- Cards use solid backgrounds, grain overlay only in dark mode
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add prefers-color-scheme:light media query with light palette.
Replace hardcoded dark oklch values with CSS variables.
Light mode: white surfaces, dark text, subtle borders, no grain overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- loadDefaultWorkflow now calls app.resetToDefaultWorkflow() via JS API
instead of navigating File → Load Default menu (menu item name varies)
- pressKey reverted to instant press() — the 400ms hold via down/up
prevented Escape from propagating to parent dialog (#10397 BEFORE video
showed wrong behavior because hold intercepted the event)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of manually calling moveCursorOverlay in each action,
patch page.mouse.move/click/dblclick/down/up globally. Now EVERY
mouse operation shows the cursor — text clicks, menu hovers, etc.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"could not be confirmed" contains "confirmed" which matched the
positive reproduc|confirm check. Now caught by the negative check first.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Agent SDK returned "model not found" for claude-sonnet-4-6-20250514.
Correct ID is claude-sonnet-4-6.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Headless Chrome's Playwright CDP doesn't trigger DOM mousemove events
reliably. Now executeAction calls __moveCursor(x,y) directly after
every mouse.move/click/drag. Cursor is an SVG arrow (white + outline).
Click state shown via scale animation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pressKey now uses keyboard.down/up with 400ms hold instead of
instant press(). Shows subtitle "⌨ Escape" and the keyboard HUD
catches the held state for video frame capture.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Agent system prompt now instructs Claude to demonstrate BOTH working
(control) and broken (test) states when bug is triggered by a setting
- Added docs/qa/backlog.md with future improvements: Type B/C comparisons,
TTS, pre-seeding, cost optimization, environment-dependent issues
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Shows "QA @ abc1234" linking to the pipeline code commit
- Shows start time → deploy time in header
- Helps trace which version of QA scripts generated each report
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Architecture:
- Claude Sonnet 4.6 plans and reasons (via Claude Agent SDK)
- Gemini 2.5 Flash watches video buffer and describes what it sees
- 4 tools: observe(), inspect(), perform(), done()
observe(seconds, focus): builds video clip from screenshot buffer,
sends to Gemini with Claude's focused question.
inspect(selector): searches a11y tree for specific element state.
perform(action, params): executes Playwright action.
done(verdict, summary): signals completion.
Falls back to Gemini-only loop if ANTHROPIC_API_KEY not set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Agent reasoning shown as subtitle bar at bottom of video during recording
- After recording, generates TTS audio via OpenAI API (tts-1, nova voice)
- Merges audio clips at correct timestamps into the video with ffmpeg
- Requires OPENAI_API_KEY env var; gracefully skips if not set
- No-sandbox + disable-dev-shm-usage for headless Chrome compatibility
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Download QA guide artifact in report job
- Extract prerequisites, test focus, and steps from guide JSON
- Display below the purpose description: focus → prerequisites → steps
- Separated by a subtle divider with smaller font
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Report site shows "PR #N aims to..." or "Issue #N reports..." block
above the video cards, extracted from pr-context.txt
- Multi-pass video links fall back to pass1 when qa-{os}.mp4 is 404
- More negative verdict patterns: "does not demonstrate", "never tested"
- Risk uses first word of Overall Risk (avoids "high confidence" match)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "does not demonstrate", "steps were not performed", "never tested"
to NOT_REPRO patterns (fixes#9101 false positive)
- Risk detection uses first word of Overall Risk section instead of
grepping entire text (fixes "high confidence" matching HIGH)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tsx compiles arrow functions with __name helpers that don't exist in
browser context. Using addScriptTag with plain JS string avoids this.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set<string>() in page.evaluate causes __name ReferenceError in browser.
Use untyped Set() since browser JS doesn't support TS generics.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"fails to reproduce" contains "reproduce" — must check negatives first
within each report. Across reports, REPRODUCED still wins (multi-pass).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Injects a persistent overlay in bottom-right corner that displays
currently held keys (e.g. "⌨ Space", "⌨ CTRL+C"). Makes keyboard
interactions visible in the recording for both human and AI reviewers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When multiple report files exist, badge shows "2/3 REPRODUCED" instead
of just "REPRODUCED". Single-pass issues still show plain verdict.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When multiple passes exist and one confirms while another is
inconclusive, the badge should show REPRODUCED. Previously
INCONCLUSIVE was checked first, hiding successful reproductions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The "Clone" context menu item doesn't exist in Nodes 2.0 mode.
Using Ctrl+C/Ctrl+V works in both legacy and Nodes 2.0.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- #10307: preflight clones KSampler node, hint says drag to overlap
- #7414: preflight clicks numeric widget, hint says drag to change value
- #7806: preflight takes baseline screenshot, hint gives exact coords
for holdKeyAndDrag with spacebar
- Hints now reference "Preflight already did X, NOW do Y" pattern
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agent was wasting turns re-doing loadDefaultWorkflow and setSetting
that preflight already executed. Now the system prompt includes
"Already Done" section listing preflight actions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Auto-execute prerequisite actions (enable Nodes 2.0, load default
workflow) BEFORE the agentic loop starts. Agent model ignores prompt
hints but preflight guarantees nodes are on canvas.
- Add "fails to reproduce" to NOT REPRODUCIBLE badge patterns
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add issues:[labeled] trigger and qa-issue label support
- Resolve github.event.issue.number for issue-triggered runs
- Include issue labels in context (feeds keyword matcher for hints)
- Remove qa-issue label after run completes (same as qa-changes/qa-full)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scan issue context for keywords (clone, copy-paste, spacebar, resize,
sidebar, scroll, middle-click, node shape, Nodes 2.0, etc.) and inject
specific MUST-follow action steps into the agentic system prompt.
Addresses 9 INCONCLUSIVE issues where agent had actions available
but didn't know when to use them.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Annotations now use cyan dashed border + monospace "QA:" prefix
instead of red solid labels that look like UI error messages
- Video review prompts explicitly tell reviewer to ignore QA annotations
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change preload=metadata to preload=auto for full video download
- Add _headers file with Accept-Ranges for Cloudflare Pages
- Add custom seekbar (range input + buffer indicator) that works
even without server HTTP range request support
- Seekbar shows buffered progress and allows dragging to any point
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous prompt said "test the specific behavior" which was too vague,
leading to generic UI walkthroughs instead of targeted tests.
New prompt: explicitly instructs to read the diff, trigger the exact
scenario the PR fixes, and avoid generic menu screenshots.
Also added reload action to before/after prompt for state persistence tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix quality badge now reads "## Overall Risk" section only
- Prevents false MAJOR ISSUES from severity labels or negated phrases
- "Low" risk → APPROVED, "High" → MAJOR ISSUES, "Medium" → MINOR ISSUES
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-frontend must run first to install node/pnpm, then rebuild
with PR code. Also re-install sno-skills deps after switching back
so QA scripts' dependencies are available.
Also gitignore .claude/scheduled_tasks.lock.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When triggered via sno-qa-* push, the workflow checks out the PR code
to build its frontend, but this replaces qa-record.ts which only
exists on sno-skills. Fix: build PR frontend, then checkout back to
sno-skills so QA scripts remain available.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SC2231: quote glob expansions in for loop
- SC2002: use sed directly instead of cat | sed
- SC2086: quote variable in echo
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update DefaultThumbnail test to match size-full class change
- Fix shellcheck warnings in qa-batch.sh (SC2001, SC2207)
- Fix shellcheck warnings in qa-deploy-pages.sh (SC2034, SC2235, SC2231, SC2002)
- Add qa-report-template.html to oxfmt ignore (minified, not formattable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- "confirmed" didn't match "confirms"/"reproducible" — use "reproduc|confirm" stem
- "partial" matched unrelated text — require "partially reproduced" specifically
- collectVideoCandidates now finds qa-session-*.mp4 for multi-pass reviews
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Check INCONCLUSIVE before reproduced/confirmed in badge detection
- Exclude markdown headings from reproduced grep match
- Add --pass-label to qa-video-review.ts for unique multi-pass filenames
- Pass pass label from workflow YAML when reviewing numbered sessions
- Collect all pass-specific reports in deploy script HTML
- Add addNode/cloneNode convenience actions to qa-record agent
- Improve strategy hints for visual/rendering bug reproduction
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove push trigger (was for dev testing only)
- Restore concurrency group (was commented out for dev)
- Move misplaced import in qa-analyze-pr.ts to top of file
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Strengthen prompt: MUST use openMenu → hoverMenuItem → clickMenuItem
in that order. Previous runs skipped openMenu causing silent failures.
- Add CI Job link to the QA report site header for quick navigation
to the GitHub Actions run that generated the report.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub Actions has a 21000 char limit per expression. The combined
badge setup step exceeded this after adding the dual badge generator.
Split into its own step.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Badge now reads: QA Bot | REPRODUCED | Fix: APPROVED
Not all issues are bugs — could be feature requests too.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reproducing a bug is a successful outcome for the QA bot.
Blue (#2196f3) = bot succeeded. Red = bot found problems with the fix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PRs now show one badge with three segments:
QA Bot | Bug: REPRODUCED | Fix: APPROVED
Instead of two separate badges. Uses gen-badge-dual.sh which
renders label + bug status + fix status in one SVG.
Issues still use single two-segment badge:
QA Bot | FINISHED: REPRODUCED
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Inject fake cursor (red dot with click animation) via addInitScript
since headless Chrome doesn't render the system cursor in video
- Add hover-before-click delay to clickByText and canvas clicks
so viewers can see where the cursor moves before clicking
- Add 'annotate' action: shows a floating label at (x,y) for N ms
so AI can draw viewer attention to important UI state changes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: runAgenticLoop never read the QA guide — agent saw
"No issue context provided" for issues. Now reads qaGuideFile,
parses structured fields, and injects into system prompt.
Also: fetch issue body via gh issue view in workflow, increase
budget to 120s/30 turns, add focus reminders, smarter stuck
detection (50px grid normalization + action-type frequency),
reject invalid click targets, add loadDefaultWorkflow and
openSettings convenience actions, strategy hints in prompt.
Fix pre-existing typecheck error in eslint.config.ts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PRs now get two separate badges:
- Bug: REPRODUCED / NOT REPRODUCIBLE / PARTIAL (before branch)
- Fix: APPROVED / MAJOR ISSUES / MINOR ISSUES (after branch)
Issues keep a single badge: FINISHED: REPRODUCED / etc.
Both badge-bug.svg and badge-fix.svg served from the deploy site.
PR comment shows all three: ![badge] ![bug] ![fix]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FINISHED is not standalone — always shows result:
- FINISHED: REPRODUCED / NOT REPRODUCIBLE / PARTIAL (issues)
- FINISHED: APPROVED / MAJOR ISSUES / MINOR ISSUES (PRs)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Badge shows QA pipeline status, deployed at each stage:
- PREPARING (blue) — setting up artifacts
- ANALYZING (orange) — running video review
- Final status with color:
- Issues: REPRODUCED (red) / NOT REPRODUCIBLE (gray) / PARTIAL (yellow)
- PRs: APPROVED (green) / MAJOR ISSUES (red) / MINOR ISSUES (yellow)
Badge served as /badge.svg from the same Cloudflare Pages site.
Included in PR comment as .
Also restore @ts-expect-error for import-x plugin type incompatibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace single-shot step generation in reproduce mode with an agentic
loop where Gemini sees the screen after each action and decides what
to do next. For multi-bug issues, decompose into sub-issues and run
separate recording passes.
- Extract executeAction() from executeSteps() for reuse
- Add reload and done action types
- Add captureScreenshotForGemini() (JPEG q50, ~50KB)
- Add runAgenticLoop() with sliding window history (3 screenshots)
- Add decomposeIssue() for multi-pass recording (1-3 sub-issues)
- Update workflow to handle numbered session videos (qa-session-1, etc.)
- Add custom video controls below each video with frame stepping
- Frame back/forward buttons (1 frame at 30fps, 10 frames skip)
- Speed selector: 0.1x, 0.25x, 0.5x (default), 1x, 1.5x, 2x
- Keyboard shortcuts: arrow keys for frame step, space for play/pause
- SMPTE-style timecode display (m:ss.ms)
- Default 0.5x speed since AI operates UI faster than humans
- Videos no longer autoplay (pause on load for inspection)
- Zero external dependencies (pure HTML5 video API)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reproduce video must be max 5 minutes (short, focused demo)
- Phase 4 reuses the environment from Phase 3 (no re-setup)
- Use video-start/video-stop commands (not --save-video flag)
- Start recording right before steps, stop immediately after
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New Claude agent-driven issue reproduction skill that:
- Phase 1-2: Research issue and set up environment (custom nodes, workflows, settings)
- Phase 3: Record research video while exploring interactively via playwright-cli
- Phase 4: Record clean reproduce video with only the minimal repro steps
- Phase 5: Generate structured reproduction report
Key difference from the old approach: Claude agent explores and adapts
instead of blindly executing a Gemini-generated static plan.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gemini was right-clicking empty canvas instead of nodes because it
didn't know where the default workflow nodes are positioned. Now the
prompt includes approximate coordinates for all 7 default nodes and
clarifies the difference between node context menu vs canvas menu.
Also fixes TS2352 in page.evaluate by using double-cast through unknown.
- Add --target-url CLI option to qa-video-review.ts
- Include target URL in generated markdown reports
- Add clickable issue/PR link in deployed HTML report header
- Workflow passes the target URL automatically
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused @ts-expect-error directives in eslint.config.ts
- Simplify LazyImage prop types from ClassValue to string
- Fix DialogInstance to avoid infinitely deep type instantiation
- Use cn() in DefaultThumbnail for class merging
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- fillDialog now tries: PrimeVue dialog → node search box → focused input → keyboard fallback
- clickSubmenuItem now tries: PrimeVue tiered menu → litegraph context menu → role menuitem
- Fixes double-click-to-add-node flow and right-click context menu clicks
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wrap each step in try/catch so failed steps don't abort the recording
- Add 5s timeout to clickByText to prevent 30s hangs on disabled elements
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
gh pr view can't distinguish PRs from issues — it succeeds for both.
Use the REST API endpoint repos/{owner}/{repo}/pulls/{number} which
returns 404 for issues.
- Enforce requestTimeoutMs via Gemini SDK requestOptions
- Add 100MB video size check before base64 encoding
- Sanitize screenshot filenames to prevent path traversal
- Sort video files by mtime for reliable rename
- Validate --mode arg against allowed values
- Add Content-Length pre-check in downloadMedia
- Add GitHub domain allowlist for media downloads (SSRF mitigation)
- Add contents:write permission and git config for report job
- Update Node.js requirement in SKILL.md from 18+ to 22+
Instead of crashing the entire recording session when Gemini generates
an invalid key name (e.g. "mouseWheelDown"), catch the error and
continue with remaining steps.
Add a "Behavior Changes" table (Behavior, Before, After, Verdict)
alongside the existing timeline comparison. This gives reviewers a
quick high-level view of all behavioral differences before diving
into the frame-by-frame timeline.
Instruct Gemini to output the Before vs After section as a markdown
table with Time, Type, Severity, Before, After columns for easier
comparison. Update HTML template table styles with fixed layout and
column widths optimized for the 5-column comparison format.
- Log raw Gemini response for debugging when parsing fails
- Handle possible wrapper keys in response
- Make qa-before/qa-after run even if analyze-pr fails (only gate
on resolve-matrix success)
When running on push events for sno-qa-* branches without an open PR,
extract the PR number from the branch name so analyze-pr can fetch
the full PR thread for analysis.
Add Gemini Pro-powered PR analysis that generates targeted QA guides
from the full PR thread (description, comments, screenshots, diff).
The analyze-pr job runs on lightweight ubuntu before recordings start,
producing qa-guide-before.json and qa-guide-after.json that are
downloaded by recording jobs to produce more focused test steps.
Graceful fallback: if analysis fails, recordings proceed without guides.
download-artifact@v7 merges all files flat regardless of
merge-multiple setting. Use separate path dirs (before/after)
and copy all files into the report directory.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
download-artifact@v7 defaults merge-multiple to true, which puts all
files flat in qa-artifacts/ instead of per-artifact subdirectories.
The merge step expects qa-artifacts/qa-before-{os}-{run}/ subdirs,
so the report directory never gets created and video review finds
no files.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of running before/after sequentially in a single job with
fragile git stash/checkout gymnastics, split into two independent
parallel jobs on separate runners:
resolve-matrix → qa-before (main) ─┐
→ qa-after (PR) ─┴→ report
- qa-before: uses git worktree for clean main branch build
- qa-after: normal PR build via setup-frontend
- report: downloads both artifact sets, merges, runs Gemini review
Benefits:
- Clean workspace isolation (no git checkout origin/main -- .)
- ~2x faster (parallel execution)
- Each job gets its own ComfyUI server (no shared state)
- Eliminates entire class of workspace contamination bugs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Pre-seed step creates qa-ci via API, so the "New user" form
shows "already exists" error. Fix by selecting the existing user
from the dropdown first, falling back to a unique username.
The localStorage userId bypass doesn't work because the server
validates user IDs and rejects the simple 'qa-ci' string. Instead,
detect the login page by its input fields and create a user via the
"New user" text input, which is how real users would log in.
Firefox headless doesn't support WebGL, causing "getCanvas: canvas is
null" errors. Switch to Chromium which has full headless WebGL support.
Also fix login flow to wait for async router guard to settle and
create user via text input as fallback.
Add coordinate fallback when .comfy-menu-button-wrapper selector isn't
found, and capture a debug screenshot after login to diagnose what the
page looks like when the editor UI fails to render.
The openComfyMenu was clicking at hardcoded coordinates (20, 67) which
missed the menu button. Now uses .comfy-menu-button-wrapper selector
matching the browser tests. Also fixes menu item hover/click selectors
to use .p-menubar-item-label and .p-tieredmenu-item classes, and adds
a wait for the editor UI to fully load before executing test steps.
The QA recordings were stuck on the user selection screen because CI
has no existing users. Fix by pre-seeding localStorage with userId,
userName, and TutorialCompleted before navigation, plus creating a
qa-ci user via API as a fallback.
The main branch build step was running pnpm install with main's lockfile,
which removed @google/generative-ai from node_modules. Move the reinstall
to after restoring PR files so the QA recording script can find its deps.
git checkout - uses @{-1} which requires a previous branch switch.
Since we use 'git checkout origin/main -- .' (file checkout, not branch
switch), there is no @{-1} ref. Use HEAD to restore from current branch.
Also restore proper concurrency group.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass the comprehensive test plan from .claude/skills/comfy-qa/SKILL.md
to Gemini when generating test steps. This gives Gemini knowledge of all
12 QA categories (canvas, menus, sidebar, settings, etc.) so it picks
the most relevant tests for each PR.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nx build runs typecheck as a prerequisite (via @nx/vite/plugin config).
Use vite build directly for the main branch comparison build.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Main branch may have transient TS errors when built with the PR
branch's lockfile. Since we only need the dist for visual comparison,
run nx build directly instead of pnpm build (which includes typecheck).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the unreliable codex exec approach with a Playwright script
(qa-record.ts) that uses Gemini to generate targeted test steps from
the PR diff, then executes them deterministically via Playwright's API.
Key changes:
- New scripts/qa-record.ts: Gemini generates JSON test actions, Playwright
executes them with reliable helper functions (menu nav, dialog fill, etc.)
- Remove codex CLI and playwright-cli dependencies
- Remove 150+ lines of prompt templates from pr-qa.yaml
- Firefox headless with video recording (same approach proven locally)
- Fallback steps if Gemini fails
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Tighten BEFORE prompt to 15s snapshot (show old state only)
- Add qa-generate-test.ts: Gemini-powered Playwright test generator
- New workflow step: generate .spec.ts and push to {branch}-add-qa-test
- Tests assert UIUX behavior (tab names, dirty state, visibility)
- Build both main (dist-before/) and PR (dist/) frontends in focused mode
- Run QA twice: BEFORE on main branch frontend, AFTER on PR branch
- Send both videos to Gemini in one request for comparative analysis
- Side-by-side dashboard layout with Before (main) / After (PR) panels
- Comparative prompt evaluates whether before confirms old behavior
and after proves the fix works
- Falls back to single-video mode when no before video available
moov atom was at end of file (8.6MB offset) — browser had to download
the entire video before seeking. Keyframes were only every 10 seconds.
Add -movflags +faststart (moov before mdat) and -g 60 (keyframe every
2.4s at 25fps) to ffmpeg conversion.
- Remove autoplay/loop so video timeline is seekable
- Skip card generation for platforms without recordings
- Add --pr-context flag to qa-video-review.ts so Gemini evaluates
against PR purpose instead of just describing what happened
- Workflow now builds pr-context.txt from PR title/body/diff
The Codex agent was spending 35s browsing the "Getting Started" template
gallery instead of testing the PR's changes. Pre-seeding this setting
via the ComfyUI API ensures the agent lands directly in the graph editor.
The Codex agent was spending time on login flow, template browsing,
and general smoke testing instead of testing the PR's actual changes.
Changes:
- Add 30-second time budget for video recording
- Move video-start AFTER login and editor verification
- Explicitly prohibit template browsing and sidebar exploration
- Reduce test steps to 3-6 targeted actions
- Restructure prompt with clear Instructions/Rules sections
Replace crude sed-based markdown conversion with client-side
rendering via marked.js CDN. Adds proper table, list, and
code styling for the report section.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the OpenAI GPT-based frame extraction approach (ffmpeg + screenshots)
with Gemini 2.5 Flash's native video understanding. This eliminates false
positives from frame-based analysis (e.g. "black screen = critical bug" during
page transitions) and produces dramatically better QA reviews.
Changes:
- Remove ffmpeg frame extraction, ffprobe duration detection, and all related
logic (~365 lines removed)
- Add @google/generative-ai SDK for native video/mp4 upload to Gemini
- Update CLI: remove --max-frames, --min-interval-seconds, --keep-frames flags
- Update env: OPENAI_API_KEY → GEMINI_API_KEY
- Update workflow: swap API key secret and model in pr-qa.yaml
- Update report: replace "Frames analyzed" with "Video size"
- Add note in prompt that brief black frames during transitions are normal
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
playwright-cli doesn't support 'evaluate' command. Instead, instruct
Codex to quickly fill the username input and click Next on user-select
page BEFORE starting video recording, so the video only shows actual
QA testing.
storageState config doesn't work with playwright-cli. Instead, use
evaluate to set Comfy.userId/userName after opening the page, then
navigate back. This skips user-select before video-start so the
recording only shows actual QA testing.
Write a Playwright storageState JSON with Comfy.userId/userName pre-set
so the app loads directly to the graph editor. Saves ~40s per QA run
that was wasted on navigating the user-select page.
The convert step was using find which picked up a 0-byte file from
playwright's videos/ directory instead of the valid qa-session.webm.
Now prefers qa-session.webm explicitly and skips empty files.
Codex was using pnpm dlx instead of the global playwright-cli.
Pre-install chromium in setup step and make prompt explicit about
using the global command directly without pnpm/npx.
Replace claude --print with codex exec for cheaper QA runs.
Uses codex-mini-latest model ($1.50/$6 vs Sonnet $3/$15).
Uses existing OPENAI_API_KEY secret (no new secrets needed).
- Replace saveVideo config (didn't produce video) with explicit
playwright-cli video-start/video-stop commands in QA prompt
- Remove apt-get install ffmpeg step (pre-installed on GH runners)
- Switch video review model from gpt-4o to gpt-4.1-mini
- Enable saveVideo in playwright-cli config for real video recording
- Replace screenshot stitching with webm→mp4 conversion
- Move video review step before deploy so reports are included
- Add GPT video review reports inline on the Cloudflare Pages site
- Each video card now has expandable "GPT Video Review" section
- Set .playwright/cli.config.json with outputDir pointing to screenshots/
- This way bare 'playwright-cli screenshot' auto-saves to the right place
- Create screenshot directory before Claude runs (don't rely on Claude)
- Collect step now searches working directory for stray PNGs
- Simplified prompt: no --filename needed, just 'playwright-cli screenshot'
Screenshots were saved to artifact root but stitch looked in frames/.
Now: prompt tells Claude to save to screenshots/ dir with numbered names,
collect step consolidates PNGs there, stitch step globs from screenshots/.
Removed video-start/video-stop (Claude doesn't use them).
- Add playwright-cli config with outputDir and saveVideo
- Use video-start/video-stop instead of relying on screenshot frames
- Add fallback artifact collection from .playwright-cli/ default dir
- Simplify prompts to focus on video recording workflow
The escaped \$QA_ARTIFACTS in the heredoc produced literal text
'$QA_ARTIFACTS' in the prompt. Claude's Bash tool didn't reliably
expand this env var, so no screenshots or reports were saved.
Remove the escapes so the heredoc expands the variable to the actual
path (e.g. /home/runner/work/_temp/qa-artifacts).
Backtick-wrapped playwright-cli examples in the unquoted heredoc were
being interpreted as bash command substitution, producing empty prompts.
Replace backtick syntax with plain "Run:" prefixed commands.
- Remove all Xvfb/ffmpeg screen recording infrastructure from qa job
(captured blank display since playwright-cli runs headless)
- Add screenshot instructions to QA prompts: Claude saves sequential
frames to $QA_ARTIFACTS/frames/ after every interaction
- Stitch screenshots into video via ffmpeg in report job (2fps)
- Merge video-review job into report job (4 jobs → 3 jobs)
- Unified PR comment with video links + video review in <details> collapse
- Clean up stale QA_VIDEO_REVIEW_COMMENT markers from prior runs
Move extra_server_params input to env var to prevent shell injection
from untrusted input. Replace wait-for-it pip dependency with a
cross-platform curl polling loop.
Add Claude Code skills and a label-triggered QA workflow:
- .claude/skills/comfy-qa/SKILL.md: 12-category QA test plan using
playwright-cli for browser automation
- .github/workflows/pr-qa.yaml: CI workflow triggered by qa-changes
(focused, Linux) or qa-full (3-OS matrix) labels. Records screen via
ffmpeg, runs Claude CLI with playwright-cli, deploys video gallery to
Cloudflare Pages, posts PR comment with GIF thumbnails, and runs
OpenAI vision-based video review
- scripts/qa-video-review.ts: frame extraction + GPT-4o analysis
- scripts/qa-video-review.test.ts: unit tests for video review
- knip.config.ts: resolve knip errors for ingest-types package
sed on video-reviews/*.md returns exit code 2 when no files match,
killing the script under set -euo pipefail. Added || true to all
potentially empty glob pipelines.
Affects 13/20 QA runs that had successful research but failed deploy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test uses comfyPageFixture, not bare page. Also match
firstNode await calls for node interaction pauses.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Regex inserts await page.waitForTimeout(800) before every
comfyPage/topbar/page/canvas/expect await call in the Phase 2
test code. Adds ~5-8s to a 10-step test (negligible vs 10min research).
Default playback changed to 0.5x (was 0.25x) since pauses provide
natural breathing room. A 15s video at 0.5x = 30s viewing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Report player defaults to 0.25x speed (was 0.5x) — 5s test videos
play in 20s, much more watchable
- Phase 2 injects cursor overlay via addInitScript into the test code
before running — white SVG arrow follows mousemove events
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Playwright config only records video when PLAYWRIGHT_LOCAL is set.
In CI, this env var was missing so Phase 2 produced no video.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After context.close(), renameLatestWebm would overwrite the Phase 2
test execution video with the idle research browser recording.
Now skips the rename if qa-session.webm already exists from Phase 2.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old video showed an idle screen (research browser doing nothing).
Now Phase 2 runs the test with --video=on from browser_tests/tests/,
finds the recorded .webm, and copies it to qa-session.webm where
the deploy script expects it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Copy test to browser_tests/tests/ where Playwright config expects it
- System prompt teaches Claude the project's test fixtures:
comfyPageFixture, comfyPage.menu.topbar, comfyPage.workflow, etc.
- Increased time budget to 10 min for write→run→fix iterations
- Increased max turns to 50
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Also search qa-artifacts/before/*/research/ for the research log
since artifacts are downloaded with that nested structure.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1: Claude reads issue + a11y tree, writes a Playwright .spec.ts
test that asserts the bug exists. Runs the test, reads errors, iterates
until the test passes (proving the bug) or determines NOT_REPRODUCIBLE.
Phase 2: Run the passing test with --video=on for clean recording.
This replaces interactive browser driving with deterministic test code.
Claude Sonnet 4.6 excels at writing Playwright tests — much more
reliable than real-time browser interaction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
page.accessibility.snapshot() was removed in Playwright 1.49+.
Use page.locator('body').ariaSnapshot() which returns a text
representation of the accessibility tree.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Copy research-log.json to deploy dir (accessible at /research-log.json)
- Read verdict from research log first (a11y-verified ground truth)
- Fall back to video review verdict only if no research log exists
- Research log is uploaded as part of QA artifacts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1 (qa-agent.ts): Claude investigates via a11y API only.
- No video, no Gemini vision — only page.accessibility.snapshot()
- Every action logged with a11y before/after state
- done() requires evidence citing inspect() results
- Outputs reproduction plan for Phase 2
Phase 2 (qa-reproduce.ts): Deterministic replay of research plan.
- Executes each step with a11y assertions
- Gemini describes visual changes (narration for humans)
- Clean focused video with subtitles
Phase 3: Report job reads research-log.json for verdict (ground truth),
narration-log.json for descriptions, video for visuals.
Gemini formats logs into report — never determines verdict.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agent: MUST use inspect() after every action, verdict based on DOM
state not opinions. "NEVER claim REPRODUCED unless inspect() confirms."
Reviewer: Two-phase prompt — Phase 1 describes what it SEES (blind,
no context). Phase 2 compares observations against issue/PR context.
Anti-hallucination rules: "describe ONLY what you observe, NEVER infer."
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ANTHROPIC_API_KEY is optional: Agent SDK uses Claude Code OAuth
session when running locally (detects CLAUDE_CODE_SSE_PORT).
In CI, ANTHROPIC_API_KEY from secrets is used.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Gemini-only agentic loop had ~47% success rate — too low to be
useful as a fallback. Now ANTHROPIC_API_KEY is required for issue
reproduction. Fails clearly if missing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
addInitScript runs before page load — Vue's app mount destroys the
cursor div when it takes over the DOM. Using addScriptTag after login
ensures the cursor persists in the stable DOM.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Locator.click/hover bypasses our page.mouse monkey-patch. Now
clickByText, hoverMenuItem, clickSubmenuItem get the element
bounding box and update cursor overlay manually.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The grep \{"verdict":\s*"[^"]+ captures up to but not including the
closing quote. The second grep for "[A-Z_]+"$ then fails because
there's no closing quote. Fixed: match "verdict":\s*"[A-Z_]+ then
extract [A-Z_]+$ (no quotes needed).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Video review prompt now requests a ## Verdict JSON block:
{"verdict": "REPRODUCED|NOT_REPRODUCIBLE|INCONCLUSIVE", "risk": "low|medium|high"}
- Deploy script reads JSON verdict first, falls back to grep
- Eliminates all regex-matching false positives permanently
- Theme: light mode is default, dark via prefers-color-scheme:dark
- Cards use solid backgrounds, grain overlay only in dark mode
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add prefers-color-scheme:light media query with light palette.
Replace hardcoded dark oklch values with CSS variables.
Light mode: white surfaces, dark text, subtle borders, no grain overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- loadDefaultWorkflow now calls app.resetToDefaultWorkflow() via JS API
instead of navigating File → Load Default menu (menu item name varies)
- pressKey reverted to instant press() — the 400ms hold via down/up
prevented Escape from propagating to parent dialog (#10397 BEFORE video
showed wrong behavior because hold intercepted the event)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of manually calling moveCursorOverlay in each action,
patch page.mouse.move/click/dblclick/down/up globally. Now EVERY
mouse operation shows the cursor — text clicks, menu hovers, etc.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"could not be confirmed" contains "confirmed" which matched the
positive reproduc|confirm check. Now caught by the negative check first.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Agent SDK returned "model not found" for claude-sonnet-4-6-20250514.
Correct ID is claude-sonnet-4-6.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Headless Chrome's Playwright CDP doesn't trigger DOM mousemove events
reliably. Now executeAction calls __moveCursor(x,y) directly after
every mouse.move/click/drag. Cursor is an SVG arrow (white + outline).
Click state shown via scale animation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pressKey now uses keyboard.down/up with 400ms hold instead of
instant press(). Shows subtitle "⌨ Escape" and the keyboard HUD
catches the held state for video frame capture.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Agent system prompt now instructs Claude to demonstrate BOTH working
(control) and broken (test) states when bug is triggered by a setting
- Added docs/qa/backlog.md with future improvements: Type B/C comparisons,
TTS, pre-seeding, cost optimization, environment-dependent issues
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Shows "QA @ abc1234" linking to the pipeline code commit
- Shows start time → deploy time in header
- Helps trace which version of QA scripts generated each report
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Architecture:
- Claude Sonnet 4.6 plans and reasons (via Claude Agent SDK)
- Gemini 2.5 Flash watches video buffer and describes what it sees
- 4 tools: observe(), inspect(), perform(), done()
observe(seconds, focus): builds video clip from screenshot buffer,
sends to Gemini with Claude's focused question.
inspect(selector): searches a11y tree for specific element state.
perform(action, params): executes Playwright action.
done(verdict, summary): signals completion.
Falls back to Gemini-only loop if ANTHROPIC_API_KEY not set.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Agent reasoning shown as subtitle bar at bottom of video during recording
- After recording, generates TTS audio via OpenAI API (tts-1, nova voice)
- Merges audio clips at correct timestamps into the video with ffmpeg
- Requires OPENAI_API_KEY env var; gracefully skips if not set
- No-sandbox + disable-dev-shm-usage for headless Chrome compatibility
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Download QA guide artifact in report job
- Extract prerequisites, test focus, and steps from guide JSON
- Display below the purpose description: focus → prerequisites → steps
- Separated by a subtle divider with smaller font
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Report site shows "PR #N aims to..." or "Issue #N reports..." block
above the video cards, extracted from pr-context.txt
- Multi-pass video links fall back to pass1 when qa-{os}.mp4 is 404
- More negative verdict patterns: "does not demonstrate", "never tested"
- Risk uses first word of Overall Risk (avoids "high confidence" match)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "does not demonstrate", "steps were not performed", "never tested"
to NOT_REPRO patterns (fixes#9101 false positive)
- Risk detection uses first word of Overall Risk section instead of
grepping entire text (fixes "high confidence" matching HIGH)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tsx compiles arrow functions with __name helpers that don't exist in
browser context. Using addScriptTag with plain JS string avoids this.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set<string>() in page.evaluate causes __name ReferenceError in browser.
Use untyped Set() since browser JS doesn't support TS generics.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"fails to reproduce" contains "reproduce" — must check negatives first
within each report. Across reports, REPRODUCED still wins (multi-pass).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Injects a persistent overlay in bottom-right corner that displays
currently held keys (e.g. "⌨ Space", "⌨ CTRL+C"). Makes keyboard
interactions visible in the recording for both human and AI reviewers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When multiple report files exist, badge shows "2/3 REPRODUCED" instead
of just "REPRODUCED". Single-pass issues still show plain verdict.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When multiple passes exist and one confirms while another is
inconclusive, the badge should show REPRODUCED. Previously
INCONCLUSIVE was checked first, hiding successful reproductions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The "Clone" context menu item doesn't exist in Nodes 2.0 mode.
Using Ctrl+C/Ctrl+V works in both legacy and Nodes 2.0.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- #10307: preflight clones KSampler node, hint says drag to overlap
- #7414: preflight clicks numeric widget, hint says drag to change value
- #7806: preflight takes baseline screenshot, hint gives exact coords
for holdKeyAndDrag with spacebar
- Hints now reference "Preflight already did X, NOW do Y" pattern
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agent was wasting turns re-doing loadDefaultWorkflow and setSetting
that preflight already executed. Now the system prompt includes
"Already Done" section listing preflight actions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Auto-execute prerequisite actions (enable Nodes 2.0, load default
workflow) BEFORE the agentic loop starts. Agent model ignores prompt
hints but preflight guarantees nodes are on canvas.
- Add "fails to reproduce" to NOT REPRODUCIBLE badge patterns
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add issues:[labeled] trigger and qa-issue label support
- Resolve github.event.issue.number for issue-triggered runs
- Include issue labels in context (feeds keyword matcher for hints)
- Remove qa-issue label after run completes (same as qa-changes/qa-full)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scan issue context for keywords (clone, copy-paste, spacebar, resize,
sidebar, scroll, middle-click, node shape, Nodes 2.0, etc.) and inject
specific MUST-follow action steps into the agentic system prompt.
Addresses 9 INCONCLUSIVE issues where agent had actions available
but didn't know when to use them.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Annotations now use cyan dashed border + monospace "QA:" prefix
instead of red solid labels that look like UI error messages
- Video review prompts explicitly tell reviewer to ignore QA annotations
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change preload=metadata to preload=auto for full video download
- Add _headers file with Accept-Ranges for Cloudflare Pages
- Add custom seekbar (range input + buffer indicator) that works
even without server HTTP range request support
- Seekbar shows buffered progress and allows dragging to any point
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous prompt said "test the specific behavior" which was too vague,
leading to generic UI walkthroughs instead of targeted tests.
New prompt: explicitly instructs to read the diff, trigger the exact
scenario the PR fixes, and avoid generic menu screenshots.
Also added reload action to before/after prompt for state persistence tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix quality badge now reads "## Overall Risk" section only
- Prevents false MAJOR ISSUES from severity labels or negated phrases
- "Low" risk → APPROVED, "High" → MAJOR ISSUES, "Medium" → MINOR ISSUES
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-frontend must run first to install node/pnpm, then rebuild
with PR code. Also re-install sno-skills deps after switching back
so QA scripts' dependencies are available.
Also gitignore .claude/scheduled_tasks.lock.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When triggered via sno-qa-* push, the workflow checks out the PR code
to build its frontend, but this replaces qa-record.ts which only
exists on sno-skills. Fix: build PR frontend, then checkout back to
sno-skills so QA scripts remain available.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SC2231: quote glob expansions in for loop
- SC2002: use sed directly instead of cat | sed
- SC2086: quote variable in echo
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update DefaultThumbnail test to match size-full class change
- Fix shellcheck warnings in qa-batch.sh (SC2001, SC2207)
- Fix shellcheck warnings in qa-deploy-pages.sh (SC2034, SC2235, SC2231, SC2002)
- Add qa-report-template.html to oxfmt ignore (minified, not formattable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- "confirmed" didn't match "confirms"/"reproducible" — use "reproduc|confirm" stem
- "partial" matched unrelated text — require "partially reproduced" specifically
- collectVideoCandidates now finds qa-session-*.mp4 for multi-pass reviews
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Check INCONCLUSIVE before reproduced/confirmed in badge detection
- Exclude markdown headings from reproduced grep match
- Add --pass-label to qa-video-review.ts for unique multi-pass filenames
- Pass pass label from workflow YAML when reviewing numbered sessions
- Collect all pass-specific reports in deploy script HTML
- Add addNode/cloneNode convenience actions to qa-record agent
- Improve strategy hints for visual/rendering bug reproduction
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove push trigger (was for dev testing only)
- Restore concurrency group (was commented out for dev)
- Move misplaced import in qa-analyze-pr.ts to top of file
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Strengthen prompt: MUST use openMenu → hoverMenuItem → clickMenuItem
in that order. Previous runs skipped openMenu causing silent failures.
- Add CI Job link to the QA report site header for quick navigation
to the GitHub Actions run that generated the report.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub Actions has a 21000 char limit per expression. The combined
badge setup step exceeded this after adding the dual badge generator.
Split into its own step.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Badge now reads: QA Bot | REPRODUCED | Fix: APPROVED
Not all issues are bugs — could be feature requests too.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reproducing a bug is a successful outcome for the QA bot.
Blue (#2196f3) = bot succeeded. Red = bot found problems with the fix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PRs now show one badge with three segments:
QA Bot | Bug: REPRODUCED | Fix: APPROVED
Instead of two separate badges. Uses gen-badge-dual.sh which
renders label + bug status + fix status in one SVG.
Issues still use single two-segment badge:
QA Bot | FINISHED: REPRODUCED
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Inject fake cursor (red dot with click animation) via addInitScript
since headless Chrome doesn't render the system cursor in video
- Add hover-before-click delay to clickByText and canvas clicks
so viewers can see where the cursor moves before clicking
- Add 'annotate' action: shows a floating label at (x,y) for N ms
so AI can draw viewer attention to important UI state changes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: runAgenticLoop never read the QA guide — agent saw
"No issue context provided" for issues. Now reads qaGuideFile,
parses structured fields, and injects into system prompt.
Also: fetch issue body via gh issue view in workflow, increase
budget to 120s/30 turns, add focus reminders, smarter stuck
detection (50px grid normalization + action-type frequency),
reject invalid click targets, add loadDefaultWorkflow and
openSettings convenience actions, strategy hints in prompt.
Fix pre-existing typecheck error in eslint.config.ts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PRs now get two separate badges:
- Bug: REPRODUCED / NOT REPRODUCIBLE / PARTIAL (before branch)
- Fix: APPROVED / MAJOR ISSUES / MINOR ISSUES (after branch)
Issues keep a single badge: FINISHED: REPRODUCED / etc.
Both badge-bug.svg and badge-fix.svg served from the deploy site.
PR comment shows all three: ![badge] ![bug] ![fix]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FINISHED is not standalone — always shows result:
- FINISHED: REPRODUCED / NOT REPRODUCIBLE / PARTIAL (issues)
- FINISHED: APPROVED / MAJOR ISSUES / MINOR ISSUES (PRs)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Badge shows QA pipeline status, deployed at each stage:
- PREPARING (blue) — setting up artifacts
- ANALYZING (orange) — running video review
- Final status with color:
- Issues: REPRODUCED (red) / NOT REPRODUCIBLE (gray) / PARTIAL (yellow)
- PRs: APPROVED (green) / MAJOR ISSUES (red) / MINOR ISSUES (yellow)
Badge served as /badge.svg from the same Cloudflare Pages site.
Included in PR comment as .
Also restore @ts-expect-error for import-x plugin type incompatibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace single-shot step generation in reproduce mode with an agentic
loop where Gemini sees the screen after each action and decides what
to do next. For multi-bug issues, decompose into sub-issues and run
separate recording passes.
- Extract executeAction() from executeSteps() for reuse
- Add reload and done action types
- Add captureScreenshotForGemini() (JPEG q50, ~50KB)
- Add runAgenticLoop() with sliding window history (3 screenshots)
- Add decomposeIssue() for multi-pass recording (1-3 sub-issues)
- Update workflow to handle numbered session videos (qa-session-1, etc.)
- Add custom video controls below each video with frame stepping
- Frame back/forward buttons (1 frame at 30fps, 10 frames skip)
- Speed selector: 0.1x, 0.25x, 0.5x (default), 1x, 1.5x, 2x
- Keyboard shortcuts: arrow keys for frame step, space for play/pause
- SMPTE-style timecode display (m:ss.ms)
- Default 0.5x speed since AI operates UI faster than humans
- Videos no longer autoplay (pause on load for inspection)
- Zero external dependencies (pure HTML5 video API)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reproduce video must be max 5 minutes (short, focused demo)
- Phase 4 reuses the environment from Phase 3 (no re-setup)
- Use video-start/video-stop commands (not --save-video flag)
- Start recording right before steps, stop immediately after
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New Claude agent-driven issue reproduction skill that:
- Phase 1-2: Research issue and set up environment (custom nodes, workflows, settings)
- Phase 3: Record research video while exploring interactively via playwright-cli
- Phase 4: Record clean reproduce video with only the minimal repro steps
- Phase 5: Generate structured reproduction report
Key difference from the old approach: Claude agent explores and adapts
instead of blindly executing a Gemini-generated static plan.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gemini was right-clicking empty canvas instead of nodes because it
didn't know where the default workflow nodes are positioned. Now the
prompt includes approximate coordinates for all 7 default nodes and
clarifies the difference between node context menu vs canvas menu.
Also fixes TS2352 in page.evaluate by using double-cast through unknown.
- Add --target-url CLI option to qa-video-review.ts
- Include target URL in generated markdown reports
- Add clickable issue/PR link in deployed HTML report header
- Workflow passes the target URL automatically
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unused @ts-expect-error directives in eslint.config.ts
- Simplify LazyImage prop types from ClassValue to string
- Fix DialogInstance to avoid infinitely deep type instantiation
- Use cn() in DefaultThumbnail for class merging
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- fillDialog now tries: PrimeVue dialog → node search box → focused input → keyboard fallback
- clickSubmenuItem now tries: PrimeVue tiered menu → litegraph context menu → role menuitem
- Fixes double-click-to-add-node flow and right-click context menu clicks
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wrap each step in try/catch so failed steps don't abort the recording
- Add 5s timeout to clickByText to prevent 30s hangs on disabled elements
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
gh pr view can't distinguish PRs from issues — it succeeds for both.
Use the REST API endpoint repos/{owner}/{repo}/pulls/{number} which
returns 404 for issues.
- Enforce requestTimeoutMs via Gemini SDK requestOptions
- Add 100MB video size check before base64 encoding
- Sanitize screenshot filenames to prevent path traversal
- Sort video files by mtime for reliable rename
- Validate --mode arg against allowed values
- Add Content-Length pre-check in downloadMedia
- Add GitHub domain allowlist for media downloads (SSRF mitigation)
- Add contents:write permission and git config for report job
- Update Node.js requirement in SKILL.md from 18+ to 22+
Instead of crashing the entire recording session when Gemini generates
an invalid key name (e.g. "mouseWheelDown"), catch the error and
continue with remaining steps.
Add a "Behavior Changes" table (Behavior, Before, After, Verdict)
alongside the existing timeline comparison. This gives reviewers a
quick high-level view of all behavioral differences before diving
into the frame-by-frame timeline.
Instruct Gemini to output the Before vs After section as a markdown
table with Time, Type, Severity, Before, After columns for easier
comparison. Update HTML template table styles with fixed layout and
column widths optimized for the 5-column comparison format.
- Log raw Gemini response for debugging when parsing fails
- Handle possible wrapper keys in response
- Make qa-before/qa-after run even if analyze-pr fails (only gate
on resolve-matrix success)
When running on push events for sno-qa-* branches without an open PR,
extract the PR number from the branch name so analyze-pr can fetch
the full PR thread for analysis.
Add Gemini Pro-powered PR analysis that generates targeted QA guides
from the full PR thread (description, comments, screenshots, diff).
The analyze-pr job runs on lightweight ubuntu before recordings start,
producing qa-guide-before.json and qa-guide-after.json that are
downloaded by recording jobs to produce more focused test steps.
Graceful fallback: if analysis fails, recordings proceed without guides.
download-artifact@v7 merges all files flat regardless of
merge-multiple setting. Use separate path dirs (before/after)
and copy all files into the report directory.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
download-artifact@v7 defaults merge-multiple to true, which puts all
files flat in qa-artifacts/ instead of per-artifact subdirectories.
The merge step expects qa-artifacts/qa-before-{os}-{run}/ subdirs,
so the report directory never gets created and video review finds
no files.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of running before/after sequentially in a single job with
fragile git stash/checkout gymnastics, split into two independent
parallel jobs on separate runners:
resolve-matrix → qa-before (main) ─┐
→ qa-after (PR) ─┴→ report
- qa-before: uses git worktree for clean main branch build
- qa-after: normal PR build via setup-frontend
- report: downloads both artifact sets, merges, runs Gemini review
Benefits:
- Clean workspace isolation (no git checkout origin/main -- .)
- ~2x faster (parallel execution)
- Each job gets its own ComfyUI server (no shared state)
- Eliminates entire class of workspace contamination bugs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Pre-seed step creates qa-ci via API, so the "New user" form
shows "already exists" error. Fix by selecting the existing user
from the dropdown first, falling back to a unique username.
The localStorage userId bypass doesn't work because the server
validates user IDs and rejects the simple 'qa-ci' string. Instead,
detect the login page by its input fields and create a user via the
"New user" text input, which is how real users would log in.
Firefox headless doesn't support WebGL, causing "getCanvas: canvas is
null" errors. Switch to Chromium which has full headless WebGL support.
Also fix login flow to wait for async router guard to settle and
create user via text input as fallback.
Add coordinate fallback when .comfy-menu-button-wrapper selector isn't
found, and capture a debug screenshot after login to diagnose what the
page looks like when the editor UI fails to render.
The openComfyMenu was clicking at hardcoded coordinates (20, 67) which
missed the menu button. Now uses .comfy-menu-button-wrapper selector
matching the browser tests. Also fixes menu item hover/click selectors
to use .p-menubar-item-label and .p-tieredmenu-item classes, and adds
a wait for the editor UI to fully load before executing test steps.
The QA recordings were stuck on the user selection screen because CI
has no existing users. Fix by pre-seeding localStorage with userId,
userName, and TutorialCompleted before navigation, plus creating a
qa-ci user via API as a fallback.
The main branch build step was running pnpm install with main's lockfile,
which removed @google/generative-ai from node_modules. Move the reinstall
to after restoring PR files so the QA recording script can find its deps.
git checkout - uses @{-1} which requires a previous branch switch.
Since we use 'git checkout origin/main -- .' (file checkout, not branch
switch), there is no @{-1} ref. Use HEAD to restore from current branch.
Also restore proper concurrency group.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass the comprehensive test plan from .claude/skills/comfy-qa/SKILL.md
to Gemini when generating test steps. This gives Gemini knowledge of all
12 QA categories (canvas, menus, sidebar, settings, etc.) so it picks
the most relevant tests for each PR.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nx build runs typecheck as a prerequisite (via @nx/vite/plugin config).
Use vite build directly for the main branch comparison build.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Main branch may have transient TS errors when built with the PR
branch's lockfile. Since we only need the dist for visual comparison,
run nx build directly instead of pnpm build (which includes typecheck).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the unreliable codex exec approach with a Playwright script
(qa-record.ts) that uses Gemini to generate targeted test steps from
the PR diff, then executes them deterministically via Playwright's API.
Key changes:
- New scripts/qa-record.ts: Gemini generates JSON test actions, Playwright
executes them with reliable helper functions (menu nav, dialog fill, etc.)
- Remove codex CLI and playwright-cli dependencies
- Remove 150+ lines of prompt templates from pr-qa.yaml
- Firefox headless with video recording (same approach proven locally)
- Fallback steps if Gemini fails
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Tighten BEFORE prompt to 15s snapshot (show old state only)
- Add qa-generate-test.ts: Gemini-powered Playwright test generator
- New workflow step: generate .spec.ts and push to {branch}-add-qa-test
- Tests assert UIUX behavior (tab names, dirty state, visibility)
- Build both main (dist-before/) and PR (dist/) frontends in focused mode
- Run QA twice: BEFORE on main branch frontend, AFTER on PR branch
- Send both videos to Gemini in one request for comparative analysis
- Side-by-side dashboard layout with Before (main) / After (PR) panels
- Comparative prompt evaluates whether before confirms old behavior
and after proves the fix works
- Falls back to single-video mode when no before video available
moov atom was at end of file (8.6MB offset) — browser had to download
the entire video before seeking. Keyframes were only every 10 seconds.
Add -movflags +faststart (moov before mdat) and -g 60 (keyframe every
2.4s at 25fps) to ffmpeg conversion.
- Remove autoplay/loop so video timeline is seekable
- Skip card generation for platforms without recordings
- Add --pr-context flag to qa-video-review.ts so Gemini evaluates
against PR purpose instead of just describing what happened
- Workflow now builds pr-context.txt from PR title/body/diff
The Codex agent was spending 35s browsing the "Getting Started" template
gallery instead of testing the PR's changes. Pre-seeding this setting
via the ComfyUI API ensures the agent lands directly in the graph editor.
The Codex agent was spending time on login flow, template browsing,
and general smoke testing instead of testing the PR's actual changes.
Changes:
- Add 30-second time budget for video recording
- Move video-start AFTER login and editor verification
- Explicitly prohibit template browsing and sidebar exploration
- Reduce test steps to 3-6 targeted actions
- Restructure prompt with clear Instructions/Rules sections
Replace crude sed-based markdown conversion with client-side
rendering via marked.js CDN. Adds proper table, list, and
code styling for the report section.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the OpenAI GPT-based frame extraction approach (ffmpeg + screenshots)
with Gemini 2.5 Flash's native video understanding. This eliminates false
positives from frame-based analysis (e.g. "black screen = critical bug" during
page transitions) and produces dramatically better QA reviews.
Changes:
- Remove ffmpeg frame extraction, ffprobe duration detection, and all related
logic (~365 lines removed)
- Add @google/generative-ai SDK for native video/mp4 upload to Gemini
- Update CLI: remove --max-frames, --min-interval-seconds, --keep-frames flags
- Update env: OPENAI_API_KEY → GEMINI_API_KEY
- Update workflow: swap API key secret and model in pr-qa.yaml
- Update report: replace "Frames analyzed" with "Video size"
- Add note in prompt that brief black frames during transitions are normal
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
playwright-cli doesn't support 'evaluate' command. Instead, instruct
Codex to quickly fill the username input and click Next on user-select
page BEFORE starting video recording, so the video only shows actual
QA testing.
storageState config doesn't work with playwright-cli. Instead, use
evaluate to set Comfy.userId/userName after opening the page, then
navigate back. This skips user-select before video-start so the
recording only shows actual QA testing.
Write a Playwright storageState JSON with Comfy.userId/userName pre-set
so the app loads directly to the graph editor. Saves ~40s per QA run
that was wasted on navigating the user-select page.
The convert step was using find which picked up a 0-byte file from
playwright's videos/ directory instead of the valid qa-session.webm.
Now prefers qa-session.webm explicitly and skips empty files.
Codex was using pnpm dlx instead of the global playwright-cli.
Pre-install chromium in setup step and make prompt explicit about
using the global command directly without pnpm/npx.
Replace claude --print with codex exec for cheaper QA runs.
Uses codex-mini-latest model ($1.50/$6 vs Sonnet $3/$15).
Uses existing OPENAI_API_KEY secret (no new secrets needed).
- Replace saveVideo config (didn't produce video) with explicit
playwright-cli video-start/video-stop commands in QA prompt
- Remove apt-get install ffmpeg step (pre-installed on GH runners)
- Switch video review model from gpt-4o to gpt-4.1-mini
- Enable saveVideo in playwright-cli config for real video recording
- Replace screenshot stitching with webm→mp4 conversion
- Move video review step before deploy so reports are included
- Add GPT video review reports inline on the Cloudflare Pages site
- Each video card now has expandable "GPT Video Review" section
- Set .playwright/cli.config.json with outputDir pointing to screenshots/
- This way bare 'playwright-cli screenshot' auto-saves to the right place
- Create screenshot directory before Claude runs (don't rely on Claude)
- Collect step now searches working directory for stray PNGs
- Simplified prompt: no --filename needed, just 'playwright-cli screenshot'
Screenshots were saved to artifact root but stitch looked in frames/.
Now: prompt tells Claude to save to screenshots/ dir with numbered names,
collect step consolidates PNGs there, stitch step globs from screenshots/.
Removed video-start/video-stop (Claude doesn't use them).
- Add playwright-cli config with outputDir and saveVideo
- Use video-start/video-stop instead of relying on screenshot frames
- Add fallback artifact collection from .playwright-cli/ default dir
- Simplify prompts to focus on video recording workflow
The escaped \$QA_ARTIFACTS in the heredoc produced literal text
'$QA_ARTIFACTS' in the prompt. Claude's Bash tool didn't reliably
expand this env var, so no screenshots or reports were saved.
Remove the escapes so the heredoc expands the variable to the actual
path (e.g. /home/runner/work/_temp/qa-artifacts).
Backtick-wrapped playwright-cli examples in the unquoted heredoc were
being interpreted as bash command substitution, producing empty prompts.
Replace backtick syntax with plain "Run:" prefixed commands.
- Remove all Xvfb/ffmpeg screen recording infrastructure from qa job
(captured blank display since playwright-cli runs headless)
- Add screenshot instructions to QA prompts: Claude saves sequential
frames to $QA_ARTIFACTS/frames/ after every interaction
- Stitch screenshots into video via ffmpeg in report job (2fps)
- Merge video-review job into report job (4 jobs → 3 jobs)
- Unified PR comment with video links + video review in <details> collapse
- Clean up stale QA_VIDEO_REVIEW_COMMENT markers from prior runs
Move extra_server_params input to env var to prevent shell injection
from untrusted input. Replace wait-for-it pip dependency with a
cross-platform curl polling loop.
Add Claude Code skills and a label-triggered QA workflow:
- .claude/skills/comfy-qa/SKILL.md: 12-category QA test plan using
playwright-cli for browser automation
- .github/workflows/pr-qa.yaml: CI workflow triggered by qa-changes
(focused, Linux) or qa-full (3-OS matrix) labels. Records screen via
ffmpeg, runs Claude CLI with playwright-cli, deploys video gallery to
Cloudflare Pages, posts PR comment with GIF thumbnails, and runs
OpenAI vision-based video review
- scripts/qa-video-review.ts: frame extraction + GPT-4o analysis
- scripts/qa-video-review.test.ts: unit tests for video review
- knip.config.ts: resolve knip errors for ingest-types package
2026-03-31 14:32:42 +09:00
42 changed files with 7013 additions and 284 deletions
When the issue attaches a workflow.json, the agent is instructed to `downloadAttachment` then `loadWorkflow` before inspecting — many bugs only manifest in the user's specific graph state. The emitted test must itself re-fetch and `loadGraphData()` because `runTest` spawns a fresh browser.
When `verdict=REPRODUCED`, Claude also provides a `videoScript` — a separate test file using demowright's `createVideoScript()` for professional narrated demo video with title cards, TTS segments, and outro.
### Verdict Logic
- **REPRODUCED** — Test passes (asserting the bug exists) → bug is proven
- **NOT_REPRODUCIBLE** — Claude exhausted attempts, test cannot pass
- **INCONCLUSIVE** — Agent timed out or encountered infrastructure issues
Auto-completion: if a test passed with a bug-specific assertion but `done()` was never called, the pipeline auto-completes with REPRODUCED. Trivial assertions (`toBeDefined()`, `toBeGreaterThan(0)`) and discovery-style tests are excluded from auto-save to avoid false positives.
## Video Recording (demowright)
Phase 2 uses the video script to record with:
@@ -115,19 +139,126 @@ Features:
- E2E test code + video script code
- Verdict banner for NOT_REPRODUCIBLE/INCONCLUSIVE with failure reason
- Copy badge button (markdown)
## Prerequisites
-`GEMINI_API_KEY` — video review, TTS
-`ANTHROPIC_API_KEY` — Claude Agent SDK (research phase)
- ComfyUI server running (auto-detected, or auto-started)
## Manual QA (Fallback)
When the automated pipeline isn't suitable (e.g., visual-only bugs, complex multi-step interactions), use **playwright-cli** for manual browser interaction:
```bash
npm install -g @playwright/cli@latest
playwright-cli open http://127.0.0.1:8188
playwright-cli snapshot
playwright-cli click e1
playwright-cli fill e2 "test text"
playwright-cli press Escape
playwright-cli screenshot --filename=f.png
```
Snapshots return element references (`e1`, `e2`, …). Always run `snapshot` after navigation to refresh refs.
## Manual QA Test Plan
When performing manual QA systematically, cover each area below.
@@ -73,6 +101,8 @@ export async function runResearchPhase(
letfinalVideoScript=''
letturnCount=0
letlastPassedTurn=-1
letconsecutiveTestFailures=0
constMAX_FAILED_RUNS=5
conststartTime=Date.now()
constresearchLog: ResearchResult['log']=[]
@@ -216,6 +246,124 @@ export async function runResearchPhase(
}
)
// ── Tool: downloadAttachment ──
constdownloadAttachmentTool=tool(
'downloadAttachment',
'Download a URL (e.g. GitHub user-attachments, gist raw, pastebin raw) and return its text content. Use this to fetch workflow.json attached to the issue before calling loadWorkflow().',
{
url: z
.string()
.describe(
'Absolute URL to fetch. Typically a GitHub user-attachments link or a gist/pastebin raw URL extracted from the issue body.'
)
},
async(args:{url: string})=>{
letresultText: string
try{
constres=awaitfetch(args.url,{redirect:'follow'})
if(!res.ok){
resultText=`HTTP ${res.status}${res.statusText} for ${args.url}`
'Load a ComfyUI workflow into the canvas via window.app.loadGraphData(). Accepts either a URL (http/https — will be fetched) OR a complete workflow JSON string. Returns the resulting node count. For a single step from an issue attachment URL, pass the URL directly — no separate downloadAttachment needed.',
{
jsonOrUrl: z
.string()
.describe(
'Either a URL starting with http(s):// (will be fetched), or a complete workflow JSON string. Must ultimately parse to an object with a "nodes" array.'
@@ -299,6 +447,7 @@ export async function runResearchPhase(
// Auto-save passing test code for fallback completion — but only if
// the test contains a bug-specific assertion (not just a discovery/debug test)
if(resultText.startsWith('TEST PASSED')){
consecutiveTestFailures=0
try{
constcode=readFileSync(browserTestPath,'utf-8')
consthasBugAssertion=
@@ -317,6 +466,13 @@ export async function runResearchPhase(
}
resultText+=
'\n\n⚠️ Test PASSED — call done() now with verdict REPRODUCED and the test code. Do NOT write more tests.'
}else{
consecutiveTestFailures++
if(consecutiveTestFailures>=MAX_FAILED_RUNS){
resultText+=`\n\n⛔ ${consecutiveTestFailures} consecutive test failures — you MUST call done() NOW with verdict NOT_REPRODUCIBLE and summary explaining why. Do not run more tests.`
}elseif(consecutiveTestFailures>=3){
resultText+=`\n\n⚠️${consecutiveTestFailures} failed runs. ${MAX_FAILED_RUNS-consecutiveTestFailures} remaining before forced done(NOT_REPRODUCIBLE).`
@@ -383,6 +539,8 @@ export async function runResearchPhase(
inspectTool,
readFixtureTool,
readTestTool,
downloadAttachmentTool,
loadWorkflowTool,
writeTestTool,
runTestTool,
doneTool
@@ -396,21 +554,42 @@ export async function runResearchPhase(
- inspect(selector?) — Read the accessibility tree to understand the current UI. Use to discover selectors, element names, and UI state.
- readFixture(path) — Read fixture source code from browser_tests/fixtures/. Use to discover available methods. E.g. "helpers/CanvasHelper.ts", "components/Topbar.ts", "ComfyPage.ts"
- readTest(path) — Read an existing test from browser_tests/tests/ to learn patterns. E.g. "workflow.spec.ts". Pass any name to list available files.
- downloadAttachment(url) — Fetch a URL (GitHub user-attachments, gist raw, pastebin raw) and return its text. Use when you need the body for inspection.
- loadWorkflow(jsonOrUrl) — Load a workflow into the canvas via window.app.loadGraphData(). Accepts either a URL (will be fetched) OR a JSON string. For a workflow.json attached to an issue, prefer passing the URL directly in ONE call. Many bugs only manifest in the user's specific graph state.
- writeTest(code) — Write a Playwright test file (.spec.ts)
- runTest() — Execute the test and get results (pass/fail + errors)
- done(verdict, summary, evidence, testCode) — Finish with the final test
## Workflow
1. Read the issue description carefully
2. Use inspect() to understand the current UI state and discover element selectors
3. If unsure about the fixture API, use readFixture() to read the relevant helper source code
4. If unsure about test patterns, use readTest() to read an existing test for reference
5. Write a Playwright test that:
- Performs the exact reproduction steps from the issue
- Asserts the BROKEN behavior (the bug) — so the test PASSES when the bug exists
6. Run the test with runTest()
7. If it fails: read the error, fix the test, run again (max 5 attempts)
8. Call done() with the final verdict and test code
1. Read the issue description carefully. Think about:
- What PRECONDITIONS are needed? (many nodes on canvas? specific layout? saved workflow? subgraph?)
- What HIDDEN ASSUMPTIONS exist? (e.g. "z-index bug" means nodes must overlap → need a crowded canvas)
- What specific UI STATE triggers the bug? (dirty workflow? collapsed node? specific menu open?)
2. **Environment setup (critical)** — many bugs only reproduce with the user's specific workflow. If the issue body contains a workflow JSON (code block or a link like \`[workflow.json](https://github.com/user-attachments/...)\`), you MUST load it:
- **URL attachment**: call loadWorkflow(url) directly with the attachment URL — ONE call, no download step needed.
- **Inline JSON code block**: call loadWorkflow(jsonString) with the raw JSON text from the issue body.
- downloadAttachment(url) is only needed when you specifically want to inspect a non-workflow file.
Skip this step only if the issue clearly does not involve a pre-existing workflow (e.g. pure menu/settings bugs).
3. FIRST: Use readTest() to read 1-2 existing tests similar to the bug you're reproducing:
- For menu/workflow bugs: readTest("workflow.spec.ts") or readTest("topbarMenu.spec.ts")
- For node/canvas bugs: readTest("nodeInteraction.spec.ts") or readTest("copyPaste.spec.ts")
- For settings bugs: readTest("settingDialogSearch.spec.ts")
- For subgraph bugs: readTest("subgraph.spec.ts")
4. Use inspect() to understand the current UI state and discover element selectors
5. If unsure about the fixture API, use readFixture("ComfyPage.ts") or relevant helper
6. Write a Playwright test that:
- FIRST sets up the preconditions (add multiple nodes, create specific layout, save workflow, etc.)
- THEN performs the exact reproduction steps from the issue
- FINALLY asserts the BROKEN behavior (the bug) — so the test PASSES when the bug exists
- Think like a tester: the bug may only appear under specific conditions that the reporter assumed were obvious
- If a workflow was loaded in step 2, the test MUST load the same workflow itself (fetch + loadGraphData inside the test) — the agent's setup does not persist into the separate test browser session
7. Run the test with runTest()
8. If it fails, ANALYZE the error before retrying (max 5 attempts):
- Is it a selector issue? Use inspect() to find the right element
- Is it a timing issue? The UI may need time to update — use nextFrame() or expect.poll()
- Is the precondition wrong? Maybe the bug only appears with MORE nodes, AFTER a save, etc.
- Try a DIFFERENT approach, not the same code with minor tweaks
9. Call done() with the final verdict and test code
## Test writing guidelines
- Import the project fixture: \`import { comfyPageFixture as test } from '../fixtures/ComfyPage'\`
@@ -423,10 +602,23 @@ export async function runResearchPhase(
- Use \`comfyPage.nextFrame()\` after interactions that trigger UI updates
- NEVER use \`page.waitForTimeout()\` — use Locator actions and retrying assertions instead
- ALWAYS call done() when finished, even if the test passed — do not keep iterating after a passing test
- CRITICAL: If your test FAILS 3 times in a row with the same or similar error, call done(NOT_REPRODUCIBLE) immediately. Do NOT keep retrying the same approach — try a completely different strategy or give up. Spending 20+ tool calls on failing tests is wasteful.
- Budget your turns: spend at most 3 turns on inspect/readFixture, 2 turns writing the first test, then max 3 fix attempts. If still failing after ~10 tool calls, call done().
- Use \`expect.poll()\` for async assertions: \`await expect.poll(() => comfyPage.nodeOps.getGraphNodesCount()).toBe(8)\`
- CRITICAL: Your assertions must be SPECIFIC TO THE BUG. A test that asserts \`expect(count).toBeGreaterThan(0)\` proves nothing — it would pass even without the bug. Instead assert the exact broken state, e.g. \`expect(clonedWidgets).toHaveLength(0)\` (missing widgets) or \`expect(zIndex).toBeLessThan(parentZIndex)\` (wrong z-order). If a test passes trivially, it's a false positive.
- NEVER write "debug", "discovery", or "inspect node types" tests. These waste turns and produce false REPRODUCED verdicts. If you need to discover node type names, use inspect() or readFixture() — not a passing test.
- If you cannot write a bug-specific assertion, call done() with verdict NOT_REPRODUCIBLE and explain why.
- Loading the user's workflow inside the test (when the issue attached one):
@@ -493,6 +689,7 @@ export async function runResearchPhase(
- \`.setupWorkflowsDirectory(structure)\` — setup test directory
- \`.deleteWorkflow(name)\`
- \`.isCurrentWorkflowModified()\` — check dirty state
- Available subgraph assets: loadWorkflow('subgraphs/basic-subgraph'), 'subgraphs/nested-subgraph', 'subgraphs/subgraph-with-promoted-text-widget', etc.
### Context Menu (comfyPage.contextMenu)
- \`.openFor(locator)\` — right-click locator and wait for menu
@@ -500,12 +697,26 @@ export async function runResearchPhase(
- \`.isVisible()\` — check if context menu is showing
- \`.assertHasItems(items)\` — assert menu contains items
### Queue & Assets (comfyPage.assets)
- \`comfyPage.runButton.click()\` — execute current workflow (backend runs with --cpu in CI)
- \`comfyPage.assets.mockOutputHistory(jobs)\` — mock queue history with fake job items
- \`comfyPage.assets.mockEmptyState()\` — clear all mocked state
- Queue overlay: \`page.getByTestId('queue-overlay-toggle')\` to open queue panel
### Subgraph (comfyPage.subgraph)
- \`.isInSubgraph()\` — check if currently viewing a subgraph
- \`.getNodeCount()\` — nodes in current graph view
'Write a Playwright E2E test that reproduces the reported bug. Use inspect() to discover selectors, readFixture() or readTest() if you need to understand the fixture API or see existing test patterns, writeTest() to write the test, runTest() to execute it. Iterate until it works or you determine the bug cannot be reproduced.',
**Cause**: The `ANTHROPIC_API_KEY` secret in the repo has exhausted its credits.
**Fix**: Top up the Anthropic API account linked to the key, or rotate to a new key in repo Settings → Secrets.
## Duplicate Username During QA Reproduce
**Symptom**: Reproduce tests fail before any assertions with `Failed to create user: {"error": "Duplicate username."}`.
**Cause**: The backend can retain an old `playwright-test-*` user while `/api/users` does not report it in the format the fixture expects, so setup falls through to `POST /api/users` and collides.
**Fix**: The Playwright fixture now retries lookup after a duplicate response and falls back to a unique username if the stale user still cannot be resolved. If this still appears, isolate the backend with `TEST_COMFYUI_DIR` or clear stale test users from the QA backend state.
## Agent Doesn't Perform Steps
**Symptom**: Agent opens menus and settings but never interacts with the canvas.
@@ -90,7 +96,8 @@
1.`loadDefaultWorkflow` failed (no nodes on canvas)
2. Agent ran out of turn budget (30 turns / 120s)
3. Gemini Flash (old agent) ignores prompt hints
**Fix**: Use hybrid agent (Claude Sonnet 4.6 + Gemini vision). Claude's superior reasoning follows instructions precisely.
**Fix**: Use hybrid agent (Claude Sonnet 4.6 + Gemini vision). Claude's superior reasoning follows instructions precisely.
'Read a fixture or helper file from browser_tests/fixtures/ to understand the API. Use this to discover available methods on comfyPage helpers before writing your test.',
{
path: z
.string()
.describe(
'Relative path within browser_tests/fixtures/, e.g. "helpers/CanvasHelper.ts" or "components/Topbar.ts" or "ComfyPage.ts"'
constsystemPrompt=`You are a senior QA engineer who writes Playwright E2E tests to reproduce reported bugs.
## Your tools
- inspect(selector?) — Read the accessibility tree to understand the current UI. Use to discover selectors, element names, and UI state.
- readFixture(path) — Read fixture source code from browser_tests/fixtures/. Use to discover available methods. E.g. "helpers/CanvasHelper.ts", "components/Topbar.ts", "ComfyPage.ts"
- readTest(path) — Read an existing test from browser_tests/tests/ to learn patterns. E.g. "workflow.spec.ts". Pass any name to list available files.
- writeTest(code) — Write a Playwright test file (.spec.ts)
- runTest() — Execute the test and get results (pass/fail + errors)
- done(verdict, summary, evidence, testCode) — Finish with the final test
## Workflow
1. Read the issue description carefully
2. Use inspect() to understand the current UI state and discover element selectors
3. If unsure about the fixture API, use readFixture() to read the relevant helper source code
4. If unsure about test patterns, use readTest() to read an existing test for reference
5. Write a Playwright test that:
- Performs the exact reproduction steps from the issue
- Asserts the BROKEN behavior (the bug) — so the test PASSES when the bug exists
6. Run the test with runTest()
7. If it fails: read the error, fix the test, run again (max 5 attempts)
8. Call done() with the final verdict and test code
## Test writing guidelines
- Import the project fixture: \`import { comfyPageFixture as test } from '../fixtures/ComfyPage'\`
- Import expect: \`import { expect } from '@playwright/test'\`
- The fixture provides \`comfyPage\` which has all the helpers listed below
- If the bug IS present, the test should PASS. If the bug is fixed, the test would FAIL.
- Keep tests focused and minimal — test ONLY the reported bug
- Write ONE test, not multiple. Focus on the single clearest reproduction.
- The test file will be placed in browser_tests/tests/qa-reproduce.spec.ts
- Use \`comfyPage.nextFrame()\` after interactions that trigger UI updates
- NEVER use \`page.waitForTimeout()\` — use Locator actions and retrying assertions instead
- ALWAYS call done() when finished, even if the test passed — do not keep iterating after a passing test
- Use \`expect.poll()\` for async assertions: \`await expect.poll(() => comfyPage.nodeOps.getGraphNodesCount()).toBe(8)\`
- CRITICAL: Your assertions must be SPECIFIC TO THE BUG. A test that asserts \`expect(count).toBeGreaterThan(0)\` proves nothing — it would pass even without the bug. Instead assert the exact broken state, e.g. \`expect(clonedWidgets).toHaveLength(0)\` (missing widgets) or \`expect(zIndex).toBeLessThan(parentZIndex)\` (wrong z-order). If a test passes trivially, it's a false positive.
- If you cannot write a bug-specific assertion, call done() with verdict NOT_REPRODUCIBLE and explain why.
## ComfyPage Fixture API Reference
### Core properties
- \`comfyPage.page\` — raw Playwright Page
- \`comfyPage.canvas\` — Locator for #graph-canvas
console.warn('Starting research phase (Claude writes E2E tests)...')
try{
forawait(constmessageofquery({
prompt:
'Write a Playwright E2E test that reproduces the reported bug. Use inspect() to discover selectors, readFixture() or readTest() if you need to understand the fixture API or see existing test patterns, writeTest() to write the test, runTest() to execute it. Iterate until it works or you determine the bug cannot be reproduced.',
return`You are a senior QA engineer analyzing a bug report for ComfyUI frontend — a node-based visual workflow editor for AI image generation (Vue 3 + TypeScript).
The UI has:
- A large canvas (1280x720 viewport) showing a node graph centered at ~(640, 400)
- Nodes are boxes with input/output slots connected by wires
- A hamburger menu (top-left C logo) with File, Edit, Help submenus
- Sidebars (Workflows, Node Library, Models)
- A topbar with workflow tabs and Queue button
- The default workflow loads with these nodes (approximate center coordinates):
- Load Checkpoint (~150, 300), CLIP Text Encode x2 (~450, 250 and ~450, 450)
return`You are a senior QA engineer analyzing a pull request for ComfyUI frontend (a Vue 3 + TypeScript web application for AI image generation workflows).
Your task: Generate TWO targeted QA test guides — one for BEFORE the PR (main branch) and one for AFTER (PR branch).
${allText}
## Available test actions
Each step must use one of these actions:
- "openMenu" — clicks the Comfy hamburger menu (top-left C logo)
- "hoverMenuItem" — hovers a top-level menu item to open submenu (label required)
- "clickMenuItem" — clicks an item in the visible submenu (label required)
- "fillDialog" — fills dialog input and presses Enter (text required)
- "pressKey" — presses a keyboard key (key required)
- "click" — clicks an element by visible text (text required)
- "wait" — waits briefly (ms required, max 3000)
- "screenshot" — takes a screenshot (name required)
## Output format
Return a JSON object with exactly two keys: "before" and "after", each containing:
{
"summary": "One sentence: what this PR changes",
"test_focus": "Specific behaviors to verify in this recording",
"prerequisites": ["e.g. Load default workflow"],
"steps": [
{
"action": "openMenu",
"description": "Open the main menu to check file options",
SUMM=$(sed -n '/^## Summary/,/^## /p'"$rpt" 2>/dev/null | head -15)
ifecho"$SUMM"| grep -iq 'INCONCLUSIVE';then
INCONC_COUNT=$((INCONC_COUNT +1))
elifecho"$SUMM"| grep -iq 'not reproduced\|could not reproduce\|could not be confirmed\|unable to reproduce\|fails\? to reproduce\|fails\? to perform\|was NOT\|NOT visible\|not observed\|fail.* to demonstrate\|does not demonstrate\|steps were not performed\|never.*tested\|never.*accessed\|not.* confirmed';then
'You are a senior QA engineer performing a BEFORE/AFTER comparison review.',
'',
'You are given TWO videos:',
'- **Video 1 (BEFORE)**: The main branch BEFORE the PR. This shows the OLD behavior.',
'- **Video 2 (AFTER)**: The PR branch AFTER the changes. This shows the NEW behavior.',
'',
'Both videos show the same test steps executed on different code versions.',
''
]
if(prContext){
lines.push('## PR Context',prContext,'')
}
lines.push(
'## Your Task',
`Platform: "${platformName}". After video: ${toProjectRelativePath(videoPath)}.`,
'',
'1. **BEFORE video**: Does it demonstrate the old behavior or bug that the PR aims to fix?',
' Describe what you observe — this establishes the baseline.',
'2. **AFTER video**: Does it prove the PR fix works? Is the intended new behavior visible?',
'3. **Comparison**: What specifically changed between before and after?',
'4. **Regressions**: Did the PR introduce any new problems visible in the AFTER video',
' that were NOT present in the BEFORE video?',
'',
'Note: Brief black frames during page transitions are NORMAL.',
'Note: Small cyan/purple dashed labels prefixed with "QA:" are annotations placed by the automated test script — they are NOT part of the application UI. Do not treat them as bugs or evidence.',
'Report only concrete, visible differences. Avoid speculation.',
'',
'Return markdown with these sections exactly:',
'## Summary',
'(What the PR changes, whether BEFORE confirms the old behavior, whether AFTER proves the fix)',
'',
'## Behavior Changes',
'Summarize ALL behavioral differences as a markdown TABLE:',
'| Behavior | Before (main) | After (PR) | Verdict |',
'',
'- **Behavior**: short name for the behavior (e.g. "Save shortcut label", "Menu hover style")',
'- **Before (main)**: how it works/looks in the BEFORE video',
'- **After (PR)**: how it works/looks in the AFTER video',
'- **Verdict**: `Fixed`, `Improved`, `Changed`, `Regression`, or `No Change`',
'',
'One row per distinct behavior. Include both changed AND unchanged key behaviors',
'that were tested, so reviewers can confirm nothing was missed.',
'',
'## Timeline Comparison',
'Present a chronological frame-by-frame comparison as a markdown TABLE:',
'| Time | Type | Severity | Before (main) | After (PR) |',
'',
'- **Time**: timestamp or range from the videos (e.g. `0:05-0:08`)',
'- **Type**: category such as `Visual`, `Behavior`, `Layout`, `Text`, `Animation`, `Menu`, `State`',
'The video shows the full test session — analyze it chronologically.',
'Focus on UI regressions, broken states, visual glitches, unreadable text, missing labels/i18n, and clear workflow failures.',
'Note: Brief black frames during page transitions are NORMAL and should NOT be reported as issues.',
'Note: Small cyan/purple dashed labels prefixed with "QA:" are annotations placed by the automated test script — they are NOT part of the application UI. Do not treat them as bugs or evidence.',
'Report only concrete, visible problems and avoid speculation.',
'If confidence is low, mark it explicitly.',
'',
'Return markdown with these sections exactly:',
'## Summary',
isIssueContext
?'(Explain what bug was reported and whether the video confirms it is reproducible)'
:prContext
?'(Explain what the PR intended and whether the video confirms it works)'
:'',
'## Confirmed Issues',
'For each confirmed issue, use this exact format (one block per issue):',
'',
'### [Short issue title]',
'`HIGH` `01:03` `Confidence: High`',
'',
'[Description of the issue — what went wrong and what was expected]',
'',
'**Evidence:** [What you observed in the video at the given timestamp]',
'',
'**Suggested Fix:** [Actionable recommendation]',
'',
'---',
'',
'The first line after the heading MUST be exactly three backtick-wrapped labels:',
'`SEVERITY` `TIMESTAMP` `Confidence: LEVEL`',
'Do NOT use a table for issues — use the block format above.',
'## Possible Issues (Needs Human Verification)',
'## Overall Risk',
'',
'## Verdict',
'End your report with this EXACT JSON block (no markdown fence):',
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.