feat: Opus escalation for INCONCLUSIVE issues

Sonnet tries first. If INCONCLUSIVE, automatically retries with claude-opus-4-6 (30 turns). Disable with QA_OPUS_ESCALATION=0. Also: model param added to ResearchOptions for flexibility.
feat: Phase 1 improvements — concurrency, auto-trigger, better prompts
2026-04-20 06:20:11 +00:00 · 2026-04-14 13:14:33 +00:00 · 2026-04-14 13:12:49 +00:00 · 2026-04-13 19:41:54 +00:00 · 2026-04-13 18:42:20 +00:00 · 2026-04-13 17:49:03 +00:00
3 changed files with 36 additions and 7 deletions
--- a/.claude/skills/comfy-qa/scripts/qa-agent.ts
+++ b/.claude/skills/comfy-qa/scripts/qa-agent.ts
@@ -35,6 +35,7 @@ interface ResearchOptions {
  anthropicApiKey?: string
  maxTurns?: number
  timeBudgetMs?: number
+  model?: string
 }

 export type ReproMethod = 'e2e_test' | 'video' | 'both' | 'none'
@@ -402,9 +403,13 @@ export async function runResearchPhase(

 ## Workflow
 1. Read the issue description carefully
-2. Use inspect() to understand the current UI state and discover element selectors
-3. If unsure about the fixture API, use readFixture() to read the relevant helper source code
-4. If unsure about test patterns, use readTest() to read an existing test for reference
+2. FIRST: Use readTest() to read 1-2 existing tests similar to the bug you're reproducing:
+   - For menu/workflow bugs: readTest("workflow.spec.ts") or readTest("topbarMenu.spec.ts")
+   - For node/canvas bugs: readTest("nodeInteraction.spec.ts") or readTest("copyPaste.spec.ts")
+   - For settings bugs: readTest("settingDialogSearch.spec.ts")
+   - For subgraph bugs: readTest("subgraph.spec.ts")
+3. Use inspect() to understand the current UI state and discover element selectors
+4. If unsure about the fixture API, use readFixture("ComfyPage.ts") or relevant helper
 5. Write a Playwright test that:
   - Performs the exact reproduction steps from the issue
   - Asserts the BROKEN behavior (the bug) — so the test PASSES when the bug exists
@@ -423,6 +428,8 @@ export async function runResearchPhase(
 - Use \`comfyPage.nextFrame()\` after interactions that trigger UI updates
 - NEVER use \`page.waitForTimeout()\` — use Locator actions and retrying assertions instead
 - ALWAYS call done() when finished, even if the test passed — do not keep iterating after a passing test
+- CRITICAL: If your test FAILS 3 times in a row with the same or similar error, call done(NOT_REPRODUCIBLE) immediately. Do NOT keep retrying the same approach — try a completely different strategy or give up. Spending 20+ tool calls on failing tests is wasteful.
+- Budget your turns: spend at most 3 turns on inspect/readFixture, 2 turns writing the first test, then max 3 fix attempts. If still failing after ~10 tool calls, call done().
 - Use \`expect.poll()\` for async assertions: \`await expect.poll(() => comfyPage.nodeOps.getGraphNodesCount()).toBe(8)\`
 - CRITICAL: Your assertions must be SPECIFIC TO THE BUG. A test that asserts \`expect(count).toBeGreaterThan(0)\` proves nothing — it would pass even without the bug. Instead assert the exact broken state, e.g. \`expect(clonedWidgets).toHaveLength(0)\` (missing widgets) or \`expect(zIndex).toBeLessThan(parentZIndex)\` (wrong z-order). If a test passes trivially, it's a false positive.
 - NEVER write "debug", "discovery", or "inspect node types" tests. These waste turns and produce false REPRODUCED verdicts. If you need to discover node type names, use inspect() or readFixture() — not a passing test.
@@ -596,7 +603,7 @@ ${issueContext}`
      prompt:
        'Write a Playwright E2E test that reproduces the reported bug. Use inspect() to discover selectors, readFixture() or readTest() if you need to understand the fixture API or see existing test patterns, writeTest() to write the test, runTest() to execute it. Iterate until it works or you determine the bug cannot be reproduced.',
      options: {
-        model: 'claude-sonnet-4-6',
+        model: opts.model ?? 'claude-sonnet-4-6',
        systemPrompt,
        ...(anthropicApiKey ? { apiKey: anthropicApiKey } : {}),
        maxTurns,
--- a/.claude/skills/comfy-qa/scripts/qa-record.ts
+++ b/.claude/skills/comfy-qa/scripts/qa-record.ts
@@ -1952,7 +1952,7 @@ async function main() {
            // QA guide not available
          }
        }
-        const research = await runResearchPhase({
+        let research = await runResearchPhase({
          page,
          issueContext: issueCtx,
          qaGuide: qaGuideText,
@@ -1963,6 +1963,28 @@ async function main() {
        console.warn(
          `Research complete: ${research.verdict} — ${research.summary.slice(0, 100)}`
        )
+
+        // Opus escalation: if Sonnet couldn't reproduce, try Opus
+        if (
+          research.verdict === 'INCONCLUSIVE' &&
+          anthropicKey &&
+          process.env.QA_OPUS_ESCALATION !== '0'
+        ) {
+          console.warn('Escalating to claude-opus-4-6 for complex issue...')
+          research = await runResearchPhase({
+            page,
+            issueContext: issueCtx,
+            qaGuide: qaGuideText,
+            outputDir: opts.outputDir,
+            serverUrl: opts.serverUrl,
+            anthropicApiKey: anthropicKey,
+            model: 'claude-opus-4-6',
+            maxTurns: 30
+          })
+          console.warn(
+            `Opus result: ${research.verdict} — ${research.summary.slice(0, 100)}`
+          )
+        }
        console.warn(`Evidence: ${research.evidence.slice(0, 200)}`)

        // ═══ Phase 2: Record demo video with demowright ═══
--- a/.github/workflows/pr-qa.yaml
+++ b/.github/workflows/pr-qa.yaml
@@ -26,7 +26,7 @@ on:
        default: focused

 concurrency:
-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.event.issue.number || github.ref }}
+  group: qa-${{ github.event.pull_request.number || github.event.issue.number || github.ref_name }}
  cancel-in-progress: true

 jobs:
@@ -53,7 +53,7 @@ jobs:

          # Only run on label events if it's one of our labels
          if [ "$EVENT_ACTION" = "labeled" ] && \
-             [ "$LABEL" != "qa-changes" ] && [ "$LABEL" != "qa-full" ] && [ "$LABEL" != "qa-issue" ]; then
+             [ "$LABEL" != "qa-changes" ] && [ "$LABEL" != "qa-full" ] && [ "$LABEL" != "qa-issue" ] && [ "$LABEL" != "Potential Bug" ] && [ "$LABEL" != "verified bug" ]; then
             echo "skip=true" >> "$GITHUB_OUTPUT"
          fi
Author	SHA1	Message	Date
snomiao	a2da58eb0f	feat: Opus escalation for INCONCLUSIVE issues Sonnet tries first. If INCONCLUSIVE, automatically retries with claude-opus-4-6 (30 turns). Disable with QA_OPUS_ESCALATION=0. Also: model param added to ResearchOptions for flexibility.	2026-04-14 13:14:33 +00:00
snomiao	3154865ce2	feat: Phase 1 improvements — concurrency, auto-trigger, better prompts - B1: Fix concurrency group to use ref_name (parallel sno-qa-* branches) - D1: Auto-trigger QA on 'Potential Bug' and 'verified bug' labels - A4: Prompt agent to read existing tests first before writing - Turn budget enforcement from previous commit	2026-04-14 13:12:49 +00:00
snomiao	ff6034e2ee	fix: reduce INCONCLUSIVE rate — enforce turn budget and fail-fast - 3 consecutive test failures → call done(NOT_REPRODUCIBLE) - Turn budget: ~3 inspect, 2 write, 3 fix = ~10 tool calls max - Prevents 20+ tool call retry loops that waste CI time	2026-04-13 19:41:54 +00:00
snomiao	529ac3cea4	trigger: re-run cancelled batch 2	2026-04-13 18:42:20 +00:00
snomiao	f95eebf3db	trigger: re-run cancelled QA batches	2026-04-13 17:49:03 +00:00