Compare commits

..

5 Commits

Author SHA1 Message Date
snomiao
a2da58eb0f feat: Opus escalation for INCONCLUSIVE issues
Sonnet tries first. If INCONCLUSIVE, automatically retries with
claude-opus-4-6 (30 turns). Disable with QA_OPUS_ESCALATION=0.
Also: model param added to ResearchOptions for flexibility.
2026-04-14 13:14:33 +00:00
snomiao
3154865ce2 feat: Phase 1 improvements — concurrency, auto-trigger, better prompts
- B1: Fix concurrency group to use ref_name (parallel sno-qa-* branches)
- D1: Auto-trigger QA on 'Potential Bug' and 'verified bug' labels
- A4: Prompt agent to read existing tests first before writing
- Turn budget enforcement from previous commit
2026-04-14 13:12:49 +00:00
snomiao
ff6034e2ee fix: reduce INCONCLUSIVE rate — enforce turn budget and fail-fast
- 3 consecutive test failures → call done(NOT_REPRODUCIBLE)
- Turn budget: ~3 inspect, 2 write, 3 fix = ~10 tool calls max
- Prevents 20+ tool call retry loops that waste CI time
2026-04-13 19:41:54 +00:00
snomiao
529ac3cea4 trigger: re-run cancelled batch 2 2026-04-13 18:42:20 +00:00
snomiao
f95eebf3db trigger: re-run cancelled QA batches 2026-04-13 17:49:03 +00:00
3 changed files with 36 additions and 7 deletions

View File

@@ -35,6 +35,7 @@ interface ResearchOptions {
anthropicApiKey?: string
maxTurns?: number
timeBudgetMs?: number
model?: string
}
export type ReproMethod = 'e2e_test' | 'video' | 'both' | 'none'
@@ -402,9 +403,13 @@ export async function runResearchPhase(
## Workflow
1. Read the issue description carefully
2. Use inspect() to understand the current UI state and discover element selectors
3. If unsure about the fixture API, use readFixture() to read the relevant helper source code
4. If unsure about test patterns, use readTest() to read an existing test for reference
2. FIRST: Use readTest() to read 1-2 existing tests similar to the bug you're reproducing:
- For menu/workflow bugs: readTest("workflow.spec.ts") or readTest("topbarMenu.spec.ts")
- For node/canvas bugs: readTest("nodeInteraction.spec.ts") or readTest("copyPaste.spec.ts")
- For settings bugs: readTest("settingDialogSearch.spec.ts")
- For subgraph bugs: readTest("subgraph.spec.ts")
3. Use inspect() to understand the current UI state and discover element selectors
4. If unsure about the fixture API, use readFixture("ComfyPage.ts") or relevant helper
5. Write a Playwright test that:
- Performs the exact reproduction steps from the issue
- Asserts the BROKEN behavior (the bug) — so the test PASSES when the bug exists
@@ -423,6 +428,8 @@ export async function runResearchPhase(
- Use \`comfyPage.nextFrame()\` after interactions that trigger UI updates
- NEVER use \`page.waitForTimeout()\` — use Locator actions and retrying assertions instead
- ALWAYS call done() when finished, even if the test passed — do not keep iterating after a passing test
- CRITICAL: If your test FAILS 3 times in a row with the same or similar error, call done(NOT_REPRODUCIBLE) immediately. Do NOT keep retrying the same approach — try a completely different strategy or give up. Spending 20+ tool calls on failing tests is wasteful.
- Budget your turns: spend at most 3 turns on inspect/readFixture, 2 turns writing the first test, then max 3 fix attempts. If still failing after ~10 tool calls, call done().
- Use \`expect.poll()\` for async assertions: \`await expect.poll(() => comfyPage.nodeOps.getGraphNodesCount()).toBe(8)\`
- CRITICAL: Your assertions must be SPECIFIC TO THE BUG. A test that asserts \`expect(count).toBeGreaterThan(0)\` proves nothing — it would pass even without the bug. Instead assert the exact broken state, e.g. \`expect(clonedWidgets).toHaveLength(0)\` (missing widgets) or \`expect(zIndex).toBeLessThan(parentZIndex)\` (wrong z-order). If a test passes trivially, it's a false positive.
- NEVER write "debug", "discovery", or "inspect node types" tests. These waste turns and produce false REPRODUCED verdicts. If you need to discover node type names, use inspect() or readFixture() — not a passing test.
@@ -596,7 +603,7 @@ ${issueContext}`
prompt:
'Write a Playwright E2E test that reproduces the reported bug. Use inspect() to discover selectors, readFixture() or readTest() if you need to understand the fixture API or see existing test patterns, writeTest() to write the test, runTest() to execute it. Iterate until it works or you determine the bug cannot be reproduced.',
options: {
model: 'claude-sonnet-4-6',
model: opts.model ?? 'claude-sonnet-4-6',
systemPrompt,
...(anthropicApiKey ? { apiKey: anthropicApiKey } : {}),
maxTurns,

View File

@@ -1952,7 +1952,7 @@ async function main() {
// QA guide not available
}
}
const research = await runResearchPhase({
let research = await runResearchPhase({
page,
issueContext: issueCtx,
qaGuide: qaGuideText,
@@ -1963,6 +1963,28 @@ async function main() {
console.warn(
`Research complete: ${research.verdict}${research.summary.slice(0, 100)}`
)
// Opus escalation: if Sonnet couldn't reproduce, try Opus
if (
research.verdict === 'INCONCLUSIVE' &&
anthropicKey &&
process.env.QA_OPUS_ESCALATION !== '0'
) {
console.warn('Escalating to claude-opus-4-6 for complex issue...')
research = await runResearchPhase({
page,
issueContext: issueCtx,
qaGuide: qaGuideText,
outputDir: opts.outputDir,
serverUrl: opts.serverUrl,
anthropicApiKey: anthropicKey,
model: 'claude-opus-4-6',
maxTurns: 30
})
console.warn(
`Opus result: ${research.verdict}${research.summary.slice(0, 100)}`
)
}
console.warn(`Evidence: ${research.evidence.slice(0, 200)}`)
// ═══ Phase 2: Record demo video with demowright ═══

View File

@@ -26,7 +26,7 @@ on:
default: focused
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.event.issue.number || github.ref }}
group: qa-${{ github.event.pull_request.number || github.event.issue.number || github.ref_name }}
cancel-in-progress: true
jobs:
@@ -53,7 +53,7 @@ jobs:
# Only run on label events if it's one of our labels
if [ "$EVENT_ACTION" = "labeled" ] && \
[ "$LABEL" != "qa-changes" ] && [ "$LABEL" != "qa-full" ] && [ "$LABEL" != "qa-issue" ]; then
[ "$LABEL" != "qa-changes" ] && [ "$LABEL" != "qa-full" ] && [ "$LABEL" != "qa-issue" ] && [ "$LABEL" != "Potential Bug" ] && [ "$LABEL" != "verified bug" ]; then
echo "skip=true" >> "$GITHUB_OUTPUT"
fi