mirror of
https://github.com/Comfy-Org/ComfyUI_frontend.git
synced 2026-04-20 14:30:41 +00:00
fix: prevent AI lies — assertion-based verdicts + blind reviewer
Agent: MUST use inspect() after every action, verdict based on DOM state not opinions. "NEVER claim REPRODUCED unless inspect() confirms." Reviewer: Two-phase prompt — Phase 1 describes what it SEES (blind, no context). Phase 2 compares observations against issue/PR context. Anti-hallucination rules: "describe ONLY what you observe, NEVER infer." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -477,12 +477,21 @@ export async function runHybridAgent(opts: AgentOptions): Promise<{
|
||||
- done(verdict, summary) — Finish with your conclusion.
|
||||
|
||||
## Strategy
|
||||
1. Start by understanding the issue, then plan your reproduction steps.
|
||||
2. Use perform() to take actions. After each action, use inspect() to verify state or observe() for visual confirmation.
|
||||
1. Start by understanding the issue, then plan your FULL reproduction sequence before acting.
|
||||
2. Use perform() to take actions. After EVERY action, use inspect() to verify the DOM state changed as expected.
|
||||
3. If a setting change doesn't seem to take effect, try reload() then verify again.
|
||||
4. Focus on the specific bug — don't explore randomly.
|
||||
5. Take screenshots at key moments for the video evidence.
|
||||
6. When you've confirmed or ruled out the bug, call done().
|
||||
7. You MUST complete ALL reproduction steps. Do NOT stop after setup — the actual bug trigger is the most important part.
|
||||
|
||||
## Verification Rules (CRITICAL — prevents false results)
|
||||
- NEVER claim REPRODUCED unless inspect() confirms the expected broken state exists
|
||||
- After EVERY action, call inspect() to verify the DOM state. This is your source of truth.
|
||||
- Your done() verdict MUST be supported by inspect() results, not by what you think happened
|
||||
- If you perform an action but inspect() shows no state change, the action FAILED — try again or adapt
|
||||
- Example: if you press Escape and inspect("Settings dialog") still returns visible → the dialog did NOT close → report that honestly
|
||||
- ALWAYS include inspect() evidence in your done() summary: "inspect('X') returned {state} confirming Y"
|
||||
|
||||
## Control/Test Comparison (IMPORTANT)
|
||||
When a bug is triggered by a specific setting, mode, or configuration:
|
||||
|
||||
@@ -421,6 +421,12 @@ function buildSingleVideoPrompt(
|
||||
): string {
|
||||
const lines = [
|
||||
'You are a senior QA engineer reviewing a UI test session recording.',
|
||||
'',
|
||||
'## ANTI-HALLUCINATION RULES (READ FIRST)',
|
||||
'- Describe ONLY what you can directly observe in the video frames',
|
||||
'- NEVER infer or assume what "must have happened" between frames',
|
||||
'- If a step is not visible in the video, say "NOT SHOWN" — do not guess',
|
||||
'- Your job is to be a CAMERA — report facts, not interpretations',
|
||||
''
|
||||
]
|
||||
|
||||
@@ -431,38 +437,40 @@ function buildSingleVideoPrompt(
|
||||
)
|
||||
|
||||
if (prContext) {
|
||||
lines.push(
|
||||
'## Phase 1: Blind Observation (describe what you SEE)',
|
||||
'First, describe every UI interaction chronologically WITHOUT knowing the expected outcome:',
|
||||
'- What elements does the user click/hover/type?',
|
||||
'- What dialogs/menus open and close?',
|
||||
'- What keyboard indicators appear? (look for subtitle overlays)',
|
||||
'- What is the BEFORE state and AFTER state of each action?',
|
||||
'',
|
||||
'## Phase 2: Compare against expected behavior',
|
||||
'Now compare your observations against the context below.',
|
||||
'Only claim a match if your Phase 1 observations EXPLICITLY support it.',
|
||||
''
|
||||
)
|
||||
|
||||
if (isIssueContext) {
|
||||
lines.push(
|
||||
'## Issue Context',
|
||||
'This video attempts to reproduce a reported bug on the main branch.',
|
||||
'Your review MUST evaluate whether the reported bug is visible and reproducible.',
|
||||
'',
|
||||
prContext,
|
||||
'',
|
||||
'## Review Instructions',
|
||||
'1. Does the video demonstrate the reported bug occurring?',
|
||||
'2. Is the bug clearly visible and reproducible from the steps shown?',
|
||||
'3. Are there any other issues visible during the reproduction attempt?',
|
||||
'',
|
||||
'## CRITICAL: Honesty Requirements',
|
||||
'- If the video only shows login, idle canvas, or trivial menu interactions WITHOUT actually performing the reproduction steps, say "INCONCLUSIVE — reproduction steps were not performed".',
|
||||
'- Do NOT claim a bug is "confirmed" unless you can clearly see the bug behavior described in the issue.',
|
||||
'- Do NOT hallucinate findings. If the video does not show meaningful interaction, say so clearly.',
|
||||
'- Rate confidence as "Low" if the video does not actually demonstrate the bug scenario.',
|
||||
'## Comparison Questions',
|
||||
'1. Did the video perform the reproduction steps described in the issue?',
|
||||
'2. Did your Phase 1 observations show the reported bug behavior?',
|
||||
'3. If the steps were not performed or the bug was not visible, say INCONCLUSIVE.',
|
||||
''
|
||||
)
|
||||
} else {
|
||||
lines.push(
|
||||
'## PR Context',
|
||||
'The video is a QA session testing a specific pull request.',
|
||||
'Your review MUST evaluate whether the PR achieves its stated purpose.',
|
||||
'',
|
||||
prContext,
|
||||
'',
|
||||
'## Review Instructions',
|
||||
"1. Does the video demonstrate the PR's intended behavior working correctly?",
|
||||
'2. Are there regressions or side effects caused by the PR changes?',
|
||||
'3. Does the observed behavior match what the PR claims to implement/fix?',
|
||||
'## Comparison Questions',
|
||||
'1. Did the video test the specific behavior the PR changes?',
|
||||
'2. Did your Phase 1 observations show the expected before/after difference?',
|
||||
'3. If the test was incomplete or inconclusive, say so honestly.',
|
||||
''
|
||||
)
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user