Files
ComfyUI_frontend/docs/qa/models.md
snomiao 7520a47b5a feat: add automated QA pipeline with E2E test-driven bug reproduction
Three-phase pipeline triggered by labels (qa-changes, qa-full, qa-issue):
1. Research: Claude writes Playwright E2E tests to reproduce reported bugs
2. Reproduce: Deterministic replay with video recording
3. Report: Deploy results to Cloudflare Pages with badges

Key design decisions:
- Playwright assertions are source of truth (not AI vision)
- Agent has readFixture/readTest tools to discover project patterns
- Bug-specific assertions required (trivial assertions banned)
- Main branch dist cached by SHA to speed up before/after comparisons
- QA deps installed inline in CI (no package.json changes needed)

Verified across 48 runs (22 PRs + 26 issues) with 0 false positives.

Amp-Thread-ID: https://ampcode.com/threads/T-019d519b-004f-71ce-b970-96edd971fbe0
Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019d519b-004f-71ce-b970-96edd971fbe0
Co-authored-by: Amp <amp@ampcode.com>
2026-04-10 08:51:10 +00:00

3.8 KiB

QA Pipeline Model Selection

Current Configuration

Script Role Model Why
qa-analyze-pr.ts PR/issue analysis, QA guide generation gemini-3.1-pro-preview Needs deep reasoning over PR diffs, screenshots, and issue threads
qa-record.ts Playwright step generation gemini-3.1-pro-preview Step quality is critical — must understand ComfyUI's canvas UI and produce precise action sequences
qa-video-review.ts Video comparison review gemini-3-flash-preview Video analysis with structured output; flash is sufficient and faster
qa-generate-test.ts Regression test generation gemini-3-flash-preview Code generation from QA reports; flash handles this well

Model Comparison

Gemini 3.1 Pro vs GPT-5.4

Gemini 3.1 Pro Preview GPT-5.4
Context window 1M tokens 1M tokens
Max output 65K tokens 128K tokens
Video input Yes No
Image input Yes Yes
Audio input Yes No
Pricing (input) $2/1M tokens $2.50/1M tokens
Pricing (output) $12/1M tokens $15/1M tokens
Function calling Yes Yes
Code execution Yes Yes (interpreter)
Structured output Yes Yes

Why Gemini over GPT for QA:

  • Native video understanding (can review recordings directly)
  • Lower cost at comparable quality
  • Native multimodal input (screenshots, videos, audio from issue threads)
  • Better price/performance for high-volume CI usage

Gemini 3 Flash vs GPT-5.4 Mini

Gemini 3 Flash Preview GPT-5.4 Mini
Context window 1M tokens 1M tokens
Pricing (input) $0.50/1M tokens $0.40/1M tokens
Pricing (output) $3/1M tokens $1.60/1M tokens
Video input Yes No

Why Gemini Flash for video review:

  • Video input support is required — GPT models cannot process video files
  • Good enough quality for structured comparison reports

Upgrade History

Date Change Reason
2026-03-24 gemini-2.5-flashgemini-3.1-pro-preview (record) Shallow step generation; pro model needed for complex ComfyUI interactions
2026-03-24 gemini-2.5-progemini-3.1-pro-preview (analyze) Keep analysis on latest pro
2026-03-24 gemini-2.5-flashgemini-3-flash-preview (review, test-gen) Latest flash for cost-efficient tasks

Override

All scripts accept --model <name> to override the default. Pass any Gemini model ID.