crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 07:48:50 +00:00

Files

unclecode 254ef0510b Fix anti-bot detection for large SPA block pages (403/503)

Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that
exceed 100KB+, bypassing all size-based detection thresholds. This caused
the fallback (Web Unlocker) to never trigger for these sites.

Changes:
- HTTP 403/503 with non-data HTML is now always treated as blocked
  regardless of page size (false positives are cheap, fallback rescues them)
- Added Tier 1 deep scan: strips scripts/styles before checking patterns
  on large pages, catching block text buried under 100KB+ of CSS/JS
- Added "blocked by network security" as Tier 1 pattern (Reddit et al.)
- Updated tests to reflect new detection philosophy

2026-02-20 10:07:59 +00:00

test_antibot_detector.py

Fix anti-bot detection for large SPA block pages (403/503)

2026-02-20 10:07:59 +00:00

test_chanel_cdp_proxy.py

Sync sec-ch-ua with User-Agent and keep WebGL alive in stealth mode

2026-02-13 04:10:47 +00:00

test_persistent_proxy.py

Fix proxy auth for persistent browser contexts

2026-02-12 11:19:29 +00:00

test_proxy_config.py

#1057 : enhance ProxyConfig initialization to support dict and string formats