mirror of
https://github.com/unclecode/crawl4ai.git
synced 2026-06-10 07:48:50 +00:00
Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that exceed 100KB+, bypassing all size-based detection thresholds. This caused the fallback (Web Unlocker) to never trigger for these sites. Changes: - HTTP 403/503 with non-data HTML is now always treated as blocked regardless of page size (false positives are cheap, fallback rescues them) - Added Tier 1 deep scan: strips scripts/styles before checking patterns on large pages, catching block text buried under 100KB+ of CSS/JS - Added "blocked by network security" as Tier 1 pattern (Reddit et al.) - Updated tests to reflect new detection philosophy