crawl4ai

unclecode/crawl4ai

Fork 0

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 15:58:15 +00:00

Commit Graph

Author	SHA1	Message	Date
unclecode	254ef0510b	Fix anti-bot detection for large SPA block pages (403/503) Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that exceed 100KB+, bypassing all size-based detection thresholds. This caused the fallback (Web Unlocker) to never trigger for these sites. Changes: - HTTP 403/503 with non-data HTML is now always treated as blocked regardless of page size (false positives are cheap, fallback rescues them) - Added Tier 1 deep scan: strips scripts/styles before checking patterns on large pages, catching block text buried under 100KB+ of CSS/JS - Added "blocked by network security" as Tier 1 pattern (Reddit et al.) - Updated tests to reflect new detection philosophy	2026-02-20 10:07:59 +00:00
unclecode	72b546c48d	Add anti-bot detection, retry, and fallback system Automatically detect when crawls are blocked by anti-bot systems (Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and escalate through configurable retry and fallback strategies. New features on CrawlerRunConfig: - max_retries: retry rounds when blocking is detected - fallback_proxy_configs: list of fallback proxies tried each round - fallback_fetch_function: async last-resort function returning raw HTML New field on ProxyConfig: - is_fallback: skip proxy on first attempt, activate only when blocked Escalation chain per round: main proxy → fallback proxies in order. After all rounds: fallback_fetch_function as last resort. Detection uses tiered heuristics — structural HTML markers (high confidence) trigger on any page, generic patterns only on short error pages to avoid false positives.	2026-02-14 05:24:07 +00:00

Author

SHA1

Message

Date

unclecode

254ef0510b

Fix anti-bot detection for large SPA block pages (403/503)

Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that
exceed 100KB+, bypassing all size-based detection thresholds. This caused
the fallback (Web Unlocker) to never trigger for these sites.

Changes:
- HTTP 403/503 with non-data HTML is now always treated as blocked
  regardless of page size (false positives are cheap, fallback rescues them)
- Added Tier 1 deep scan: strips scripts/styles before checking patterns
  on large pages, catching block text buried under 100KB+ of CSS/JS
- Added "blocked by network security" as Tier 1 pattern (Reddit et al.)
- Updated tests to reflect new detection philosophy

2026-02-20 10:07:59 +00:00

unclecode

72b546c48d

Add anti-bot detection, retry, and fallback system

Automatically detect when crawls are blocked by anti-bot systems
(Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and
escalate through configurable retry and fallback strategies.

New features on CrawlerRunConfig:
- max_retries: retry rounds when blocking is detected
- fallback_proxy_configs: list of fallback proxies tried each round
- fallback_fetch_function: async last-resort function returning raw HTML

New field on ProxyConfig:
- is_fallback: skip proxy on first attempt, activate only when blocked

Escalation chain per round: main proxy → fallback proxies in order.
After all rounds: fallback_fetch_function as last resort.

Detection uses tiered heuristics — structural HTML markers (high
confidence) trigger on any page, generic patterns only on short
error pages to avoid false positives.

2026-02-14 05:24:07 +00:00

2 Commits