crawl4ai/tests/test_cloud_bugs_batch.py at develop

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 15:58:15 +00:00

Files

unclecode 11b45760da fix: anti-bot false positive on browser JSON, URLPatternFilter prefix match, PDF deserialization

- antibot_detector: add <pre> to content elements regex, detect
  browser-wrapped JSON in _looks_like_data() so httpbin-style
  responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
  patterns (/docs/*) instead of matching against full URL, which
  always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
  ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
  and edge cases

2026-03-09 14:52:58 +00:00

22 KiB

Raw Permalink Blame History

View Raw

22 KiB Raw Permalink Blame History

22 KiB

Raw Permalink Blame History