233 Commits

Author SHA1 Message Date
unclecode
9d5bcf78e2 feat: Add DomainMapper for comprehensive domain URL discovery
Add DomainMapper class that discovers all URLs under a domain using
8 sources: sitemap, Common Crawl, Wayback Machine, Certificate
Transparency (crt.sh), path probing, robots.txt mining, RSS/Atom
feeds, and homepage link extraction.

Key features:
- Subdomain discovery via crt.sh, Wayback, CC, and DNS guessing
- Soft-404 detection: fingerprints SPA sites and filters fake pages
- Per-host scanning with parallel execution across discovered hosts
- URL normalization, deduplication, and source attribution
- BM25 relevance scoring with head metadata extraction
- Nonsense filter for static assets, webpack chunks, Wayback garbage

For superdesign.dev: finds 171 URLs across 11 hosts in ~13s
(vs 4 URLs from AsyncUrlSeeder)

New files:
- crawl4ai/domain_mapper.py (DomainMapper class)
- crawl4ai/async_configs.py (DomainMapperConfig)
- docs/md_v2/core/domain-mapping.md (documentation)
- docs/examples/domain_mapper/domain_mapper_demo.py
- 67 tests across unit/integration/adversarial/regression

(cherry picked from commit 2d10534a8742177f1d5f521e3174ae66591d3533)
2026-06-01 12:58:23 +00:00
unclecode
fcaf08b3b3 merge: slot April 2026 security batch (Docker API vulns, SSRF, JWT, file-write, XSS, execute_js) into develop for 0.8.7 2026-06-01 12:40:37 +00:00
Nasrin
be71585239 Merge pull request #1969 from cgseyhan/fix/async-logger-stderr-mcp-1968
fix: route AsyncLogger output to stderr by default (fixes #1968)
2026-05-25 12:20:01 +02:00
cemgo
944eb1e456 fix(logger): route AsyncLogger output to stderr by default
MCP stdio transport uses stdout for JSON-RPC messages. AsyncLogger
was writing Rich progress output to stdout (the default Console()
target), which caused clients to receive garbled JSON and log lines
interleaved in the same stream.

Changes:
- Pass stderr=True to Console() so all log output goes to stderr,
  which is the correct channel for library diagnostics and aligns
  with the behaviour of Python's own logging.StreamHandler.
- Add an injectable console parameter so downstream wrappers
  (e.g. mcp-crawl4ai, FastMCP integrations) can override the target
  stream without monkey-patching.
- Add import sys (used in docstring example).
- Add tests/test_async_logger_stderr.py with 7 tests covering the
  default-to-stderr behaviour, custom console injection, verbose=False
  suppression, file logging, and an end-to-end MCP scenario.

Fixes #1968
2026-05-14 14:13:30 +03:00
Nasrin
35ee366e28 Merge pull request #1901 from hafezparast/fix/maysam-arun-type-hint-1898
fix: correct arun() return type annotation (#1898)
2026-04-24 18:33:41 +02:00
Nasrin
936e4470eb Merge pull request #1845 from hafezparast/fix/maysam-mermaid-svg-text-1043
fix: preserve mermaid diagram text from SVGs during scraping (#1043)
2026-04-24 16:49:59 +02:00
unclecode
1e25edcb5c fix(security): block IPv6-mapped IPv4 SSRF bypass
Caught during internal review. `http://[::ffff:127.0.0.1]/` bypassed
validate_webhook_url because getaddrinfo returns ::ffff:7f00:1, which
is not in any IPv4 blocklist (127.0.0.0/8) nor IPv6 blocklist (::1/128).

Fix: added _expand_ip_candidates() helper that unwraps IPv4 from
IPv4-mapped (::ffff:X.Y.Z.W, via .ipv4_mapped) and IPv4-compatible
(::X.Y.Z.W, via low-32-bits) IPv6 addresses. Blocklist now checks
both the original IP and the unwrapped IPv4 form.

Added 6 new TestIPv6MappedBypass tests covering:
- Loopback, RFC 1918, link-local (cloud metadata) via ::ffff: mapping
- IPv4-compatible variant (::127.0.0.1)
- Regression test that plain ::1 still blocked

Also updated stale test assertion in test_eval_security_adversarial:
hasattr, type, __build_class__ were removed from hook builtins in
batch 2 but the test still expected hasattr to remain.

DO NOT PUSH until release day.
2026-04-20 10:10:59 +00:00
ntohidi
3d4bda122a fix(deep-crawl): use set(False) instead of reset(token) for ContextVar (#1917)
ContextVar.reset(token) requires the same Context that created the token.
When Starlette's StreamingResponse consumes the async generator in a
different Task, the Context changes and reset() raises ValueError.

Replaced with set(False) which works across context boundaries. Safe
because deep_crawl_active is never nested — the guard on line 21
prevents re-entry.
2026-04-16 13:49:32 +08:00
hafezparast
c5612f7551 fix: correct arun() return type from RunManyReturn to CrawlResultContainer (#1898)
arun() always returns CrawlResultContainer, never AsyncGenerator. The
RunManyReturn type (Union[CrawlResultContainer, AsyncGenerator]) caused
Pylance/Pyright to flag result.markdown as an error because AsyncGenerator
doesn't have that attribute.

Also adds test_type_annotations.py — 11 static analysis tests that catch
annotation mismatches (return types, missing annotations, export checks)
without needing pyright in CI. Would have caught this bug before it was
reported.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 21:35:17 +08:00
Nasrin
3d02d75edb Merge pull request #1852 from hafezparast/feat/maysam-arun-many-config-list-1837
feat: expose arun_many config-list support in Docker API (#1837)
2026-04-06 10:26:44 +02:00
unclecode
e326da9166 fix(security): complete AST sandbox escape remediation (CVSS 9.8)
Addresses the gi_frame.f_back chain exploit reported by Song Binglin (q1uf3ng).

- Delete _safe_eval_expression() and _SAFE_EVAL_BUILTINS entirely from
  extraction_strategy.py. Dead security-sensitive code is a liability.
  The eval path was already disabled; this removes the function itself.
- Fix hook_manager.py module injection: replace broken exec("import X", ns)
  pattern (silently failed due to missing __import__) with direct module
  injection. Sanitize asyncio to strip subprocess access (RCE vector).
- Add startup warning when CRAWL4AI_API_TOKEN is unset (all endpoints
  unauthenticated).
- Expand adversarial test suite to 87 tests: hook sandbox escapes,
  asyncio.subprocess RCE verification, end-to-end exploit payload from
  vuln report, dead code deletion checks, codebase eval/exec audit.
2026-03-31 13:01:57 +00:00
unclecode
2fc39cbe89 fix(security): remove eval() from computed fields, harden config deserializer
- Disable eval() in _compute_field expression path (RCE vector via untrusted input).
  Expression key now logs warning and returns default; function key still works.
- Harden _safe_eval_config in server.py with name/attribute allowlists,
  block lambdas, generators, comprehensions in constructor args.
- Remove getattr/setattr from hook_manager allowed builtins (sandbox escape vectors).
- Add 67 adversarial security tests covering all eval/exec attack surfaces.

Closes #1886, closes #1855
2026-03-31 12:02:43 +00:00
hafezparast
e9f832274e fix: validate markdown_generator type in CrawlerRunConfig to catch bad JSON format (#1880)
When the Docker API receives markdown_generator as JSON with "options"
instead of "params", from_serializable_dict silently passes the raw
dict through. This later crashes with a confusing "'dict' object has
no attribute 'generate_markdown'" deep in the crawl pipeline.

Add type validation for markdown_generator in CrawlerRunConfig.__init__
(matching existing extraction_strategy/chunking_strategy validation).
When a dict slips through, the error now clearly states:
- What type was expected vs received
- That "params" is the required key (not "options")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 07:39:28 +08:00
Nasrin
1a40ccf093 Merge pull request #1844 from hafezparast/fix/maysam-browser-none-guard-1842
fix: improve browser None guard in create_browser_context (#1842)
2026-03-24 11:37:46 +01:00
Nasrin
6eb2530bd9 Merge pull request #1849 from hafezparast/fix/maysam-serialize-skip-non-config-1848
fix: skip non-allowlisted types in serialization/deserialization (#1848)
2026-03-24 11:36:03 +01:00
hafezparast
8995c1bbd6 feat: expose arun_many config-list support in Docker API (#1837)
The /crawl endpoint now accepts an optional crawler_configs field
(list of CrawlerRunConfig dicts) alongside the existing crawler_config.
When provided with multiple URLs, each config is deserialized and passed
as a list to arun_many(), enabling per-URL configuration with url_matcher
patterns. Single-URL requests and requests without crawler_configs are
unchanged (backward compatible).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:56:53 +08:00
hafezparast
219416e49d fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)
Starlette's Route wraps async functions in request_response(), calling
handler(request) instead of handler(scope, receive, send). This broke
the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable
class instead of an async function — Route passes class instances
through as raw ASGI apps without wrapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 08:55:41 +08:00
hafezparast
e603e4a722 fix: skip non-allowlisted types in serialization/deserialization (#1848)
to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES
(returns None), preventing objects like logging.Logger from being
serialized as {"type": "Logger", "params": {...}} which then fails
deserialization. from_serializable_dict returns None for unknown types
instead of raising ValueError, handling payloads from older clients.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 08:17:02 +08:00
hafezparast
2fd0f4c6a7 fix: preserve mermaid diagram text from SVGs during scraping (#1043)
Mermaid diagrams rendered as SVGs were completely stripped during HTML
cleaning, losing all text content. Now detects SVGs with id="mermaid-*",
extracts node/edge labels, and replaces the SVG with a fenced mermaid
code block containing the diagram type and extracted text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 11:43:54 +08:00
hafezparast
310b52b663 fix: improve browser None guard in create_browser_context (#1842)
The existing guard assumed self.browser=None only meant persistent context mode.
In reality, the browser can be None because it was closed by the janitor, crashed,
or never started. This caused a misleading error message. Now the guard distinguishes
between persistent context and closed/crashed browser with appropriate messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 10:45:38 +08:00
unclecode
9b571bb947 feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.)
The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.

Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
2026-03-16 14:03:43 +00:00
Nasrin
648f36b622 Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-redis-config-1611-1817
fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
2026-03-13 03:59:28 +01:00
Nasrin
6e4299577f Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw-1484
fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
2026-03-13 03:38:15 +01:00
hafezparast
8de83a3590 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:00:33 +08:00
unclecode
a73bc1c076 fix: MCP SSE endpoint crash — mount via raw ASGI Route (#1594)
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.

Also update CONTRIBUTORS.md for PR #1829.
2026-03-12 11:22:48 +00:00
hafezparast
3f481e9e5c fix: screenshot distortion, deep crawl timeout/arun_many, CLI encoding (#1370, #1818, #1509, #1762)
- #1370: Freeze element dimensions via CSS before viewport resize in
  take_screenshot_scroller() to prevent responsive reflow on Elementor
  sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
  to abort pending loads; move event listener cleanup outside session_id
  guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
  set — call arun() directly per URL so the DeepCrawlDecorator can
  invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
  save_global_config() (cli.py line 58).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 18:17:13 +08:00
hafezparast
480d938f67 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
  provider, temperature, base_url query params with fallback to server config.
  Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
  URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
  REDIS_PASSWORD environment variable overrides.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:53:04 +08:00
Nasrin
d907e167a5 Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-full-page-1750
fix: screenshot respects scan_full_page=False (#1750)
2026-03-12 07:39:52 +01:00
Maysam Hafezparast
57b0d09934 fix: deduplicate BM25ContentFilter output (#1213) (#1824)
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:23:34 +08:00
hafezparast
6efbffe345 fix: screenshot respects scan_full_page=False (#1750)
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.

Includes 16 tests (8 unit + 8 integration).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 12:04:45 +08:00
unclecode
11b45760da fix: anti-bot false positive on browser JSON, URLPatternFilter prefix match, PDF deserialization
- antibot_detector: add <pre> to content elements regex, detect
  browser-wrapped JSON in _looks_like_data() so httpbin-style
  responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
  patterns (/docs/*) instead of matching against full URL, which
  always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
  ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
  and edge cases
2026-03-09 14:52:58 +00:00
unclecode
d788c28315 test: add comprehensive regression test suite (291 tests)
Full regression suite covering all major Crawl4AI subsystems:
- core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks)
- content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata)
- extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction)
- deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization)
- browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes)
- config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips)
- utilities (extract_xml_data, cache modes, content hashing)
- edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery)

Also adds /c4ai-check slash command for testing changes against the suite.
2026-03-08 03:20:52 +00:00
unclecode
3a75dd3f4c fix: batch fix for 10 open issues (#1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758)
- #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance)
- #1489: Preserve query parameter key casing in normalize_url
- #1374: Close NamedTemporaryFile handle before reopening (Windows fix)
- #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1)
- #1183: Fix extract_xml_data regex matching tag names in prose text
- #1354: Make import_knowledge_base async (fix asyncio.run in running loop)
- #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences)
- #1031: Make Docker playground code editor resizable with overflow-auto
- #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes
- #1758: Change screenshot stitching format from BMP to PNG
2026-03-07 09:47:38 +00:00
unclecode
7c0cc3ed88 fix: batch merge of community PRs (#1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668)
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)

Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)

CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
  build-push-action v6, setup-buildx v4, login v4 (#1734)

Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
  extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
  for Unicode preservation in JSON output (#1668)
2026-03-07 08:45:11 +00:00
unclecode
a4cc0a9f04 feat: add separate query_llm_config for adaptive crawler query expansion (#1682)
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.

Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.

Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).

Inspired by PR #1683 from @sthakrar — thank you for identifying the
issue and proposing the initial approach.
2026-02-25 12:26:39 +00:00
unclecode
c0912f7234 feat: add avoid_ads/avoid_css resource filtering and pool release lifecycle
Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker
domains and CSS resources at the browser context level. Refactor crawler pool
with release_crawler() and active_requests tracking to prevent janitor from
closing browsers with in-flight requests. Add proper finally blocks to all
Docker API/server handlers. Update docs for new config options.

Inspired by #1689.
2026-02-25 07:12:28 +00:00
Nasrin
d419199a4c Merge pull request #1775 from unclecode/fix/issue-1748-screenshot-scroll-delay
Fix/issue 1748 screenshot scroll delay
2026-02-25 05:54:24 +01:00
Ahmed-tawfik94
cd81e3cd19 Fix scroll_delay ignored in take_screenshot_scroller for full-page screenshots 2026-02-25 06:52:53 +03:00
Nasrin
4f9cc0810b Merge pull request #1764 from PatD42/fix/table-gfm-pipes
Fix: Add leading/trailing pipes to GFM tables (pad_tables=False)
2026-02-25 03:32:54 +01:00
Nasrin
c4cdc02e27 Merge pull request #1761 from AtharvaJaiswal005/fix/total-score-missing-for-failed-head-extraction-1749
Fix total_score not calculated for links that fail head extraction
2026-02-25 02:25:22 +01:00
unclecode
1a9f68d825 Fix cascading context crash from duplicate add_init_script (#1768)
context.add_init_script() was called in both setup_context() and
_crawl_web(), causing unbounded script accumulation on shared contexts
under concurrent load. Chromium kills the overloaded context, cascading
"Target page, context or browser has been closed" to all concurrent crawls.

Add flag-based dedup: after injecting navigator_overrider or shadow-DOM
scripts, set _crawl4ai_nav_overrider_injected / _crawl4ai_shadow_dom_injected
on the context. Before injecting, check the flag. This preserves context-level
scope (popups/iframes covered) and the fallback for managed/persistent/CDP
paths where setup_context() runs without crawlerRunConfig.
2026-02-24 09:45:18 +00:00
unclecode
254ef0510b Fix anti-bot detection for large SPA block pages (403/503)
Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that
exceed 100KB+, bypassing all size-based detection thresholds. This caused
the fallback (Web Unlocker) to never trigger for these sites.

Changes:
- HTTP 403/503 with non-data HTML is now always treated as blocked
  regardless of page size (false positives are cheap, fallback rescues them)
- Added Tier 1 deep scan: strips scripts/styles before checking patterns
  on large pages, catching block text buried under 100KB+ of CSS/JS
- Added "blocked by network security" as Tier 1 pattern (Reddit et al.)
- Updated tests to reflect new detection philosophy
2026-02-20 10:07:59 +00:00
unclecode
94a77eea30 Move test_repro_1640.py to tests/browser/ 2026-02-19 06:33:46 +00:00
unclecode
c9cb0160cf Add token usage tracking to generate_schema / agenerate_schema
generate_schema can make up to 5 internal LLM calls (field inference,
schema generation, validation retries) with no way to track token
consumption. Add an optional `usage: TokenUsage = None` parameter that
accumulates prompt/completion/total tokens across all calls in-place.

- _infer_target_json: accept and populate usage accumulator
- agenerate_schema: track usage after every aperform_completion call
  in the retry loop, forward usage to _infer_target_json
- generate_schema (sync): forward usage to agenerate_schema

Fully backward-compatible — omitting usage changes nothing.
2026-02-18 06:44:17 +00:00
unclecode
8576331d4e Add Shadow DOM flattening and reorder js_code execution pipeline
- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
  shadow DOM content into the light DOM before HTML capture. Uses a
  recursive serializer that resolves <slot> projections and strips
  only shadow-scoped <style> tags. Also injects an init script to
  force-open closed shadow roots via attachShadow patching.

- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
  so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
  for the less common case of triggering loading before waiting.

- Add JS snippet (flatten_shadow_dom.js), integration test, example,
  and documentation across all relevant doc files.
2026-02-18 06:43:00 +00:00
Patrick
c70ab31abd fix: add leading/trailing pipes to GFM tables (pad_tables=False)
When pad_tables=False (default), html2text generated table rows without
leading/trailing pipe delimiters, producing non-compliant GFM markdown:

  Before: A | B | C
  After:  | A | B | C |

Changes:
- Add leading pipe on first cell, spaced pipe between cells
- Add trailing pipe at end of each row
- Format separator as | --- | --- | instead of ---|---
- Ensure table starts on its own line (soft_br at <table>)
- Handle <caption> element to prevent inline merge with header row
- All changes guarded by `not self.pad_tables` — pad_tables mode unchanged

Includes 13 unit tests covering GFM compliance and pad_tables regression.

Fixes: #1731
2026-02-17 21:14:36 -05:00
unclecode
d267c650cb Add source (sibling selector) support to JSON extraction strategies
Many sites (e.g. Hacker News) split a single item's data across sibling
elements. Field selectors only search descendants, making sibling data
unreachable. The new "source" field key navigates to a sibling element
before running the selector: {"source": "+ tr"} finds the next sibling
<tr>, then extracts from there.

- Add _resolve_source abstract method to JsonElementExtractionStrategy
- Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS)
- Modify _extract_field to resolve source before type dispatch
- Update CSS and XPath LLM prompts with source docs and HN example
- Default generate_schema validate=True so schemas are checked on creation
- Add schema validation with feedback loop for auto-refinement
- Add messages param to completion helpers for multi-turn refinement
- Document source field and schema validation in docs
- Add 14 unit tests covering CSS, XPath, backward compat, edge cases
2026-02-17 09:04:40 +00:00
Atharva Jaiswal
094242d4a7 Fix total_score not calculated for links that fail head extraction
The _merge_head_data() function only called calculate_total_score() for
links present in url_to_head_data. Links that failed head extraction
(PDFs, timeouts, non-HTML) hit the else branch and were appended
unchanged, leaving total_score as None even when intrinsic_score was
available.

Added calculate_total_score() calls in both else branches (internal
and external links) so all links get a total_score computed from their
intrinsic_score when head data is unavailable.

Fixes #1749
2026-02-16 20:41:30 +05:30
unclecode
72b546c48d Add anti-bot detection, retry, and fallback system
Automatically detect when crawls are blocked by anti-bot systems
(Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and
escalate through configurable retry and fallback strategies.

New features on CrawlerRunConfig:
- max_retries: retry rounds when blocking is detected
- fallback_proxy_configs: list of fallback proxies tried each round
- fallback_fetch_function: async last-resort function returning raw HTML

New field on ProxyConfig:
- is_fallback: skip proxy on first attempt, activate only when blocked

Escalation chain per round: main proxy → fallback proxies in order.
After all rounds: fallback_fetch_function as last resort.

Detection uses tiered heuristics — structural HTML markers (high
confidence) trigger on any page, generic patterns only on short
error pages to avoid false positives.
2026-02-14 05:24:07 +00:00
unclecode
fdd989785f Sync sec-ch-ua with User-Agent and keep WebGL alive in stealth mode
Fix a bug where magic mode and per-request UA overrides would change
the User-Agent header without updating the sec-ch-ua (browser hint)
header to match. Anti-bot systems like Akamai detect this mismatch
as a bot signal.

Changes:
- Regenerate browser_hint via UAGen.generate_client_hints() whenever
  the UA is changed at crawl time (magic mode or explicit override)
- Re-apply updated headers to the page via set_extra_http_headers()
- Skip per-crawl UA override for persistent contexts where the UA is
  locked at launch time by Playwright's protocol layer
- Move --disable-gpu flags behind enable_stealth check so WebGL works
  via SwiftShader when stealth mode is active (missing WebGL is a
  detectable headless signal)
- Clean up old test scripts, add clean anti-bot test
2026-02-13 04:10:47 +00:00