crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 07:48:50 +00:00

Author	SHA1	Message	Date
unclecode	9d5bcf78e2	feat: Add DomainMapper for comprehensive domain URL discovery Add DomainMapper class that discovers all URLs under a domain using 8 sources: sitemap, Common Crawl, Wayback Machine, Certificate Transparency (crt.sh), path probing, robots.txt mining, RSS/Atom feeds, and homepage link extraction. Key features: - Subdomain discovery via crt.sh, Wayback, CC, and DNS guessing - Soft-404 detection: fingerprints SPA sites and filters fake pages - Per-host scanning with parallel execution across discovered hosts - URL normalization, deduplication, and source attribution - BM25 relevance scoring with head metadata extraction - Nonsense filter for static assets, webpack chunks, Wayback garbage For superdesign.dev: finds 171 URLs across 11 hosts in ~13s (vs 4 URLs from AsyncUrlSeeder) New files: - crawl4ai/domain_mapper.py (DomainMapper class) - crawl4ai/async_configs.py (DomainMapperConfig) - docs/md_v2/core/domain-mapping.md (documentation) - docs/examples/domain_mapper/domain_mapper_demo.py - 67 tests across unit/integration/adversarial/regression (cherry picked from commit 2d10534a8742177f1d5f521e3174ae66591d3533)	2026-06-01 12:58:23 +00:00
unclecode	fcaf08b3b3	merge: slot April 2026 security batch (Docker API vulns, SSRF, JWT, file-write, XSS, execute_js) into develop for 0.8.7	2026-06-01 12:40:37 +00:00
Nasrin	be71585239	Merge pull request #1969 from cgseyhan/fix/async-logger-stderr-mcp-1968 fix: route AsyncLogger output to stderr by default (fixes #1968)	2026-05-25 12:20:01 +02:00
cemgo	944eb1e456	fix(logger): route AsyncLogger output to stderr by default MCP stdio transport uses stdout for JSON-RPC messages. AsyncLogger was writing Rich progress output to stdout (the default Console() target), which caused clients to receive garbled JSON and log lines interleaved in the same stream. Changes: - Pass stderr=True to Console() so all log output goes to stderr, which is the correct channel for library diagnostics and aligns with the behaviour of Python's own logging.StreamHandler. - Add an injectable console parameter so downstream wrappers (e.g. mcp-crawl4ai, FastMCP integrations) can override the target stream without monkey-patching. - Add import sys (used in docstring example). - Add tests/test_async_logger_stderr.py with 7 tests covering the default-to-stderr behaviour, custom console injection, verbose=False suppression, file logging, and an end-to-end MCP scenario. Fixes #1968	2026-05-14 14:13:30 +03:00
Nasrin	35ee366e28	Merge pull request #1901 from hafezparast/fix/maysam-arun-type-hint-1898 fix: correct arun() return type annotation (#1898)	2026-04-24 18:33:41 +02:00
Nasrin	936e4470eb	Merge pull request #1845 from hafezparast/fix/maysam-mermaid-svg-text-1043 fix: preserve mermaid diagram text from SVGs during scraping (#1043)	2026-04-24 16:49:59 +02:00
unclecode	1e25edcb5c	fix(security): block IPv6-mapped IPv4 SSRF bypass Caught during internal review. `http://[::ffff:127.0.0.1]/` bypassed validate_webhook_url because getaddrinfo returns ::ffff:7f00:1, which is not in any IPv4 blocklist (127.0.0.0/8) nor IPv6 blocklist (::1/128). Fix: added _expand_ip_candidates() helper that unwraps IPv4 from IPv4-mapped (::ffff:X.Y.Z.W, via .ipv4_mapped) and IPv4-compatible (::X.Y.Z.W, via low-32-bits) IPv6 addresses. Blocklist now checks both the original IP and the unwrapped IPv4 form. Added 6 new TestIPv6MappedBypass tests covering: - Loopback, RFC 1918, link-local (cloud metadata) via ::ffff: mapping - IPv4-compatible variant (::127.0.0.1) - Regression test that plain ::1 still blocked Also updated stale test assertion in test_eval_security_adversarial: hasattr, type, __build_class__ were removed from hook builtins in batch 2 but the test still expected hasattr to remain. DO NOT PUSH until release day.	2026-04-20 10:10:59 +00:00
ntohidi	3d4bda122a	fix(deep-crawl): use set(False) instead of reset(token) for ContextVar (#1917 ) ContextVar.reset(token) requires the same Context that created the token. When Starlette's StreamingResponse consumes the async generator in a different Task, the Context changes and reset() raises ValueError. Replaced with set(False) which works across context boundaries. Safe because deep_crawl_active is never nested — the guard on line 21 prevents re-entry.	2026-04-16 13:49:32 +08:00
hafezparast	c5612f7551	fix: correct arun() return type from RunManyReturn to CrawlResultContainer (#1898 ) arun() always returns CrawlResultContainer, never AsyncGenerator. The RunManyReturn type (Union[CrawlResultContainer, AsyncGenerator]) caused Pylance/Pyright to flag result.markdown as an error because AsyncGenerator doesn't have that attribute. Also adds test_type_annotations.py — 11 static analysis tests that catch annotation mismatches (return types, missing annotations, export checks) without needing pyright in CI. Would have caught this bug before it was reported. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 21:35:17 +08:00
Nasrin	3d02d75edb	Merge pull request #1852 from hafezparast/feat/maysam-arun-many-config-list-1837 feat: expose arun_many config-list support in Docker API (#1837)	2026-04-06 10:26:44 +02:00
unclecode	e326da9166	fix(security): complete AST sandbox escape remediation (CVSS 9.8) Addresses the gi_frame.f_back chain exploit reported by Song Binglin (q1uf3ng). - Delete _safe_eval_expression() and _SAFE_EVAL_BUILTINS entirely from extraction_strategy.py. Dead security-sensitive code is a liability. The eval path was already disabled; this removes the function itself. - Fix hook_manager.py module injection: replace broken exec("import X", ns) pattern (silently failed due to missing __import__) with direct module injection. Sanitize asyncio to strip subprocess access (RCE vector). - Add startup warning when CRAWL4AI_API_TOKEN is unset (all endpoints unauthenticated). - Expand adversarial test suite to 87 tests: hook sandbox escapes, asyncio.subprocess RCE verification, end-to-end exploit payload from vuln report, dead code deletion checks, codebase eval/exec audit.	2026-03-31 13:01:57 +00:00
unclecode	2fc39cbe89	fix(security): remove eval() from computed fields, harden config deserializer - Disable eval() in _compute_field expression path (RCE vector via untrusted input). Expression key now logs warning and returns default; function key still works. - Harden _safe_eval_config in server.py with name/attribute allowlists, block lambdas, generators, comprehensions in constructor args. - Remove getattr/setattr from hook_manager allowed builtins (sandbox escape vectors). - Add 67 adversarial security tests covering all eval/exec attack surfaces. Closes #1886, closes #1855	2026-03-31 12:02:43 +00:00
hafezparast	e9f832274e	fix: validate markdown_generator type in CrawlerRunConfig to catch bad JSON format (#1880 ) When the Docker API receives markdown_generator as JSON with "options" instead of "params", from_serializable_dict silently passes the raw dict through. This later crashes with a confusing "'dict' object has no attribute 'generate_markdown'" deep in the crawl pipeline. Add type validation for markdown_generator in CrawlerRunConfig.__init__ (matching existing extraction_strategy/chunking_strategy validation). When a dict slips through, the error now clearly states: - What type was expected vs received - That "params" is the required key (not "options") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 07:39:28 +08:00
Nasrin	1a40ccf093	Merge pull request #1844 from hafezparast/fix/maysam-browser-none-guard-1842 fix: improve browser None guard in create_browser_context (#1842)	2026-03-24 11:37:46 +01:00
Nasrin	6eb2530bd9	Merge pull request #1849 from hafezparast/fix/maysam-serialize-skip-non-config-1848 fix: skip non-allowlisted types in serialization/deserialization (#1848)	2026-03-24 11:36:03 +01:00
hafezparast	8995c1bbd6	feat: expose arun_many config-list support in Docker API (#1837 ) The /crawl endpoint now accepts an optional crawler_configs field (list of CrawlerRunConfig dicts) alongside the existing crawler_config. When provided with multiple URLs, each config is deserialized and passed as a list to arun_many(), enabling per-URL configuration with url_matcher patterns. Single-URL requests and requests without crawler_configs are unchanged (backward compatible). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 09:56:53 +08:00
hafezparast	219416e49d	fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850 ) Starlette's Route wraps async functions in request_response(), calling handler(request) instead of handler(scope, receive, send). This broke the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable class instead of an async function — Route passes class instances through as raw ASGI apps without wrapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 08:55:41 +08:00
hafezparast	e603e4a722	fix: skip non-allowlisted types in serialization/deserialization (#1848 ) to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES (returns None), preventing objects like logging.Logger from being serialized as {"type": "Logger", "params": {...}} which then fails deserialization. from_serializable_dict returns None for unknown types instead of raising ValueError, handling payloads from older clients. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 08:17:02 +08:00
hafezparast	2fd0f4c6a7	fix: preserve mermaid diagram text from SVGs during scraping (#1043 ) Mermaid diagrams rendered as SVGs were completely stripped during HTML cleaning, losing all text content. Now detects SVGs with id="mermaid-*", extracts node/edge labels, and replaces the SVG with a fenced mermaid code block containing the diagram type and extracted text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 11:43:54 +08:00
hafezparast	310b52b663	fix: improve browser None guard in create_browser_context (#1842 ) The existing guard assumed self.browser=None only meant persistent context mode. In reality, the browser can be None because it was closed by the janitor, crashed, or never started. This caused a misleading error message. Now the guard distinguishes between persistent context and closed/crashed browser with appropriate messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 10:45:38 +08:00
unclecode	9b571bb947	feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.) The HTTP crawler strategy now checks Content-Type and Content-Disposition headers to detect non-HTML file responses. When a file download is detected, raw bytes are saved to disk and the path is returned via downloaded_files. Text-based files (CSV, JSON, XML) also populate the html field for backward compatibility. Binary files (PDF, images) set html to empty string — content is only available via downloaded_files. Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).	2026-03-16 14:03:43 +00:00
Nasrin	648f36b622	Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-redis-config-1611-1817 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)	2026-03-13 03:59:28 +01:00
Nasrin	6e4299577f	Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw-1484 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)	2026-03-13 03:38:15 +01:00
hafezparast	8de83a3590	fix: css_selector ignored in LXML scraping for raw:// URLs (#1484 ) css_selector was skipped in _scrap() — only target_elements was applied. Now css_selector filters the DOM first, then target_elements narrows within that selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 20:00:33 +08:00
unclecode	a73bc1c076	fix: MCP SSE endpoint crash — mount via raw ASGI Route (#1594 ) Replace @app.get() with starlette.routing.Route() for the SSE handler. The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send) internally, which conflicts with Starlette's middleware wrapping. Also update CONTRIBUTORS.md for PR #1829.	2026-03-12 11:22:48 +00:00
hafezparast	3f481e9e5c	fix: screenshot distortion, deep crawl timeout/arun_many, CLI encoding (#1370 , #1818 , #1509 , #1762 ) - #1370: Freeze element dimensions via CSS before viewport resize in take_screenshot_scroller() to prevent responsive reflow on Elementor sites; restore original viewport after capture. - #1818: Call window.stop() on session-reused pages before navigation to abort pending loads; move event listener cleanup outside session_id guard so listeners don't accumulate across reuses. - #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is set — call arun() directly per URL so the DeepCrawlDecorator can invoke the strategy (dispatcher crashes on List[CrawlResult] return). - #1762: Add encoding="utf-8" to the remaining open() call in save_global_config() (cli.py line 58). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 18:17:13 +08:00
hafezparast	480d938f67	fix: /llm per-request provider override, Redis config from host/port/password (#1611 , #1817 ) - #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional provider, temperature, base_url query params with fallback to server config. Consistent with /md and /llm/job endpoints. - #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT, REDIS_PASSWORD environment variable overrides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:53:04 +08:00
Nasrin	d907e167a5	Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-full-page-1750 fix: screenshot respects scan_full_page=False (#1750)	2026-03-12 07:39:52 +01:00
Maysam Hafezparast	57b0d09934	fix: deduplicate BM25ContentFilter output (#1213 ) (#1824 ) BM25ContentFilter.filter_content() returned duplicate text chunks when the same content appeared in multiple DOM elements. Added exact-text deduplication after threshold filtering, keeping the first occurrence in document order. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:23:34 +08:00
hafezparast	6efbffe345	fix: screenshot respects scan_full_page=False (#1750 ) take_screenshot() ignored the scan_full_page config flag — tall pages always got a full-page screenshot even when scan_full_page=False. Now passes scan_full_page through to take_screenshot() and uses viewport-only capture when False. Includes 16 tests (8 unit + 8 integration). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 12:04:45 +08:00
unclecode	11b45760da	fix: anti-bot false positive on browser JSON, URLPatternFilter prefix match, PDF deserialization - antibot_detector: add <pre> to content elements regex, detect browser-wrapped JSON in _looks_like_data() so httpbin-style responses are not flagged as blocked - deep_crawling/filters: use urlparse().path for path-only prefix patterns (/docs/*) instead of matching against full URL, which always failed; full-URL prefixes still match correctly - async_configs: add PDFContentScrapingStrategy to ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it - __init__: export PDFContentScrapingStrategy for type resolution - tests: add 86-test suite covering all three fixes with adversarial and edge cases	2026-03-09 14:52:58 +00:00
unclecode	d788c28315	test: add comprehensive regression test suite (291 tests) Full regression suite covering all major Crawl4AI subsystems: - core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks) - content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata) - extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction) - deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization) - browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes) - config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips) - utilities (extract_xml_data, cache modes, content hashing) - edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery) Also adds /c4ai-check slash command for testing changes against the suite.	2026-03-08 03:20:52 +00:00
unclecode	3a75dd3f4c	fix: batch fix for 10 open issues (#1520 , #1489 , #1374 , #1424 , #1183 , #1354 , #880 , #1031 , #1251 , #1758 ) - #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance) - #1489: Preserve query parameter key casing in normalize_url - #1374: Close NamedTemporaryFile handle before reopening (Windows fix) - #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1) - #1183: Fix extract_xml_data regex matching tag names in prose text - #1354: Make import_knowledge_base async (fix asyncio.run in running loop) - #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences) - #1031: Make Docker playground code editor resizable with overflow-auto - #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes - #1758: Change screenshot stitching format from BMP to PNG	2026-03-07 09:47:38 +00:00
unclecode	7c0cc3ed88	fix: batch merge of community PRs (#1622 , #1786 , #1796 , #1795 , #1798 , #1734 , #1290 , #1668 ) Bug fixes: - Verify redirect targets are alive before returning from URL seeder (#1622) - Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786) - Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796) Security/Docker: - Require api_token for /token endpoint when configured (#1795) - Deep-crawl streaming now mirrors Python library behavior via arun() (#1798) CI: - Bump GitHub Actions to latest versions - checkout v6, setup-python v6, build-push-action v6, setup-buildx v4, login v4 (#1734) Features: - Support type-list pipeline in JsonCssExtractionStrategy for chained extraction like ["attribute", "regex"] (#1290) - Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting for Unicode preservation in JSON output (#1668)	2026-03-07 08:45:11 +00:00
unclecode	a4cc0a9f04	feat: add separate query_llm_config for adaptive crawler query expansion (#1682 ) The embedding strategy uses two incompatible API call types: embedding calls (text-to-vector) and query expansion (chat completion). Previously both used a single embedding_llm_config, so setting an embedding model broke query expansion and vice versa. Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users can specify separate models for each call type. Fallback chain preserves backward compatibility: query_llm_config -> llm_config -> hardcoded defaults. Also fixes base_url and backoff params not being passed to perform_completion_with_backoff in query expansion, and simplifies _embedding_llm_config_dict to use LLMConfig.to_dict() (which includes the 3 backoff fields the manual extraction was missing). Inspired by PR #1683 from @sthakrar — thank you for identifying the issue and proposing the initial approach.	2026-02-25 12:26:39 +00:00
unclecode	c0912f7234	feat: add avoid_ads/avoid_css resource filtering and pool release lifecycle Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker domains and CSS resources at the browser context level. Refactor crawler pool with release_crawler() and active_requests tracking to prevent janitor from closing browsers with in-flight requests. Add proper finally blocks to all Docker API/server handlers. Update docs for new config options. Inspired by #1689.	2026-02-25 07:12:28 +00:00
Nasrin	d419199a4c	Merge pull request #1775 from unclecode/fix/issue-1748-screenshot-scroll-delay Fix/issue 1748 screenshot scroll delay	2026-02-25 05:54:24 +01:00
Ahmed-tawfik94	cd81e3cd19	Fix scroll_delay ignored in take_screenshot_scroller for full-page screenshots	2026-02-25 06:52:53 +03:00
Nasrin	4f9cc0810b	Merge pull request #1764 from PatD42/fix/table-gfm-pipes Fix: Add leading/trailing pipes to GFM tables (pad_tables=False)	2026-02-25 03:32:54 +01:00
Nasrin	c4cdc02e27	Merge pull request #1761 from AtharvaJaiswal005/fix/total-score-missing-for-failed-head-extraction-1749 Fix total_score not calculated for links that fail head extraction	2026-02-25 02:25:22 +01:00
unclecode	1a9f68d825	Fix cascading context crash from duplicate add_init_script (#1768 ) context.add_init_script() was called in both setup_context() and _crawl_web(), causing unbounded script accumulation on shared contexts under concurrent load. Chromium kills the overloaded context, cascading "Target page, context or browser has been closed" to all concurrent crawls. Add flag-based dedup: after injecting navigator_overrider or shadow-DOM scripts, set _crawl4ai_nav_overrider_injected / _crawl4ai_shadow_dom_injected on the context. Before injecting, check the flag. This preserves context-level scope (popups/iframes covered) and the fallback for managed/persistent/CDP paths where setup_context() runs without crawlerRunConfig.	2026-02-24 09:45:18 +00:00
unclecode	254ef0510b	Fix anti-bot detection for large SPA block pages (403/503) Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that exceed 100KB+, bypassing all size-based detection thresholds. This caused the fallback (Web Unlocker) to never trigger for these sites. Changes: - HTTP 403/503 with non-data HTML is now always treated as blocked regardless of page size (false positives are cheap, fallback rescues them) - Added Tier 1 deep scan: strips scripts/styles before checking patterns on large pages, catching block text buried under 100KB+ of CSS/JS - Added "blocked by network security" as Tier 1 pattern (Reddit et al.) - Updated tests to reflect new detection philosophy	2026-02-20 10:07:59 +00:00
unclecode	94a77eea30	Move test_repro_1640.py to tests/browser/	2026-02-19 06:33:46 +00:00
unclecode	c9cb0160cf	Add token usage tracking to generate_schema / agenerate_schema generate_schema can make up to 5 internal LLM calls (field inference, schema generation, validation retries) with no way to track token consumption. Add an optional `usage: TokenUsage = None` parameter that accumulates prompt/completion/total tokens across all calls in-place. - _infer_target_json: accept and populate usage accumulator - agenerate_schema: track usage after every aperform_completion call in the retry loop, forward usage to _infer_target_json - generate_schema (sync): forward usage to agenerate_schema Fully backward-compatible — omitting usage changes nothing.	2026-02-18 06:44:17 +00:00
unclecode	8576331d4e	Add Shadow DOM flattening and reorder js_code execution pipeline - Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes shadow DOM content into the light DOM before HTML capture. Uses a recursive serializer that resolves <slot> projections and strips only shadow-scoped <style> tags. Also injects an init script to force-open closed shadow roots via attachShadow patching. - Move `js_code` execution to after `wait_for` + `delay_before_return_html` so user scripts run on the fully-hydrated page. Add `js_code_before_wait` for the less common case of triggering loading before waiting. - Add JS snippet (flatten_shadow_dom.js), integration test, example, and documentation across all relevant doc files.	2026-02-18 06:43:00 +00:00
Patrick	c70ab31abd	fix: add leading/trailing pipes to GFM tables (pad_tables=False) When pad_tables=False (default), html2text generated table rows without leading/trailing pipe delimiters, producing non-compliant GFM markdown: Before: A \| B \| C After: \| A \| B \| C \| Changes: - Add leading pipe on first cell, spaced pipe between cells - Add trailing pipe at end of each row - Format separator as \| --- \| --- \| instead of ---\|--- - Ensure table starts on its own line (soft_br at <table>) - Handle <caption> element to prevent inline merge with header row - All changes guarded by `not self.pad_tables` — pad_tables mode unchanged Includes 13 unit tests covering GFM compliance and pad_tables regression. Fixes: #1731	2026-02-17 21:14:36 -05:00
unclecode	d267c650cb	Add source (sibling selector) support to JSON extraction strategies Many sites (e.g. Hacker News) split a single item's data across sibling elements. Field selectors only search descendants, making sibling data unreachable. The new "source" field key navigates to a sibling element before running the selector: {"source": "+ tr"} finds the next sibling <tr>, then extracts from there. - Add _resolve_source abstract method to JsonElementExtractionStrategy - Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS) - Modify _extract_field to resolve source before type dispatch - Update CSS and XPath LLM prompts with source docs and HN example - Default generate_schema validate=True so schemas are checked on creation - Add schema validation with feedback loop for auto-refinement - Add messages param to completion helpers for multi-turn refinement - Document source field and schema validation in docs - Add 14 unit tests covering CSS, XPath, backward compat, edge cases	2026-02-17 09:04:40 +00:00
Atharva Jaiswal	094242d4a7	Fix total_score not calculated for links that fail head extraction The _merge_head_data() function only called calculate_total_score() for links present in url_to_head_data. Links that failed head extraction (PDFs, timeouts, non-HTML) hit the else branch and were appended unchanged, leaving total_score as None even when intrinsic_score was available. Added calculate_total_score() calls in both else branches (internal and external links) so all links get a total_score computed from their intrinsic_score when head data is unavailable. Fixes #1749	2026-02-16 20:41:30 +05:30
unclecode	72b546c48d	Add anti-bot detection, retry, and fallback system Automatically detect when crawls are blocked by anti-bot systems (Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and escalate through configurable retry and fallback strategies. New features on CrawlerRunConfig: - max_retries: retry rounds when blocking is detected - fallback_proxy_configs: list of fallback proxies tried each round - fallback_fetch_function: async last-resort function returning raw HTML New field on ProxyConfig: - is_fallback: skip proxy on first attempt, activate only when blocked Escalation chain per round: main proxy → fallback proxies in order. After all rounds: fallback_fetch_function as last resort. Detection uses tiered heuristics — structural HTML markers (high confidence) trigger on any page, generic patterns only on short error pages to avoid false positives.	2026-02-14 05:24:07 +00:00
unclecode	fdd989785f	Sync sec-ch-ua with User-Agent and keep WebGL alive in stealth mode Fix a bug where magic mode and per-request UA overrides would change the User-Agent header without updating the sec-ch-ua (browser hint) header to match. Anti-bot systems like Akamai detect this mismatch as a bot signal. Changes: - Regenerate browser_hint via UAGen.generate_client_hints() whenever the UA is changed at crawl time (magic mode or explicit override) - Re-apply updated headers to the page via set_extra_http_headers() - Skip per-crawl UA override for persistent contexts where the UA is locked at launch time by Playwright's protocol layer - Move --disable-gpu flags behind enable_stealth check so WebGL works via SwiftShader when stealth mode is active (missing WebGL is a detectable headless signal) - Clean up old test scripts, add clean anti-bot test	2026-02-13 04:10:47 +00:00

1 2 3 4 5

233 Commits