crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 15:58:15 +00:00

Author	SHA1	Message	Date
Ahmed-tawfik94	9cfeb4626d	Document scroll_delay parameter for full-page screenshot crawling	2026-02-25 06:52:59 +03:00
Ahmed-tawfik94	cd81e3cd19	Fix scroll_delay ignored in take_screenshot_scroller for full-page screenshots	2026-02-25 06:52:53 +03:00
Nasrin	4f9cc0810b	Merge pull request #1764 from PatD42/fix/table-gfm-pipes Fix: Add leading/trailing pipes to GFM tables (pad_tables=False)	2026-02-25 03:32:54 +01:00
Nasrin	c4cdc02e27	Merge pull request #1761 from AtharvaJaiswal005/fix/total-score-missing-for-failed-head-extraction-1749 Fix total_score not calculated for links that fail head extraction	2026-02-25 02:25:22 +01:00
unclecode	cbd36b74b2	Add stats dashboard page for LP summit - Create scripts/update_stats.py: fetches GitHub, PyPI, Docker Hub data and generates docs/md_v2/stats.md with Chart.js visualizations - Add stats.md with 5 metric cards and 5 charts (monthly downloads, star growth, cumulative downloads, daily trend, GitHub traffic) - Add "Growth" nav entry in mkdocs.yml - Update .gitignore to allow scripts/update_stats.py	2026-02-24 12:58:34 +00:00
Atharva Jaiswal	5b815c278e	Fix redirect URL mismatch in head data merging Store original_url in head extraction results and key _merge_head_data by both final and original URLs so redirected results match back to their original link.href	2026-02-24 16:02:16 +05:30
unclecode	1a9f68d825	Fix cascading context crash from duplicate add_init_script (#1768 ) context.add_init_script() was called in both setup_context() and _crawl_web(), causing unbounded script accumulation on shared contexts under concurrent load. Chromium kills the overloaded context, cascading "Target page, context or browser has been closed" to all concurrent crawls. Add flag-based dedup: after injecting navigator_overrider or shadow-DOM scripts, set _crawl4ai_nav_overrider_injected / _crawl4ai_shadow_dom_injected on the context. Before injecting, check the flag. This preserves context-level scope (popups/iframes covered) and the fallback for managed/persistent/CDP paths where setup_context() runs without crawlerRunConfig.	2026-02-24 09:45:18 +00:00
Nasrin	731388c65c	Merge pull request #1760 from nitesh-77/fix/async-chardet-block fix: run blocking chardet.detect in thread executor #1751	2026-02-24 09:42:21 +01:00
Nasrin	57be8b8732	Merge pull request #1759 from nitesh-77/fix/filterchain-tuple-attribute-error Fix AttributeError in FilterChain.add_filter (Tuple Immutability) #1753	2026-02-24 08:48:46 +01:00
Nasrin	7435a1654c	Merge pull request #1771 from hafezparast/claude/check-fork-sync-S9SSz Claude/check fork sync s9 s sz	2026-02-23 06:28:10 +01:00
Claude	0e9b677870	Fix MCP bridge httpx timeout: add configurable timeout parameter The httpx.AsyncClient() default 5s timeout causes TimeoutException on slow LLM-backed endpoints. The exception bypasses the HTTPStatusError handler, propagating as an unhandled error to the MCP framework. - Add `timeout` parameter to `attach_mcp()` (default None = no limit) - Pass timeout through to `_make_http_proxy()` and `httpx.AsyncClient()` - Catch `httpx.TimeoutException` and surface it as HTTP 504 Fixes #1769 https://claude.ai/code/session_01LpranMwFBtQU7kFrV5EHAB	2026-02-23 02:10:04 +00:00
unclecode	254ef0510b	Fix anti-bot detection for large SPA block pages (403/503) Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that exceed 100KB+, bypassing all size-based detection thresholds. This caused the fallback (Web Unlocker) to never trigger for these sites. Changes: - HTTP 403/503 with non-data HTML is now always treated as blocked regardless of page size (false positives are cheap, fallback rescues them) - Added Tier 1 deep scan: strips scripts/styles before checking patterns on large pages, catching block text buried under 100KB+ of CSS/JS - Added "blocked by network security" as Tier 1 pattern (Reddit et al.) - Updated tests to reflect new detection philosophy	2026-02-20 10:07:59 +00:00
UncleCode	aa7b05072d	Change security contact email and update date Updated contact email for security reports and revised last updated date.	2026-02-20 04:31:12 +01:00
unclecode	7226f8face	Extend try/finally to cover all post-get_page setup code (#1640 ) The try block that guards the finally (which calls release_page_with_context) started ~125 lines after get_page(). Code between them — add_init_script, set_extra_http_headers, execute_hook, capture setup — was unprotected. If any of those threw (e.g. add_init_script failing on a recycled context), the refcount was never decremented and the context leaked permanently. Move the try to start immediately after get_page() so the finally block covers all setup code. Found by @Martichou during testing.	2026-02-20 02:30:49 +00:00
unclecode	c854e2b899	Fix simulate_user destroying page content via ArrowDown keypress The simulate_user/magic block fired keyboard.press("ArrowDown") and mouse.click at fixed coords (100,100) on every page. This destroyed content on any site with keyboard event handlers (MkDocs, React Router, Vue, Angular, SPAs) and clicked interactive elements at that position. Replace with mouse movement (two-point trajectory) and mouse.wheel scroll — these generate the same anti-bot signals (mousemove, scroll events) without triggering JS framework navigation or clicking buttons. Also remove temporary RAW_DEBUG logging that was left from investigation.	2026-02-19 15:03:28 +00:00
unclecode	8df3541ac4	Skip anti-bot checks and fallback for raw: URLs raw: URLs contain caller-provided HTML (e.g. from cache), not content fetched from a web server. Anti-bot detection, proxy retries, and fallback fetching are meaningless for this content. - Skip is_blocked() in retry loop and final re-check for raw: URLs - Skip fallback_fetch_function invocation for raw: URLs - Add RAW_DEBUG logging in browser strategy for set_content/page.content	2026-02-19 14:05:56 +00:00
unclecode	94a77eea30	Move test_repro_1640.py to tests/browser/	2026-02-19 06:33:46 +00:00
unclecode	2060c7e965	Fix browser recycling deadlock under sustained concurrent load (#1640 ) Three bugs in the version-based browser recycling caused requests to hang after ~80-130 pages under concurrent load: 1. Race condition: _maybe_bump_browser_version() added ALL context signatures to _pending_cleanup, including those with refcount 0. Since no future release would trigger cleanup for idle sigs, they stayed in _pending_cleanup permanently. Fix: split sigs into active (refcount > 0, go to pending) and idle (refcount == 0, cleaned up immediately). 2. Finally block fragility: the first line of _crawl_web's finally block accessed page.context.browser.contexts, which throws if the browser crashed. This prevented release_page_with_context() from ever being called, permanently leaking the refcount. Fix: call release_page_with_context() first in its own try/except, then do best-effort cleanup of listeners and page. 3. Safety cap deadlock: when _pending_cleanup accumulated >= 3 stuck entries, _maybe_bump_browser_version() blocked get_page() forever with no timeout. Fix: 30-second timeout on the wait, after which stuck entries (refcount 0) are force-cleaned. Includes regression test covering all three bugs plus multi-config concurrent crawl scenarios.	2026-02-19 06:27:25 +00:00
unclecode	13048a106b	Add Tier 3 structural integrity check to anti-bot detector Catches silent blocks, anti-bot redirects, and incomplete renders that pass pattern-based detection (Tiers 1/2) but are structurally broken: - No <body> tag on pages under 50KB - Minimal visible text after stripping scripts/styles/tags - No semantic content elements (p, h1-6, article, section, li, td, a) - Script-heavy shells with scripts but no real content Uses signal scoring: 2+ signals = blocked, 1 signal on small page (<5KB) = blocked. Skips large pages and JSON/XML data responses.	2026-02-18 06:59:22 +00:00
unclecode	c9cb0160cf	Add token usage tracking to generate_schema / agenerate_schema generate_schema can make up to 5 internal LLM calls (field inference, schema generation, validation retries) with no way to track token consumption. Add an optional `usage: TokenUsage = None` parameter that accumulates prompt/completion/total tokens across all calls in-place. - _infer_target_json: accept and populate usage accumulator - agenerate_schema: track usage after every aperform_completion call in the retry loop, forward usage to _infer_target_json - generate_schema (sync): forward usage to agenerate_schema Fully backward-compatible — omitting usage changes nothing.	2026-02-18 06:44:17 +00:00
unclecode	8576331d4e	Add Shadow DOM flattening and reorder js_code execution pipeline - Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes shadow DOM content into the light DOM before HTML capture. Uses a recursive serializer that resolves <slot> projections and strips only shadow-scoped <style> tags. Also injects an init script to force-open closed shadow roots via attachShadow patching. - Move `js_code` execution to after `wait_for` + `delay_before_return_html` so user scripts run on the fully-hydrated page. Add `js_code_before_wait` for the less common case of triggering loading before waiting. - Add JS snippet (flatten_shadow_dom.js), integration test, example, and documentation across all relevant doc files.	2026-02-18 06:43:00 +00:00
Patrick	c70ab31abd	fix: add leading/trailing pipes to GFM tables (pad_tables=False) When pad_tables=False (default), html2text generated table rows without leading/trailing pipe delimiters, producing non-compliant GFM markdown: Before: A \| B \| C After: \| A \| B \| C \| Changes: - Add leading pipe on first cell, spaced pipe between cells - Add trailing pipe at end of each row - Format separator as \| --- \| --- \| instead of ---\|--- - Ensure table starts on its own line (soft_br at <table>) - Handle <caption> element to prevent inline merge with header row - All changes guarded by `not self.pad_tables` — pad_tables mode unchanged Includes 13 unit tests covering GFM compliance and pad_tables regression. Fixes: #1731	2026-02-17 21:14:36 -05:00
Otman404	6ea0e38325	Re-raise exceptions in MemoryAdaptiveDispatcher.run_urls after logging	2026-02-18 00:34:50 +00:00
unclecode	4fb02f8b50	Warn LLM against hashed/generated CSS class names in schema prompts Replace vague "handle dynamic class names appropriately" with explicit rule: never use auto-generated class names (.styles_card__xK9r2, etc.) as they break on every site rebuild. Prefer data-* attributes, semantic tags, ARIA attributes, and stable meaningful class names instead.	2026-02-17 12:02:58 +00:00
unclecode	d267c650cb	Add source (sibling selector) support to JSON extraction strategies Many sites (e.g. Hacker News) split a single item's data across sibling elements. Field selectors only search descendants, making sibling data unreachable. The new "source" field key navigates to a sibling element before running the selector: {"source": "+ tr"} finds the next sibling <tr>, then extracts from there. - Add _resolve_source abstract method to JsonElementExtractionStrategy - Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS) - Modify _extract_field to resolve source before type dispatch - Update CSS and XPath LLM prompts with source docs and HN example - Default generate_schema validate=True so schemas are checked on creation - Add schema validation with feedback loop for auto-refinement - Add messages param to completion helpers for multi-turn refinement - Document source field and schema validation in docs - Add 14 unit tests covering CSS, XPath, backward compat, edge cases	2026-02-17 09:04:40 +00:00
Otman404	87f57f1675	Fix return in finally block silently suppressing exceptions	2026-02-17 02:26:09 +00:00
Atharva Jaiswal	094242d4a7	Fix total_score not calculated for links that fail head extraction The _merge_head_data() function only called calculate_total_score() for links present in url_to_head_data. Links that failed head extraction (PDFs, timeouts, non-HTML) hit the else branch and were appended unchanged, leaving total_score as None even when intrinsic_score was available. Added calculate_total_score() calls in both else branches (internal and external links) so all links get a total_score computed from their intrinsic_score when head data is unavailable. Fixes #1749	2026-02-16 20:41:30 +05:30
nitesh-77	4298e26525	fix: run blocking chardet.detect in thread executor #1751	2026-02-16 06:33:09 +05:30
nitesh-77	cfa73084ea	fix: resolve AttributeError in FilterChain.add_filter by handling tuple immutability	2026-02-16 05:41:22 +05:30
unclecode	ccd24aa824	Fix fallback fetch: run when all proxies crash, skip re-check, never return None Three related fixes to the anti-bot proxy retry + fallback pipeline: 1. Allow fallback_fetch_function to run when crawl_result is None (all proxies threw exceptions like browser crashes). Previously fallback only ran when crawl_result existed but was blocked — exception-only failures bypassed it. 2. Skip is_blocked() re-check after successful fallback. Real unblocked pages may contain anti-bot script markers (e.g. PerimeterX JS on Walmart) that trigger false positives, overriding success=True back to False. 3. Always return a CrawlResult with crawl_stats, never None. When all proxies and fallback fail, create a minimal failed result so callers get stats about what was attempted instead of AttributeError on None. Also: if aprocess_html fails during fallback (dead browser can't run Page.evaluate for consent popup removal), fall back to raw HTML result instead of silently discarding the successfully-fetched fallback content.	2026-02-15 10:55:00 +00:00
unclecode	45d8e1450f	Fix proxy escalation: don't re-raise on first proxy exception when chain has alternatives When proxy_config is a list (escalation chain) and the first proxy throws an exception (timeout, connection error, browser crash), the retry loop now continues to the next proxy instead of immediately re-raising. Previously, exceptions on _p_idx==0 and _attempt==0 were always re-raised, which broke the entire escalation chain — ISP/Residential/fallback proxies were never tried. This made the proxy list effectively useless for sites where the first-tier proxy fails with an exception rather than a blocked response. The raise is preserved when there's only a single proxy and single attempt (len(proxy_list) <= 1 and max_attempts <= 1) so that simple non-chain crawls still get immediate error propagation.	2026-02-15 09:55:55 +00:00
unclecode	d028a889d0	Make proxy_config a property so direct assignment also normalizes Setting config.proxy_config = [ProxyConfig.DIRECT, ...] after construction now goes through the same normalization as __init__, converting "direct" sentinels to None. Fixes crash when proxy_config is assigned directly instead of passed to the constructor.	2026-02-14 13:16:36 +00:00
unclecode	879553955c	Add ProxyConfig.DIRECT sentinel for direct-then-proxy escalation Allow "direct" or None in proxy_config list to explicitly try without a proxy before escalating to proxy servers. The retry loop already handled None as direct — this exposes it as a clean user-facing API via ProxyConfig.DIRECT.	2026-02-14 10:25:07 +00:00
unclecode	875207287e	Unify proxy_config to accept list, add crawl_stats tracking - proxy_config on CrawlerRunConfig now accepts a single ProxyConfig or a list of ProxyConfig tried in order (first-come-first-served) - Remove is_fallback from ProxyConfig and fallback_proxy_configs from CrawlerRunConfig — proxy escalation handled entirely by list order - Add _get_proxy_list() normalizer for the retry loop - Add CrawlResult.crawl_stats with attempts, retries, proxies_used, fallback_fetch_used, and resolved_by for billing and observability - Set success=False with error_message when all attempts are blocked - Simplify retry loop — no more is_fallback stashing logic - Update docs and tests to reflect new API	2026-02-14 07:53:46 +00:00
unclecode	72b546c48d	Add anti-bot detection, retry, and fallback system Automatically detect when crawls are blocked by anti-bot systems (Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and escalate through configurable retry and fallback strategies. New features on CrawlerRunConfig: - max_retries: retry rounds when blocking is detected - fallback_proxy_configs: list of fallback proxies tried each round - fallback_fetch_function: async last-resort function returning raw HTML New field on ProxyConfig: - is_fallback: skip proxy on first attempt, activate only when blocked Escalation chain per round: main proxy → fallback proxies in order. After all rounds: fallback_fetch_function as last resort. Detection uses tiered heuristics — structural HTML markers (high confidence) trigger on any page, generic patterns only on short error pages to avoid false positives.	2026-02-14 05:24:07 +00:00
unclecode	fdd989785f	Sync sec-ch-ua with User-Agent and keep WebGL alive in stealth mode Fix a bug where magic mode and per-request UA overrides would change the User-Agent header without updating the sec-ch-ua (browser hint) header to match. Anti-bot systems like Akamai detect this mismatch as a bot signal. Changes: - Regenerate browser_hint via UAGen.generate_client_hints() whenever the UA is changed at crawl time (magic mode or explicit override) - Re-apply updated headers to the page via set_extra_http_headers() - Skip per-crawl UA override for persistent contexts where the UA is locked at launch time by Playwright's protocol layer - Move --disable-gpu flags behind enable_stealth check so WebGL works via SwiftShader when stealth mode is active (missing WebGL is a detectable headless signal) - Clean up old test scripts, add clean anti-bot test checkpoint-pre-antibot-fallback	2026-02-13 04:10:47 +00:00
unclecode	112f44a97d	Fix proxy auth for persistent browser contexts Chromium's --proxy-server CLI flag silently ignores inline credentials (user:pass@server). For persistent contexts, crawl4ai was embedding credentials in this flag via ManagedBrowser.build_browser_flags(), causing proxy auth to fail and the browser to fall back to direct connection. Fix: Use Playwright's launch_persistent_context(proxy=...) API instead of subprocess + CDP when use_persistent_context=True. This handles proxy authentication properly via the HTTP CONNECT handshake. The non-persistent and CDP paths remain unchanged. Changes: - Strip credentials from --proxy-server flag in build_browser_flags() - Add launch_persistent_context() path in BrowserManager.start() - Add cleanup path in BrowserManager.close() - Guard create_browser_context() when self.browser is None - Add regression tests covering all 4 proxy/persistence combinations	2026-02-12 11:19:29 +00:00
unclecode	1a24ac785e	Refactor from_kwargs to respect set_defaults and use __init__ defaults Replace hardcoded parameter listings in BrowserConfig.from_kwargs() and CrawlerRunConfig.from_kwargs() with a generic approach that filters input kwargs to valid __init__ params and passes them through. This: - Makes set_defaults() work with from_kwargs() (previously ignored) - Fixes default mismatches (word_count_threshold was 200 vs __init__=1, markdown_generator was None vs __init__=DefaultMarkdownGenerator()) - Eliminates ~160 lines of duplicated default values - Auto-supports new params without updating from_kwargs	2026-02-11 13:35:36 +00:00
unclecode	3fc7730aaf	Add remove_consent_popups flag and fix from_kwargs dict deserialization Add CrawlerRunConfig.remove_consent_popups (bool, default False) that targets GDPR/cookie consent popups from 70+ known CMP providers including OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics, Sourcepoint, Google FundingChoices, and many more. The JS strategy uses a 5-phase approach: 1. Click "Accept All" buttons (cleanest dismissal, sets cookies) 2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro) 3. Remove known CMP containers by selector (~120 selectors) 4. Handle iframe-based and shadow DOM CMPs 5. Restore body scroll and remove CMP body classes Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to auto-deserialize dict values using the existing from_serializable_dict() infrastructure. Previously, strategy objects like markdown_generator arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from JSON APIs were passed through as raw dicts, causing crashes when the crawler later called methods on them.	2026-02-11 12:46:47 +00:00
unclecode	44b8afb6dc	Improve schema generation prompt for sibling-based layouts	2026-02-10 08:34:22 +00:00
unclecode	fbc52813a4	Add tests, docs, and contributors for PRs #1463 and #1435 - Add tests for device_scale_factor (config + integration) - Add tests for redirected_status_code (model + redirect + raw HTML) - Document device_scale_factor in browser config docs and API reference - Document redirected_status_code in crawler result docs and API reference - Add TristanDonze and charlaie to CONTRIBUTORS.md - Update PR-TODOLIST with session results	2026-02-06 09:30:19 +00:00
unclecode	37a49c5315	Merge PR #1435 : Add redirected_status_code to CrawlResult Applied manually due to conflicts (PR based on older code). Also fixed missing variable initialization for non-goto paths (file://, raw:, js_only) that would have caused NameError. Closes #1434	2026-02-06 09:23:54 +00:00
unclecode	0aacafed0a	Merge PR #1463 : Add configurable device_scale_factor for screenshot quality	2026-02-06 09:19:42 +00:00
unclecode	719e83e105	Update PR todolist — refresh open PRs, add 6 new, classify - Added PRs #475, #462, #416, #335, #332, #312 - Flagged #475 as duplicate of merged #1296 - Corrected author for #1450 (rbushri) - Updated total count to ~63 open PRs - Updated date to 2026-02-06	2026-02-06 09:06:13 +00:00
unclecode	3401dd1620	Fix browser recycling under high concurrency — version-based approach The previous recycle logic waited for all refcounts to hit 0 before recycling, which never happened under sustained concurrent load (20+ crawls always had at least one active). New approach: - Add _browser_version to config signature — bump it to force new contexts - When threshold is hit: bump version, move old sigs to _pending_cleanup - New requests get new contexts automatically (different signature) - Old contexts drain naturally and get cleaned up when refcount hits 0 - Safety cap: max 3 pending browsers draining at once This means recycling now works under any load pattern — no blocking, no waiting for quiet moments. Old and new browsers coexist briefly during transitions. Includes 12 new tests covering version bumps, concurrent recycling, safety cap, and edge cases.	2026-02-05 07:48:12 +00:00
unclecode	c046918bb4	Add memory-saving mode, browser recycling, and CDP leak fixes - Add memory_saving_mode config: aggressive cache discard + V8 heap cap flags for high-volume crawling (1000+ pages) - Add max_pages_before_recycle config: automatic browser process recycling after N pages to reclaim leaked memory (recommended 500-1000) - Add default Chrome flags to disable unused features (OptimizationHints, MediaRouter, component updates, domain reliability) - Fix CDP session leak: detach CDP session after viewport adjustment - Fix session kill: only close context when refcount reaches 0, preventing use-after-close for shared contexts - Add browser lifecycle and memory tests	2026-02-04 02:00:53 +00:00
ntohidi	4e56f3e00d	Add contributing guide and update mkdocs navigation for community resources	2026-02-03 09:46:54 +01:00
unclecode	0bfcf080dd	Add contributors from PRs #1133 , #729 Credit chrizzly2309 and complete-dope for identifying bugs that were resolved on develop.	2026-02-02 07:56:37 +00:00
unclecode	b962699c0d	Add contributors from PRs #973 , #1073 , #931 Credit danyQe, saipavanmeruga7797, and stevenaldinger for identifying bugs that were resolved on develop.	2026-02-02 07:14:12 +00:00
unclecode	ffd3face6b	Remove duplicate PROMPT_EXTRACT_BLOCKS definition in prompts.py The first definition (with tags/questions fields) was immediately overwritten by the second simpler definition — pure dead code. Removes 61 lines of unused prompt text. Inspired by PR #931 (stevenaldinger).	2026-02-02 07:04:35 +00:00

... 2 3 4 5 6 ...

1533 Commits