- Route deep-crawl stream=True requests with a single URL through AsyncWebCrawler.arun so each discovered page is streamed as its own CrawlResult
- Preserve existing arun_many + MemoryAdaptiveDispatcher behavior for non–deep-crawl or multi-URL streaming.
- Add Docker REST tests for deep-crawl streaming success (single URL) and helpful error on multi-URL usage
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.
Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.
Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).
Inspired by PR #1683 from @Vaccarini-Lorenzo — thank you for identifying the
issue and proposing the initial approach.
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.
Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.
Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).
Inspired by PR #1683 from @sthakrar — thank you for identifying the
issue and proposing the initial approach.
Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker
domains and CSS resources at the browser context level. Refactor crawler pool
with release_crawler() and active_requests tracking to prevent janitor from
closing browsers with in-flight requests. Add proper finally blocks to all
Docker API/server handlers. Update docs for new config options.
Inspired by #1689.
Store original_url in head extraction results and key _merge_head_data
by both final and original URLs so redirected results match back to
their original link.href
context.add_init_script() was called in both setup_context() and
_crawl_web(), causing unbounded script accumulation on shared contexts
under concurrent load. Chromium kills the overloaded context, cascading
"Target page, context or browser has been closed" to all concurrent crawls.
Add flag-based dedup: after injecting navigator_overrider or shadow-DOM
scripts, set _crawl4ai_nav_overrider_injected / _crawl4ai_shadow_dom_injected
on the context. Before injecting, check the flag. This preserves context-level
scope (popups/iframes covered) and the fallback for managed/persistent/CDP
paths where setup_context() runs without crawlerRunConfig.
The httpx.AsyncClient() default 5s timeout causes TimeoutException on
slow LLM-backed endpoints. The exception bypasses the HTTPStatusError
handler, propagating as an unhandled error to the MCP framework.
- Add `timeout` parameter to `attach_mcp()` (default None = no limit)
- Pass timeout through to `_make_http_proxy()` and `httpx.AsyncClient()`
- Catch `httpx.TimeoutException` and surface it as HTTP 504
Fixes#1769https://claude.ai/code/session_01LpranMwFBtQU7kFrV5EHAB
Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that
exceed 100KB+, bypassing all size-based detection thresholds. This caused
the fallback (Web Unlocker) to never trigger for these sites.
Changes:
- HTTP 403/503 with non-data HTML is now always treated as blocked
regardless of page size (false positives are cheap, fallback rescues them)
- Added Tier 1 deep scan: strips scripts/styles before checking patterns
on large pages, catching block text buried under 100KB+ of CSS/JS
- Added "blocked by network security" as Tier 1 pattern (Reddit et al.)
- Updated tests to reflect new detection philosophy
The try block that guards the finally (which calls
release_page_with_context) started ~125 lines after get_page().
Code between them — add_init_script, set_extra_http_headers,
execute_hook, capture setup — was unprotected. If any of those
threw (e.g. add_init_script failing on a recycled context), the
refcount was never decremented and the context leaked permanently.
Move the try to start immediately after get_page() so the finally
block covers all setup code.
Found by @Martichou during testing.
The simulate_user/magic block fired keyboard.press("ArrowDown") and
mouse.click at fixed coords (100,100) on every page. This destroyed
content on any site with keyboard event handlers (MkDocs, React Router,
Vue, Angular, SPAs) and clicked interactive elements at that position.
Replace with mouse movement (two-point trajectory) and mouse.wheel
scroll — these generate the same anti-bot signals (mousemove, scroll
events) without triggering JS framework navigation or clicking buttons.
Also remove temporary RAW_DEBUG logging that was left from investigation.
raw: URLs contain caller-provided HTML (e.g. from cache), not content
fetched from a web server. Anti-bot detection, proxy retries, and
fallback fetching are meaningless for this content.
- Skip is_blocked() in retry loop and final re-check for raw: URLs
- Skip fallback_fetch_function invocation for raw: URLs
- Add RAW_DEBUG logging in browser strategy for set_content/page.content
Three bugs in the version-based browser recycling caused requests to
hang after ~80-130 pages under concurrent load:
1. Race condition: _maybe_bump_browser_version() added ALL context
signatures to _pending_cleanup, including those with refcount 0.
Since no future release would trigger cleanup for idle sigs, they
stayed in _pending_cleanup permanently. Fix: split sigs into
active (refcount > 0, go to pending) and idle (refcount == 0,
cleaned up immediately).
2. Finally block fragility: the first line of _crawl_web's finally
block accessed page.context.browser.contexts, which throws if the
browser crashed. This prevented release_page_with_context() from
ever being called, permanently leaking the refcount. Fix: call
release_page_with_context() first in its own try/except, then do
best-effort cleanup of listeners and page.
3. Safety cap deadlock: when _pending_cleanup accumulated >= 3 stuck
entries, _maybe_bump_browser_version() blocked get_page() forever
with no timeout. Fix: 30-second timeout on the wait, after which
stuck entries (refcount 0) are force-cleaned.
Includes regression test covering all three bugs plus multi-config
concurrent crawl scenarios.
Catches silent blocks, anti-bot redirects, and incomplete renders that
pass pattern-based detection (Tiers 1/2) but are structurally broken:
- No <body> tag on pages under 50KB
- Minimal visible text after stripping scripts/styles/tags
- No semantic content elements (p, h1-6, article, section, li, td, a)
- Script-heavy shells with scripts but no real content
Uses signal scoring: 2+ signals = blocked, 1 signal on small page
(<5KB) = blocked. Skips large pages and JSON/XML data responses.
generate_schema can make up to 5 internal LLM calls (field inference,
schema generation, validation retries) with no way to track token
consumption. Add an optional `usage: TokenUsage = None` parameter that
accumulates prompt/completion/total tokens across all calls in-place.
- _infer_target_json: accept and populate usage accumulator
- agenerate_schema: track usage after every aperform_completion call
in the retry loop, forward usage to _infer_target_json
- generate_schema (sync): forward usage to agenerate_schema
Fully backward-compatible — omitting usage changes nothing.
- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
shadow DOM content into the light DOM before HTML capture. Uses a
recursive serializer that resolves <slot> projections and strips
only shadow-scoped <style> tags. Also injects an init script to
force-open closed shadow roots via attachShadow patching.
- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
for the less common case of triggering loading before waiting.
- Add JS snippet (flatten_shadow_dom.js), integration test, example,
and documentation across all relevant doc files.
When pad_tables=False (default), html2text generated table rows without
leading/trailing pipe delimiters, producing non-compliant GFM markdown:
Before: A | B | C
After: | A | B | C |
Changes:
- Add leading pipe on first cell, spaced pipe between cells
- Add trailing pipe at end of each row
- Format separator as | --- | --- | instead of ---|---
- Ensure table starts on its own line (soft_br at <table>)
- Handle <caption> element to prevent inline merge with header row
- All changes guarded by `not self.pad_tables` — pad_tables mode unchanged
Includes 13 unit tests covering GFM compliance and pad_tables regression.
Fixes: #1731
Replace vague "handle dynamic class names appropriately" with explicit
rule: never use auto-generated class names (.styles_card__xK9r2, etc.)
as they break on every site rebuild. Prefer data-* attributes, semantic
tags, ARIA attributes, and stable meaningful class names instead.
Many sites (e.g. Hacker News) split a single item's data across sibling
elements. Field selectors only search descendants, making sibling data
unreachable. The new "source" field key navigates to a sibling element
before running the selector: {"source": "+ tr"} finds the next sibling
<tr>, then extracts from there.
- Add _resolve_source abstract method to JsonElementExtractionStrategy
- Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS)
- Modify _extract_field to resolve source before type dispatch
- Update CSS and XPath LLM prompts with source docs and HN example
- Default generate_schema validate=True so schemas are checked on creation
- Add schema validation with feedback loop for auto-refinement
- Add messages param to completion helpers for multi-turn refinement
- Document source field and schema validation in docs
- Add 14 unit tests covering CSS, XPath, backward compat, edge cases
The _merge_head_data() function only called calculate_total_score() for
links present in url_to_head_data. Links that failed head extraction
(PDFs, timeouts, non-HTML) hit the else branch and were appended
unchanged, leaving total_score as None even when intrinsic_score was
available.
Added calculate_total_score() calls in both else branches (internal
and external links) so all links get a total_score computed from their
intrinsic_score when head data is unavailable.
Fixes#1749
Three related fixes to the anti-bot proxy retry + fallback pipeline:
1. Allow fallback_fetch_function to run when crawl_result is None (all proxies
threw exceptions like browser crashes). Previously fallback only ran when
crawl_result existed but was blocked — exception-only failures bypassed it.
2. Skip is_blocked() re-check after successful fallback. Real unblocked pages
may contain anti-bot script markers (e.g. PerimeterX JS on Walmart) that
trigger false positives, overriding success=True back to False.
3. Always return a CrawlResult with crawl_stats, never None. When all proxies
and fallback fail, create a minimal failed result so callers get stats
about what was attempted instead of AttributeError on None.
Also: if aprocess_html fails during fallback (dead browser can't run
Page.evaluate for consent popup removal), fall back to raw HTML result
instead of silently discarding the successfully-fetched fallback content.
When proxy_config is a list (escalation chain) and the first proxy throws
an exception (timeout, connection error, browser crash), the retry loop
now continues to the next proxy instead of immediately re-raising.
Previously, exceptions on _p_idx==0 and _attempt==0 were always re-raised,
which broke the entire escalation chain — ISP/Residential/fallback proxies
were never tried. This made the proxy list effectively useless for sites
where the first-tier proxy fails with an exception rather than a blocked
response.
The raise is preserved when there's only a single proxy and single attempt
(len(proxy_list) <= 1 and max_attempts <= 1) so that simple non-chain
crawls still get immediate error propagation.
Setting config.proxy_config = [ProxyConfig.DIRECT, ...] after
construction now goes through the same normalization as __init__,
converting "direct" sentinels to None. Fixes crash when proxy_config
is assigned directly instead of passed to the constructor.
Allow "direct" or None in proxy_config list to explicitly try
without a proxy before escalating to proxy servers. The retry
loop already handled None as direct — this exposes it as a
clean user-facing API via ProxyConfig.DIRECT.
- proxy_config on CrawlerRunConfig now accepts a single ProxyConfig or
a list of ProxyConfig tried in order (first-come-first-served)
- Remove is_fallback from ProxyConfig and fallback_proxy_configs from
CrawlerRunConfig — proxy escalation handled entirely by list order
- Add _get_proxy_list() normalizer for the retry loop
- Add CrawlResult.crawl_stats with attempts, retries, proxies_used,
fallback_fetch_used, and resolved_by for billing and observability
- Set success=False with error_message when all attempts are blocked
- Simplify retry loop — no more is_fallback stashing logic
- Update docs and tests to reflect new API
Automatically detect when crawls are blocked by anti-bot systems
(Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and
escalate through configurable retry and fallback strategies.
New features on CrawlerRunConfig:
- max_retries: retry rounds when blocking is detected
- fallback_proxy_configs: list of fallback proxies tried each round
- fallback_fetch_function: async last-resort function returning raw HTML
New field on ProxyConfig:
- is_fallback: skip proxy on first attempt, activate only when blocked
Escalation chain per round: main proxy → fallback proxies in order.
After all rounds: fallback_fetch_function as last resort.
Detection uses tiered heuristics — structural HTML markers (high
confidence) trigger on any page, generic patterns only on short
error pages to avoid false positives.
Fix a bug where magic mode and per-request UA overrides would change
the User-Agent header without updating the sec-ch-ua (browser hint)
header to match. Anti-bot systems like Akamai detect this mismatch
as a bot signal.
Changes:
- Regenerate browser_hint via UAGen.generate_client_hints() whenever
the UA is changed at crawl time (magic mode or explicit override)
- Re-apply updated headers to the page via set_extra_http_headers()
- Skip per-crawl UA override for persistent contexts where the UA is
locked at launch time by Playwright's protocol layer
- Move --disable-gpu flags behind enable_stealth check so WebGL works
via SwiftShader when stealth mode is active (missing WebGL is a
detectable headless signal)
- Clean up old test scripts, add clean anti-bot test
Chromium's --proxy-server CLI flag silently ignores inline credentials
(user:pass@server). For persistent contexts, crawl4ai was embedding
credentials in this flag via ManagedBrowser.build_browser_flags(),
causing proxy auth to fail and the browser to fall back to direct
connection.
Fix: Use Playwright's launch_persistent_context(proxy=...) API instead
of subprocess + CDP when use_persistent_context=True. This handles
proxy authentication properly via the HTTP CONNECT handshake. The
non-persistent and CDP paths remain unchanged.
Changes:
- Strip credentials from --proxy-server flag in build_browser_flags()
- Add launch_persistent_context() path in BrowserManager.start()
- Add cleanup path in BrowserManager.close()
- Guard create_browser_context() when self.browser is None
- Add regression tests covering all 4 proxy/persistence combinations
Replace hardcoded parameter listings in BrowserConfig.from_kwargs() and
CrawlerRunConfig.from_kwargs() with a generic approach that filters
input kwargs to valid __init__ params and passes them through. This:
- Makes set_defaults() work with from_kwargs() (previously ignored)
- Fixes default mismatches (word_count_threshold was 200 vs __init__=1,
markdown_generator was None vs __init__=DefaultMarkdownGenerator())
- Eliminates ~160 lines of duplicated default values
- Auto-supports new params without updating from_kwargs
Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.
The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes
Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
- Add tests for device_scale_factor (config + integration)
- Add tests for redirected_status_code (model + redirect + raw HTML)
- Document device_scale_factor in browser config docs and API reference
- Document redirected_status_code in crawler result docs and API reference
- Add TristanDonze and charlaie to CONTRIBUTORS.md
- Update PR-TODOLIST with session results
Applied manually due to conflicts (PR based on older code).
Also fixed missing variable initialization for non-goto paths
(file://, raw:, js_only) that would have caused NameError.
Closes#1434