Starlette's Route wraps async functions in request_response(), calling
handler(request) instead of handler(scope, receive, send). This broke
the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable
class instead of an async function — Route passes class instances
through as raw ASGI apps without wrapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES
(returns None), preventing objects like logging.Logger from being
serialized as {"type": "Logger", "params": {...}} which then fails
deserialization. from_serializable_dict returns None for unknown types
instead of raising ValueError, handling payloads from older clients.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing guard assumed self.browser=None only meant persistent context mode.
In reality, the browser can be None because it was closed by the janitor, crashed,
or never started. This caused a misleading error message. Now the guard distinguishes
between persistent context and closed/crashed browser with appropriate messages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.
Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.
Also update CONTRIBUTORS.md for PR #1829.
- #1370: Freeze element dimensions via CSS before viewport resize in
take_screenshot_scroller() to prevent responsive reflow on Elementor
sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
to abort pending loads; move event listener cleanup outside session_id
guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
set — call arun() directly per URL so the DeepCrawlDecorator can
invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
save_global_config() (cli.py line 58).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
provider, temperature, base_url query params with fallback to server config.
Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
REDIS_PASSWORD environment variable overrides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.
Includes 16 tests (8 unit + 8 integration).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- antibot_detector: add <pre> to content elements regex, detect
browser-wrapped JSON in _looks_like_data() so httpbin-style
responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
patterns (/docs/*) instead of matching against full URL, which
always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
and edge cases
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)
Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)
CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
build-push-action v6, setup-buildx v4, login v4 (#1734)
Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
for Unicode preservation in JSON output (#1668)
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.
Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.
Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).
Inspired by PR #1683 from @sthakrar — thank you for identifying the
issue and proposing the initial approach.
Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker
domains and CSS resources at the browser context level. Refactor crawler pool
with release_crawler() and active_requests tracking to prevent janitor from
closing browsers with in-flight requests. Add proper finally blocks to all
Docker API/server handlers. Update docs for new config options.
Inspired by #1689.
context.add_init_script() was called in both setup_context() and
_crawl_web(), causing unbounded script accumulation on shared contexts
under concurrent load. Chromium kills the overloaded context, cascading
"Target page, context or browser has been closed" to all concurrent crawls.
Add flag-based dedup: after injecting navigator_overrider or shadow-DOM
scripts, set _crawl4ai_nav_overrider_injected / _crawl4ai_shadow_dom_injected
on the context. Before injecting, check the flag. This preserves context-level
scope (popups/iframes covered) and the fallback for managed/persistent/CDP
paths where setup_context() runs without crawlerRunConfig.
Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that
exceed 100KB+, bypassing all size-based detection thresholds. This caused
the fallback (Web Unlocker) to never trigger for these sites.
Changes:
- HTTP 403/503 with non-data HTML is now always treated as blocked
regardless of page size (false positives are cheap, fallback rescues them)
- Added Tier 1 deep scan: strips scripts/styles before checking patterns
on large pages, catching block text buried under 100KB+ of CSS/JS
- Added "blocked by network security" as Tier 1 pattern (Reddit et al.)
- Updated tests to reflect new detection philosophy
generate_schema can make up to 5 internal LLM calls (field inference,
schema generation, validation retries) with no way to track token
consumption. Add an optional `usage: TokenUsage = None` parameter that
accumulates prompt/completion/total tokens across all calls in-place.
- _infer_target_json: accept and populate usage accumulator
- agenerate_schema: track usage after every aperform_completion call
in the retry loop, forward usage to _infer_target_json
- generate_schema (sync): forward usage to agenerate_schema
Fully backward-compatible — omitting usage changes nothing.
- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
shadow DOM content into the light DOM before HTML capture. Uses a
recursive serializer that resolves <slot> projections and strips
only shadow-scoped <style> tags. Also injects an init script to
force-open closed shadow roots via attachShadow patching.
- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
for the less common case of triggering loading before waiting.
- Add JS snippet (flatten_shadow_dom.js), integration test, example,
and documentation across all relevant doc files.
When pad_tables=False (default), html2text generated table rows without
leading/trailing pipe delimiters, producing non-compliant GFM markdown:
Before: A | B | C
After: | A | B | C |
Changes:
- Add leading pipe on first cell, spaced pipe between cells
- Add trailing pipe at end of each row
- Format separator as | --- | --- | instead of ---|---
- Ensure table starts on its own line (soft_br at <table>)
- Handle <caption> element to prevent inline merge with header row
- All changes guarded by `not self.pad_tables` — pad_tables mode unchanged
Includes 13 unit tests covering GFM compliance and pad_tables regression.
Fixes: #1731
Many sites (e.g. Hacker News) split a single item's data across sibling
elements. Field selectors only search descendants, making sibling data
unreachable. The new "source" field key navigates to a sibling element
before running the selector: {"source": "+ tr"} finds the next sibling
<tr>, then extracts from there.
- Add _resolve_source abstract method to JsonElementExtractionStrategy
- Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS)
- Modify _extract_field to resolve source before type dispatch
- Update CSS and XPath LLM prompts with source docs and HN example
- Default generate_schema validate=True so schemas are checked on creation
- Add schema validation with feedback loop for auto-refinement
- Add messages param to completion helpers for multi-turn refinement
- Document source field and schema validation in docs
- Add 14 unit tests covering CSS, XPath, backward compat, edge cases
The _merge_head_data() function only called calculate_total_score() for
links present in url_to_head_data. Links that failed head extraction
(PDFs, timeouts, non-HTML) hit the else branch and were appended
unchanged, leaving total_score as None even when intrinsic_score was
available.
Added calculate_total_score() calls in both else branches (internal
and external links) so all links get a total_score computed from their
intrinsic_score when head data is unavailable.
Fixes#1749
Automatically detect when crawls are blocked by anti-bot systems
(Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and
escalate through configurable retry and fallback strategies.
New features on CrawlerRunConfig:
- max_retries: retry rounds when blocking is detected
- fallback_proxy_configs: list of fallback proxies tried each round
- fallback_fetch_function: async last-resort function returning raw HTML
New field on ProxyConfig:
- is_fallback: skip proxy on first attempt, activate only when blocked
Escalation chain per round: main proxy → fallback proxies in order.
After all rounds: fallback_fetch_function as last resort.
Detection uses tiered heuristics — structural HTML markers (high
confidence) trigger on any page, generic patterns only on short
error pages to avoid false positives.
Fix a bug where magic mode and per-request UA overrides would change
the User-Agent header without updating the sec-ch-ua (browser hint)
header to match. Anti-bot systems like Akamai detect this mismatch
as a bot signal.
Changes:
- Regenerate browser_hint via UAGen.generate_client_hints() whenever
the UA is changed at crawl time (magic mode or explicit override)
- Re-apply updated headers to the page via set_extra_http_headers()
- Skip per-crawl UA override for persistent contexts where the UA is
locked at launch time by Playwright's protocol layer
- Move --disable-gpu flags behind enable_stealth check so WebGL works
via SwiftShader when stealth mode is active (missing WebGL is a
detectable headless signal)
- Clean up old test scripts, add clean anti-bot test
Chromium's --proxy-server CLI flag silently ignores inline credentials
(user:pass@server). For persistent contexts, crawl4ai was embedding
credentials in this flag via ManagedBrowser.build_browser_flags(),
causing proxy auth to fail and the browser to fall back to direct
connection.
Fix: Use Playwright's launch_persistent_context(proxy=...) API instead
of subprocess + CDP when use_persistent_context=True. This handles
proxy authentication properly via the HTTP CONNECT handshake. The
non-persistent and CDP paths remain unchanged.
Changes:
- Strip credentials from --proxy-server flag in build_browser_flags()
- Add launch_persistent_context() path in BrowserManager.start()
- Add cleanup path in BrowserManager.close()
- Guard create_browser_context() when self.browser is None
- Add regression tests covering all 4 proxy/persistence combinations
- Add tests for device_scale_factor (config + integration)
- Add tests for redirected_status_code (model + redirect + raw HTML)
- Document device_scale_factor in browser config docs and API reference
- Document redirected_status_code in crawler result docs and API reference
- Add TristanDonze and charlaie to CONTRIBUTORS.md
- Update PR-TODOLIST with session results
The previous recycle logic waited for all refcounts to hit 0 before
recycling, which never happened under sustained concurrent load (20+
crawls always had at least one active).
New approach:
- Add _browser_version to config signature — bump it to force new contexts
- When threshold is hit: bump version, move old sigs to _pending_cleanup
- New requests get new contexts automatically (different signature)
- Old contexts drain naturally and get cleaned up when refcount hits 0
- Safety cap: max 3 pending browsers draining at once
This means recycling now works under any load pattern — no blocking,
no waiting for quiet moments. Old and new browsers coexist briefly
during transitions.
Includes 12 new tests covering version bumps, concurrent recycling,
safety cap, and edge cases.
contexts_by_config accumulated browser contexts unboundedly in long-running
crawlers (Docker API). Two root causes fixed:
1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7
affect the browser context (proxy_config, locale, timezone_id, geolocation,
override_navigator, simulate_user, magic). Switched from blacklist to
whitelist — non-context fields like word_count_threshold, css_selector,
screenshot, verbose no longer cause unnecessary context creation.
2. No eviction mechanism existed between close() calls. Added refcount
tracking (_context_refcounts, incremented under _contexts_lock in
get_page, decremented in release_page_with_context) and LRU eviction
(_evict_lru_context_locked) that caps contexts at _max_contexts=20,
evicting only idle contexts (refcount==0) oldest-first.
Also fixed: storage_state path leaked a temporary context every request
(now explicitly closed after clone_runtime_state).
Closes#943. Credit to @Martichou for the investigation in #1640.
Strip markdown code fences (```json ... ```) from LLM responses before
json.loads() in agenerate_schema(). Anthropic models wrap JSON output
in markdown fences when litellm silently drops the unsupported
response_format parameter, causing json.loads("") parse failures.
- Add _strip_markdown_fences() helper to extraction_strategy.py
- Apply fence stripping + empty response check in agenerate_schema()
- Separate JSONDecodeError for clearer error messages
- Add 34 tests: unit, real API integration (Anthropic/OpenAI/Groq
against quotes.toscrape.com), and regression parametrized
- Use class-level tracking keyed by normalized CDP URL
- All BrowserManager instances connecting to same browser share tracking
- For CDP connections, always create new pages (cross-connection page
sharing isn't reliable in Playwright)
- For managed browsers, page reuse works within same process
- Normalize CDP URLs to handle different formats (http, ws, query params)
When using create_isolated_context=False with concurrent crawls, multiple
tasks would reuse the same page (pages[0]) causing navigation race
conditions and "Page.content: Unable to retrieve content because the
page is navigating" errors.
Changes:
- Add _pages_in_use set to track pages currently being used by crawls
- Rewrite get_page() to only reuse pages that are not in use
- Create new pages when all existing pages are busy
- Add release_page() method to release pages after crawl completes
- Update cleanup paths to release pages before closing
This maintains context sharing (cookies, localStorage) while ensuring
each concurrent crawl gets its own isolated page for navigation.
Includes integration tests verifying:
- Single and sequential crawls still work
- Concurrent crawls don't cause race conditions
- High concurrency (10 simultaneous crawls) works
- Page tracking state remains consistent