ContextVar.reset(token) requires the same Context that created the token.
When Starlette's StreamingResponse consumes the async generator in a
different Task, the Context changes and reset() raises ValueError.
Replaced with set(False) which works across context boundaries. Safe
because deep_crawl_active is never nested — the guard on line 21
prevents re-entry.
arun() always returns CrawlResultContainer, never AsyncGenerator. The
RunManyReturn type (Union[CrawlResultContainer, AsyncGenerator]) caused
Pylance/Pyright to flag result.markdown as an error because AsyncGenerator
doesn't have that attribute.
Also adds test_type_annotations.py — 11 static analysis tests that catch
annotation mismatches (return types, missing annotations, export checks)
without needing pyright in CI. Would have caught this bug before it was
reported.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses the gi_frame.f_back chain exploit reported by Song Binglin (q1uf3ng).
- Delete _safe_eval_expression() and _SAFE_EVAL_BUILTINS entirely from
extraction_strategy.py. Dead security-sensitive code is a liability.
The eval path was already disabled; this removes the function itself.
- Fix hook_manager.py module injection: replace broken exec("import X", ns)
pattern (silently failed due to missing __import__) with direct module
injection. Sanitize asyncio to strip subprocess access (RCE vector).
- Add startup warning when CRAWL4AI_API_TOKEN is unset (all endpoints
unauthenticated).
- Expand adversarial test suite to 87 tests: hook sandbox escapes,
asyncio.subprocess RCE verification, end-to-end exploit payload from
vuln report, dead code deletion checks, codebase eval/exec audit.
When the Docker API receives markdown_generator as JSON with "options"
instead of "params", from_serializable_dict silently passes the raw
dict through. This later crashes with a confusing "'dict' object has
no attribute 'generate_markdown'" deep in the crawl pipeline.
Add type validation for markdown_generator in CrawlerRunConfig.__init__
(matching existing extraction_strategy/chunking_strategy validation).
When a dict slips through, the error now clearly states:
- What type was expected vs received
- That "params" is the required key (not "options")
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The /crawl endpoint now accepts an optional crawler_configs field
(list of CrawlerRunConfig dicts) alongside the existing crawler_config.
When provided with multiple URLs, each config is deserialized and passed
as a list to arun_many(), enabling per-URL configuration with url_matcher
patterns. Single-URL requests and requests without crawler_configs are
unchanged (backward compatible).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Starlette's Route wraps async functions in request_response(), calling
handler(request) instead of handler(scope, receive, send). This broke
the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable
class instead of an async function — Route passes class instances
through as raw ASGI apps without wrapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES
(returns None), preventing objects like logging.Logger from being
serialized as {"type": "Logger", "params": {...}} which then fails
deserialization. from_serializable_dict returns None for unknown types
instead of raising ValueError, handling payloads from older clients.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mermaid diagrams rendered as SVGs were completely stripped during HTML
cleaning, losing all text content. Now detects SVGs with id="mermaid-*",
extracts node/edge labels, and replaces the SVG with a fenced mermaid
code block containing the diagram type and extracted text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing guard assumed self.browser=None only meant persistent context mode.
In reality, the browser can be None because it was closed by the janitor, crashed,
or never started. This caused a misleading error message. Now the guard distinguishes
between persistent context and closed/crashed browser with appropriate messages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.
Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.
Also update CONTRIBUTORS.md for PR #1829.
- #1370: Freeze element dimensions via CSS before viewport resize in
take_screenshot_scroller() to prevent responsive reflow on Elementor
sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
to abort pending loads; move event listener cleanup outside session_id
guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
set — call arun() directly per URL so the DeepCrawlDecorator can
invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
save_global_config() (cli.py line 58).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
provider, temperature, base_url query params with fallback to server config.
Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
REDIS_PASSWORD environment variable overrides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.
Includes 16 tests (8 unit + 8 integration).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- antibot_detector: add <pre> to content elements regex, detect
browser-wrapped JSON in _looks_like_data() so httpbin-style
responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
patterns (/docs/*) instead of matching against full URL, which
always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
and edge cases
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)
Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)
CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
build-push-action v6, setup-buildx v4, login v4 (#1734)
Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
for Unicode preservation in JSON output (#1668)
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.
Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.
Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).
Inspired by PR #1683 from @sthakrar — thank you for identifying the
issue and proposing the initial approach.
Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker
domains and CSS resources at the browser context level. Refactor crawler pool
with release_crawler() and active_requests tracking to prevent janitor from
closing browsers with in-flight requests. Add proper finally blocks to all
Docker API/server handlers. Update docs for new config options.
Inspired by #1689.
context.add_init_script() was called in both setup_context() and
_crawl_web(), causing unbounded script accumulation on shared contexts
under concurrent load. Chromium kills the overloaded context, cascading
"Target page, context or browser has been closed" to all concurrent crawls.
Add flag-based dedup: after injecting navigator_overrider or shadow-DOM
scripts, set _crawl4ai_nav_overrider_injected / _crawl4ai_shadow_dom_injected
on the context. Before injecting, check the flag. This preserves context-level
scope (popups/iframes covered) and the fallback for managed/persistent/CDP
paths where setup_context() runs without crawlerRunConfig.
Modern block pages (Reddit, LinkedIn, etc.) serve full SPA shells that
exceed 100KB+, bypassing all size-based detection thresholds. This caused
the fallback (Web Unlocker) to never trigger for these sites.
Changes:
- HTTP 403/503 with non-data HTML is now always treated as blocked
regardless of page size (false positives are cheap, fallback rescues them)
- Added Tier 1 deep scan: strips scripts/styles before checking patterns
on large pages, catching block text buried under 100KB+ of CSS/JS
- Added "blocked by network security" as Tier 1 pattern (Reddit et al.)
- Updated tests to reflect new detection philosophy
generate_schema can make up to 5 internal LLM calls (field inference,
schema generation, validation retries) with no way to track token
consumption. Add an optional `usage: TokenUsage = None` parameter that
accumulates prompt/completion/total tokens across all calls in-place.
- _infer_target_json: accept and populate usage accumulator
- agenerate_schema: track usage after every aperform_completion call
in the retry loop, forward usage to _infer_target_json
- generate_schema (sync): forward usage to agenerate_schema
Fully backward-compatible — omitting usage changes nothing.
- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
shadow DOM content into the light DOM before HTML capture. Uses a
recursive serializer that resolves <slot> projections and strips
only shadow-scoped <style> tags. Also injects an init script to
force-open closed shadow roots via attachShadow patching.
- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
for the less common case of triggering loading before waiting.
- Add JS snippet (flatten_shadow_dom.js), integration test, example,
and documentation across all relevant doc files.
When pad_tables=False (default), html2text generated table rows without
leading/trailing pipe delimiters, producing non-compliant GFM markdown:
Before: A | B | C
After: | A | B | C |
Changes:
- Add leading pipe on first cell, spaced pipe between cells
- Add trailing pipe at end of each row
- Format separator as | --- | --- | instead of ---|---
- Ensure table starts on its own line (soft_br at <table>)
- Handle <caption> element to prevent inline merge with header row
- All changes guarded by `not self.pad_tables` — pad_tables mode unchanged
Includes 13 unit tests covering GFM compliance and pad_tables regression.
Fixes: #1731
Many sites (e.g. Hacker News) split a single item's data across sibling
elements. Field selectors only search descendants, making sibling data
unreachable. The new "source" field key navigates to a sibling element
before running the selector: {"source": "+ tr"} finds the next sibling
<tr>, then extracts from there.
- Add _resolve_source abstract method to JsonElementExtractionStrategy
- Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS)
- Modify _extract_field to resolve source before type dispatch
- Update CSS and XPath LLM prompts with source docs and HN example
- Default generate_schema validate=True so schemas are checked on creation
- Add schema validation with feedback loop for auto-refinement
- Add messages param to completion helpers for multi-turn refinement
- Document source field and schema validation in docs
- Add 14 unit tests covering CSS, XPath, backward compat, edge cases
The _merge_head_data() function only called calculate_total_score() for
links present in url_to_head_data. Links that failed head extraction
(PDFs, timeouts, non-HTML) hit the else branch and were appended
unchanged, leaving total_score as None even when intrinsic_score was
available.
Added calculate_total_score() calls in both else branches (internal
and external links) so all links get a total_score computed from their
intrinsic_score when head data is unavailable.
Fixes#1749
Automatically detect when crawls are blocked by anti-bot systems
(Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and
escalate through configurable retry and fallback strategies.
New features on CrawlerRunConfig:
- max_retries: retry rounds when blocking is detected
- fallback_proxy_configs: list of fallback proxies tried each round
- fallback_fetch_function: async last-resort function returning raw HTML
New field on ProxyConfig:
- is_fallback: skip proxy on first attempt, activate only when blocked
Escalation chain per round: main proxy → fallback proxies in order.
After all rounds: fallback_fetch_function as last resort.
Detection uses tiered heuristics — structural HTML markers (high
confidence) trigger on any page, generic patterns only on short
error pages to avoid false positives.
Fix a bug where magic mode and per-request UA overrides would change
the User-Agent header without updating the sec-ch-ua (browser hint)
header to match. Anti-bot systems like Akamai detect this mismatch
as a bot signal.
Changes:
- Regenerate browser_hint via UAGen.generate_client_hints() whenever
the UA is changed at crawl time (magic mode or explicit override)
- Re-apply updated headers to the page via set_extra_http_headers()
- Skip per-crawl UA override for persistent contexts where the UA is
locked at launch time by Playwright's protocol layer
- Move --disable-gpu flags behind enable_stealth check so WebGL works
via SwiftShader when stealth mode is active (missing WebGL is a
detectable headless signal)
- Clean up old test scripts, add clean anti-bot test
Chromium's --proxy-server CLI flag silently ignores inline credentials
(user:pass@server). For persistent contexts, crawl4ai was embedding
credentials in this flag via ManagedBrowser.build_browser_flags(),
causing proxy auth to fail and the browser to fall back to direct
connection.
Fix: Use Playwright's launch_persistent_context(proxy=...) API instead
of subprocess + CDP when use_persistent_context=True. This handles
proxy authentication properly via the HTTP CONNECT handshake. The
non-persistent and CDP paths remain unchanged.
Changes:
- Strip credentials from --proxy-server flag in build_browser_flags()
- Add launch_persistent_context() path in BrowserManager.start()
- Add cleanup path in BrowserManager.close()
- Guard create_browser_context() when self.browser is None
- Add regression tests covering all 4 proxy/persistence combinations
- Add tests for device_scale_factor (config + integration)
- Add tests for redirected_status_code (model + redirect + raw HTML)
- Document device_scale_factor in browser config docs and API reference
- Document redirected_status_code in crawler result docs and API reference
- Add TristanDonze and charlaie to CONTRIBUTORS.md
- Update PR-TODOLIST with session results
The previous recycle logic waited for all refcounts to hit 0 before
recycling, which never happened under sustained concurrent load (20+
crawls always had at least one active).
New approach:
- Add _browser_version to config signature — bump it to force new contexts
- When threshold is hit: bump version, move old sigs to _pending_cleanup
- New requests get new contexts automatically (different signature)
- Old contexts drain naturally and get cleaned up when refcount hits 0
- Safety cap: max 3 pending browsers draining at once
This means recycling now works under any load pattern — no blocking,
no waiting for quiet moments. Old and new browsers coexist briefly
during transitions.
Includes 12 new tests covering version bumps, concurrent recycling,
safety cap, and edge cases.
contexts_by_config accumulated browser contexts unboundedly in long-running
crawlers (Docker API). Two root causes fixed:
1. _make_config_signature() hashed ~60 CrawlerRunConfig fields but only 7
affect the browser context (proxy_config, locale, timezone_id, geolocation,
override_navigator, simulate_user, magic). Switched from blacklist to
whitelist — non-context fields like word_count_threshold, css_selector,
screenshot, verbose no longer cause unnecessary context creation.
2. No eviction mechanism existed between close() calls. Added refcount
tracking (_context_refcounts, incremented under _contexts_lock in
get_page, decremented in release_page_with_context) and LRU eviction
(_evict_lru_context_locked) that caps contexts at _max_contexts=20,
evicting only idle contexts (refcount==0) oldest-first.
Also fixed: storage_state path leaked a temporary context every request
(now explicitly closed after clone_runtime_state).
Closes#943. Credit to @Martichou for the investigation in #1640.