ContextVar.reset(token) requires the same Context that created the token.
When Starlette's StreamingResponse consumes the async generator in a
different Task, the Context changes and reset() raises ValueError.
Replaced with set(False) which works across context boundaries. Safe
because deep_crawl_active is never nested — the guard on line 21
prevents re-entry.
Remove broken re-import of load_nltk_punkt (already imported at module level).
Replace list(set(sens)) with plain return — set() destroyed document order
and silently dropped duplicate sentences.
arun() always returns CrawlResultContainer, never AsyncGenerator. The
RunManyReturn type (Union[CrawlResultContainer, AsyncGenerator]) caused
Pylance/Pyright to flag result.markdown as an error because AsyncGenerator
doesn't have that attribute.
Also adds test_type_annotations.py — 11 static analysis tests that catch
annotation mismatches (return types, missing annotations, export checks)
without needing pyright in CI. Would have caught this bug before it was
reported.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Block mode returns an internal index/tags/content format that is
rarely useful. Schema mode returns clean structured JSON, either
matching a provided schema or inferred from the instruction.
Addresses the gi_frame.f_back chain exploit reported by Song Binglin (q1uf3ng).
- Delete _safe_eval_expression() and _SAFE_EVAL_BUILTINS entirely from
extraction_strategy.py. Dead security-sensitive code is a liability.
The eval path was already disabled; this removes the function itself.
- Fix hook_manager.py module injection: replace broken exec("import X", ns)
pattern (silently failed due to missing __import__) with direct module
injection. Sanitize asyncio to strip subprocess access (RCE vector).
- Add startup warning when CRAWL4AI_API_TOKEN is unset (all endpoints
unauthenticated).
- Expand adversarial test suite to 87 tests: hook sandbox escapes,
asyncio.subprocess RCE verification, end-to-end exploit payload from
vuln report, dead code deletion checks, codebase eval/exec audit.
When the Docker API receives markdown_generator as JSON with "options"
instead of "params", from_serializable_dict silently passes the raw
dict through. This later crashes with a confusing "'dict' object has
no attribute 'generate_markdown'" deep in the crawl pipeline.
Add type validation for markdown_generator in CrawlerRunConfig.__init__
(matching existing extraction_strategy/chunking_strategy validation).
When a dict slips through, the error now clearly states:
- What type was expected vs received
- That "params" is the required key (not "options")
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
litellm 1.82.7-1.82.8 on PyPI were compromised with malicious code.
PyPI quarantined the entire package (all versions uninstallable).
Switched to unclecode-litellm==1.81.13, a pre-compromise fork published
under our own PyPI account. Drop-in replacement — all imports unchanged.
litellm versions 1.82.7 and 1.82.8 on PyPI were compromised with malicious
code. PyPI has quarantined the entire package, blocking all installs.
Temporarily pin to our own fork at a known-safe version.
The /crawl endpoint now accepts an optional crawler_configs field
(list of CrawlerRunConfig dicts) alongside the existing crawler_config.
When provided with multiple URLs, each config is deserialized and passed
as a list to arun_many(), enabling per-URL configuration with url_matcher
patterns. Single-URL requests and requests without crawler_configs are
unchanged (backward compatible).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Starlette's Route wraps async functions in request_response(), calling
handler(request) instead of handler(scope, receive, send). This broke
the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable
class instead of an async function — Route passes class instances
through as raw ASGI apps without wrapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES
(returns None), preventing objects like logging.Logger from being
serialized as {"type": "Logger", "params": {...}} which then fails
deserialization. from_serializable_dict returns None for unknown types
instead of raising ValueError, handling payloads from older clients.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mermaid diagrams rendered as SVGs were completely stripped during HTML
cleaning, losing all text content. Now detects SVGs with id="mermaid-*",
extracts node/edge labels, and replaces the SVG with a fenced mermaid
code block containing the diagram type and extracted text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing guard assumed self.browser=None only meant persistent context mode.
In reality, the browser can be None because it was closed by the janitor, crashed,
or never started. This caused a misleading error message. Now the guard distinguishes
between persistent context and closed/crashed browser with appropriate messages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Default max_scroll_steps to 10 when not explicitly set (was None/unlimited)
- Wrap _handle_full_page_scan in asyncio.wait_for with page_timeout
- On timeout, log warning and continue with partial scroll instead of hanging
Previously, scan_full_page could hang indefinitely because:
1. max_scroll_steps defaulted to None (no limit)
2. Dynamic pages keep growing total_height on each scroll
3. No asyncio timeout wrapper to interrupt hung coroutines
When fallback_fetch_function was invoked but failed (exception or empty
response), the final is_blocked() re-check was skipped because
fallback_fetch_used=True. This left crawl_result.success=True even though
the result was a blocked page from the last proxy attempt.
Changed the condition to check resolved_by=='fallback_fetch' (set only on
success) instead of fallback_fetch_used (set before the try block).
The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.
Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
Bump version to 0.8.5 across all references (Dockerfile, README,
Docker README, blog index, __version__.py).
Add release notes, blog post, demo verification script (13 real-crawl
tests), and releases directory entry.
Key highlights:
- Anti-bot detection with 3-tier proxy escalation
- Shadow DOM flattening
- Deep crawl cancellation
- Config defaults API
- 60+ bug fixes and critical security patches
The monitor's update_timeline(), get_health_summary(), and
get_browser_list() all acquired the crawler pool's global LOCK to read
pool stats. That same lock is held during slow browser start/close
operations (get_crawler, janitor, close_all), causing the monitor to
block indefinitely and the pod to become unresponsive after sustained
crawling.
Replaced all three lock acquisitions in monitor.py with a lock-free
get_pool_snapshot() in crawler_pool.py that returns shallow dict copies.
Under CPython's GIL, dict.copy() and len() are atomic — safe for
read-only monitoring with at most slightly stale counts.
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add official Redis apt repository and pin redis-server to 7.2.7 which
patches the Lua use-after-free vulnerability. REDIS_VERSION build arg
allows override.
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.
Also update CONTRIBUTORS.md for PR #1829.
- #1370: Freeze element dimensions via CSS before viewport resize in
take_screenshot_scroller() to prevent responsive reflow on Elementor
sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
to abort pending loads; move event listener cleanup outside session_id
guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
set — call arun() directly per URL so the DeepCrawlDecorator can
invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
save_global_config() (cli.py line 58).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
provider, temperature, base_url query params with fallback to server config.
Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
REDIS_PASSWORD environment variable overrides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.
Includes 16 tests (8 unit + 8 integration).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>