litellm versions 1.82.7 and 1.82.8 on PyPI were compromised with malicious
code. PyPI has quarantined the entire package, blocking all installs.
Temporarily pin to our own fork at a known-safe version.
Starlette's Route wraps async functions in request_response(), calling
handler(request) instead of handler(scope, receive, send). This broke
the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable
class instead of an async function — Route passes class instances
through as raw ASGI apps without wrapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES
(returns None), preventing objects like logging.Logger from being
serialized as {"type": "Logger", "params": {...}} which then fails
deserialization. from_serializable_dict returns None for unknown types
instead of raising ValueError, handling payloads from older clients.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing guard assumed self.browser=None only meant persistent context mode.
In reality, the browser can be None because it was closed by the janitor, crashed,
or never started. This caused a misleading error message. Now the guard distinguishes
between persistent context and closed/crashed browser with appropriate messages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Default max_scroll_steps to 10 when not explicitly set (was None/unlimited)
- Wrap _handle_full_page_scan in asyncio.wait_for with page_timeout
- On timeout, log warning and continue with partial scroll instead of hanging
Previously, scan_full_page could hang indefinitely because:
1. max_scroll_steps defaulted to None (no limit)
2. Dynamic pages keep growing total_height on each scroll
3. No asyncio timeout wrapper to interrupt hung coroutines
When fallback_fetch_function was invoked but failed (exception or empty
response), the final is_blocked() re-check was skipped because
fallback_fetch_used=True. This left crawl_result.success=True even though
the result was a blocked page from the last proxy attempt.
Changed the condition to check resolved_by=='fallback_fetch' (set only on
success) instead of fallback_fetch_used (set before the try block).
The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.
Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
Bump version to 0.8.5 across all references (Dockerfile, README,
Docker README, blog index, __version__.py).
Add release notes, blog post, demo verification script (13 real-crawl
tests), and releases directory entry.
Key highlights:
- Anti-bot detection with 3-tier proxy escalation
- Shadow DOM flattening
- Deep crawl cancellation
- Config defaults API
- 60+ bug fixes and critical security patches
The monitor's update_timeline(), get_health_summary(), and
get_browser_list() all acquired the crawler pool's global LOCK to read
pool stats. That same lock is held during slow browser start/close
operations (get_crawler, janitor, close_all), causing the monitor to
block indefinitely and the pod to become unresponsive after sustained
crawling.
Replaced all three lock acquisitions in monitor.py with a lock-free
get_pool_snapshot() in crawler_pool.py that returns shallow dict copies.
Under CPython's GIL, dict.copy() and len() are atomic — safe for
read-only monitoring with at most slightly stale counts.
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add official Redis apt repository and pin redis-server to 7.2.7 which
patches the Lua use-after-free vulnerability. REDIS_VERSION build arg
allows override.
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.
Also update CONTRIBUTORS.md for PR #1829.
- #1370: Freeze element dimensions via CSS before viewport resize in
take_screenshot_scroller() to prevent responsive reflow on Elementor
sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
to abort pending loads; move event listener cleanup outside session_id
guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
set — call arun() directly per URL so the DeepCrawlDecorator can
invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
save_global_config() (cli.py line 58).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
provider, temperature, base_url query params with fallback to server config.
Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
REDIS_PASSWORD environment variable overrides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.
Includes 16 tests (8 unit + 8 integration).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- antibot_detector: add <pre> to content elements regex, detect
browser-wrapped JSON in _looks_like_data() so httpbin-style
responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
patterns (/docs/*) instead of matching against full URL, which
always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
and edge cases
- #1487: Move virtual scroll after wait_for so dynamic containers exist
- #1512: Add __aiter__ to CrawlResultContainer for async for support
- #1666: Kill process group on cleanup to prevent zombie child processes,
add lsof fallback for Docker environments without lsof installed
- Close#1472 (redirect chain already fixed), #1480 (links already
normalized), #1679 (duplicate of #1509)
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)
Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)
CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
build-push-action v6, setup-buildx v4, login v4 (#1734)
Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
for Unicode preservation in JSON output (#1668)