The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.
Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
The monitor's update_timeline(), get_health_summary(), and
get_browser_list() all acquired the crawler pool's global LOCK to read
pool stats. That same lock is held during slow browser start/close
operations (get_crawler, janitor, close_all), causing the monitor to
block indefinitely and the pod to become unresponsive after sustained
crawling.
Replaced all three lock acquisitions in monitor.py with a lock-free
get_pool_snapshot() in crawler_pool.py that returns shallow dict copies.
Under CPython's GIL, dict.copy() and len() are atomic — safe for
read-only monitoring with at most slightly stale counts.
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add official Redis apt repository and pin redis-server to 7.2.7 which
patches the Lua use-after-free vulnerability. REDIS_VERSION build arg
allows override.
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.
Also update CONTRIBUTORS.md for PR #1829.
- #1370: Freeze element dimensions via CSS before viewport resize in
take_screenshot_scroller() to prevent responsive reflow on Elementor
sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
to abort pending loads; move event listener cleanup outside session_id
guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
set — call arun() directly per URL so the DeepCrawlDecorator can
invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
save_global_config() (cli.py line 58).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
provider, temperature, base_url query params with fallback to server config.
Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
REDIS_PASSWORD environment variable overrides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.
Includes 16 tests (8 unit + 8 integration).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- antibot_detector: add <pre> to content elements regex, detect
browser-wrapped JSON in _looks_like_data() so httpbin-style
responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
patterns (/docs/*) instead of matching against full URL, which
always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
and edge cases
- #1487: Move virtual scroll after wait_for so dynamic containers exist
- #1512: Add __aiter__ to CrawlResultContainer for async for support
- #1666: Kill process group on cleanup to prevent zombie child processes,
add lsof fallback for Docker environments without lsof installed
- Close#1472 (redirect chain already fixed), #1480 (links already
normalized), #1679 (duplicate of #1509)
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)
Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)
CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
build-push-action v6, setup-buildx v4, login v4 (#1734)
Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
for Unicode preservation in JSON output (#1668)
Adds None check before processing LLM response content in both extract()
and aextract(). When LLM returns no content (e.g. content filter, token
limit), returns an error block with finish_reason instead of crashing.
Also guards the except fallback path against None content.
Adds score_threshold parameter (default -inf for backward compatibility)
to BestFirstCrawlingStrategy, matching BFS and DFS strategies. URLs
scoring below the threshold are skipped.
Fixes#1801.
Wire existing _strip_markdown_fences() into the force_json_response
code path in both extract() and aextract(). LLMs frequently wrap JSON
in ```json fences which caused json.loads() to fail.
Inspired by PR #1787 (Br1an67).
Narrows the typed-object deserialization path to only match dicts with
"params" or {"type":"dict","value":{...}}, preventing crashes on normal
data dicts like JSON-Schema fragments that happen to have a "type" key.
AdaptiveCrawler.digest() unconditionally added external links to
pending_links, causing the crawler to follow links to entirely
different domains even though include_external=False was set in
LinkPreviewConfig.
Remove external links from being added to pending_links in both the
initial crawl and subsequent crawl loops.
Fixes#1776
Add score_threshold parameter to BestFirstCrawlingStrategy, matching the
existing behavior in BFSDeepCrawlStrategy and DFSDeepCrawlStrategy.
URLs scoring below the threshold are now skipped during link discovery
instead of being unconditionally enqueued.
Fixes#1801