- antibot_detector: add <pre> to content elements regex, detect
browser-wrapped JSON in _looks_like_data() so httpbin-style
responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
patterns (/docs/*) instead of matching against full URL, which
always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
and edge cases
- #1487: Move virtual scroll after wait_for so dynamic containers exist
- #1512: Add __aiter__ to CrawlResultContainer for async for support
- #1666: Kill process group on cleanup to prevent zombie child processes,
add lsof fallback for Docker environments without lsof installed
- Close#1472 (redirect chain already fixed), #1480 (links already
normalized), #1679 (duplicate of #1509)
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)
Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)
CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
build-push-action v6, setup-buildx v4, login v4 (#1734)
Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
for Unicode preservation in JSON output (#1668)
Adds None check before processing LLM response content in both extract()
and aextract(). When LLM returns no content (e.g. content filter, token
limit), returns an error block with finish_reason instead of crashing.
Also guards the except fallback path against None content.
Adds score_threshold parameter (default -inf for backward compatibility)
to BestFirstCrawlingStrategy, matching BFS and DFS strategies. URLs
scoring below the threshold are skipped.
Fixes#1801.
Wire existing _strip_markdown_fences() into the force_json_response
code path in both extract() and aextract(). LLMs frequently wrap JSON
in ```json fences which caused json.loads() to fail.
Inspired by PR #1787 (Br1an67).
Narrows the typed-object deserialization path to only match dicts with
"params" or {"type":"dict","value":{...}}, preventing crashes on normal
data dicts like JSON-Schema fragments that happen to have a "type" key.
AdaptiveCrawler.digest() unconditionally added external links to
pending_links, causing the crawler to follow links to entirely
different domains even though include_external=False was set in
LinkPreviewConfig.
Remove external links from being added to pending_links in both the
initial crawl and subsequent crawl loops.
Fixes#1776
Add score_threshold parameter to BestFirstCrawlingStrategy, matching the
existing behavior in BFSDeepCrawlStrategy and DFSDeepCrawlStrategy.
URLs scoring below the threshold are now skipped during link discovery
instead of being unconditionally enqueued.
Fixes#1801
The typed-object entry condition (`"type" in data`) was too broad: it
also matched plain business dicts that happen to carry a "type" key,
such as JsonCssExtractionStrategy field specs ({"type": "text"}) and
LLMExtractionStrategy JSON Schema fragments ({"type": "string"}).
These were never config objects, but the deserializer tried to treat
them as such, hit the ALLOWED_DESERIALIZE_TYPES allowlist, and raised
a ValueError — causing /crawl to return HTTP 500 for perfectly valid
extraction-strategy payloads.
Fix: narrow the entry condition to require "params" (or "type":"dict"
+ "value"), matching only the shapes that to_serializable_dict() actually
produces. Dicts with "type" but no "params"/"value" fall through to the
raw-dict path and are passed as plain data.
The RCE protection from commit 0104db6 is fully preserved: any real
class-instantiation attack still requires "type" + "params", still
enters the typed path, and is still blocked by the allowlist.
Fixes#1797
The previous regex [^\]]+ stopped at the first ] which broke
markdown links containing embedded images like:
The new pattern allows one level of nested [...] in the link text
and one level of nested (...) in the URL, correctly handling:
- Embedded images in link text
- Wikipedia-style URLs with parentheses
Fixes#711
When max_tokens is too small, the LLM may return None content with
finish_reason=MAX_TOKENS. This caused a crash in extraction with
'NoneType' object has no attribute 'startswith'.
Add a None check on LLM response content. When content is None,
return an error block including the finish_reason so callers can
diagnose the issue. Also guard the fallback split_and_parse path
against None content.
Fixes#1606
The is_external_url function compared the full netloc (including port)
against base_domain (which has port stripped by get_base_domain).
This caused URLs like http://localhost:8000/page to be wrongly
classified as external when base_domain is 'localhost'.
Strip the port from parsed.netloc before comparison.
Fixes#1503
Add 'class' and 'id' to IMPORTANT_ATTRS so they are retained when
cleaning HTML attributes. This allows users to use cleaned_html for
further analysis that depends on CSS classes and element IDs.
Fixes#1601
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.
Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.
Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).
Inspired by PR #1683 from @Vaccarini-Lorenzo — thank you for identifying the
issue and proposing the initial approach.
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.
Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.
Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).
Inspired by PR #1683 from @sthakrar — thank you for identifying the
issue and proposing the initial approach.
Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker
domains and CSS resources at the browser context level. Refactor crawler pool
with release_crawler() and active_requests tracking to prevent janitor from
closing browsers with in-flight requests. Add proper finally blocks to all
Docker API/server handlers. Update docs for new config options.
Inspired by #1689.