Commit Graph

1429 Commits

Author SHA1 Message Date
unclecode
3a75dd3f4c fix: batch fix for 10 open issues (#1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758)
- #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance)
- #1489: Preserve query parameter key casing in normalize_url
- #1374: Close NamedTemporaryFile handle before reopening (Windows fix)
- #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1)
- #1183: Fix extract_xml_data regex matching tag names in prose text
- #1354: Make import_knowledge_base async (fix asyncio.run in running loop)
- #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences)
- #1031: Make Docker playground code editor resizable with overflow-auto
- #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes
- #1758: Change screenshot stitching format from BMP to PNG
2026-03-07 09:47:38 +00:00
unclecode
0c9e3c427e Update CONTRIBUTORS and PR-TODOLIST for batch 5 (15 PRs resolved)
Batch 5 merged: #1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668
Closed as superseded: #1592
Closed as won't merge: #999, #1180, #1425, #1702, #1707, #1729
2026-03-07 08:49:32 +00:00
unclecode
7c0cc3ed88 fix: batch merge of community PRs (#1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668)
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)

Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)

CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
  build-push-action v6, setup-buildx v4, login v4 (#1734)

Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
  extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
  for Unicode preservation in JSON output (#1668)
2026-03-07 08:45:11 +00:00
unclecode
11ed854155 Update CONTRIBUTORS for PR #462 2026-03-07 07:06:49 +00:00
unclecode
697c2b2a58 fix: add newline before opening code fence in html2text (#462)
From PR #462 by @jtanningbed
2026-03-07 07:06:41 +00:00
unclecode
3704758746 Update CONTRIBUTORS for PR #1770 2026-03-07 07:01:54 +00:00
unclecode
04e83aa3c7 docs: modernize deprecated API usage across shipped docs (#1770)
Update docs/examples to use current API:
- proxy → proxy_config in BrowserConfig
- result.fit_markdown → result.markdown.fit_markdown
- result.fit_html → result.markdown.fit_html
- markdown_v2 deprecation notes updated
- bypass_cache → cache_mode=CacheMode.BYPASS
- LLMExtractionStrategy now uses llm_config=LLMConfig(...)
- CrawlerConfig → CrawlerRunConfig
- cache_mode string values → CacheMode enum
- Fix missing CacheMode import in local-files.md
- Fix indentation in app-detail.html example
- Fix tautological cache mode descriptions in arun.md

From PR #1770 by @maksimzayats
2026-03-07 07:01:06 +00:00
unclecode
31d0de23df Update PR-TODOLIST for batch 4 merge (10 PRs) and refresh open PR list 2026-03-07 06:50:26 +00:00
unclecode
db98aefb03 Update CONTRIBUTORS for PRs #1494, #1715, #1716, #1308, #1789, #1793, #1792, #1794, #1784, #1730 2026-03-07 06:47:03 +00:00
unclecode
761664d29e fix: add TTL expiry for Redis task data to prevent memory growth (#1730)
From PR #1730 by @hoi
2026-03-07 06:17:58 +00:00
unclecode
e47e810aca fix: handle UnicodeEncodeError in URL seeder and strip zero-width chars (#1784)
From PR #1784 by @Br1an67
2026-03-07 06:16:41 +00:00
unclecode
1029815fd4 fix: add Windows support for crawler monitor keyboard input (#1794)
From PR #1794 by @Br1an67
2026-03-07 06:16:12 +00:00
unclecode
d229beeaf8 fix: add wait_for_images option to screenshot endpoint (#1792)
From PR #1792 by @Br1an67
2026-03-07 06:15:54 +00:00
unclecode
c73aa271ac fix: make link_preview_timeout configurable in AdaptiveConfig (#1793)
From PR #1793 by @Br1an67
2026-03-07 06:15:44 +00:00
unclecode
91330ef179 fix: add explicit utf-8 encoding to CLI file output (#1789)
From PR #1789 by @Br1an67
2026-03-07 06:15:32 +00:00
unclecode
d6a8f57fdd docs: fix css_selector type from list to string in examples (#1308)
From PR #1308 by @dominicx
2026-03-07 06:15:14 +00:00
unclecode
e6c2a65625 docs: fix return type annotations to use RunManyReturn (#1716)
From PR #1716 by @YuriNachos
2026-03-07 06:14:49 +00:00
unclecode
5601861555 docs: add missing CacheMode import in quickstart example (#1715)
From PR #1715 by @YuriNachos
2026-03-07 06:13:32 +00:00
unclecode
72cc17c113 docs: fix docstring param name crawler_config -> config (#1494)
From PR #1494 by @AkosLukacs
2026-03-07 06:13:18 +00:00
unclecode
814bc4df47 Update CONTRIBUTORS for PRs #1782, #1788, #1783, #1179 2026-03-07 04:15:49 +00:00
unclecode
93f2f03fab Merge PR #1783: fix: strip port from URL domain in is_external_url comparison
Strip port number from netloc before domain comparison so that
example.com:8080 correctly matches base domain example.com.
2026-03-07 04:15:35 +00:00
unclecode
5f65d2d1fd Merge PR #1788: fix: guard against None LLM content and propagate finish_reason
Adds None check before processing LLM response content in both extract()
and aextract(). When LLM returns no content (e.g. content filter, token
limit), returns an error block with finish_reason instead of crashing.
Also guards the except fallback path against None content.
2026-03-07 04:15:22 +00:00
unclecode
122be00076 Merge PR #1782: fix: preserve class and id attributes in cleaned_html
Add "class" and "id" to IMPORTANT_ATTRS so they survive HTML cleaning.
CSS-based extraction strategies need these attributes to match selectors.
2026-03-07 04:14:21 +00:00
unclecode
4bde952ade Update CONTRIBUTORS for PRs #1787, #1790, #1804 2026-03-07 04:00:36 +00:00
unclecode
ff2ea3429a Merge PR #1804: feat: add score_threshold support to BestFirstCrawlingStrategy
Adds score_threshold parameter (default -inf for backward compatibility)
to BestFirstCrawlingStrategy, matching BFS and DFS strategies. URLs
scoring below the threshold are skipped.
Fixes #1801.
2026-03-07 03:59:28 +00:00
unclecode
9ec2969d99 Merge PR #1790: fix: handle nested brackets and parentheses in LINK_PATTERN regex
Improves LINK_PATTERN regex in markdown citation conversion to correctly
handle Wikipedia-style URLs with parentheses and text with nested brackets.
2026-03-07 03:59:17 +00:00
unclecode
bd0f6e1bd5 fix: strip markdown fences in force_json_response path (LLM extraction)
Wire existing _strip_markdown_fences() into the force_json_response
code path in both extract() and aextract(). LLMs frequently wrap JSON
in ```json fences which caused json.loads() to fail.

Inspired by PR #1787 (Br1an67).
2026-03-07 03:59:00 +00:00
unclecode
d4588904b3 Update PR-TODOLIST and CONTRIBUTORS for merged PRs #1805, #1763, #1803 2026-03-07 03:40:36 +00:00
unclecode
b008671345 Merge PR #1803: fix from_serializable_dict to ignore plain data dicts with "type" key
Narrows the typed-object deserialization path to only match dicts with
"params" or {"type":"dict","value":{...}}, preventing crashes on normal
data dicts like JSON-Schema fragments that happen to have a "type" key.
2026-03-07 03:21:33 +00:00
unclecode
fdb3f8fd98 Merge PR #1763: fix: return in finally block silently suppressing exceptions
Moves return out of finally block and adds raise in except block so
QUEUE_ERROR exceptions properly propagate in MemoryAdaptiveDispatcher.
2026-03-07 03:21:22 +00:00
unclecode
8a677a9db1 Merge PR #1805: fix: prevent AdaptiveCrawler from crawling external domains
Removes external links from being added to pending_links in digest(),
since _crawl_with_preview() always sets include_external=False.
Fixes #1776.
2026-03-07 03:21:11 +00:00
nightcityblade
78434eadac fix: prevent AdaptiveCrawler from crawling external domains
AdaptiveCrawler.digest() unconditionally added external links to
pending_links, causing the crawler to follow links to entirely
different domains even though include_external=False was set in
LinkPreviewConfig.

Remove external links from being added to pending_links in both the
initial crawl and subsequent crawl loops.

Fixes #1776
2026-03-07 10:57:42 +08:00
nightcityblade
379591047d fix: add score_threshold support to BestFirstCrawlingStrategy
Add score_threshold parameter to BestFirstCrawlingStrategy, matching the
existing behavior in BFSDeepCrawlStrategy and DFSDeepCrawlStrategy.

URLs scoring below the threshold are now skipped during link discovery
instead of being unconditionally enqueued.

Fixes #1801
2026-03-07 10:55:09 +08:00
Soham Kukreti
71a6526459 fix(docker): narrow from_serializable_dict to ignore plain data dicts with "type" key
The typed-object entry condition (`"type" in data`) was too broad: it
also matched plain business dicts that happen to carry a "type" key,
such as JsonCssExtractionStrategy field specs ({"type": "text"}) and
LLMExtractionStrategy JSON Schema fragments ({"type": "string"}).
These were never config objects, but the deserializer tried to treat
them as such, hit the ALLOWED_DESERIALIZE_TYPES allowlist, and raised
a ValueError — causing /crawl to return HTTP 500 for perfectly valid
extraction-strategy payloads.

Fix: narrow the entry condition to require "params" (or "type":"dict"
+ "value"), matching only the shapes that to_serializable_dict() actually
produces. Dicts with "type" but no "params"/"value" fall through to the
raw-dict path and are passed as plain data.

The RCE protection from commit 0104db6 is fully preserved: any real
class-instantiation attack still requires "type" + "params", still
enters the typed path, and is still blocked by the allowlist.

Fixes #1797
2026-03-06 13:10:35 +05:30
ntohidi
0273b27821 Fix MediaItem crash on non-numeric width values (e.g. "100%", "auto")
Add BeforeValidator to coerce width to int or None, preventing Pydantic
  validation errors when HTML contains non-integer width attributes.

  Fixes #1635
2026-03-02 09:51:59 +08:00
ntohidi
0d151eba82 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2026-03-02 09:42:28 +08:00
Br1an67
669b466667 fix: handle nested brackets and parentheses in LINK_PATTERN regex
The previous regex [^\]]+ stopped at the first ] which broke
markdown links containing embedded images like:

The new pattern allows one level of nested [...] in the link text
and one level of nested (...) in the URL, correctly handling:
- Embedded images in link text
- Wikipedia-style URLs with parentheses

Fixes #711
2026-03-02 01:24:02 +08:00
Br1an67
b138c949b5 fix: guard against None LLM content and propagate finish_reason
When max_tokens is too small, the LLM may return None content with
finish_reason=MAX_TOKENS. This caused a crash in extraction with
'NoneType' object has no attribute 'startswith'.

Add a None check on LLM response content. When content is None,
return an error block including the finish_reason so callers can
diagnose the issue. Also guard the fallback split_and_parse path
against None content.

Fixes #1606
2026-03-02 01:18:47 +08:00
Br1an67
20488620cd fix: strip port from URL domain in is_external_url comparison
The is_external_url function compared the full netloc (including port)
against base_domain (which has port stripped by get_base_domain).
This caused URLs like http://localhost:8000/page to be wrongly
classified as external when base_domain is 'localhost'.

Strip the port from parsed.netloc before comparison.

Fixes #1503
2026-03-02 00:48:50 +08:00
Br1an67
500d047654 fix: preserve class and id attributes in cleaned_html
Add 'class' and 'id' to IMPORTANT_ATTRS so they are retained when
cleaning HTML attributes. This allows users to use cleaned_html for
further analysis that depends on CSS classes and element IDs.

Fixes #1601
2026-03-02 00:43:23 +08:00
unclecode
0a45c1056d feat: add separate query_llm_config for adaptive crawler query expansion (#1682)
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.

Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.

Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).

Inspired by PR #1683 from @Vaccarini-Lorenzo — thank you for identifying the
issue and proposing the initial approach.
2026-02-27 20:31:51 +08:00
unclecode
a4cc0a9f04 feat: add separate query_llm_config for adaptive crawler query expansion (#1682)
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.

Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.

Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).

Inspired by PR #1683 from @sthakrar — thank you for identifying the
issue and proposing the initial approach.
2026-02-25 12:26:39 +00:00
unclecode
8f2c2e1f90 docs: add mzyfree to contributors for PR #1689 2026-02-25 07:29:28 +00:00
unclecode
c0912f7234 feat: add avoid_ads/avoid_css resource filtering and pool release lifecycle
Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker
domains and CSS resources at the browser context level. Refactor crawler pool
with release_crawler() and active_requests tracking to prevent janitor from
closing browsers with in-flight requests. Add proper finally blocks to all
Docker API/server handlers. Update docs for new config options.

Inspired by #1689.
2026-02-25 07:12:28 +00:00
Nasrin
8d35d17d01 Merge pull request #1722 from YuriNachos/fix/issue-1652-md-docstring
fix: Add docstring to MCP tool 'md' endpoint
2026-02-25 06:00:09 +01:00
Nasrin
d419199a4c Merge pull request #1775 from unclecode/fix/issue-1748-screenshot-scroll-delay
Fix/issue 1748 screenshot scroll delay
2026-02-25 05:54:24 +01:00
Ahmed-tawfik94
9cfeb4626d Document scroll_delay parameter for full-page screenshot crawling 2026-02-25 06:52:59 +03:00
Ahmed-tawfik94
cd81e3cd19 Fix scroll_delay ignored in take_screenshot_scroller for full-page screenshots 2026-02-25 06:52:53 +03:00
Nasrin
4f9cc0810b Merge pull request #1764 from PatD42/fix/table-gfm-pipes
Fix: Add leading/trailing pipes to GFM tables (pad_tables=False)
2026-02-25 03:32:54 +01:00
Nasrin
c4cdc02e27 Merge pull request #1761 from AtharvaJaiswal005/fix/total-score-missing-for-failed-head-extraction-1749
Fix total_score not calculated for links that fail head extraction
2026-02-25 02:25:22 +01:00