crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 07:48:50 +00:00

Author	SHA1	Message	Date
unclecode	11b45760da	fix: anti-bot false positive on browser JSON, URLPatternFilter prefix match, PDF deserialization - antibot_detector: add <pre> to content elements regex, detect browser-wrapped JSON in _looks_like_data() so httpbin-style responses are not flagged as blocked - deep_crawling/filters: use urlparse().path for path-only prefix patterns (/docs/*) instead of matching against full URL, which always failed; full-URL prefixes still match correctly - async_configs: add PDFContentScrapingStrategy to ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it - __init__: export PDFContentScrapingStrategy for type resolution - tests: add 86-test suite covering all three fixes with adversarial and edge cases	2026-03-09 14:52:58 +00:00
unclecode	55956a874d	fix: 3 bug fixes (#1487 , #1512 , #1666 ) + close 3 already-fixed issues - #1487: Move virtual scroll after wait_for so dynamic containers exist - #1512: Add __aiter__ to CrawlResultContainer for async for support - #1666: Kill process group on cleanup to prevent zombie child processes, add lsof fallback for Docker environments without lsof installed - Close #1472 (redirect chain already fixed), #1480 (links already normalized), #1679 (duplicate of #1509)	2026-03-08 08:44:04 +00:00
unclecode	a7e6da0b19	Merge fix/batch-easy-issues-10: 10 bug fixes + regression test suite Bug fixes: #1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758 Regression tests: 291 tests covering all major subsystems	2026-03-08 03:20:56 +00:00
unclecode	d788c28315	test: add comprehensive regression test suite (291 tests) Full regression suite covering all major Crawl4AI subsystems: - core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks) - content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata) - extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction) - deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization) - browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes) - config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips) - utilities (extract_xml_data, cache modes, content hashing) - edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery) Also adds /c4ai-check slash command for testing changes against the suite.	2026-03-08 03:20:52 +00:00
unclecode	3a75dd3f4c	fix: batch fix for 10 open issues (#1520 , #1489 , #1374 , #1424 , #1183 , #1354 , #880 , #1031 , #1251 , #1758 ) - #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance) - #1489: Preserve query parameter key casing in normalize_url - #1374: Close NamedTemporaryFile handle before reopening (Windows fix) - #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1) - #1183: Fix extract_xml_data regex matching tag names in prose text - #1354: Make import_knowledge_base async (fix asyncio.run in running loop) - #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences) - #1031: Make Docker playground code editor resizable with overflow-auto - #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes - #1758: Change screenshot stitching format from BMP to PNG	2026-03-07 09:47:38 +00:00
unclecode	0c9e3c427e	Update CONTRIBUTORS and PR-TODOLIST for batch 5 (15 PRs resolved) Batch 5 merged: #1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668 Closed as superseded: #1592 Closed as won't merge: #999, #1180, #1425, #1702, #1707, #1729	2026-03-07 08:49:32 +00:00
unclecode	7c0cc3ed88	fix: batch merge of community PRs (#1622 , #1786 , #1796 , #1795 , #1798 , #1734 , #1290 , #1668 ) Bug fixes: - Verify redirect targets are alive before returning from URL seeder (#1622) - Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786) - Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796) Security/Docker: - Require api_token for /token endpoint when configured (#1795) - Deep-crawl streaming now mirrors Python library behavior via arun() (#1798) CI: - Bump GitHub Actions to latest versions - checkout v6, setup-python v6, build-push-action v6, setup-buildx v4, login v4 (#1734) Features: - Support type-list pipeline in JsonCssExtractionStrategy for chained extraction like ["attribute", "regex"] (#1290) - Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting for Unicode preservation in JSON output (#1668)	2026-03-07 08:45:11 +00:00
unclecode	11ed854155	Update CONTRIBUTORS for PR #462	2026-03-07 07:06:49 +00:00
unclecode	697c2b2a58	fix: add newline before opening code fence in html2text (#462 ) From PR #462 by @jtanningbed	2026-03-07 07:06:41 +00:00
unclecode	3704758746	Update CONTRIBUTORS for PR #1770	2026-03-07 07:01:54 +00:00
unclecode	04e83aa3c7	docs: modernize deprecated API usage across shipped docs (#1770 ) Update docs/examples to use current API: - proxy → proxy_config in BrowserConfig - result.fit_markdown → result.markdown.fit_markdown - result.fit_html → result.markdown.fit_html - markdown_v2 deprecation notes updated - bypass_cache → cache_mode=CacheMode.BYPASS - LLMExtractionStrategy now uses llm_config=LLMConfig(...) - CrawlerConfig → CrawlerRunConfig - cache_mode string values → CacheMode enum - Fix missing CacheMode import in local-files.md - Fix indentation in app-detail.html example - Fix tautological cache mode descriptions in arun.md From PR #1770 by @maksimzayats	2026-03-07 07:01:06 +00:00
unclecode	31d0de23df	Update PR-TODOLIST for batch 4 merge (10 PRs) and refresh open PR list	2026-03-07 06:50:26 +00:00
unclecode	db98aefb03	Update CONTRIBUTORS for PRs #1494 , #1715 , #1716 , #1308 , #1789 , #1793 , #1792 , #1794 , #1784 , #1730	2026-03-07 06:47:03 +00:00
unclecode	761664d29e	fix: add TTL expiry for Redis task data to prevent memory growth (#1730 ) From PR #1730 by @hoi	2026-03-07 06:17:58 +00:00
unclecode	e47e810aca	fix: handle UnicodeEncodeError in URL seeder and strip zero-width chars (#1784 ) From PR #1784 by @Br1an67	2026-03-07 06:16:41 +00:00
unclecode	1029815fd4	fix: add Windows support for crawler monitor keyboard input (#1794 ) From PR #1794 by @Br1an67	2026-03-07 06:16:12 +00:00
unclecode	d229beeaf8	fix: add wait_for_images option to screenshot endpoint (#1792 ) From PR #1792 by @Br1an67	2026-03-07 06:15:54 +00:00
unclecode	c73aa271ac	fix: make link_preview_timeout configurable in AdaptiveConfig (#1793 ) From PR #1793 by @Br1an67	2026-03-07 06:15:44 +00:00
unclecode	91330ef179	fix: add explicit utf-8 encoding to CLI file output (#1789 ) From PR #1789 by @Br1an67	2026-03-07 06:15:32 +00:00
unclecode	d6a8f57fdd	docs: fix css_selector type from list to string in examples (#1308 ) From PR #1308 by @dominicx	2026-03-07 06:15:14 +00:00
unclecode	e6c2a65625	docs: fix return type annotations to use RunManyReturn (#1716 ) From PR #1716 by @YuriNachos	2026-03-07 06:14:49 +00:00
unclecode	5601861555	docs: add missing CacheMode import in quickstart example (#1715 ) From PR #1715 by @YuriNachos	2026-03-07 06:13:32 +00:00
unclecode	72cc17c113	docs: fix docstring param name crawler_config -> config (#1494 ) From PR #1494 by @AkosLukacs	2026-03-07 06:13:18 +00:00
unclecode	814bc4df47	Update CONTRIBUTORS for PRs #1782 , #1788 , #1783 , #1179	2026-03-07 04:15:49 +00:00
unclecode	93f2f03fab	Merge PR #1783 : fix: strip port from URL domain in is_external_url comparison Strip port number from netloc before domain comparison so that example.com:8080 correctly matches base domain example.com.	2026-03-07 04:15:35 +00:00
unclecode	5f65d2d1fd	Merge PR #1788 : fix: guard against None LLM content and propagate finish_reason Adds None check before processing LLM response content in both extract() and aextract(). When LLM returns no content (e.g. content filter, token limit), returns an error block with finish_reason instead of crashing. Also guards the except fallback path against None content.	2026-03-07 04:15:22 +00:00
unclecode	122be00076	Merge PR #1782 : fix: preserve class and id attributes in cleaned_html Add "class" and "id" to IMPORTANT_ATTRS so they survive HTML cleaning. CSS-based extraction strategies need these attributes to match selectors.	2026-03-07 04:14:21 +00:00
unclecode	4bde952ade	Update CONTRIBUTORS for PRs #1787 , #1790 , #1804	2026-03-07 04:00:36 +00:00
unclecode	ff2ea3429a	Merge PR #1804 : feat: add score_threshold support to BestFirstCrawlingStrategy Adds score_threshold parameter (default -inf for backward compatibility) to BestFirstCrawlingStrategy, matching BFS and DFS strategies. URLs scoring below the threshold are skipped. Fixes #1801.	2026-03-07 03:59:28 +00:00
unclecode	9ec2969d99	Merge PR #1790 : fix: handle nested brackets and parentheses in LINK_PATTERN regex Improves LINK_PATTERN regex in markdown citation conversion to correctly handle Wikipedia-style URLs with parentheses and text with nested brackets.	2026-03-07 03:59:17 +00:00
unclecode	bd0f6e1bd5	fix: strip markdown fences in force_json_response path (LLM extraction) Wire existing _strip_markdown_fences() into the force_json_response code path in both extract() and aextract(). LLMs frequently wrap JSON in ```json fences which caused json.loads() to fail. Inspired by PR #1787 (Br1an67).	2026-03-07 03:59:00 +00:00
unclecode	d4588904b3	Update PR-TODOLIST and CONTRIBUTORS for merged PRs #1805 , #1763 , #1803	2026-03-07 03:40:36 +00:00
unclecode	b008671345	Merge PR #1803 : fix from_serializable_dict to ignore plain data dicts with "type" key Narrows the typed-object deserialization path to only match dicts with "params" or {"type":"dict","value":{...}}, preventing crashes on normal data dicts like JSON-Schema fragments that happen to have a "type" key.	2026-03-07 03:21:33 +00:00
unclecode	fdb3f8fd98	Merge PR #1763 : fix: return in finally block silently suppressing exceptions Moves return out of finally block and adds raise in except block so QUEUE_ERROR exceptions properly propagate in MemoryAdaptiveDispatcher.	2026-03-07 03:21:22 +00:00
unclecode	8a677a9db1	Merge PR #1805 : fix: prevent AdaptiveCrawler from crawling external domains Removes external links from being added to pending_links in digest(), since _crawl_with_preview() always sets include_external=False. Fixes #1776.	2026-03-07 03:21:11 +00:00
nightcityblade	78434eadac	fix: prevent AdaptiveCrawler from crawling external domains AdaptiveCrawler.digest() unconditionally added external links to pending_links, causing the crawler to follow links to entirely different domains even though include_external=False was set in LinkPreviewConfig. Remove external links from being added to pending_links in both the initial crawl and subsequent crawl loops. Fixes #1776	2026-03-07 10:57:42 +08:00
nightcityblade	379591047d	fix: add score_threshold support to BestFirstCrawlingStrategy Add score_threshold parameter to BestFirstCrawlingStrategy, matching the existing behavior in BFSDeepCrawlStrategy and DFSDeepCrawlStrategy. URLs scoring below the threshold are now skipped during link discovery instead of being unconditionally enqueued. Fixes #1801	2026-03-07 10:55:09 +08:00
Soham Kukreti	71a6526459	fix(docker): narrow from_serializable_dict to ignore plain data dicts with "type" key The typed-object entry condition (`"type" in data`) was too broad: it also matched plain business dicts that happen to carry a "type" key, such as JsonCssExtractionStrategy field specs ({"type": "text"}) and LLMExtractionStrategy JSON Schema fragments ({"type": "string"}). These were never config objects, but the deserializer tried to treat them as such, hit the ALLOWED_DESERIALIZE_TYPES allowlist, and raised a ValueError — causing /crawl to return HTTP 500 for perfectly valid extraction-strategy payloads. Fix: narrow the entry condition to require "params" (or "type":"dict" + "value"), matching only the shapes that to_serializable_dict() actually produces. Dicts with "type" but no "params"/"value" fall through to the raw-dict path and are passed as plain data. The RCE protection from commit `0104db6` is fully preserved: any real class-instantiation attack still requires "type" + "params", still enters the typed path, and is still blocked by the allowlist. Fixes #1797	2026-03-06 13:10:35 +05:30
ntohidi	0273b27821	Fix MediaItem crash on non-numeric width values (e.g. "100%", "auto") Add BeforeValidator to coerce width to int or None, preventing Pydantic validation errors when HTML contains non-integer width attributes. Fixes #1635	2026-03-02 09:51:59 +08:00
ntohidi	0d151eba82	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2026-03-02 09:42:28 +08:00
Br1an67	669b466667	fix: handle nested brackets and parentheses in LINK_PATTERN regex The previous regex [^\]]+ stopped at the first ] which broke markdown links containing embedded images like: The new pattern allows one level of nested [...] in the link text and one level of nested (...) in the URL, correctly handling: - Embedded images in link text - Wikipedia-style URLs with parentheses Fixes #711	2026-03-02 01:24:02 +08:00
Br1an67	b138c949b5	fix: guard against None LLM content and propagate finish_reason When max_tokens is too small, the LLM may return None content with finish_reason=MAX_TOKENS. This caused a crash in extraction with 'NoneType' object has no attribute 'startswith'. Add a None check on LLM response content. When content is None, return an error block including the finish_reason so callers can diagnose the issue. Also guard the fallback split_and_parse path against None content. Fixes #1606	2026-03-02 01:18:47 +08:00
Br1an67	20488620cd	fix: strip port from URL domain in is_external_url comparison The is_external_url function compared the full netloc (including port) against base_domain (which has port stripped by get_base_domain). This caused URLs like http://localhost:8000/page to be wrongly classified as external when base_domain is 'localhost'. Strip the port from parsed.netloc before comparison. Fixes #1503	2026-03-02 00:48:50 +08:00
Br1an67	500d047654	fix: preserve class and id attributes in cleaned_html Add 'class' and 'id' to IMPORTANT_ATTRS so they are retained when cleaning HTML attributes. This allows users to use cleaned_html for further analysis that depends on CSS classes and element IDs. Fixes #1601	2026-03-02 00:43:23 +08:00
unclecode	0a45c1056d	feat: add separate query_llm_config for adaptive crawler query expansion (#1682 ) The embedding strategy uses two incompatible API call types: embedding calls (text-to-vector) and query expansion (chat completion). Previously both used a single embedding_llm_config, so setting an embedding model broke query expansion and vice versa. Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users can specify separate models for each call type. Fallback chain preserves backward compatibility: query_llm_config -> llm_config -> hardcoded defaults. Also fixes base_url and backoff params not being passed to perform_completion_with_backoff in query expansion, and simplifies _embedding_llm_config_dict to use LLMConfig.to_dict() (which includes the 3 backoff fields the manual extraction was missing). Inspired by PR #1683 from @Vaccarini-Lorenzo — thank you for identifying the issue and proposing the initial approach.	2026-02-27 20:31:51 +08:00
unclecode	a4cc0a9f04	feat: add separate query_llm_config for adaptive crawler query expansion (#1682 ) The embedding strategy uses two incompatible API call types: embedding calls (text-to-vector) and query expansion (chat completion). Previously both used a single embedding_llm_config, so setting an embedding model broke query expansion and vice versa. Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users can specify separate models for each call type. Fallback chain preserves backward compatibility: query_llm_config -> llm_config -> hardcoded defaults. Also fixes base_url and backoff params not being passed to perform_completion_with_backoff in query expansion, and simplifies _embedding_llm_config_dict to use LLMConfig.to_dict() (which includes the 3 backoff fields the manual extraction was missing). Inspired by PR #1683 from @sthakrar — thank you for identifying the issue and proposing the initial approach.	2026-02-25 12:26:39 +00:00
unclecode	8f2c2e1f90	docs: add mzyfree to contributors for PR #1689	2026-02-25 07:29:28 +00:00
unclecode	c0912f7234	feat: add avoid_ads/avoid_css resource filtering and pool release lifecycle Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker domains and CSS resources at the browser context level. Refactor crawler pool with release_crawler() and active_requests tracking to prevent janitor from closing browsers with in-flight requests. Add proper finally blocks to all Docker API/server handlers. Update docs for new config options. Inspired by #1689.	2026-02-25 07:12:28 +00:00
Nasrin	8d35d17d01	Merge pull request #1722 from YuriNachos/fix/issue-1652-md-docstring fix: Add docstring to MCP tool 'md' endpoint	2026-02-25 06:00:09 +01:00
Nasrin	d419199a4c	Merge pull request #1775 from unclecode/fix/issue-1748-screenshot-scroll-delay Fix/issue 1748 screenshot scroll delay	2026-02-25 05:54:24 +01:00

1 2 3 4 5 ...

1533 Commits