crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-12 00:38:00 +00:00

Author	SHA1	Message	Date
unclecode	9b571bb947	feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.) The HTTP crawler strategy now checks Content-Type and Content-Disposition headers to detect non-HTML file responses. When a file download is detected, raw bytes are saved to disk and the path is returned via downloaded_files. Text-based files (CSV, JSON, XML) also populate the html field for backward compatibility. Binary files (PDF, images) set html to empty string — content is only available via downloaded_files. Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).	2026-03-16 14:03:43 +00:00
ntohidi	f6ab207e25	fix: remove shared LOCK contention in monitor to prevent pod deadlock (#1754 ) The monitor's update_timeline(), get_health_summary(), and get_browser_list() all acquired the crawler pool's global LOCK to read pool stats. That same lock is held during slow browser start/close operations (get_crawler, janitor, close_all), causing the monitor to block indefinitely and the pod to become unresponsive after sustained crawling. Replaced all three lock acquisitions in monitor.py with a lock-free get_pool_snapshot() in crawler_pool.py that returns shallow dict copies. Under CPython's GIL, dict.copy() and len() are atomic — safe for read-only monitoring with at most slightly stale counts.	2026-03-13 12:17:52 +08:00
Nasrin	648f36b622	Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-redis-config-1611-1817 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)	2026-03-13 03:59:28 +01:00
Nasrin	6e4299577f	Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw-1484 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)	2026-03-13 03:38:15 +01:00
hafezparast	8de83a3590	fix: css_selector ignored in LXML scraping for raw:// URLs (#1484 ) css_selector was skipped in _scrap() — only target_elements was applied. Now css_selector filters the DOM first, then target_elements narrows within that selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 20:00:33 +08:00
unclecode	bf1158a61b	fix: upgrade Redis to 7.2.7 for CVE-2025-49844 (CVSS 10.0) (#1671 ) Add official Redis apt repository and pin redis-server to 7.2.7 which patches the Lua use-after-free vulnerability. REDIS_VERSION build arg allows override.	2026-03-12 11:24:42 +00:00
unclecode	a73bc1c076	fix: MCP SSE endpoint crash — mount via raw ASGI Route (#1594 ) Replace @app.get() with starlette.routing.Route() for the SSE handler. The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send) internally, which conflicts with Starlette's middleware wrapping. Also update CONTRIBUTORS.md for PR #1829.	2026-03-12 11:22:48 +00:00
hafezparast	3f481e9e5c	fix: screenshot distortion, deep crawl timeout/arun_many, CLI encoding (#1370 , #1818 , #1509 , #1762 ) - #1370: Freeze element dimensions via CSS before viewport resize in take_screenshot_scroller() to prevent responsive reflow on Elementor sites; restore original viewport after capture. - #1818: Call window.stop() on session-reused pages before navigation to abort pending loads; move event listener cleanup outside session_id guard so listeners don't accumulate across reuses. - #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is set — call arun() directly per URL so the DeepCrawlDecorator can invoke the strategy (dispatcher crashes on List[CrawlResult] return). - #1762: Add encoding="utf-8" to the remaining open() call in save_global_config() (cli.py line 58). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 18:17:13 +08:00
hafezparast	480d938f67	fix: /llm per-request provider override, Redis config from host/port/password (#1611 , #1817 ) - #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional provider, temperature, base_url query params with fallback to server config. Consistent with /md and /llm/job endpoints. - #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT, REDIS_PASSWORD environment variable overrides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:53:04 +08:00
Nasrin	d907e167a5	Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-full-page-1750 fix: screenshot respects scan_full_page=False (#1750)	2026-03-12 07:39:52 +01:00
Maysam Hafezparast	57b0d09934	fix: deduplicate BM25ContentFilter output (#1213 ) (#1824 ) BM25ContentFilter.filter_content() returned duplicate text chunks when the same content appeared in multiple DOM elements. Added exact-text deduplication after threshold filtering, keeping the first occurrence in document order. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:23:34 +08:00
unclecode	35034f551b	docs: add hafezparast to CONTRIBUTORS.md Recognized for identifying and confirming the PDFContentScrapingStrategy deserialization fix (#1815).	2026-03-12 05:43:48 +00:00
hafezparast	6efbffe345	fix: screenshot respects scan_full_page=False (#1750 ) take_screenshot() ignored the scan_full_page config flag — tall pages always got a full-page screenshot even when scan_full_page=False. Now passes scan_full_page through to take_screenshot() and uses viewport-only capture when False. Includes 16 tests (8 unit + 8 integration). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 12:04:45 +08:00
unclecode	11b45760da	fix: anti-bot false positive on browser JSON, URLPatternFilter prefix match, PDF deserialization - antibot_detector: add <pre> to content elements regex, detect browser-wrapped JSON in _looks_like_data() so httpbin-style responses are not flagged as blocked - deep_crawling/filters: use urlparse().path for path-only prefix patterns (/docs/*) instead of matching against full URL, which always failed; full-URL prefixes still match correctly - async_configs: add PDFContentScrapingStrategy to ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it - __init__: export PDFContentScrapingStrategy for type resolution - tests: add 86-test suite covering all three fixes with adversarial and edge cases	2026-03-09 14:52:58 +00:00
unclecode	55956a874d	fix: 3 bug fixes (#1487 , #1512 , #1666 ) + close 3 already-fixed issues - #1487: Move virtual scroll after wait_for so dynamic containers exist - #1512: Add __aiter__ to CrawlResultContainer for async for support - #1666: Kill process group on cleanup to prevent zombie child processes, add lsof fallback for Docker environments without lsof installed - Close #1472 (redirect chain already fixed), #1480 (links already normalized), #1679 (duplicate of #1509)	2026-03-08 08:44:04 +00:00
unclecode	a7e6da0b19	Merge fix/batch-easy-issues-10: 10 bug fixes + regression test suite Bug fixes: #1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758 Regression tests: 291 tests covering all major subsystems	2026-03-08 03:20:56 +00:00
unclecode	d788c28315	test: add comprehensive regression test suite (291 tests) Full regression suite covering all major Crawl4AI subsystems: - core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks) - content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata) - extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction) - deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization) - browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes) - config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips) - utilities (extract_xml_data, cache modes, content hashing) - edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery) Also adds /c4ai-check slash command for testing changes against the suite.	2026-03-08 03:20:52 +00:00
unclecode	3a75dd3f4c	fix: batch fix for 10 open issues (#1520 , #1489 , #1374 , #1424 , #1183 , #1354 , #880 , #1031 , #1251 , #1758 ) - #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance) - #1489: Preserve query parameter key casing in normalize_url - #1374: Close NamedTemporaryFile handle before reopening (Windows fix) - #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1) - #1183: Fix extract_xml_data regex matching tag names in prose text - #1354: Make import_knowledge_base async (fix asyncio.run in running loop) - #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences) - #1031: Make Docker playground code editor resizable with overflow-auto - #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes - #1758: Change screenshot stitching format from BMP to PNG	2026-03-07 09:47:38 +00:00
unclecode	0c9e3c427e	Update CONTRIBUTORS and PR-TODOLIST for batch 5 (15 PRs resolved) Batch 5 merged: #1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668 Closed as superseded: #1592 Closed as won't merge: #999, #1180, #1425, #1702, #1707, #1729	2026-03-07 08:49:32 +00:00
unclecode	7c0cc3ed88	fix: batch merge of community PRs (#1622 , #1786 , #1796 , #1795 , #1798 , #1734 , #1290 , #1668 ) Bug fixes: - Verify redirect targets are alive before returning from URL seeder (#1622) - Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786) - Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796) Security/Docker: - Require api_token for /token endpoint when configured (#1795) - Deep-crawl streaming now mirrors Python library behavior via arun() (#1798) CI: - Bump GitHub Actions to latest versions - checkout v6, setup-python v6, build-push-action v6, setup-buildx v4, login v4 (#1734) Features: - Support type-list pipeline in JsonCssExtractionStrategy for chained extraction like ["attribute", "regex"] (#1290) - Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting for Unicode preservation in JSON output (#1668)	2026-03-07 08:45:11 +00:00
unclecode	11ed854155	Update CONTRIBUTORS for PR #462	2026-03-07 07:06:49 +00:00
unclecode	697c2b2a58	fix: add newline before opening code fence in html2text (#462 ) From PR #462 by @jtanningbed	2026-03-07 07:06:41 +00:00
unclecode	3704758746	Update CONTRIBUTORS for PR #1770	2026-03-07 07:01:54 +00:00
unclecode	04e83aa3c7	docs: modernize deprecated API usage across shipped docs (#1770 ) Update docs/examples to use current API: - proxy → proxy_config in BrowserConfig - result.fit_markdown → result.markdown.fit_markdown - result.fit_html → result.markdown.fit_html - markdown_v2 deprecation notes updated - bypass_cache → cache_mode=CacheMode.BYPASS - LLMExtractionStrategy now uses llm_config=LLMConfig(...) - CrawlerConfig → CrawlerRunConfig - cache_mode string values → CacheMode enum - Fix missing CacheMode import in local-files.md - Fix indentation in app-detail.html example - Fix tautological cache mode descriptions in arun.md From PR #1770 by @maksimzayats	2026-03-07 07:01:06 +00:00
unclecode	31d0de23df	Update PR-TODOLIST for batch 4 merge (10 PRs) and refresh open PR list	2026-03-07 06:50:26 +00:00
unclecode	db98aefb03	Update CONTRIBUTORS for PRs #1494 , #1715 , #1716 , #1308 , #1789 , #1793 , #1792 , #1794 , #1784 , #1730	2026-03-07 06:47:03 +00:00
unclecode	761664d29e	fix: add TTL expiry for Redis task data to prevent memory growth (#1730 ) From PR #1730 by @hoi	2026-03-07 06:17:58 +00:00
unclecode	e47e810aca	fix: handle UnicodeEncodeError in URL seeder and strip zero-width chars (#1784 ) From PR #1784 by @Br1an67	2026-03-07 06:16:41 +00:00
unclecode	1029815fd4	fix: add Windows support for crawler monitor keyboard input (#1794 ) From PR #1794 by @Br1an67	2026-03-07 06:16:12 +00:00
unclecode	d229beeaf8	fix: add wait_for_images option to screenshot endpoint (#1792 ) From PR #1792 by @Br1an67	2026-03-07 06:15:54 +00:00
unclecode	c73aa271ac	fix: make link_preview_timeout configurable in AdaptiveConfig (#1793 ) From PR #1793 by @Br1an67	2026-03-07 06:15:44 +00:00
unclecode	91330ef179	fix: add explicit utf-8 encoding to CLI file output (#1789 ) From PR #1789 by @Br1an67	2026-03-07 06:15:32 +00:00
unclecode	d6a8f57fdd	docs: fix css_selector type from list to string in examples (#1308 ) From PR #1308 by @dominicx	2026-03-07 06:15:14 +00:00
unclecode	e6c2a65625	docs: fix return type annotations to use RunManyReturn (#1716 ) From PR #1716 by @YuriNachos	2026-03-07 06:14:49 +00:00
unclecode	5601861555	docs: add missing CacheMode import in quickstart example (#1715 ) From PR #1715 by @YuriNachos	2026-03-07 06:13:32 +00:00
unclecode	72cc17c113	docs: fix docstring param name crawler_config -> config (#1494 ) From PR #1494 by @AkosLukacs	2026-03-07 06:13:18 +00:00
unclecode	814bc4df47	Update CONTRIBUTORS for PRs #1782 , #1788 , #1783 , #1179	2026-03-07 04:15:49 +00:00
unclecode	93f2f03fab	Merge PR #1783 : fix: strip port from URL domain in is_external_url comparison Strip port number from netloc before domain comparison so that example.com:8080 correctly matches base domain example.com.	2026-03-07 04:15:35 +00:00
unclecode	5f65d2d1fd	Merge PR #1788 : fix: guard against None LLM content and propagate finish_reason Adds None check before processing LLM response content in both extract() and aextract(). When LLM returns no content (e.g. content filter, token limit), returns an error block with finish_reason instead of crashing. Also guards the except fallback path against None content.	2026-03-07 04:15:22 +00:00
unclecode	122be00076	Merge PR #1782 : fix: preserve class and id attributes in cleaned_html Add "class" and "id" to IMPORTANT_ATTRS so they survive HTML cleaning. CSS-based extraction strategies need these attributes to match selectors.	2026-03-07 04:14:21 +00:00
unclecode	4bde952ade	Update CONTRIBUTORS for PRs #1787 , #1790 , #1804	2026-03-07 04:00:36 +00:00
unclecode	ff2ea3429a	Merge PR #1804 : feat: add score_threshold support to BestFirstCrawlingStrategy Adds score_threshold parameter (default -inf for backward compatibility) to BestFirstCrawlingStrategy, matching BFS and DFS strategies. URLs scoring below the threshold are skipped. Fixes #1801.	2026-03-07 03:59:28 +00:00
unclecode	9ec2969d99	Merge PR #1790 : fix: handle nested brackets and parentheses in LINK_PATTERN regex Improves LINK_PATTERN regex in markdown citation conversion to correctly handle Wikipedia-style URLs with parentheses and text with nested brackets.	2026-03-07 03:59:17 +00:00
unclecode	bd0f6e1bd5	fix: strip markdown fences in force_json_response path (LLM extraction) Wire existing _strip_markdown_fences() into the force_json_response code path in both extract() and aextract(). LLMs frequently wrap JSON in ```json fences which caused json.loads() to fail. Inspired by PR #1787 (Br1an67).	2026-03-07 03:59:00 +00:00
unclecode	d4588904b3	Update PR-TODOLIST and CONTRIBUTORS for merged PRs #1805 , #1763 , #1803	2026-03-07 03:40:36 +00:00
unclecode	b008671345	Merge PR #1803 : fix from_serializable_dict to ignore plain data dicts with "type" key Narrows the typed-object deserialization path to only match dicts with "params" or {"type":"dict","value":{...}}, preventing crashes on normal data dicts like JSON-Schema fragments that happen to have a "type" key.	2026-03-07 03:21:33 +00:00
unclecode	fdb3f8fd98	Merge PR #1763 : fix: return in finally block silently suppressing exceptions Moves return out of finally block and adds raise in except block so QUEUE_ERROR exceptions properly propagate in MemoryAdaptiveDispatcher.	2026-03-07 03:21:22 +00:00
unclecode	8a677a9db1	Merge PR #1805 : fix: prevent AdaptiveCrawler from crawling external domains Removes external links from being added to pending_links in digest(), since _crawl_with_preview() always sets include_external=False. Fixes #1776.	2026-03-07 03:21:11 +00:00
nightcityblade	78434eadac	fix: prevent AdaptiveCrawler from crawling external domains AdaptiveCrawler.digest() unconditionally added external links to pending_links, causing the crawler to follow links to entirely different domains even though include_external=False was set in LinkPreviewConfig. Remove external links from being added to pending_links in both the initial crawl and subsequent crawl loops. Fixes #1776	2026-03-07 10:57:42 +08:00
nightcityblade	379591047d	fix: add score_threshold support to BestFirstCrawlingStrategy Add score_threshold parameter to BestFirstCrawlingStrategy, matching the existing behavior in BFSDeepCrawlStrategy and DFSDeepCrawlStrategy. URLs scoring below the threshold are now skipped during link discovery instead of being unconditionally enqueued. Fixes #1801	2026-03-07 10:55:09 +08:00

1 2 3 4 5 ...

1446 Commits