crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 15:58:15 +00:00

Author	SHA1	Message	Date
unclecode	01c685cd3a	fix: pin litellm to safe fork (v1.81.13) due to PyPI supply chain compromise litellm versions 1.82.7 and 1.82.8 on PyPI were compromised with malicious code. PyPI has quarantined the entire package, blocking all installs. Temporarily pin to our own fork at a known-safe version.	2026-03-24 14:03:26 +00:00
Nasrin	1a40ccf093	Merge pull request #1844 from hafezparast/fix/maysam-browser-none-guard-1842 fix: improve browser None guard in create_browser_context (#1842)	2026-03-24 11:37:46 +01:00
Nasrin	6eb2530bd9	Merge pull request #1849 from hafezparast/fix/maysam-serialize-skip-non-config-1848 fix: skip non-allowlisted types in serialization/deserialization (#1848)	2026-03-24 11:36:03 +01:00
Nasrin	fb24ee592e	Merge pull request #1851 from hafezparast/fix/maysam-mcp-sse-asgi-1850 fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)	2026-03-24 11:17:35 +01:00
ntohidi	3846b738cf	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into main	2026-03-24 18:10:40 +08:00
UncleCode	1a597cb97f	Merge pull request #1836 from unclecode/release/v0.8.5 Release v0.8.5	2026-03-24 11:06:58 +01:00
hafezparast	219416e49d	fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850 ) Starlette's Route wraps async functions in request_response(), calling handler(request) instead of handler(scope, receive, send). This broke the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable class instead of an async function — Route passes class instances through as raw ASGI apps without wrapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 08:55:41 +08:00
hafezparast	e603e4a722	fix: skip non-allowlisted types in serialization/deserialization (#1848 ) to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES (returns None), preventing objects like logging.Logger from being serialized as {"type": "Logger", "params": {...}} which then fails deserialization. from_serializable_dict returns None for unknown types instead of raising ValueError, handling payloads from older clients. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 08:17:02 +08:00
hafezparast	310b52b663	fix: improve browser None guard in create_browser_context (#1842 ) The existing guard assumed self.browser=None only meant persistent context mode. In reality, the browser can be None because it was closed by the janitor, crashed, or never started. This caused a misleading error message. Now the guard distinguishes between persistent context and closed/crashed browser with appropriate messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 10:45:38 +08:00
ntohidi	37da8b8f97	fix: pin redis-tools version to match redis-server in Dockerfile docker-rebuild-v0.8.5	2026-03-21 14:26:23 +08:00
ntohidi	29d27ed1ae	fix: install curl and gnupg in Dockerfile to support Redis repository addition	2026-03-21 14:17:27 +08:00
unclecode	c4389adddf	fix: Prevent scan_full_page from hanging on dynamic/infinite-scroll pages - Default max_scroll_steps to 10 when not explicitly set (was None/unlimited) - Wrap _handle_full_page_scan in asyncio.wait_for with page_timeout - On timeout, log warning and continue with partial scroll instead of hanging Previously, scan_full_page could hang indefinitely because: 1. max_scroll_steps defaulted to None (no limit) 2. Dynamic pages keep growing total_height on each scroll 3. No asyncio timeout wrapper to interrupt hung coroutines	2026-03-18 15:36:12 +00:00
unclecode	3ecd852011	fix: Re-check is_blocked() when fallback fetch fails When fallback_fetch_function was invoked but failed (exception or empty response), the final is_blocked() re-check was skipped because fallback_fetch_used=True. This left crawl_result.success=True even though the result was a blocked page from the last proxy attempt. Changed the condition to check resolved_by=='fallback_fetch' (set only on success) instead of fallback_fetch_used (set before the try block).	2026-03-18 14:36:57 +00:00
ntohidi	4bf17796d4	feat: add version 0.8.5 release highlights including anti-bot detection, shadow DOM support, and critical security fixes to README v0.8.5	2026-03-18 11:23:20 +08:00
unclecode	9b571bb947	feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.) The HTTP crawler strategy now checks Content-Type and Content-Disposition headers to detect non-HTML file responses. When a file download is detected, raw bytes are saved to disk and the path is returned via downloaded_files. Text-based files (CSV, JSON, XML) also populate the html field for backward compatibility. Binary files (PDF, images) set html to empty string — content is only available via downloaded_files. Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).	2026-03-16 14:03:43 +00:00
ntohidi	bb6406a2d0	release: Crawl4AI v0.8.5 Bump version to 0.8.5 across all references (Dockerfile, README, Docker README, blog index, __version__.py). Add release notes, blog post, demo verification script (13 real-crawl tests), and releases directory entry. Key highlights: - Anti-bot detection with 3-tier proxy escalation - Shadow DOM flattening - Deep crawl cancellation - Config defaults API - 60+ bug fixes and critical security patches	2026-03-16 18:46:05 +08:00
ntohidi	f6ab207e25	fix: remove shared LOCK contention in monitor to prevent pod deadlock (#1754 ) The monitor's update_timeline(), get_health_summary(), and get_browser_list() all acquired the crawler pool's global LOCK to read pool stats. That same lock is held during slow browser start/close operations (get_crawler, janitor, close_all), causing the monitor to block indefinitely and the pod to become unresponsive after sustained crawling. Replaced all three lock acquisitions in monitor.py with a lock-free get_pool_snapshot() in crawler_pool.py that returns shallow dict copies. Under CPython's GIL, dict.copy() and len() are atomic — safe for read-only monitoring with at most slightly stale counts.	2026-03-13 12:17:52 +08:00
Nasrin	648f36b622	Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-redis-config-1611-1817 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)	2026-03-13 03:59:28 +01:00
Nasrin	6e4299577f	Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw-1484 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)	2026-03-13 03:38:15 +01:00
hafezparast	8de83a3590	fix: css_selector ignored in LXML scraping for raw:// URLs (#1484 ) css_selector was skipped in _scrap() — only target_elements was applied. Now css_selector filters the DOM first, then target_elements narrows within that selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 20:00:33 +08:00
unclecode	bf1158a61b	fix: upgrade Redis to 7.2.7 for CVE-2025-49844 (CVSS 10.0) (#1671 ) Add official Redis apt repository and pin redis-server to 7.2.7 which patches the Lua use-after-free vulnerability. REDIS_VERSION build arg allows override.	2026-03-12 11:24:42 +00:00
unclecode	a73bc1c076	fix: MCP SSE endpoint crash — mount via raw ASGI Route (#1594 ) Replace @app.get() with starlette.routing.Route() for the SSE handler. The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send) internally, which conflicts with Starlette's middleware wrapping. Also update CONTRIBUTORS.md for PR #1829.	2026-03-12 11:22:48 +00:00
hafezparast	3f481e9e5c	fix: screenshot distortion, deep crawl timeout/arun_many, CLI encoding (#1370 , #1818 , #1509 , #1762 ) - #1370: Freeze element dimensions via CSS before viewport resize in take_screenshot_scroller() to prevent responsive reflow on Elementor sites; restore original viewport after capture. - #1818: Call window.stop() on session-reused pages before navigation to abort pending loads; move event listener cleanup outside session_id guard so listeners don't accumulate across reuses. - #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is set — call arun() directly per URL so the DeepCrawlDecorator can invoke the strategy (dispatcher crashes on List[CrawlResult] return). - #1762: Add encoding="utf-8" to the remaining open() call in save_global_config() (cli.py line 58). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 18:17:13 +08:00
hafezparast	480d938f67	fix: /llm per-request provider override, Redis config from host/port/password (#1611 , #1817 ) - #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional provider, temperature, base_url query params with fallback to server config. Consistent with /md and /llm/job endpoints. - #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT, REDIS_PASSWORD environment variable overrides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:53:04 +08:00
Nasrin	d907e167a5	Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-full-page-1750 fix: screenshot respects scan_full_page=False (#1750)	2026-03-12 07:39:52 +01:00
Maysam Hafezparast	57b0d09934	fix: deduplicate BM25ContentFilter output (#1213 ) (#1824 ) BM25ContentFilter.filter_content() returned duplicate text chunks when the same content appeared in multiple DOM elements. Added exact-text deduplication after threshold filtering, keeping the first occurrence in document order. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:23:34 +08:00
unclecode	35034f551b	docs: add hafezparast to CONTRIBUTORS.md Recognized for identifying and confirming the PDFContentScrapingStrategy deserialization fix (#1815).	2026-03-12 05:43:48 +00:00
hafezparast	6efbffe345	fix: screenshot respects scan_full_page=False (#1750 ) take_screenshot() ignored the scan_full_page config flag — tall pages always got a full-page screenshot even when scan_full_page=False. Now passes scan_full_page through to take_screenshot() and uses viewport-only capture when False. Includes 16 tests (8 unit + 8 integration). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 12:04:45 +08:00
unclecode	11b45760da	fix: anti-bot false positive on browser JSON, URLPatternFilter prefix match, PDF deserialization - antibot_detector: add <pre> to content elements regex, detect browser-wrapped JSON in _looks_like_data() so httpbin-style responses are not flagged as blocked - deep_crawling/filters: use urlparse().path for path-only prefix patterns (/docs/*) instead of matching against full URL, which always failed; full-URL prefixes still match correctly - async_configs: add PDFContentScrapingStrategy to ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it - __init__: export PDFContentScrapingStrategy for type resolution - tests: add 86-test suite covering all three fixes with adversarial and edge cases	2026-03-09 14:52:58 +00:00
unclecode	55956a874d	fix: 3 bug fixes (#1487 , #1512 , #1666 ) + close 3 already-fixed issues - #1487: Move virtual scroll after wait_for so dynamic containers exist - #1512: Add __aiter__ to CrawlResultContainer for async for support - #1666: Kill process group on cleanup to prevent zombie child processes, add lsof fallback for Docker environments without lsof installed - Close #1472 (redirect chain already fixed), #1480 (links already normalized), #1679 (duplicate of #1509)	2026-03-08 08:44:04 +00:00
unclecode	a7e6da0b19	Merge fix/batch-easy-issues-10: 10 bug fixes + regression test suite Bug fixes: #1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758 Regression tests: 291 tests covering all major subsystems	2026-03-08 03:20:56 +00:00
unclecode	d788c28315	test: add comprehensive regression test suite (291 tests) Full regression suite covering all major Crawl4AI subsystems: - core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks) - content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata) - extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction) - deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization) - browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes) - config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips) - utilities (extract_xml_data, cache modes, content hashing) - edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery) Also adds /c4ai-check slash command for testing changes against the suite.	2026-03-08 03:20:52 +00:00
unclecode	3a75dd3f4c	fix: batch fix for 10 open issues (#1520 , #1489 , #1374 , #1424 , #1183 , #1354 , #880 , #1031 , #1251 , #1758 ) - #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance) - #1489: Preserve query parameter key casing in normalize_url - #1374: Close NamedTemporaryFile handle before reopening (Windows fix) - #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1) - #1183: Fix extract_xml_data regex matching tag names in prose text - #1354: Make import_knowledge_base async (fix asyncio.run in running loop) - #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences) - #1031: Make Docker playground code editor resizable with overflow-auto - #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes - #1758: Change screenshot stitching format from BMP to PNG	2026-03-07 09:47:38 +00:00
unclecode	0c9e3c427e	Update CONTRIBUTORS and PR-TODOLIST for batch 5 (15 PRs resolved) Batch 5 merged: #1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668 Closed as superseded: #1592 Closed as won't merge: #999, #1180, #1425, #1702, #1707, #1729	2026-03-07 08:49:32 +00:00
unclecode	7c0cc3ed88	fix: batch merge of community PRs (#1622 , #1786 , #1796 , #1795 , #1798 , #1734 , #1290 , #1668 ) Bug fixes: - Verify redirect targets are alive before returning from URL seeder (#1622) - Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786) - Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796) Security/Docker: - Require api_token for /token endpoint when configured (#1795) - Deep-crawl streaming now mirrors Python library behavior via arun() (#1798) CI: - Bump GitHub Actions to latest versions - checkout v6, setup-python v6, build-push-action v6, setup-buildx v4, login v4 (#1734) Features: - Support type-list pipeline in JsonCssExtractionStrategy for chained extraction like ["attribute", "regex"] (#1290) - Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting for Unicode preservation in JSON output (#1668)	2026-03-07 08:45:11 +00:00
unclecode	11ed854155	Update CONTRIBUTORS for PR #462	2026-03-07 07:06:49 +00:00
unclecode	697c2b2a58	fix: add newline before opening code fence in html2text (#462 ) From PR #462 by @jtanningbed	2026-03-07 07:06:41 +00:00
unclecode	3704758746	Update CONTRIBUTORS for PR #1770	2026-03-07 07:01:54 +00:00
unclecode	04e83aa3c7	docs: modernize deprecated API usage across shipped docs (#1770 ) Update docs/examples to use current API: - proxy → proxy_config in BrowserConfig - result.fit_markdown → result.markdown.fit_markdown - result.fit_html → result.markdown.fit_html - markdown_v2 deprecation notes updated - bypass_cache → cache_mode=CacheMode.BYPASS - LLMExtractionStrategy now uses llm_config=LLMConfig(...) - CrawlerConfig → CrawlerRunConfig - cache_mode string values → CacheMode enum - Fix missing CacheMode import in local-files.md - Fix indentation in app-detail.html example - Fix tautological cache mode descriptions in arun.md From PR #1770 by @maksimzayats	2026-03-07 07:01:06 +00:00
unclecode	31d0de23df	Update PR-TODOLIST for batch 4 merge (10 PRs) and refresh open PR list	2026-03-07 06:50:26 +00:00
unclecode	db98aefb03	Update CONTRIBUTORS for PRs #1494 , #1715 , #1716 , #1308 , #1789 , #1793 , #1792 , #1794 , #1784 , #1730	2026-03-07 06:47:03 +00:00
unclecode	761664d29e	fix: add TTL expiry for Redis task data to prevent memory growth (#1730 ) From PR #1730 by @hoi	2026-03-07 06:17:58 +00:00
unclecode	e47e810aca	fix: handle UnicodeEncodeError in URL seeder and strip zero-width chars (#1784 ) From PR #1784 by @Br1an67	2026-03-07 06:16:41 +00:00
unclecode	1029815fd4	fix: add Windows support for crawler monitor keyboard input (#1794 ) From PR #1794 by @Br1an67	2026-03-07 06:16:12 +00:00
unclecode	d229beeaf8	fix: add wait_for_images option to screenshot endpoint (#1792 ) From PR #1792 by @Br1an67	2026-03-07 06:15:54 +00:00
unclecode	c73aa271ac	fix: make link_preview_timeout configurable in AdaptiveConfig (#1793 ) From PR #1793 by @Br1an67	2026-03-07 06:15:44 +00:00
unclecode	91330ef179	fix: add explicit utf-8 encoding to CLI file output (#1789 ) From PR #1789 by @Br1an67	2026-03-07 06:15:32 +00:00
unclecode	d6a8f57fdd	docs: fix css_selector type from list to string in examples (#1308 ) From PR #1308 by @dominicx	2026-03-07 06:15:14 +00:00
unclecode	e6c2a65625	docs: fix return type annotations to use RunManyReturn (#1716 ) From PR #1716 by @YuriNachos	2026-03-07 06:14:49 +00:00
unclecode	5601861555	docs: add missing CacheMode import in quickstart example (#1715 ) From PR #1715 by @YuriNachos	2026-03-07 06:13:32 +00:00

1 2 3 4 5 ...

1461 Commits