Commit Graph

1461 Commits

Author SHA1 Message Date
unclecode
01c685cd3a fix: pin litellm to safe fork (v1.81.13) due to PyPI supply chain compromise
litellm versions 1.82.7 and 1.82.8 on PyPI were compromised with malicious
code. PyPI has quarantined the entire package, blocking all installs.
Temporarily pin to our own fork at a known-safe version.
2026-03-24 14:03:26 +00:00
Nasrin
1a40ccf093 Merge pull request #1844 from hafezparast/fix/maysam-browser-none-guard-1842
fix: improve browser None guard in create_browser_context (#1842)
2026-03-24 11:37:46 +01:00
Nasrin
6eb2530bd9 Merge pull request #1849 from hafezparast/fix/maysam-serialize-skip-non-config-1848
fix: skip non-allowlisted types in serialization/deserialization (#1848)
2026-03-24 11:36:03 +01:00
Nasrin
fb24ee592e Merge pull request #1851 from hafezparast/fix/maysam-mcp-sse-asgi-1850
fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)
2026-03-24 11:17:35 +01:00
ntohidi
3846b738cf Merge branch 'develop' of https://github.com/unclecode/crawl4ai into main 2026-03-24 18:10:40 +08:00
UncleCode
1a597cb97f Merge pull request #1836 from unclecode/release/v0.8.5
Release v0.8.5
2026-03-24 11:06:58 +01:00
hafezparast
219416e49d fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)
Starlette's Route wraps async functions in request_response(), calling
handler(request) instead of handler(scope, receive, send). This broke
the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable
class instead of an async function — Route passes class instances
through as raw ASGI apps without wrapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 08:55:41 +08:00
hafezparast
e603e4a722 fix: skip non-allowlisted types in serialization/deserialization (#1848)
to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES
(returns None), preventing objects like logging.Logger from being
serialized as {"type": "Logger", "params": {...}} which then fails
deserialization. from_serializable_dict returns None for unknown types
instead of raising ValueError, handling payloads from older clients.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 08:17:02 +08:00
hafezparast
310b52b663 fix: improve browser None guard in create_browser_context (#1842)
The existing guard assumed self.browser=None only meant persistent context mode.
In reality, the browser can be None because it was closed by the janitor, crashed,
or never started. This caused a misleading error message. Now the guard distinguishes
between persistent context and closed/crashed browser with appropriate messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 10:45:38 +08:00
ntohidi
37da8b8f97 fix: pin redis-tools version to match redis-server in Dockerfile docker-rebuild-v0.8.5 2026-03-21 14:26:23 +08:00
ntohidi
29d27ed1ae fix: install curl and gnupg in Dockerfile to support Redis repository addition 2026-03-21 14:17:27 +08:00
unclecode
c4389adddf fix: Prevent scan_full_page from hanging on dynamic/infinite-scroll pages
- Default max_scroll_steps to 10 when not explicitly set (was None/unlimited)
- Wrap _handle_full_page_scan in asyncio.wait_for with page_timeout
- On timeout, log warning and continue with partial scroll instead of hanging

Previously, scan_full_page could hang indefinitely because:
1. max_scroll_steps defaulted to None (no limit)
2. Dynamic pages keep growing total_height on each scroll
3. No asyncio timeout wrapper to interrupt hung coroutines
2026-03-18 15:36:12 +00:00
unclecode
3ecd852011 fix: Re-check is_blocked() when fallback fetch fails
When fallback_fetch_function was invoked but failed (exception or empty
response), the final is_blocked() re-check was skipped because
fallback_fetch_used=True. This left crawl_result.success=True even though
the result was a blocked page from the last proxy attempt.

Changed the condition to check resolved_by=='fallback_fetch' (set only on
success) instead of fallback_fetch_used (set before the try block).
2026-03-18 14:36:57 +00:00
ntohidi
4bf17796d4 feat: add version 0.8.5 release highlights including anti-bot detection, shadow DOM support, and critical security fixes to README v0.8.5 2026-03-18 11:23:20 +08:00
unclecode
9b571bb947 feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.)
The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.

Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
2026-03-16 14:03:43 +00:00
ntohidi
bb6406a2d0 release: Crawl4AI v0.8.5
Bump version to 0.8.5 across all references (Dockerfile, README,
Docker README, blog index, __version__.py).

Add release notes, blog post, demo verification script (13 real-crawl
tests), and releases directory entry.

Key highlights:
- Anti-bot detection with 3-tier proxy escalation
- Shadow DOM flattening
- Deep crawl cancellation
- Config defaults API
- 60+ bug fixes and critical security patches
2026-03-16 18:46:05 +08:00
ntohidi
f6ab207e25 fix: remove shared LOCK contention in monitor to prevent pod deadlock (#1754)
The monitor's update_timeline(), get_health_summary(), and
get_browser_list() all acquired the crawler pool's global LOCK to read
pool stats. That same lock is held during slow browser start/close
operations (get_crawler, janitor, close_all), causing the monitor to
block indefinitely and the pod to become unresponsive after sustained
crawling.

Replaced all three lock acquisitions in monitor.py with a lock-free
get_pool_snapshot() in crawler_pool.py that returns shallow dict copies.
Under CPython's GIL, dict.copy() and len() are atomic — safe for
read-only monitoring with at most slightly stale counts.
2026-03-13 12:17:52 +08:00
Nasrin
648f36b622 Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-redis-config-1611-1817
fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
2026-03-13 03:59:28 +01:00
Nasrin
6e4299577f Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw-1484
fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
2026-03-13 03:38:15 +01:00
hafezparast
8de83a3590 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:00:33 +08:00
unclecode
bf1158a61b fix: upgrade Redis to 7.2.7 for CVE-2025-49844 (CVSS 10.0) (#1671)
Add official Redis apt repository and pin redis-server to 7.2.7 which
patches the Lua use-after-free vulnerability. REDIS_VERSION build arg
allows override.
2026-03-12 11:24:42 +00:00
unclecode
a73bc1c076 fix: MCP SSE endpoint crash — mount via raw ASGI Route (#1594)
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.

Also update CONTRIBUTORS.md for PR #1829.
2026-03-12 11:22:48 +00:00
hafezparast
3f481e9e5c fix: screenshot distortion, deep crawl timeout/arun_many, CLI encoding (#1370, #1818, #1509, #1762)
- #1370: Freeze element dimensions via CSS before viewport resize in
  take_screenshot_scroller() to prevent responsive reflow on Elementor
  sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
  to abort pending loads; move event listener cleanup outside session_id
  guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
  set — call arun() directly per URL so the DeepCrawlDecorator can
  invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
  save_global_config() (cli.py line 58).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 18:17:13 +08:00
hafezparast
480d938f67 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
  provider, temperature, base_url query params with fallback to server config.
  Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
  URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
  REDIS_PASSWORD environment variable overrides.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:53:04 +08:00
Nasrin
d907e167a5 Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-full-page-1750
fix: screenshot respects scan_full_page=False (#1750)
2026-03-12 07:39:52 +01:00
Maysam Hafezparast
57b0d09934 fix: deduplicate BM25ContentFilter output (#1213) (#1824)
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:23:34 +08:00
unclecode
35034f551b docs: add hafezparast to CONTRIBUTORS.md
Recognized for identifying and confirming the PDFContentScrapingStrategy
deserialization fix (#1815).
2026-03-12 05:43:48 +00:00
hafezparast
6efbffe345 fix: screenshot respects scan_full_page=False (#1750)
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.

Includes 16 tests (8 unit + 8 integration).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 12:04:45 +08:00
unclecode
11b45760da fix: anti-bot false positive on browser JSON, URLPatternFilter prefix match, PDF deserialization
- antibot_detector: add <pre> to content elements regex, detect
  browser-wrapped JSON in _looks_like_data() so httpbin-style
  responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
  patterns (/docs/*) instead of matching against full URL, which
  always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
  ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
  and edge cases
2026-03-09 14:52:58 +00:00
unclecode
55956a874d fix: 3 bug fixes (#1487, #1512, #1666) + close 3 already-fixed issues
- #1487: Move virtual scroll after wait_for so dynamic containers exist
- #1512: Add __aiter__ to CrawlResultContainer for async for support
- #1666: Kill process group on cleanup to prevent zombie child processes,
  add lsof fallback for Docker environments without lsof installed
- Close #1472 (redirect chain already fixed), #1480 (links already
  normalized), #1679 (duplicate of #1509)
2026-03-08 08:44:04 +00:00
unclecode
a7e6da0b19 Merge fix/batch-easy-issues-10: 10 bug fixes + regression test suite
Bug fixes: #1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758
Regression tests: 291 tests covering all major subsystems
2026-03-08 03:20:56 +00:00
unclecode
d788c28315 test: add comprehensive regression test suite (291 tests)
Full regression suite covering all major Crawl4AI subsystems:
- core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks)
- content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata)
- extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction)
- deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization)
- browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes)
- config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips)
- utilities (extract_xml_data, cache modes, content hashing)
- edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery)

Also adds /c4ai-check slash command for testing changes against the suite.
2026-03-08 03:20:52 +00:00
unclecode
3a75dd3f4c fix: batch fix for 10 open issues (#1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758)
- #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance)
- #1489: Preserve query parameter key casing in normalize_url
- #1374: Close NamedTemporaryFile handle before reopening (Windows fix)
- #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1)
- #1183: Fix extract_xml_data regex matching tag names in prose text
- #1354: Make import_knowledge_base async (fix asyncio.run in running loop)
- #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences)
- #1031: Make Docker playground code editor resizable with overflow-auto
- #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes
- #1758: Change screenshot stitching format from BMP to PNG
2026-03-07 09:47:38 +00:00
unclecode
0c9e3c427e Update CONTRIBUTORS and PR-TODOLIST for batch 5 (15 PRs resolved)
Batch 5 merged: #1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668
Closed as superseded: #1592
Closed as won't merge: #999, #1180, #1425, #1702, #1707, #1729
2026-03-07 08:49:32 +00:00
unclecode
7c0cc3ed88 fix: batch merge of community PRs (#1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668)
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)

Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)

CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
  build-push-action v6, setup-buildx v4, login v4 (#1734)

Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
  extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
  for Unicode preservation in JSON output (#1668)
2026-03-07 08:45:11 +00:00
unclecode
11ed854155 Update CONTRIBUTORS for PR #462 2026-03-07 07:06:49 +00:00
unclecode
697c2b2a58 fix: add newline before opening code fence in html2text (#462)
From PR #462 by @jtanningbed
2026-03-07 07:06:41 +00:00
unclecode
3704758746 Update CONTRIBUTORS for PR #1770 2026-03-07 07:01:54 +00:00
unclecode
04e83aa3c7 docs: modernize deprecated API usage across shipped docs (#1770)
Update docs/examples to use current API:
- proxy → proxy_config in BrowserConfig
- result.fit_markdown → result.markdown.fit_markdown
- result.fit_html → result.markdown.fit_html
- markdown_v2 deprecation notes updated
- bypass_cache → cache_mode=CacheMode.BYPASS
- LLMExtractionStrategy now uses llm_config=LLMConfig(...)
- CrawlerConfig → CrawlerRunConfig
- cache_mode string values → CacheMode enum
- Fix missing CacheMode import in local-files.md
- Fix indentation in app-detail.html example
- Fix tautological cache mode descriptions in arun.md

From PR #1770 by @maksimzayats
2026-03-07 07:01:06 +00:00
unclecode
31d0de23df Update PR-TODOLIST for batch 4 merge (10 PRs) and refresh open PR list 2026-03-07 06:50:26 +00:00
unclecode
db98aefb03 Update CONTRIBUTORS for PRs #1494, #1715, #1716, #1308, #1789, #1793, #1792, #1794, #1784, #1730 2026-03-07 06:47:03 +00:00
unclecode
761664d29e fix: add TTL expiry for Redis task data to prevent memory growth (#1730)
From PR #1730 by @hoi
2026-03-07 06:17:58 +00:00
unclecode
e47e810aca fix: handle UnicodeEncodeError in URL seeder and strip zero-width chars (#1784)
From PR #1784 by @Br1an67
2026-03-07 06:16:41 +00:00
unclecode
1029815fd4 fix: add Windows support for crawler monitor keyboard input (#1794)
From PR #1794 by @Br1an67
2026-03-07 06:16:12 +00:00
unclecode
d229beeaf8 fix: add wait_for_images option to screenshot endpoint (#1792)
From PR #1792 by @Br1an67
2026-03-07 06:15:54 +00:00
unclecode
c73aa271ac fix: make link_preview_timeout configurable in AdaptiveConfig (#1793)
From PR #1793 by @Br1an67
2026-03-07 06:15:44 +00:00
unclecode
91330ef179 fix: add explicit utf-8 encoding to CLI file output (#1789)
From PR #1789 by @Br1an67
2026-03-07 06:15:32 +00:00
unclecode
d6a8f57fdd docs: fix css_selector type from list to string in examples (#1308)
From PR #1308 by @dominicx
2026-03-07 06:15:14 +00:00
unclecode
e6c2a65625 docs: fix return type annotations to use RunManyReturn (#1716)
From PR #1716 by @YuriNachos
2026-03-07 06:14:49 +00:00
unclecode
5601861555 docs: add missing CacheMode import in quickstart example (#1715)
From PR #1715 by @YuriNachos
2026-03-07 06:13:32 +00:00