Commit Graph

1446 Commits

Author SHA1 Message Date
unclecode
9b571bb947 feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.)
The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.

Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
2026-03-16 14:03:43 +00:00
ntohidi
f6ab207e25 fix: remove shared LOCK contention in monitor to prevent pod deadlock (#1754)
The monitor's update_timeline(), get_health_summary(), and
get_browser_list() all acquired the crawler pool's global LOCK to read
pool stats. That same lock is held during slow browser start/close
operations (get_crawler, janitor, close_all), causing the monitor to
block indefinitely and the pod to become unresponsive after sustained
crawling.

Replaced all three lock acquisitions in monitor.py with a lock-free
get_pool_snapshot() in crawler_pool.py that returns shallow dict copies.
Under CPython's GIL, dict.copy() and len() are atomic — safe for
read-only monitoring with at most slightly stale counts.
2026-03-13 12:17:52 +08:00
Nasrin
648f36b622 Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-redis-config-1611-1817
fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
2026-03-13 03:59:28 +01:00
Nasrin
6e4299577f Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw-1484
fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
2026-03-13 03:38:15 +01:00
hafezparast
8de83a3590 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:00:33 +08:00
unclecode
bf1158a61b fix: upgrade Redis to 7.2.7 for CVE-2025-49844 (CVSS 10.0) (#1671)
Add official Redis apt repository and pin redis-server to 7.2.7 which
patches the Lua use-after-free vulnerability. REDIS_VERSION build arg
allows override.
2026-03-12 11:24:42 +00:00
unclecode
a73bc1c076 fix: MCP SSE endpoint crash — mount via raw ASGI Route (#1594)
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.

Also update CONTRIBUTORS.md for PR #1829.
2026-03-12 11:22:48 +00:00
hafezparast
3f481e9e5c fix: screenshot distortion, deep crawl timeout/arun_many, CLI encoding (#1370, #1818, #1509, #1762)
- #1370: Freeze element dimensions via CSS before viewport resize in
  take_screenshot_scroller() to prevent responsive reflow on Elementor
  sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
  to abort pending loads; move event listener cleanup outside session_id
  guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
  set — call arun() directly per URL so the DeepCrawlDecorator can
  invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
  save_global_config() (cli.py line 58).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 18:17:13 +08:00
hafezparast
480d938f67 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
  provider, temperature, base_url query params with fallback to server config.
  Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
  URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
  REDIS_PASSWORD environment variable overrides.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:53:04 +08:00
Nasrin
d907e167a5 Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-full-page-1750
fix: screenshot respects scan_full_page=False (#1750)
2026-03-12 07:39:52 +01:00
Maysam Hafezparast
57b0d09934 fix: deduplicate BM25ContentFilter output (#1213) (#1824)
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:23:34 +08:00
unclecode
35034f551b docs: add hafezparast to CONTRIBUTORS.md
Recognized for identifying and confirming the PDFContentScrapingStrategy
deserialization fix (#1815).
2026-03-12 05:43:48 +00:00
hafezparast
6efbffe345 fix: screenshot respects scan_full_page=False (#1750)
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.

Includes 16 tests (8 unit + 8 integration).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 12:04:45 +08:00
unclecode
11b45760da fix: anti-bot false positive on browser JSON, URLPatternFilter prefix match, PDF deserialization
- antibot_detector: add <pre> to content elements regex, detect
  browser-wrapped JSON in _looks_like_data() so httpbin-style
  responses are not flagged as blocked
- deep_crawling/filters: use urlparse().path for path-only prefix
  patterns (/docs/*) instead of matching against full URL, which
  always failed; full-URL prefixes still match correctly
- async_configs: add PDFContentScrapingStrategy to
  ALLOWED_DESERIALIZE_TYPES so /crawl API can deserialize it
- __init__: export PDFContentScrapingStrategy for type resolution
- tests: add 86-test suite covering all three fixes with adversarial
  and edge cases
2026-03-09 14:52:58 +00:00
unclecode
55956a874d fix: 3 bug fixes (#1487, #1512, #1666) + close 3 already-fixed issues
- #1487: Move virtual scroll after wait_for so dynamic containers exist
- #1512: Add __aiter__ to CrawlResultContainer for async for support
- #1666: Kill process group on cleanup to prevent zombie child processes,
  add lsof fallback for Docker environments without lsof installed
- Close #1472 (redirect chain already fixed), #1480 (links already
  normalized), #1679 (duplicate of #1509)
2026-03-08 08:44:04 +00:00
unclecode
a7e6da0b19 Merge fix/batch-easy-issues-10: 10 bug fixes + regression test suite
Bug fixes: #1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758
Regression tests: 291 tests covering all major subsystems
2026-03-08 03:20:56 +00:00
unclecode
d788c28315 test: add comprehensive regression test suite (291 tests)
Full regression suite covering all major Crawl4AI subsystems:
- core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks)
- content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata)
- extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction)
- deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization)
- browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes)
- config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips)
- utilities (extract_xml_data, cache modes, content hashing)
- edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery)

Also adds /c4ai-check slash command for testing changes against the suite.
2026-03-08 03:20:52 +00:00
unclecode
3a75dd3f4c fix: batch fix for 10 open issues (#1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758)
- #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance)
- #1489: Preserve query parameter key casing in normalize_url
- #1374: Close NamedTemporaryFile handle before reopening (Windows fix)
- #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1)
- #1183: Fix extract_xml_data regex matching tag names in prose text
- #1354: Make import_knowledge_base async (fix asyncio.run in running loop)
- #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences)
- #1031: Make Docker playground code editor resizable with overflow-auto
- #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes
- #1758: Change screenshot stitching format from BMP to PNG
2026-03-07 09:47:38 +00:00
unclecode
0c9e3c427e Update CONTRIBUTORS and PR-TODOLIST for batch 5 (15 PRs resolved)
Batch 5 merged: #1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668
Closed as superseded: #1592
Closed as won't merge: #999, #1180, #1425, #1702, #1707, #1729
2026-03-07 08:49:32 +00:00
unclecode
7c0cc3ed88 fix: batch merge of community PRs (#1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668)
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)

Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)

CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
  build-push-action v6, setup-buildx v4, login v4 (#1734)

Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
  extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
  for Unicode preservation in JSON output (#1668)
2026-03-07 08:45:11 +00:00
unclecode
11ed854155 Update CONTRIBUTORS for PR #462 2026-03-07 07:06:49 +00:00
unclecode
697c2b2a58 fix: add newline before opening code fence in html2text (#462)
From PR #462 by @jtanningbed
2026-03-07 07:06:41 +00:00
unclecode
3704758746 Update CONTRIBUTORS for PR #1770 2026-03-07 07:01:54 +00:00
unclecode
04e83aa3c7 docs: modernize deprecated API usage across shipped docs (#1770)
Update docs/examples to use current API:
- proxy → proxy_config in BrowserConfig
- result.fit_markdown → result.markdown.fit_markdown
- result.fit_html → result.markdown.fit_html
- markdown_v2 deprecation notes updated
- bypass_cache → cache_mode=CacheMode.BYPASS
- LLMExtractionStrategy now uses llm_config=LLMConfig(...)
- CrawlerConfig → CrawlerRunConfig
- cache_mode string values → CacheMode enum
- Fix missing CacheMode import in local-files.md
- Fix indentation in app-detail.html example
- Fix tautological cache mode descriptions in arun.md

From PR #1770 by @maksimzayats
2026-03-07 07:01:06 +00:00
unclecode
31d0de23df Update PR-TODOLIST for batch 4 merge (10 PRs) and refresh open PR list 2026-03-07 06:50:26 +00:00
unclecode
db98aefb03 Update CONTRIBUTORS for PRs #1494, #1715, #1716, #1308, #1789, #1793, #1792, #1794, #1784, #1730 2026-03-07 06:47:03 +00:00
unclecode
761664d29e fix: add TTL expiry for Redis task data to prevent memory growth (#1730)
From PR #1730 by @hoi
2026-03-07 06:17:58 +00:00
unclecode
e47e810aca fix: handle UnicodeEncodeError in URL seeder and strip zero-width chars (#1784)
From PR #1784 by @Br1an67
2026-03-07 06:16:41 +00:00
unclecode
1029815fd4 fix: add Windows support for crawler monitor keyboard input (#1794)
From PR #1794 by @Br1an67
2026-03-07 06:16:12 +00:00
unclecode
d229beeaf8 fix: add wait_for_images option to screenshot endpoint (#1792)
From PR #1792 by @Br1an67
2026-03-07 06:15:54 +00:00
unclecode
c73aa271ac fix: make link_preview_timeout configurable in AdaptiveConfig (#1793)
From PR #1793 by @Br1an67
2026-03-07 06:15:44 +00:00
unclecode
91330ef179 fix: add explicit utf-8 encoding to CLI file output (#1789)
From PR #1789 by @Br1an67
2026-03-07 06:15:32 +00:00
unclecode
d6a8f57fdd docs: fix css_selector type from list to string in examples (#1308)
From PR #1308 by @dominicx
2026-03-07 06:15:14 +00:00
unclecode
e6c2a65625 docs: fix return type annotations to use RunManyReturn (#1716)
From PR #1716 by @YuriNachos
2026-03-07 06:14:49 +00:00
unclecode
5601861555 docs: add missing CacheMode import in quickstart example (#1715)
From PR #1715 by @YuriNachos
2026-03-07 06:13:32 +00:00
unclecode
72cc17c113 docs: fix docstring param name crawler_config -> config (#1494)
From PR #1494 by @AkosLukacs
2026-03-07 06:13:18 +00:00
unclecode
814bc4df47 Update CONTRIBUTORS for PRs #1782, #1788, #1783, #1179 2026-03-07 04:15:49 +00:00
unclecode
93f2f03fab Merge PR #1783: fix: strip port from URL domain in is_external_url comparison
Strip port number from netloc before domain comparison so that
example.com:8080 correctly matches base domain example.com.
2026-03-07 04:15:35 +00:00
unclecode
5f65d2d1fd Merge PR #1788: fix: guard against None LLM content and propagate finish_reason
Adds None check before processing LLM response content in both extract()
and aextract(). When LLM returns no content (e.g. content filter, token
limit), returns an error block with finish_reason instead of crashing.
Also guards the except fallback path against None content.
2026-03-07 04:15:22 +00:00
unclecode
122be00076 Merge PR #1782: fix: preserve class and id attributes in cleaned_html
Add "class" and "id" to IMPORTANT_ATTRS so they survive HTML cleaning.
CSS-based extraction strategies need these attributes to match selectors.
2026-03-07 04:14:21 +00:00
unclecode
4bde952ade Update CONTRIBUTORS for PRs #1787, #1790, #1804 2026-03-07 04:00:36 +00:00
unclecode
ff2ea3429a Merge PR #1804: feat: add score_threshold support to BestFirstCrawlingStrategy
Adds score_threshold parameter (default -inf for backward compatibility)
to BestFirstCrawlingStrategy, matching BFS and DFS strategies. URLs
scoring below the threshold are skipped.
Fixes #1801.
2026-03-07 03:59:28 +00:00
unclecode
9ec2969d99 Merge PR #1790: fix: handle nested brackets and parentheses in LINK_PATTERN regex
Improves LINK_PATTERN regex in markdown citation conversion to correctly
handle Wikipedia-style URLs with parentheses and text with nested brackets.
2026-03-07 03:59:17 +00:00
unclecode
bd0f6e1bd5 fix: strip markdown fences in force_json_response path (LLM extraction)
Wire existing _strip_markdown_fences() into the force_json_response
code path in both extract() and aextract(). LLMs frequently wrap JSON
in ```json fences which caused json.loads() to fail.

Inspired by PR #1787 (Br1an67).
2026-03-07 03:59:00 +00:00
unclecode
d4588904b3 Update PR-TODOLIST and CONTRIBUTORS for merged PRs #1805, #1763, #1803 2026-03-07 03:40:36 +00:00
unclecode
b008671345 Merge PR #1803: fix from_serializable_dict to ignore plain data dicts with "type" key
Narrows the typed-object deserialization path to only match dicts with
"params" or {"type":"dict","value":{...}}, preventing crashes on normal
data dicts like JSON-Schema fragments that happen to have a "type" key.
2026-03-07 03:21:33 +00:00
unclecode
fdb3f8fd98 Merge PR #1763: fix: return in finally block silently suppressing exceptions
Moves return out of finally block and adds raise in except block so
QUEUE_ERROR exceptions properly propagate in MemoryAdaptiveDispatcher.
2026-03-07 03:21:22 +00:00
unclecode
8a677a9db1 Merge PR #1805: fix: prevent AdaptiveCrawler from crawling external domains
Removes external links from being added to pending_links in digest(),
since _crawl_with_preview() always sets include_external=False.
Fixes #1776.
2026-03-07 03:21:11 +00:00
nightcityblade
78434eadac fix: prevent AdaptiveCrawler from crawling external domains
AdaptiveCrawler.digest() unconditionally added external links to
pending_links, causing the crawler to follow links to entirely
different domains even though include_external=False was set in
LinkPreviewConfig.

Remove external links from being added to pending_links in both the
initial crawl and subsequent crawl loops.

Fixes #1776
2026-03-07 10:57:42 +08:00
nightcityblade
379591047d fix: add score_threshold support to BestFirstCrawlingStrategy
Add score_threshold parameter to BestFirstCrawlingStrategy, matching the
existing behavior in BFSDeepCrawlStrategy and DFSDeepCrawlStrategy.

URLs scoring below the threshold are now skipped during link discovery
instead of being unconditionally enqueued.

Fixes #1801
2026-03-07 10:55:09 +08:00