1533 Commits

Author SHA1 Message Date
Nasrin
4e86399bfa Merge pull request #1913 from unclecode/fix/nlp-sentence-chunking-1909
fix(chunking): preserve sentence order in NlpSentenceChunking
2026-04-16 10:24:59 +02:00
ntohidi
3d4bda122a fix(deep-crawl): use set(False) instead of reset(token) for ContextVar (#1917)
ContextVar.reset(token) requires the same Context that created the token.
When Starlette's StreamingResponse consumes the async generator in a
different Task, the Context changes and reset() raises ValueError.

Replaced with set(False) which works across context boundaries. Safe
because deep_crawl_active is never nested — the guard on line 21
prevents re-entry.
2026-04-16 13:49:32 +08:00
ntohidi
7bfc547bce fix: preserve rowspan/colspan in cleaned_html (#1920)
Add rowspan and colspan to IMPORTANT_ATTRS so they survive
attribute stripping in remove_unwanted_attributes_fast().
2026-04-16 12:42:36 +08:00
unclecode
c9914691db chore: add GitHub Security Advisory payloads for release day 2026-04-15 06:08:42 +00:00
unclecode
c45ccf20f6 fix: credit wulonchia by requested handle 2026-04-15 05:45:15 +00:00
unclecode
0f20f8bb83 fix(security): batch 2 - JWT secret, eval removal, execute_js, hook sandbox
Fixes for 4 vulnerabilities reported by by111/August829 (2026-04-14):

1. Hardcoded JWT secret (CVSS 9.8): Removed "mysecret" default from
   auth.py. Added weak secret validation (blocklist + min 32 chars).
   Auto-generates ephemeral key when none set.

2. eval() in /config/dump (CVSS 9.1): Replaced eval-based config
   parsing with JSON input {type, params} validated by Pydantic.
   Added authentication. Deleted _safe_eval_config and all AST
   allowlist code.

3. /execute_js endpoint (CVSS 8.1): Disabled by default via
   CRAWL4AI_EXECUTE_JS_ENABLED env var. Added SSRF blocklist on
   destination URL. Removed --disable-web-security from default
   browser args.

4. Hook sandbox escape (CVSS 9.8): Strip __builtins__, __loader__,
   __spec__ from injected module proxies. Removed type, hasattr,
   __build_class__ from allowed builtins.

Also added SECURITY-CREDITS.md tracking all reporters.
30 adversarial tests added.

DO NOT PUSH until release day.
2026-04-15 05:42:14 +00:00
unclecode
7976b45817 fix(security): patch 4 vulns - file write, SSRF, monitor auth, XSS
Fixes for 4 vulnerabilities reported by Jeongbean Jeon (2026-04-13):

1. Arbitrary File Write (CVSS 9.1): /screenshot and /pdf output_path
   now validated via validate_output_path() restricting writes to
   CRAWL4AI_OUTPUT_DIR. Pydantic validator rejects '..' at schema level.

2. SSRF via Webhook (CVSS 8.6): validate_webhook_url() blocks private
   IPs (RFC 1918, loopback, link-local, cloud metadata), dangerous
   hostnames (localhost, metadata.google.internal, host.docker.internal).
   Validated at job submission + send time. follow_redirects=False set.

3. Monitor Auth Bypass (CVSS 6.5): monitor_router now mounted with
   dependencies=[Depends(token_dep)]. WebSocket /ws endpoint checks
   CRAWL4AI_API_TOKEN from query params.

4. Stored XSS (CVSS 6.1): Server-side html.escape() on URLs and errors
   in monitor.py. Client-side escapeHtml() wrapping all innerHTML
   template injections in index.html (active/completed/error lists +
   WebSocket updates).

33 adversarial security tests added.

DO NOT PUSH until release day. Merge to develop + tag + advisory together.
2026-04-13 11:29:54 +00:00
ntohidi
c837c0d9cb fix(chunking): preserve sentence order in NlpSentenceChunking (#1909)
Remove broken re-import of load_nltk_punkt (already imported at module level).
Replace list(set(sens)) with plain return — set() destroyed document order
and silently dropped duplicate sentences.
2026-04-11 17:27:18 +08:00
hafezparast
c5612f7551 fix: correct arun() return type from RunManyReturn to CrawlResultContainer (#1898)
arun() always returns CrawlResultContainer, never AsyncGenerator. The
RunManyReturn type (Union[CrawlResultContainer, AsyncGenerator]) caused
Pylance/Pyright to flag result.markdown as an error because AsyncGenerator
doesn't have that attribute.

Also adds test_type_annotations.py — 11 static analysis tests that catch
annotation mismatches (return types, missing annotations, export checks)
without needing pyright in CI. Would have caught this bug before it was
reported.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 21:35:17 +08:00
Nasrin
3d02d75edb Merge pull request #1852 from hafezparast/feat/maysam-arun-many-config-list-1837
feat: expose arun_many config-list support in Docker API (#1837)
2026-04-06 10:26:44 +02:00
unclecode
ec560f13d2 fix: default LLMExtractionStrategy extraction_type to schema
Block mode returns an internal index/tags/content format that is
rarely useful. Schema mode returns clean structured JSON, either
matching a provided schema or inferred from the instruction.
2026-04-04 09:26:35 +00:00
unclecode
e326da9166 fix(security): complete AST sandbox escape remediation (CVSS 9.8)
Addresses the gi_frame.f_back chain exploit reported by Song Binglin (q1uf3ng).

- Delete _safe_eval_expression() and _SAFE_EVAL_BUILTINS entirely from
  extraction_strategy.py. Dead security-sensitive code is a liability.
  The eval path was already disabled; this removes the function itself.
- Fix hook_manager.py module injection: replace broken exec("import X", ns)
  pattern (silently failed due to missing __import__) with direct module
  injection. Sanitize asyncio to strip subprocess access (RCE vector).
- Add startup warning when CRAWL4AI_API_TOKEN is unset (all endpoints
  unauthenticated).
- Expand adversarial test suite to 87 tests: hook sandbox escapes,
  asyncio.subprocess RCE verification, end-to-end exploit payload from
  vuln report, dead code deletion checks, codebase eval/exec audit.
2026-03-31 13:01:57 +00:00
unclecode
2fc39cbe89 fix(security): remove eval() from computed fields, harden config deserializer
- Disable eval() in _compute_field expression path (RCE vector via untrusted input).
  Expression key now logs warning and returns default; function key still works.
- Harden _safe_eval_config in server.py with name/attribute allowlists,
  block lambdas, generators, comprehensions in constructor args.
- Remove getattr/setattr from hook_manager allowed builtins (sandbox escape vectors).
- Add 67 adversarial security tests covering all eval/exec attack surfaces.

Closes #1886, closes #1855
2026-03-31 12:02:43 +00:00
UncleCode
1debe5f5fc Merge pull request #1885 from unclecode/develop
docs: update version references to 0.8.6
2026-03-30 09:59:58 +07:00
ntohidi
bcbccbea2f docs: update version references to 0.8.6 in README and Docker docs 2026-03-30 10:57:13 +08:00
Nasrin
7e7533ec7c Merge pull request #1882 from hafezparast/fix/crawler-config-dict-validation-1880
fix: validate markdown_generator type to catch bad JSON format (#1880)
2026-03-30 04:50:32 +02:00
hafezparast
e9f832274e fix: validate markdown_generator type in CrawlerRunConfig to catch bad JSON format (#1880)
When the Docker API receives markdown_generator as JSON with "options"
instead of "params", from_serializable_dict silently passes the raw
dict through. This later crashes with a confusing "'dict' object has
no attribute 'generate_markdown'" deep in the crawl pipeline.

Add type validation for markdown_generator in CrawlerRunConfig.__init__
(matching existing extraction_strategy/chunking_strategy validation).
When a dict slips through, the error now clearly states:
- What type was expected vs received
- That "params" is the required key (not "options")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 07:39:28 +08:00
unclecode
af648e104f fix: bump Dockerfile version to 0.8.6 docker-rebuild-v0.8.6 2026-03-24 15:19:18 +00:00
unclecode
4e4a996878 fix: replace litellm with unclecode-litellm due to PyPI supply chain compromise
litellm 1.82.7-1.82.8 on PyPI were compromised with malicious code.
PyPI quarantined the entire package (all versions uninstallable).
Switched to unclecode-litellm==1.81.13, a pre-compromise fork published
under our own PyPI account. Drop-in replacement — all imports unchanged.
v0.8.6
2026-03-24 14:49:36 +00:00
unclecode
f4bda05178 release: bump version to 0.8.6
Pin litellm to safe fork due to PyPI supply chain compromise
(versions 1.82.7-1.82.8 compromised, entire package quarantined).
2026-03-24 14:13:41 +00:00
unclecode
01c685cd3a fix: pin litellm to safe fork (v1.81.13) due to PyPI supply chain compromise
litellm versions 1.82.7 and 1.82.8 on PyPI were compromised with malicious
code. PyPI has quarantined the entire package, blocking all installs.
Temporarily pin to our own fork at a known-safe version.
2026-03-24 14:03:26 +00:00
Nasrin
1a40ccf093 Merge pull request #1844 from hafezparast/fix/maysam-browser-none-guard-1842
fix: improve browser None guard in create_browser_context (#1842)
2026-03-24 11:37:46 +01:00
Nasrin
6eb2530bd9 Merge pull request #1849 from hafezparast/fix/maysam-serialize-skip-non-config-1848
fix: skip non-allowlisted types in serialization/deserialization (#1848)
2026-03-24 11:36:03 +01:00
Nasrin
fb24ee592e Merge pull request #1851 from hafezparast/fix/maysam-mcp-sse-asgi-1850
fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)
2026-03-24 11:17:35 +01:00
ntohidi
3846b738cf Merge branch 'develop' of https://github.com/unclecode/crawl4ai into main 2026-03-24 18:10:40 +08:00
UncleCode
1a597cb97f Merge pull request #1836 from unclecode/release/v0.8.5
Release v0.8.5
2026-03-24 11:06:58 +01:00
hafezparast
8995c1bbd6 feat: expose arun_many config-list support in Docker API (#1837)
The /crawl endpoint now accepts an optional crawler_configs field
(list of CrawlerRunConfig dicts) alongside the existing crawler_config.
When provided with multiple URLs, each config is deserialized and passed
as a list to arun_many(), enabling per-URL configuration with url_matcher
patterns. Single-URL requests and requests without crawler_configs are
unchanged (backward compatible).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:56:53 +08:00
hafezparast
219416e49d fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)
Starlette's Route wraps async functions in request_response(), calling
handler(request) instead of handler(scope, receive, send). This broke
the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable
class instead of an async function — Route passes class instances
through as raw ASGI apps without wrapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 08:55:41 +08:00
hafezparast
e603e4a722 fix: skip non-allowlisted types in serialization/deserialization (#1848)
to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES
(returns None), preventing objects like logging.Logger from being
serialized as {"type": "Logger", "params": {...}} which then fails
deserialization. from_serializable_dict returns None for unknown types
instead of raising ValueError, handling payloads from older clients.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 08:17:02 +08:00
hafezparast
2fd0f4c6a7 fix: preserve mermaid diagram text from SVGs during scraping (#1043)
Mermaid diagrams rendered as SVGs were completely stripped during HTML
cleaning, losing all text content. Now detects SVGs with id="mermaid-*",
extracts node/edge labels, and replaces the SVG with a fenced mermaid
code block containing the diagram type and extracted text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 11:43:54 +08:00
hafezparast
310b52b663 fix: improve browser None guard in create_browser_context (#1842)
The existing guard assumed self.browser=None only meant persistent context mode.
In reality, the browser can be None because it was closed by the janitor, crashed,
or never started. This caused a misleading error message. Now the guard distinguishes
between persistent context and closed/crashed browser with appropriate messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 10:45:38 +08:00
ntohidi
37da8b8f97 fix: pin redis-tools version to match redis-server in Dockerfile docker-rebuild-v0.8.5 2026-03-21 14:26:23 +08:00
ntohidi
29d27ed1ae fix: install curl and gnupg in Dockerfile to support Redis repository addition 2026-03-21 14:17:27 +08:00
unclecode
c4389adddf fix: Prevent scan_full_page from hanging on dynamic/infinite-scroll pages
- Default max_scroll_steps to 10 when not explicitly set (was None/unlimited)
- Wrap _handle_full_page_scan in asyncio.wait_for with page_timeout
- On timeout, log warning and continue with partial scroll instead of hanging

Previously, scan_full_page could hang indefinitely because:
1. max_scroll_steps defaulted to None (no limit)
2. Dynamic pages keep growing total_height on each scroll
3. No asyncio timeout wrapper to interrupt hung coroutines
2026-03-18 15:36:12 +00:00
unclecode
3ecd852011 fix: Re-check is_blocked() when fallback fetch fails
When fallback_fetch_function was invoked but failed (exception or empty
response), the final is_blocked() re-check was skipped because
fallback_fetch_used=True. This left crawl_result.success=True even though
the result was a blocked page from the last proxy attempt.

Changed the condition to check resolved_by=='fallback_fetch' (set only on
success) instead of fallback_fetch_used (set before the try block).
2026-03-18 14:36:57 +00:00
ntohidi
4bf17796d4 feat: add version 0.8.5 release highlights including anti-bot detection, shadow DOM support, and critical security fixes to README v0.8.5 2026-03-18 11:23:20 +08:00
unclecode
9b571bb947 feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.)
The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.

Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
2026-03-16 14:03:43 +00:00
ntohidi
bb6406a2d0 release: Crawl4AI v0.8.5
Bump version to 0.8.5 across all references (Dockerfile, README,
Docker README, blog index, __version__.py).

Add release notes, blog post, demo verification script (13 real-crawl
tests), and releases directory entry.

Key highlights:
- Anti-bot detection with 3-tier proxy escalation
- Shadow DOM flattening
- Deep crawl cancellation
- Config defaults API
- 60+ bug fixes and critical security patches
2026-03-16 18:46:05 +08:00
ntohidi
f6ab207e25 fix: remove shared LOCK contention in monitor to prevent pod deadlock (#1754)
The monitor's update_timeline(), get_health_summary(), and
get_browser_list() all acquired the crawler pool's global LOCK to read
pool stats. That same lock is held during slow browser start/close
operations (get_crawler, janitor, close_all), causing the monitor to
block indefinitely and the pod to become unresponsive after sustained
crawling.

Replaced all three lock acquisitions in monitor.py with a lock-free
get_pool_snapshot() in crawler_pool.py that returns shallow dict copies.
Under CPython's GIL, dict.copy() and len() are atomic — safe for
read-only monitoring with at most slightly stale counts.
2026-03-13 12:17:52 +08:00
Nasrin
648f36b622 Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-redis-config-1611-1817
fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
2026-03-13 03:59:28 +01:00
Nasrin
6e4299577f Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw-1484
fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
2026-03-13 03:38:15 +01:00
hafezparast
8de83a3590 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)
css_selector was skipped in _scrap() — only target_elements was
applied. Now css_selector filters the DOM first, then target_elements
narrows within that selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:00:33 +08:00
unclecode
bf1158a61b fix: upgrade Redis to 7.2.7 for CVE-2025-49844 (CVSS 10.0) (#1671)
Add official Redis apt repository and pin redis-server to 7.2.7 which
patches the Lua use-after-free vulnerability. REDIS_VERSION build arg
allows override.
2026-03-12 11:24:42 +00:00
unclecode
a73bc1c076 fix: MCP SSE endpoint crash — mount via raw ASGI Route (#1594)
Replace @app.get() with starlette.routing.Route() for the SSE handler.
The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send)
internally, which conflicts with Starlette's middleware wrapping.

Also update CONTRIBUTORS.md for PR #1829.
2026-03-12 11:22:48 +00:00
hafezparast
3f481e9e5c fix: screenshot distortion, deep crawl timeout/arun_many, CLI encoding (#1370, #1818, #1509, #1762)
- #1370: Freeze element dimensions via CSS before viewport resize in
  take_screenshot_scroller() to prevent responsive reflow on Elementor
  sites; restore original viewport after capture.
- #1818: Call window.stop() on session-reused pages before navigation
  to abort pending loads; move event listener cleanup outside session_id
  guard so listeners don't accumulate across reuses.
- #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is
  set — call arun() directly per URL so the DeepCrawlDecorator can
  invoke the strategy (dispatcher crashes on List[CrawlResult] return).
- #1762: Add encoding="utf-8" to the remaining open() call in
  save_global_config() (cli.py line 58).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 18:17:13 +08:00
hafezparast
480d938f67 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
  provider, temperature, base_url query params with fallback to server config.
  Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
  URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
  REDIS_PASSWORD environment variable overrides.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:53:04 +08:00
Nasrin
d907e167a5 Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-full-page-1750
fix: screenshot respects scan_full_page=False (#1750)
2026-03-12 07:39:52 +01:00
Maysam Hafezparast
57b0d09934 fix: deduplicate BM25ContentFilter output (#1213) (#1824)
BM25ContentFilter.filter_content() returned duplicate text chunks when
the same content appeared in multiple DOM elements. Added exact-text
deduplication after threshold filtering, keeping the first occurrence
in document order.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 14:23:34 +08:00
unclecode
35034f551b docs: add hafezparast to CONTRIBUTORS.md
Recognized for identifying and confirming the PDFContentScrapingStrategy
deserialization fix (#1815).
2026-03-12 05:43:48 +00:00
hafezparast
6efbffe345 fix: screenshot respects scan_full_page=False (#1750)
take_screenshot() ignored the scan_full_page config flag — tall pages
always got a full-page screenshot even when scan_full_page=False.
Now passes scan_full_page through to take_screenshot() and uses
viewport-only capture when False.

Includes 16 tests (8 unit + 8 integration).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 12:04:45 +08:00