Commit Graph

1531 Commits

Author SHA1 Message Date
unclecode
5d3d0fe2d7 docs: 0.8.8 release notes, changelog, README; pre-announce next secure-by-default release
- CHANGELOG: 0.8.8 entry (backward-compatible security patch - SSRF filter gaps,
  output_path symlink hardening, LLM credential exfiltration, CRLF/webhook
  header injection) plus a Coming next note giving advance notice of the next
  secure-by-default Docker server release with breaking changes (~1-2 weeks),
  framed as hardening only.
- docs/blog/release-v0.8.8.md: release notes.
- README: badge + highlights -> 0.8.8, prior release demoted.

Credit Geo (geo-chen) for the base_url credential-exfiltration report.
2026-06-04 00:59:17 +00:00
unclecode
aa81e8fe7d security: non-breaking hardening patch (0.8.8)
Backward-compatible fixes for the Docker server - features keep working, only
the unsafe behavior is closed. (The secure-by-default redesign is the later
major.)

- SSRF: replace the explicit blocklist with the one rule (reject any resolved
  IP where not ip.is_global) evaluated on embedded IPv4 transition forms too,
  closing the gaps - IPv6 unspecified ::, NAT64 64:ff9b::/96, 6to4 2002::/16,
  v4-mapped. Error messages are now opaque (no resolved-IP leak).
- output_path arbitrary write: harden validate_output_path with realpath
  containment (defeats a symlinked path component) and write via O_NOFOLLOW
  (write_output_file). output_path stays supported.
- LLM base_url key exfil: ignore a request-supplied base_url in /md, /llm,
  /llm/job; the endpoint is always server-derived. Field still accepted (no
  4xx) for compatibility.
- env:SECRET_KEY exfil gadget: LLMConfig refuses env: resolution of protected
  names (SECRET/PASSWORD/PRIVATE substrings, CRAWL4AI*/AWS_SECRET* prefixes,
  SECRET_KEY/REDIS_PASSWORD/TOKEN). Normal provider keys (OPENAI_API_KEY, ...)
  unaffected.
- CRLF log injection: CRLFSafeFilter strips CR/LF/control from log records.
- Webhook header injection: sanitize_webhook_headers (name pattern, no control
  chars, deny hop-by-hop/sensitive) at send time + a WebhookConfig validator
  for early 422.

Bump 0.8.7 -> 0.8.8 (__version__ + Dockerfile C4AI_VER). 30 new behavioral
tests; existing 111 security tests + 112 library config tests still pass.

NOT included (breaking -> deferred to the major): auth-by-default, trust
boundary, declarative hooks, output_path removal, base_url/provider removal,
loopback bind, redis password, TLS-verify-on, CORS, bounded queue. The
exec-hook RCE and unauth-by-default criticals have no non-breaking fix and are
closed only in the major (hooks are already off by default).
2026-06-02 12:39:04 +00:00
unclecode
72fd78e59e chore: gitignore out/ local pipeline output 2026-06-02 05:14:11 +00:00
unclecode
7259d734a1 security(credits): credit IcySun & Yashon (co-reporter) per their request 2026-06-01 14:44:43 +00:00
unclecode
7b280d36b1 chore: gitignore .security/ - advisory payloads stay private, published via GHSA only v0.8.7 docker-rebuild-v0.8.7 2026-06-01 14:09:23 +00:00
unclecode
4bf6071226 chore: bump version to 0.8.7 2026-06-01 14:04:51 +00:00
unclecode
30187e6dc7 docs: 0.8.7 release notes, changelog, README highlights; finalize security credits (q1uf3ng confirmed, Velayutham Selvaraj) 2026-06-01 14:04:11 +00:00
unclecode
d705d7c4f1 security(credits): acknowledge independent reporters Velayutham S and IcySun 2026-06-01 13:25:47 +00:00
unclecode
71e1667bd1 security(advisory): fold direct /crawl,/md,/llm SSRF + IPv6-mapped bypass into advisory-2; credit secsys_codex 2026-06-01 13:13:57 +00:00
unclecode
bd20700626 merge: domain mapper + 4 core fixes cherry-picked from cloud (OSS-safe, audited) for 0.8.7 2026-06-01 13:05:41 +00:00
unclecode
c06a3dcf56 fix(logger): default Console width=200 for non-TTY contexts
Rich's Console() falls back to width=80 when stdout isn't a TTY (Docker
container logs, CI, captured output). That truncates diagnostic lines
that carry structured fields like "[ANTIBOT] tier N/M proxy=... status=
validator=... reason=..." — exactly the information you want when
debugging anti-bot escalation from logs alone.

Setting width=200 only affects the non-TTY fallback; real terminals
still auto-detect their actual width. Pure ergonomics improvement for
cloud/container deployments where everyone reads logs via `docker logs`
or log aggregators rather than interactively.

(cherry picked from commit 398851562ca102d73f979d7ceab589079817cc0d)
2026-06-01 12:59:32 +00:00
unclecode
62c619d454 fix(prefetch): honor <head><base href> in quick_extract_links (#752)
Prefetch link extraction (used by cloud scan/site) resolved every relative href
against the page URL and ignored the <base> tag — so a page with
<base href> mis-resolved its relative links, while the full LXML scraping path
(non-prefetch) handled it correctly. Parse <head><base href> and urljoin it onto
the page URL as the resolution base, mirroring LXMLWebScrapingStrategy.
base_domain stays computed from the page origin so internal/external
classification is unaffected.

(cherry picked from commit 09fe40c606ea15704d3452dafbf9ad043a2f5ae6)
2026-06-01 12:59:00 +00:00
unclecode
7059170614 fix(async_webcrawler): skip block validator when downloaded_files is set
The post-fallback final block-check ran the validator on `crawl_result.html`
unconditionally. For binary downloads (PDFs, archives, executables), html
is empty by design — the content lives in `downloaded_files`. The keyword
detector sees `is_blocked(200, "")` → "Near-empty content (0 bytes) with
HTTP 200" → false BLOCKED verdict → success flipped to False even though
the file was retrieved cleanly.

Reproduced consistently with federalregister.gov PDFs:
  * downloaded_files: ['<S3 url>'] (file is valid, opens fine)
  * success: False
  * error_message: "Blocked by anti-bot protection: keyword:Near-empty
    content (0 bytes) with HTTP 200"

Fix: extend the validator-skip condition to include binary downloads.
The validator only knows about HTML — when downloaded_files is populated,
the response succeeded via a non-HTML path and the blocker check has no
basis to fire.

Sibling skips already in place: raw: URLs (caller-provided content) and
fallback-fetch successes.

Cloud-side bug report: unclecode/crawl4ai-cloud#710.

(cherry picked from commit d364fa1435b41b6f21355fe068b4a0955fb25dd6)
2026-06-01 12:59:00 +00:00
unclecode
a615d697af fix(async_webcrawler): success=True for binary downloads
`crawl_result.success = bool(html)` flipped success to False whenever
html was empty — including the case where a binary file (PDF, archive,
executable) was downloaded successfully into `downloaded_files`. The
sibling validator-skip patch in the prior commit handled the false
"Blocked by anti-bot" error message, but `success` was being set to
False *before* that, so the result still surfaced as a failed crawl
even though the file was retrieved cleanly.

Fix: success is True when html is non-empty OR `downloaded_files` is
populated. Applies to both the live-fetch path (~line 527) and the
cached path (~line 740) so a cached PDF replays correctly.

After this + the validator-skip patch:
  * sync /v1/crawl PDF: success=True, error_message="", downloaded_files=[<S3 URL>]
  * async /v1/crawl/async PDF: same
  * Non-binary captcha pages still flag blocked (no regression)

Cloud bug: unclecode/crawl4ai-cloud#710 + #711 (combined PDF cluster).

(cherry picked from commit 09f898f410bf27235601b680f996d7b80afc96eb)
2026-06-01 12:58:30 +00:00
unclecode
858c827145 feat: add per-source timeout to DomainMapper
Each discovery source now has its own timeout (source_timeout, default 30s).
Slow sources (wayback, crt.sh) get killed individually instead of
blocking the entire scan. Fast sources always return results.

(cherry picked from commit 3fe7f3c29b4b0b2006d091d895ba6fdba1fc8f87)
2026-06-01 12:58:23 +00:00
unclecode
ed60c6628a feat: add include_subdomains flag to DomainMapperConfig
When include_subdomains=False, DomainMapper skips all subdomain
discovery (crt.sh, DNS guessing, Wayback/CC host expansion) and
only scans the exact domain provided. Default is True (existing
behavior unchanged).

Used by crawl4ai-cloud /v1/scan endpoint to give users control
over subdomain discovery scope.

(cherry picked from commit 4d60dba53cb6da1ad5e8261d0652ab8c771f8474)
2026-06-01 12:58:23 +00:00
unclecode
9d5bcf78e2 feat: Add DomainMapper for comprehensive domain URL discovery
Add DomainMapper class that discovers all URLs under a domain using
8 sources: sitemap, Common Crawl, Wayback Machine, Certificate
Transparency (crt.sh), path probing, robots.txt mining, RSS/Atom
feeds, and homepage link extraction.

Key features:
- Subdomain discovery via crt.sh, Wayback, CC, and DNS guessing
- Soft-404 detection: fingerprints SPA sites and filters fake pages
- Per-host scanning with parallel execution across discovered hosts
- URL normalization, deduplication, and source attribution
- BM25 relevance scoring with head metadata extraction
- Nonsense filter for static assets, webpack chunks, Wayback garbage

For superdesign.dev: finds 171 URLs across 11 hosts in ~13s
(vs 4 URLs from AsyncUrlSeeder)

New files:
- crawl4ai/domain_mapper.py (DomainMapper class)
- crawl4ai/async_configs.py (DomainMapperConfig)
- docs/md_v2/core/domain-mapping.md (documentation)
- docs/examples/domain_mapper/domain_mapper_demo.py
- 67 tests across unit/integration/adversarial/regression

(cherry picked from commit 2d10534a8742177f1d5f521e3174ae66591d3533)
2026-06-01 12:58:23 +00:00
unclecode
fcaf08b3b3 merge: slot April 2026 security batch (Docker API vulns, SSRF, JWT, file-write, XSS, execute_js) into develop for 0.8.7 2026-06-01 12:40:37 +00:00
Nasrin
820cbb59f9 Merge pull request #1960 from NaabZer/bugfix/stealth-import-mismatch
fix(stealth): use Stealth class from playwright-stealth 2.x
2026-05-25 12:50:56 +02:00
Nasrin
1ef552f351 Merge pull request #1967 from unclecode/fix/mcp-ensure-ascii-cjk-encoding
Preserve native Unicode in MCP tool responses by disabling ASCII escaping
2026-05-25 12:26:37 +02:00
Nasrin
be71585239 Merge pull request #1969 from cgseyhan/fix/async-logger-stderr-mcp-1968
fix: route AsyncLogger output to stderr by default (fixes #1968)
2026-05-25 12:20:01 +02:00
Nasrin
532b105fc7 Merge pull request #1975 from nightcityblade/fix/issue-1973
Thanks @nightcityblade for the quick fix! Clean and straightforward CSS change.
2026-05-25 11:59:25 +02:00
nightcityblade
791aae3a21 fix: allow assistant toolbar to scroll
Fixes unclecode/crawl4ai#1973
2026-05-20 23:07:47 +08:00
cemgo
944eb1e456 fix(logger): route AsyncLogger output to stderr by default
MCP stdio transport uses stdout for JSON-RPC messages. AsyncLogger
was writing Rich progress output to stdout (the default Console()
target), which caused clients to receive garbled JSON and log lines
interleaved in the same stream.

Changes:
- Pass stderr=True to Console() so all log output goes to stderr,
  which is the correct channel for library diagnostics and aligns
  with the behaviour of Python's own logging.StreamHandler.
- Add an injectable console parameter so downstream wrappers
  (e.g. mcp-crawl4ai, FastMCP integrations) can override the target
  stream without monkey-patching.
- Add import sys (used in docstring example).
- Add tests/test_async_logger_stderr.py with 7 tests covering the
  default-to-stderr behaviour, custom console injection, verbose=False
  suppression, file logging, and an end-to-end MCP scenario.

Fixes #1968
2026-05-14 14:13:30 +03:00
Soham Kukreti
76f56af2dd fix: use ensure_ascii=False in MCP bridge json.dumps to preserve CJK characters 2026-05-13 14:10:19 +05:30
NaabZer
5568a9ad38 Change browser_adapter to use Stealth import instead of stealth_* 2026-05-07 18:10:40 +09:00
Nasrin
dfb525edec Merge pull request #1951 from unclecode/fix/bedrock-provider-prefix
Add bedrock to PROVIDER_MODELS_PREFIXES so AWS credential auth works
2026-05-06 10:37:05 +02:00
Nasrin
47a4c256c9 Merge pull request #1952 from hafezparast/fix/maysam-silent-scrape-failure-1949
fix: log failure reason before COMPLETE and fix misleading SCRAPE ✓ (#1949)
2026-05-05 07:52:44 +02:00
Nasrin
a45c678ee4 Merge pull request #1939 from unclecode/fix/preserve-tail-text-1938
fix: preserve .tail text when removing empty elements (#1938)
2026-05-05 07:46:47 +02:00
hafezparast
5e5519b1c6 fix: log failure reason before COMPLETE and fix misleading SCRAPE ✓ (#1949)
Two issues caused silent COMPLETE ✗ with no diagnostic output:

1. When crawl_result.success=False (anti-bot detection, empty HTML, etc.),
   the error_message was set on the CrawlResult but never logged — users
   saw only [COMPLETE] ✗ with zero explanation. Fix: emit an [ERROR] log
   containing error_message before the COMPLETE line whenever success=False.

2. The SCRAPE log in aprocess_html always emitted success=True regardless
   of whether scraping produced any content. Fix: use bool(cleaned_html)
   so SCRAPE reflects the actual outcome.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:18:03 +08:00
Soham Kukreti
660f49c879 fix: add bedrock to PROVIDER_MODELS_PREFIXES so AWS credential auth works
LLMConfig.__init__ checks PROVIDER_MODELS_PREFIXES when api_token=None.
  If the provider prefix isn't found, it silently falls through to the else
  branch and overwrites self.provider with DEFAULT_PROVIDER (openai/gpt-4o),
  meaning any bedrock/* model string was being replaced before the LLM call
  was even made.

  This broke supported Bedrock auth methods when api_token is not passed in the LLMConfig

  Only passing api_token=<bearer_token> explicitly worked, because the
  truthy api_token bypassed the prefix check entirely.

  Adding "bedrock": None to PROVIDER_MODELS_PREFIXES keeps
  self.provider intact so the correct Bedrock provider is used. The actual
  auth (SigV4 signing or Bearer header) is handled downstream based on what credentials are
  available in the environment.
2026-04-30 22:13:54 +05:30
Nasrin
388ce3f033 Merge pull request #1940 from hafezparast/fix/mermaid-sequence-fence-1043
fix: broaden mermaid SVG text extraction and prevent nested fences (#1043)
2026-04-30 13:34:00 +02:00
hafezparast
dba38c7886 fix: broaden mermaid SVG text extraction and prevent nested fences (#1043)
Three improvements over the initial fix:

1. Add SVG text/tspan fallback for sequence, gantt, and git diagrams
   which don't use foreignObject / .nodeLabel spans but instead
   render labels via native SVG <text> elements.

2. Detect when a mermaid SVG is already wrapped in an outer <pre>
   (e.g. deepwiki.com) and replace it with a plain <span> instead
   of a new <pre> block, avoiding invalid nested markdown fences.

3. Add data-language attribute support to CustomHTML2Text so that
   pre[data-language=mermaid] emits a proper ```mermaid fence in
   the markdown output, instead of an unlabelled ``` block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 01:36:52 +08:00
Nasrin
4d139247b9 Merge pull request #1934 from hafezparast/fix/maysam-dispatcher-semaphore-count-1927
fix: wire semaphore_count into auto-created MemoryAdaptiveDispatcher (#1927)
2026-04-24 18:53:56 +02:00
ntohidi
e8f1af7c16 fix: preserve .tail text when removing empty elements (#1938)
remove_empty_elements_fast() was dropping trailing text attached to
elements via lxml .tail when removing empty elements. Now appends
the tail to the previous sibling or parent before removal.
2026-04-24 18:51:24 +02:00
ntohidi
04985ea15e docs: update arun() docstring to match CrawlResultContainer return type 2026-04-24 18:35:57 +02:00
Nasrin
35ee366e28 Merge pull request #1901 from hafezparast/fix/maysam-arun-type-hint-1898
fix: correct arun() return type annotation (#1898)
2026-04-24 18:33:41 +02:00
Nasrin
244fbf7b58 Merge pull request #1929 from atomic-carpenter/listen-on-all-addressess
docker: listen on all addresses
2026-04-24 17:59:48 +02:00
hafezparast
4e72f31011 fix: use semaphore_count default of 10 to match CrawlerRunConfig default 2026-04-24 23:57:17 +08:00
Nasrin
d595679d25 Merge pull request #1925 from sevenmoonlightsteps/fix/docker-llm-table-extraction-allowlist
fix: add LLMTableExtraction to Docker API deserialization allowlist
2026-04-24 17:11:56 +02:00
Nasrin
936e4470eb Merge pull request #1845 from hafezparast/fix/maysam-mermaid-svg-text-1043
fix: preserve mermaid diagram text from SVGs during scraping (#1043)
2026-04-24 16:49:59 +02:00
Nasrin
5e56e34840 Merge pull request #1922 from unclecode/fix/deep-crawl-streaming-contextvar-1917
fix(deep-crawl): ContextVar crash in streaming deep crawl (#1917)
2026-04-24 16:36:08 +02:00
hafezparast
d3c92ee3df fix: wire semaphore_count into auto-created MemoryAdaptiveDispatcher (#1927)
When arun_many() created a MemoryAdaptiveDispatcher automatically,
max_session_permit was always 20 (the class default), silently ignoring
the user's semaphore_count setting in CrawlerRunConfig.

Now reads semaphore_count from the primary config and passes it as
max_session_permit. The max(1, ... or 5) guard handles zero/None/negative
values safely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 10:59:52 +08:00
Atomic Carpenter
e06b19ca09 docker: listen on all addresses
Fixes: https://github.com/unclecode/crawl4ai/issues/1928
2026-04-22 00:47:08 +02:00
unclecode
1e25edcb5c fix(security): block IPv6-mapped IPv4 SSRF bypass
Caught during internal review. `http://[::ffff:127.0.0.1]/` bypassed
validate_webhook_url because getaddrinfo returns ::ffff:7f00:1, which
is not in any IPv4 blocklist (127.0.0.0/8) nor IPv6 blocklist (::1/128).

Fix: added _expand_ip_candidates() helper that unwraps IPv4 from
IPv4-mapped (::ffff:X.Y.Z.W, via .ipv4_mapped) and IPv4-compatible
(::X.Y.Z.W, via low-32-bits) IPv6 addresses. Blocklist now checks
both the original IP and the unwrapped IPv4 form.

Added 6 new TestIPv6MappedBypass tests covering:
- Loopback, RFC 1918, link-local (cloud metadata) via ::ffff: mapping
- IPv4-compatible variant (::127.0.0.1)
- Regression test that plain ::1 still blocked

Also updated stale test assertion in test_eval_security_adversarial:
hasattr, type, __build_class__ were removed from hook builtins in
batch 2 but the test still expected hasattr to remain.

DO NOT PUSH until release day.
2026-04-20 10:10:59 +00:00
unclecode
f77c0a856f fix(security): SSRF protection on all crawl/md/llm URL entry points
Reported by secsys_codex (2026-04-18): /md, /crawl, /llm endpoints
pass user URLs to crawler.arun() with no private IP validation.

- Add validate_url_destination() to utils.py with opt-out via
  CRAWL4AI_ALLOW_INTERNAL_URLS=true env var for users who need
  to crawl internal services.
- Integrate into validate_url_scheme() (covers all server.py endpoints).
- Add validation at all 4 URL entry points in api.py (handle_llm_qa,
  handle_markdown_request, create_new_task, handle_crawl_request).
- raw: URLs bypass check (inline HTML, no network fetch).
- 16 adversarial + source coverage tests added.
- secsys_codex added to SECURITY-CREDITS.md.

DO NOT PUSH until release day.
2026-04-20 09:42:43 +00:00
unclecode
0e92b5e239 docs: add Privacy Policy, Terms of Service, and Support pages
Add legal pages required for Google Workspace Marketplace listing
verification. Pages cover the whole Crawl4AI Cloud business (OSS
library, hosted API, dashboard, integrations, Workspace add-ons),
not specific to any single product.

- privacy.md: data collection, usage, retention, Workspace Limited Use
- terms.md: account, billing, acceptable use, IP, governing law (SG)
- support.md: email, docs, GitHub, Discord, security disclosure
2026-04-20 02:24:21 +00:00
Gab
c8c2dc319f fix: add LLMTableExtraction to Docker API deserialization allowlist 2026-04-17 15:43:56 -04:00
Nasrin
4e86399bfa Merge pull request #1913 from unclecode/fix/nlp-sentence-chunking-1909
fix(chunking): preserve sentence order in NlpSentenceChunking
2026-04-16 10:24:59 +02:00
ntohidi
3d4bda122a fix(deep-crawl): use set(False) instead of reset(token) for ContextVar (#1917)
ContextVar.reset(token) requires the same Context that created the token.
When Starlette's StreamingResponse consumes the async generator in a
different Task, the Context changes and reset() raises ValueError.

Replaced with set(False) which works across context boundaries. Safe
because deep_crawl_active is never nested — the guard on line 21
prevents re-entry.
2026-04-16 13:49:32 +08:00