Rich's Console() falls back to width=80 when stdout isn't a TTY (Docker
container logs, CI, captured output). That truncates diagnostic lines
that carry structured fields like "[ANTIBOT] tier N/M proxy=... status=
validator=... reason=..." — exactly the information you want when
debugging anti-bot escalation from logs alone.
Setting width=200 only affects the non-TTY fallback; real terminals
still auto-detect their actual width. Pure ergonomics improvement for
cloud/container deployments where everyone reads logs via `docker logs`
or log aggregators rather than interactively.
(cherry picked from commit 398851562ca102d73f979d7ceab589079817cc0d)
Prefetch link extraction (used by cloud scan/site) resolved every relative href
against the page URL and ignored the <base> tag — so a page with
<base href> mis-resolved its relative links, while the full LXML scraping path
(non-prefetch) handled it correctly. Parse <head><base href> and urljoin it onto
the page URL as the resolution base, mirroring LXMLWebScrapingStrategy.
base_domain stays computed from the page origin so internal/external
classification is unaffected.
(cherry picked from commit 09fe40c606ea15704d3452dafbf9ad043a2f5ae6)
The post-fallback final block-check ran the validator on `crawl_result.html`
unconditionally. For binary downloads (PDFs, archives, executables), html
is empty by design — the content lives in `downloaded_files`. The keyword
detector sees `is_blocked(200, "")` → "Near-empty content (0 bytes) with
HTTP 200" → false BLOCKED verdict → success flipped to False even though
the file was retrieved cleanly.
Reproduced consistently with federalregister.gov PDFs:
* downloaded_files: ['<S3 url>'] (file is valid, opens fine)
* success: False
* error_message: "Blocked by anti-bot protection: keyword:Near-empty
content (0 bytes) with HTTP 200"
Fix: extend the validator-skip condition to include binary downloads.
The validator only knows about HTML — when downloaded_files is populated,
the response succeeded via a non-HTML path and the blocker check has no
basis to fire.
Sibling skips already in place: raw: URLs (caller-provided content) and
fallback-fetch successes.
Cloud-side bug report: unclecode/crawl4ai-cloud#710.
(cherry picked from commit d364fa1435b41b6f21355fe068b4a0955fb25dd6)
`crawl_result.success = bool(html)` flipped success to False whenever
html was empty — including the case where a binary file (PDF, archive,
executable) was downloaded successfully into `downloaded_files`. The
sibling validator-skip patch in the prior commit handled the false
"Blocked by anti-bot" error message, but `success` was being set to
False *before* that, so the result still surfaced as a failed crawl
even though the file was retrieved cleanly.
Fix: success is True when html is non-empty OR `downloaded_files` is
populated. Applies to both the live-fetch path (~line 527) and the
cached path (~line 740) so a cached PDF replays correctly.
After this + the validator-skip patch:
* sync /v1/crawl PDF: success=True, error_message="", downloaded_files=[<S3 URL>]
* async /v1/crawl/async PDF: same
* Non-binary captcha pages still flag blocked (no regression)
Cloud bug: unclecode/crawl4ai-cloud#710 + #711 (combined PDF cluster).
(cherry picked from commit 09f898f410bf27235601b680f996d7b80afc96eb)
Each discovery source now has its own timeout (source_timeout, default 30s).
Slow sources (wayback, crt.sh) get killed individually instead of
blocking the entire scan. Fast sources always return results.
(cherry picked from commit 3fe7f3c29b4b0b2006d091d895ba6fdba1fc8f87)
When include_subdomains=False, DomainMapper skips all subdomain
discovery (crt.sh, DNS guessing, Wayback/CC host expansion) and
only scans the exact domain provided. Default is True (existing
behavior unchanged).
Used by crawl4ai-cloud /v1/scan endpoint to give users control
over subdomain discovery scope.
(cherry picked from commit 4d60dba53cb6da1ad5e8261d0652ab8c771f8474)
Add DomainMapper class that discovers all URLs under a domain using
8 sources: sitemap, Common Crawl, Wayback Machine, Certificate
Transparency (crt.sh), path probing, robots.txt mining, RSS/Atom
feeds, and homepage link extraction.
Key features:
- Subdomain discovery via crt.sh, Wayback, CC, and DNS guessing
- Soft-404 detection: fingerprints SPA sites and filters fake pages
- Per-host scanning with parallel execution across discovered hosts
- URL normalization, deduplication, and source attribution
- BM25 relevance scoring with head metadata extraction
- Nonsense filter for static assets, webpack chunks, Wayback garbage
For superdesign.dev: finds 171 URLs across 11 hosts in ~13s
(vs 4 URLs from AsyncUrlSeeder)
New files:
- crawl4ai/domain_mapper.py (DomainMapper class)
- crawl4ai/async_configs.py (DomainMapperConfig)
- docs/md_v2/core/domain-mapping.md (documentation)
- docs/examples/domain_mapper/domain_mapper_demo.py
- 67 tests across unit/integration/adversarial/regression
(cherry picked from commit 2d10534a8742177f1d5f521e3174ae66591d3533)
MCP stdio transport uses stdout for JSON-RPC messages. AsyncLogger
was writing Rich progress output to stdout (the default Console()
target), which caused clients to receive garbled JSON and log lines
interleaved in the same stream.
Changes:
- Pass stderr=True to Console() so all log output goes to stderr,
which is the correct channel for library diagnostics and aligns
with the behaviour of Python's own logging.StreamHandler.
- Add an injectable console parameter so downstream wrappers
(e.g. mcp-crawl4ai, FastMCP integrations) can override the target
stream without monkey-patching.
- Add import sys (used in docstring example).
- Add tests/test_async_logger_stderr.py with 7 tests covering the
default-to-stderr behaviour, custom console injection, verbose=False
suppression, file logging, and an end-to-end MCP scenario.
Fixes#1968
Two issues caused silent COMPLETE ✗ with no diagnostic output:
1. When crawl_result.success=False (anti-bot detection, empty HTML, etc.),
the error_message was set on the CrawlResult but never logged — users
saw only [COMPLETE] ✗ with zero explanation. Fix: emit an [ERROR] log
containing error_message before the COMPLETE line whenever success=False.
2. The SCRAPE log in aprocess_html always emitted success=True regardless
of whether scraping produced any content. Fix: use bool(cleaned_html)
so SCRAPE reflects the actual outcome.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LLMConfig.__init__ checks PROVIDER_MODELS_PREFIXES when api_token=None.
If the provider prefix isn't found, it silently falls through to the else
branch and overwrites self.provider with DEFAULT_PROVIDER (openai/gpt-4o),
meaning any bedrock/* model string was being replaced before the LLM call
was even made.
This broke supported Bedrock auth methods when api_token is not passed in the LLMConfig
Only passing api_token=<bearer_token> explicitly worked, because the
truthy api_token bypassed the prefix check entirely.
Adding "bedrock": None to PROVIDER_MODELS_PREFIXES keeps
self.provider intact so the correct Bedrock provider is used. The actual
auth (SigV4 signing or Bearer header) is handled downstream based on what credentials are
available in the environment.
Three improvements over the initial fix:
1. Add SVG text/tspan fallback for sequence, gantt, and git diagrams
which don't use foreignObject / .nodeLabel spans but instead
render labels via native SVG <text> elements.
2. Detect when a mermaid SVG is already wrapped in an outer <pre>
(e.g. deepwiki.com) and replace it with a plain <span> instead
of a new <pre> block, avoiding invalid nested markdown fences.
3. Add data-language attribute support to CustomHTML2Text so that
pre[data-language=mermaid] emits a proper ```mermaid fence in
the markdown output, instead of an unlabelled ``` block.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
remove_empty_elements_fast() was dropping trailing text attached to
elements via lxml .tail when removing empty elements. Now appends
the tail to the previous sibling or parent before removal.
When arun_many() created a MemoryAdaptiveDispatcher automatically,
max_session_permit was always 20 (the class default), silently ignoring
the user's semaphore_count setting in CrawlerRunConfig.
Now reads semaphore_count from the primary config and passes it as
max_session_permit. The max(1, ... or 5) guard handles zero/None/negative
values safely.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Caught during internal review. `http://[::ffff:127.0.0.1]/` bypassed
validate_webhook_url because getaddrinfo returns ::ffff:7f00:1, which
is not in any IPv4 blocklist (127.0.0.0/8) nor IPv6 blocklist (::1/128).
Fix: added _expand_ip_candidates() helper that unwraps IPv4 from
IPv4-mapped (::ffff:X.Y.Z.W, via .ipv4_mapped) and IPv4-compatible
(::X.Y.Z.W, via low-32-bits) IPv6 addresses. Blocklist now checks
both the original IP and the unwrapped IPv4 form.
Added 6 new TestIPv6MappedBypass tests covering:
- Loopback, RFC 1918, link-local (cloud metadata) via ::ffff: mapping
- IPv4-compatible variant (::127.0.0.1)
- Regression test that plain ::1 still blocked
Also updated stale test assertion in test_eval_security_adversarial:
hasattr, type, __build_class__ were removed from hook builtins in
batch 2 but the test still expected hasattr to remain.
DO NOT PUSH until release day.
Reported by secsys_codex (2026-04-18): /md, /crawl, /llm endpoints
pass user URLs to crawler.arun() with no private IP validation.
- Add validate_url_destination() to utils.py with opt-out via
CRAWL4AI_ALLOW_INTERNAL_URLS=true env var for users who need
to crawl internal services.
- Integrate into validate_url_scheme() (covers all server.py endpoints).
- Add validation at all 4 URL entry points in api.py (handle_llm_qa,
handle_markdown_request, create_new_task, handle_crawl_request).
- raw: URLs bypass check (inline HTML, no network fetch).
- 16 adversarial + source coverage tests added.
- secsys_codex added to SECURITY-CREDITS.md.
DO NOT PUSH until release day.
Add legal pages required for Google Workspace Marketplace listing
verification. Pages cover the whole Crawl4AI Cloud business (OSS
library, hosted API, dashboard, integrations, Workspace add-ons),
not specific to any single product.
- privacy.md: data collection, usage, retention, Workspace Limited Use
- terms.md: account, billing, acceptable use, IP, governing law (SG)
- support.md: email, docs, GitHub, Discord, security disclosure
ContextVar.reset(token) requires the same Context that created the token.
When Starlette's StreamingResponse consumes the async generator in a
different Task, the Context changes and reset() raises ValueError.
Replaced with set(False) which works across context boundaries. Safe
because deep_crawl_active is never nested — the guard on line 21
prevents re-entry.