Two issues caused silent COMPLETE ✗ with no diagnostic output:
1. When crawl_result.success=False (anti-bot detection, empty HTML, etc.),
the error_message was set on the CrawlResult but never logged — users
saw only [COMPLETE] ✗ with zero explanation. Fix: emit an [ERROR] log
containing error_message before the COMPLETE line whenever success=False.
2. The SCRAPE log in aprocess_html always emitted success=True regardless
of whether scraping produced any content. Fix: use bool(cleaned_html)
so SCRAPE reflects the actual outcome.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LLMConfig.__init__ checks PROVIDER_MODELS_PREFIXES when api_token=None.
If the provider prefix isn't found, it silently falls through to the else
branch and overwrites self.provider with DEFAULT_PROVIDER (openai/gpt-4o),
meaning any bedrock/* model string was being replaced before the LLM call
was even made.
This broke supported Bedrock auth methods when api_token is not passed in the LLMConfig
Only passing api_token=<bearer_token> explicitly worked, because the
truthy api_token bypassed the prefix check entirely.
Adding "bedrock": None to PROVIDER_MODELS_PREFIXES keeps
self.provider intact so the correct Bedrock provider is used. The actual
auth (SigV4 signing or Bearer header) is handled downstream based on what credentials are
available in the environment.
Three improvements over the initial fix:
1. Add SVG text/tspan fallback for sequence, gantt, and git diagrams
which don't use foreignObject / .nodeLabel spans but instead
render labels via native SVG <text> elements.
2. Detect when a mermaid SVG is already wrapped in an outer <pre>
(e.g. deepwiki.com) and replace it with a plain <span> instead
of a new <pre> block, avoiding invalid nested markdown fences.
3. Add data-language attribute support to CustomHTML2Text so that
pre[data-language=mermaid] emits a proper ```mermaid fence in
the markdown output, instead of an unlabelled ``` block.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
remove_empty_elements_fast() was dropping trailing text attached to
elements via lxml .tail when removing empty elements. Now appends
the tail to the previous sibling or parent before removal.
When arun_many() created a MemoryAdaptiveDispatcher automatically,
max_session_permit was always 20 (the class default), silently ignoring
the user's semaphore_count setting in CrawlerRunConfig.
Now reads semaphore_count from the primary config and passes it as
max_session_permit. The max(1, ... or 5) guard handles zero/None/negative
values safely.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add legal pages required for Google Workspace Marketplace listing
verification. Pages cover the whole Crawl4AI Cloud business (OSS
library, hosted API, dashboard, integrations, Workspace add-ons),
not specific to any single product.
- privacy.md: data collection, usage, retention, Workspace Limited Use
- terms.md: account, billing, acceptable use, IP, governing law (SG)
- support.md: email, docs, GitHub, Discord, security disclosure
ContextVar.reset(token) requires the same Context that created the token.
When Starlette's StreamingResponse consumes the async generator in a
different Task, the Context changes and reset() raises ValueError.
Replaced with set(False) which works across context boundaries. Safe
because deep_crawl_active is never nested — the guard on line 21
prevents re-entry.
Remove broken re-import of load_nltk_punkt (already imported at module level).
Replace list(set(sens)) with plain return — set() destroyed document order
and silently dropped duplicate sentences.
arun() always returns CrawlResultContainer, never AsyncGenerator. The
RunManyReturn type (Union[CrawlResultContainer, AsyncGenerator]) caused
Pylance/Pyright to flag result.markdown as an error because AsyncGenerator
doesn't have that attribute.
Also adds test_type_annotations.py — 11 static analysis tests that catch
annotation mismatches (return types, missing annotations, export checks)
without needing pyright in CI. Would have caught this bug before it was
reported.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Block mode returns an internal index/tags/content format that is
rarely useful. Schema mode returns clean structured JSON, either
matching a provided schema or inferred from the instruction.
Addresses the gi_frame.f_back chain exploit reported by Song Binglin (q1uf3ng).
- Delete _safe_eval_expression() and _SAFE_EVAL_BUILTINS entirely from
extraction_strategy.py. Dead security-sensitive code is a liability.
The eval path was already disabled; this removes the function itself.
- Fix hook_manager.py module injection: replace broken exec("import X", ns)
pattern (silently failed due to missing __import__) with direct module
injection. Sanitize asyncio to strip subprocess access (RCE vector).
- Add startup warning when CRAWL4AI_API_TOKEN is unset (all endpoints
unauthenticated).
- Expand adversarial test suite to 87 tests: hook sandbox escapes,
asyncio.subprocess RCE verification, end-to-end exploit payload from
vuln report, dead code deletion checks, codebase eval/exec audit.
When the Docker API receives markdown_generator as JSON with "options"
instead of "params", from_serializable_dict silently passes the raw
dict through. This later crashes with a confusing "'dict' object has
no attribute 'generate_markdown'" deep in the crawl pipeline.
Add type validation for markdown_generator in CrawlerRunConfig.__init__
(matching existing extraction_strategy/chunking_strategy validation).
When a dict slips through, the error now clearly states:
- What type was expected vs received
- That "params" is the required key (not "options")
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
litellm 1.82.7-1.82.8 on PyPI were compromised with malicious code.
PyPI quarantined the entire package (all versions uninstallable).
Switched to unclecode-litellm==1.81.13, a pre-compromise fork published
under our own PyPI account. Drop-in replacement — all imports unchanged.
litellm versions 1.82.7 and 1.82.8 on PyPI were compromised with malicious
code. PyPI has quarantined the entire package, blocking all installs.
Temporarily pin to our own fork at a known-safe version.
The /crawl endpoint now accepts an optional crawler_configs field
(list of CrawlerRunConfig dicts) alongside the existing crawler_config.
When provided with multiple URLs, each config is deserialized and passed
as a list to arun_many(), enabling per-URL configuration with url_matcher
patterns. Single-URL requests and requests without crawler_configs are
unchanged (backward compatible).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Starlette's Route wraps async functions in request_response(), calling
handler(request) instead of handler(scope, receive, send). This broke
the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable
class instead of an async function — Route passes class instances
through as raw ASGI apps without wrapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES
(returns None), preventing objects like logging.Logger from being
serialized as {"type": "Logger", "params": {...}} which then fails
deserialization. from_serializable_dict returns None for unknown types
instead of raising ValueError, handling payloads from older clients.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mermaid diagrams rendered as SVGs were completely stripped during HTML
cleaning, losing all text content. Now detects SVGs with id="mermaid-*",
extracts node/edge labels, and replaces the SVG with a fenced mermaid
code block containing the diagram type and extracted text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing guard assumed self.browser=None only meant persistent context mode.
In reality, the browser can be None because it was closed by the janitor, crashed,
or never started. This caused a misleading error message. Now the guard distinguishes
between persistent context and closed/crashed browser with appropriate messages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>