Commit Graph

1501 Commits

Author SHA1 Message Date
Nasrin
532b105fc7 Merge pull request #1975 from nightcityblade/fix/issue-1973
Thanks @nightcityblade for the quick fix! Clean and straightforward CSS change.
2026-05-25 11:59:25 +02:00
nightcityblade
791aae3a21 fix: allow assistant toolbar to scroll
Fixes unclecode/crawl4ai#1973
2026-05-20 23:07:47 +08:00
Nasrin
dfb525edec Merge pull request #1951 from unclecode/fix/bedrock-provider-prefix
Add bedrock to PROVIDER_MODELS_PREFIXES so AWS credential auth works
2026-05-06 10:37:05 +02:00
Nasrin
47a4c256c9 Merge pull request #1952 from hafezparast/fix/maysam-silent-scrape-failure-1949
fix: log failure reason before COMPLETE and fix misleading SCRAPE ✓ (#1949)
2026-05-05 07:52:44 +02:00
Nasrin
a45c678ee4 Merge pull request #1939 from unclecode/fix/preserve-tail-text-1938
fix: preserve .tail text when removing empty elements (#1938)
2026-05-05 07:46:47 +02:00
hafezparast
5e5519b1c6 fix: log failure reason before COMPLETE and fix misleading SCRAPE ✓ (#1949)
Two issues caused silent COMPLETE ✗ with no diagnostic output:

1. When crawl_result.success=False (anti-bot detection, empty HTML, etc.),
   the error_message was set on the CrawlResult but never logged — users
   saw only [COMPLETE] ✗ with zero explanation. Fix: emit an [ERROR] log
   containing error_message before the COMPLETE line whenever success=False.

2. The SCRAPE log in aprocess_html always emitted success=True regardless
   of whether scraping produced any content. Fix: use bool(cleaned_html)
   so SCRAPE reflects the actual outcome.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:18:03 +08:00
Soham Kukreti
660f49c879 fix: add bedrock to PROVIDER_MODELS_PREFIXES so AWS credential auth works
LLMConfig.__init__ checks PROVIDER_MODELS_PREFIXES when api_token=None.
  If the provider prefix isn't found, it silently falls through to the else
  branch and overwrites self.provider with DEFAULT_PROVIDER (openai/gpt-4o),
  meaning any bedrock/* model string was being replaced before the LLM call
  was even made.

  This broke supported Bedrock auth methods when api_token is not passed in the LLMConfig

  Only passing api_token=<bearer_token> explicitly worked, because the
  truthy api_token bypassed the prefix check entirely.

  Adding "bedrock": None to PROVIDER_MODELS_PREFIXES keeps
  self.provider intact so the correct Bedrock provider is used. The actual
  auth (SigV4 signing or Bearer header) is handled downstream based on what credentials are
  available in the environment.
2026-04-30 22:13:54 +05:30
Nasrin
388ce3f033 Merge pull request #1940 from hafezparast/fix/mermaid-sequence-fence-1043
fix: broaden mermaid SVG text extraction and prevent nested fences (#1043)
2026-04-30 13:34:00 +02:00
hafezparast
dba38c7886 fix: broaden mermaid SVG text extraction and prevent nested fences (#1043)
Three improvements over the initial fix:

1. Add SVG text/tspan fallback for sequence, gantt, and git diagrams
   which don't use foreignObject / .nodeLabel spans but instead
   render labels via native SVG <text> elements.

2. Detect when a mermaid SVG is already wrapped in an outer <pre>
   (e.g. deepwiki.com) and replace it with a plain <span> instead
   of a new <pre> block, avoiding invalid nested markdown fences.

3. Add data-language attribute support to CustomHTML2Text so that
   pre[data-language=mermaid] emits a proper ```mermaid fence in
   the markdown output, instead of an unlabelled ``` block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 01:36:52 +08:00
Nasrin
4d139247b9 Merge pull request #1934 from hafezparast/fix/maysam-dispatcher-semaphore-count-1927
fix: wire semaphore_count into auto-created MemoryAdaptiveDispatcher (#1927)
2026-04-24 18:53:56 +02:00
ntohidi
e8f1af7c16 fix: preserve .tail text when removing empty elements (#1938)
remove_empty_elements_fast() was dropping trailing text attached to
elements via lxml .tail when removing empty elements. Now appends
the tail to the previous sibling or parent before removal.
2026-04-24 18:51:24 +02:00
ntohidi
04985ea15e docs: update arun() docstring to match CrawlResultContainer return type 2026-04-24 18:35:57 +02:00
Nasrin
35ee366e28 Merge pull request #1901 from hafezparast/fix/maysam-arun-type-hint-1898
fix: correct arun() return type annotation (#1898)
2026-04-24 18:33:41 +02:00
Nasrin
244fbf7b58 Merge pull request #1929 from atomic-carpenter/listen-on-all-addressess
docker: listen on all addresses
2026-04-24 17:59:48 +02:00
hafezparast
4e72f31011 fix: use semaphore_count default of 10 to match CrawlerRunConfig default 2026-04-24 23:57:17 +08:00
Nasrin
d595679d25 Merge pull request #1925 from sevenmoonlightsteps/fix/docker-llm-table-extraction-allowlist
fix: add LLMTableExtraction to Docker API deserialization allowlist
2026-04-24 17:11:56 +02:00
Nasrin
936e4470eb Merge pull request #1845 from hafezparast/fix/maysam-mermaid-svg-text-1043
fix: preserve mermaid diagram text from SVGs during scraping (#1043)
2026-04-24 16:49:59 +02:00
Nasrin
5e56e34840 Merge pull request #1922 from unclecode/fix/deep-crawl-streaming-contextvar-1917
fix(deep-crawl): ContextVar crash in streaming deep crawl (#1917)
2026-04-24 16:36:08 +02:00
hafezparast
d3c92ee3df fix: wire semaphore_count into auto-created MemoryAdaptiveDispatcher (#1927)
When arun_many() created a MemoryAdaptiveDispatcher automatically,
max_session_permit was always 20 (the class default), silently ignoring
the user's semaphore_count setting in CrawlerRunConfig.

Now reads semaphore_count from the primary config and passes it as
max_session_permit. The max(1, ... or 5) guard handles zero/None/negative
values safely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 10:59:52 +08:00
Atomic Carpenter
e06b19ca09 docker: listen on all addresses
Fixes: https://github.com/unclecode/crawl4ai/issues/1928
2026-04-22 00:47:08 +02:00
unclecode
0e92b5e239 docs: add Privacy Policy, Terms of Service, and Support pages
Add legal pages required for Google Workspace Marketplace listing
verification. Pages cover the whole Crawl4AI Cloud business (OSS
library, hosted API, dashboard, integrations, Workspace add-ons),
not specific to any single product.

- privacy.md: data collection, usage, retention, Workspace Limited Use
- terms.md: account, billing, acceptable use, IP, governing law (SG)
- support.md: email, docs, GitHub, Discord, security disclosure
2026-04-20 02:24:21 +00:00
Gab
c8c2dc319f fix: add LLMTableExtraction to Docker API deserialization allowlist 2026-04-17 15:43:56 -04:00
Nasrin
4e86399bfa Merge pull request #1913 from unclecode/fix/nlp-sentence-chunking-1909
fix(chunking): preserve sentence order in NlpSentenceChunking
2026-04-16 10:24:59 +02:00
ntohidi
3d4bda122a fix(deep-crawl): use set(False) instead of reset(token) for ContextVar (#1917)
ContextVar.reset(token) requires the same Context that created the token.
When Starlette's StreamingResponse consumes the async generator in a
different Task, the Context changes and reset() raises ValueError.

Replaced with set(False) which works across context boundaries. Safe
because deep_crawl_active is never nested — the guard on line 21
prevents re-entry.
2026-04-16 13:49:32 +08:00
ntohidi
7bfc547bce fix: preserve rowspan/colspan in cleaned_html (#1920)
Add rowspan and colspan to IMPORTANT_ATTRS so they survive
attribute stripping in remove_unwanted_attributes_fast().
2026-04-16 12:42:36 +08:00
ntohidi
c837c0d9cb fix(chunking): preserve sentence order in NlpSentenceChunking (#1909)
Remove broken re-import of load_nltk_punkt (already imported at module level).
Replace list(set(sens)) with plain return — set() destroyed document order
and silently dropped duplicate sentences.
2026-04-11 17:27:18 +08:00
hafezparast
c5612f7551 fix: correct arun() return type from RunManyReturn to CrawlResultContainer (#1898)
arun() always returns CrawlResultContainer, never AsyncGenerator. The
RunManyReturn type (Union[CrawlResultContainer, AsyncGenerator]) caused
Pylance/Pyright to flag result.markdown as an error because AsyncGenerator
doesn't have that attribute.

Also adds test_type_annotations.py — 11 static analysis tests that catch
annotation mismatches (return types, missing annotations, export checks)
without needing pyright in CI. Would have caught this bug before it was
reported.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 21:35:17 +08:00
Nasrin
3d02d75edb Merge pull request #1852 from hafezparast/feat/maysam-arun-many-config-list-1837
feat: expose arun_many config-list support in Docker API (#1837)
2026-04-06 10:26:44 +02:00
unclecode
ec560f13d2 fix: default LLMExtractionStrategy extraction_type to schema
Block mode returns an internal index/tags/content format that is
rarely useful. Schema mode returns clean structured JSON, either
matching a provided schema or inferred from the instruction.
2026-04-04 09:26:35 +00:00
unclecode
e326da9166 fix(security): complete AST sandbox escape remediation (CVSS 9.8)
Addresses the gi_frame.f_back chain exploit reported by Song Binglin (q1uf3ng).

- Delete _safe_eval_expression() and _SAFE_EVAL_BUILTINS entirely from
  extraction_strategy.py. Dead security-sensitive code is a liability.
  The eval path was already disabled; this removes the function itself.
- Fix hook_manager.py module injection: replace broken exec("import X", ns)
  pattern (silently failed due to missing __import__) with direct module
  injection. Sanitize asyncio to strip subprocess access (RCE vector).
- Add startup warning when CRAWL4AI_API_TOKEN is unset (all endpoints
  unauthenticated).
- Expand adversarial test suite to 87 tests: hook sandbox escapes,
  asyncio.subprocess RCE verification, end-to-end exploit payload from
  vuln report, dead code deletion checks, codebase eval/exec audit.
2026-03-31 13:01:57 +00:00
unclecode
2fc39cbe89 fix(security): remove eval() from computed fields, harden config deserializer
- Disable eval() in _compute_field expression path (RCE vector via untrusted input).
  Expression key now logs warning and returns default; function key still works.
- Harden _safe_eval_config in server.py with name/attribute allowlists,
  block lambdas, generators, comprehensions in constructor args.
- Remove getattr/setattr from hook_manager allowed builtins (sandbox escape vectors).
- Add 67 adversarial security tests covering all eval/exec attack surfaces.

Closes #1886, closes #1855
2026-03-31 12:02:43 +00:00
UncleCode
1debe5f5fc Merge pull request #1885 from unclecode/develop
docs: update version references to 0.8.6
2026-03-30 09:59:58 +07:00
ntohidi
bcbccbea2f docs: update version references to 0.8.6 in README and Docker docs 2026-03-30 10:57:13 +08:00
Nasrin
7e7533ec7c Merge pull request #1882 from hafezparast/fix/crawler-config-dict-validation-1880
fix: validate markdown_generator type to catch bad JSON format (#1880)
2026-03-30 04:50:32 +02:00
hafezparast
e9f832274e fix: validate markdown_generator type in CrawlerRunConfig to catch bad JSON format (#1880)
When the Docker API receives markdown_generator as JSON with "options"
instead of "params", from_serializable_dict silently passes the raw
dict through. This later crashes with a confusing "'dict' object has
no attribute 'generate_markdown'" deep in the crawl pipeline.

Add type validation for markdown_generator in CrawlerRunConfig.__init__
(matching existing extraction_strategy/chunking_strategy validation).
When a dict slips through, the error now clearly states:
- What type was expected vs received
- That "params" is the required key (not "options")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 07:39:28 +08:00
unclecode
af648e104f fix: bump Dockerfile version to 0.8.6 docker-rebuild-v0.8.6 2026-03-24 15:19:18 +00:00
unclecode
4e4a996878 fix: replace litellm with unclecode-litellm due to PyPI supply chain compromise
litellm 1.82.7-1.82.8 on PyPI were compromised with malicious code.
PyPI quarantined the entire package (all versions uninstallable).
Switched to unclecode-litellm==1.81.13, a pre-compromise fork published
under our own PyPI account. Drop-in replacement — all imports unchanged.
v0.8.6
2026-03-24 14:49:36 +00:00
unclecode
f4bda05178 release: bump version to 0.8.6
Pin litellm to safe fork due to PyPI supply chain compromise
(versions 1.82.7-1.82.8 compromised, entire package quarantined).
2026-03-24 14:13:41 +00:00
unclecode
01c685cd3a fix: pin litellm to safe fork (v1.81.13) due to PyPI supply chain compromise
litellm versions 1.82.7 and 1.82.8 on PyPI were compromised with malicious
code. PyPI has quarantined the entire package, blocking all installs.
Temporarily pin to our own fork at a known-safe version.
2026-03-24 14:03:26 +00:00
Nasrin
1a40ccf093 Merge pull request #1844 from hafezparast/fix/maysam-browser-none-guard-1842
fix: improve browser None guard in create_browser_context (#1842)
2026-03-24 11:37:46 +01:00
Nasrin
6eb2530bd9 Merge pull request #1849 from hafezparast/fix/maysam-serialize-skip-non-config-1848
fix: skip non-allowlisted types in serialization/deserialization (#1848)
2026-03-24 11:36:03 +01:00
Nasrin
fb24ee592e Merge pull request #1851 from hafezparast/fix/maysam-mcp-sse-asgi-1850
fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)
2026-03-24 11:17:35 +01:00
ntohidi
3846b738cf Merge branch 'develop' of https://github.com/unclecode/crawl4ai into main 2026-03-24 18:10:40 +08:00
UncleCode
1a597cb97f Merge pull request #1836 from unclecode/release/v0.8.5
Release v0.8.5
2026-03-24 11:06:58 +01:00
hafezparast
8995c1bbd6 feat: expose arun_many config-list support in Docker API (#1837)
The /crawl endpoint now accepts an optional crawler_configs field
(list of CrawlerRunConfig dicts) alongside the existing crawler_config.
When provided with multiple URLs, each config is deserialized and passed
as a list to arun_many(), enabling per-URL configuration with url_matcher
patterns. Single-URL requests and requests without crawler_configs are
unchanged (backward compatible).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:56:53 +08:00
hafezparast
219416e49d fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)
Starlette's Route wraps async functions in request_response(), calling
handler(request) instead of handler(scope, receive, send). This broke
the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable
class instead of an async function — Route passes class instances
through as raw ASGI apps without wrapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 08:55:41 +08:00
hafezparast
e603e4a722 fix: skip non-allowlisted types in serialization/deserialization (#1848)
to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES
(returns None), preventing objects like logging.Logger from being
serialized as {"type": "Logger", "params": {...}} which then fails
deserialization. from_serializable_dict returns None for unknown types
instead of raising ValueError, handling payloads from older clients.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 08:17:02 +08:00
hafezparast
2fd0f4c6a7 fix: preserve mermaid diagram text from SVGs during scraping (#1043)
Mermaid diagrams rendered as SVGs were completely stripped during HTML
cleaning, losing all text content. Now detects SVGs with id="mermaid-*",
extracts node/edge labels, and replaces the SVG with a fenced mermaid
code block containing the diagram type and extracted text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 11:43:54 +08:00
hafezparast
310b52b663 fix: improve browser None guard in create_browser_context (#1842)
The existing guard assumed self.browser=None only meant persistent context mode.
In reality, the browser can be None because it was closed by the janitor, crashed,
or never started. This caused a misleading error message. Now the guard distinguishes
between persistent context and closed/crashed browser with appropriate messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 10:45:38 +08:00
ntohidi
37da8b8f97 fix: pin redis-tools version to match redis-server in Dockerfile docker-rebuild-v0.8.5 2026-03-21 14:26:23 +08:00