crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-11 00:08:01 +00:00

Author	SHA1	Message	Date
Nasrin	532b105fc7	Merge pull request #1975 from nightcityblade/fix/issue-1973 Thanks @nightcityblade for the quick fix! Clean and straightforward CSS change.	2026-05-25 11:59:25 +02:00
nightcityblade	791aae3a21	fix: allow assistant toolbar to scroll Fixes unclecode/crawl4ai#1973	2026-05-20 23:07:47 +08:00
Nasrin	dfb525edec	Merge pull request #1951 from unclecode/fix/bedrock-provider-prefix Add bedrock to PROVIDER_MODELS_PREFIXES so AWS credential auth works	2026-05-06 10:37:05 +02:00
Nasrin	47a4c256c9	Merge pull request #1952 from hafezparast/fix/maysam-silent-scrape-failure-1949 fix: log failure reason before COMPLETE and fix misleading SCRAPE ✓ (#1949)	2026-05-05 07:52:44 +02:00
Nasrin	a45c678ee4	Merge pull request #1939 from unclecode/fix/preserve-tail-text-1938 fix: preserve .tail text when removing empty elements (#1938)	2026-05-05 07:46:47 +02:00
hafezparast	5e5519b1c6	fix: log failure reason before COMPLETE and fix misleading SCRAPE ✓ (#1949 ) Two issues caused silent COMPLETE ✗ with no diagnostic output: 1. When crawl_result.success=False (anti-bot detection, empty HTML, etc.), the error_message was set on the CrawlResult but never logged — users saw only [COMPLETE] ✗ with zero explanation. Fix: emit an [ERROR] log containing error_message before the COMPLETE line whenever success=False. 2. The SCRAPE log in aprocess_html always emitted success=True regardless of whether scraping produced any content. Fix: use bool(cleaned_html) so SCRAPE reflects the actual outcome. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-01 16:18:03 +08:00
Soham Kukreti	660f49c879	fix: add bedrock to PROVIDER_MODELS_PREFIXES so AWS credential auth works LLMConfig.__init__ checks PROVIDER_MODELS_PREFIXES when api_token=None. If the provider prefix isn't found, it silently falls through to the else branch and overwrites self.provider with DEFAULT_PROVIDER (openai/gpt-4o), meaning any bedrock/* model string was being replaced before the LLM call was even made. This broke supported Bedrock auth methods when api_token is not passed in the LLMConfig Only passing api_token=<bearer_token> explicitly worked, because the truthy api_token bypassed the prefix check entirely. Adding "bedrock": None to PROVIDER_MODELS_PREFIXES keeps self.provider intact so the correct Bedrock provider is used. The actual auth (SigV4 signing or Bearer header) is handled downstream based on what credentials are available in the environment.	2026-04-30 22:13:54 +05:30
Nasrin	388ce3f033	Merge pull request #1940 from hafezparast/fix/mermaid-sequence-fence-1043 fix: broaden mermaid SVG text extraction and prevent nested fences (#1043)	2026-04-30 13:34:00 +02:00
hafezparast	dba38c7886	fix: broaden mermaid SVG text extraction and prevent nested fences (#1043 ) Three improvements over the initial fix: 1. Add SVG text/tspan fallback for sequence, gantt, and git diagrams which don't use foreignObject / .nodeLabel spans but instead render labels via native SVG <text> elements. 2. Detect when a mermaid SVG is already wrapped in an outer <pre> (e.g. deepwiki.com) and replace it with a plain <span> instead of a new <pre> block, avoiding invalid nested markdown fences. 3. Add data-language attribute support to CustomHTML2Text so that pre[data-language=mermaid] emits a proper ```mermaid fence in the markdown output, instead of an unlabelled ``` block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 01:36:52 +08:00
Nasrin	4d139247b9	Merge pull request #1934 from hafezparast/fix/maysam-dispatcher-semaphore-count-1927 fix: wire semaphore_count into auto-created MemoryAdaptiveDispatcher (#1927)	2026-04-24 18:53:56 +02:00
ntohidi	e8f1af7c16	fix: preserve .tail text when removing empty elements (#1938 ) remove_empty_elements_fast() was dropping trailing text attached to elements via lxml .tail when removing empty elements. Now appends the tail to the previous sibling or parent before removal.	2026-04-24 18:51:24 +02:00
ntohidi	04985ea15e	docs: update arun() docstring to match CrawlResultContainer return type	2026-04-24 18:35:57 +02:00
Nasrin	35ee366e28	Merge pull request #1901 from hafezparast/fix/maysam-arun-type-hint-1898 fix: correct arun() return type annotation (#1898)	2026-04-24 18:33:41 +02:00
Nasrin	244fbf7b58	Merge pull request #1929 from atomic-carpenter/listen-on-all-addressess docker: listen on all addresses	2026-04-24 17:59:48 +02:00
hafezparast	4e72f31011	fix: use semaphore_count default of 10 to match CrawlerRunConfig default	2026-04-24 23:57:17 +08:00
Nasrin	d595679d25	Merge pull request #1925 from sevenmoonlightsteps/fix/docker-llm-table-extraction-allowlist fix: add LLMTableExtraction to Docker API deserialization allowlist	2026-04-24 17:11:56 +02:00
Nasrin	936e4470eb	Merge pull request #1845 from hafezparast/fix/maysam-mermaid-svg-text-1043 fix: preserve mermaid diagram text from SVGs during scraping (#1043)	2026-04-24 16:49:59 +02:00
Nasrin	5e56e34840	Merge pull request #1922 from unclecode/fix/deep-crawl-streaming-contextvar-1917 fix(deep-crawl): ContextVar crash in streaming deep crawl (#1917)	2026-04-24 16:36:08 +02:00
hafezparast	d3c92ee3df	fix: wire semaphore_count into auto-created MemoryAdaptiveDispatcher (#1927 ) When arun_many() created a MemoryAdaptiveDispatcher automatically, max_session_permit was always 20 (the class default), silently ignoring the user's semaphore_count setting in CrawlerRunConfig. Now reads semaphore_count from the primary config and passes it as max_session_permit. The max(1, ... or 5) guard handles zero/None/negative values safely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 10:59:52 +08:00
Atomic Carpenter	e06b19ca09	docker: listen on all addresses Fixes: https://github.com/unclecode/crawl4ai/issues/1928	2026-04-22 00:47:08 +02:00
unclecode	0e92b5e239	docs: add Privacy Policy, Terms of Service, and Support pages Add legal pages required for Google Workspace Marketplace listing verification. Pages cover the whole Crawl4AI Cloud business (OSS library, hosted API, dashboard, integrations, Workspace add-ons), not specific to any single product. - privacy.md: data collection, usage, retention, Workspace Limited Use - terms.md: account, billing, acceptable use, IP, governing law (SG) - support.md: email, docs, GitHub, Discord, security disclosure	2026-04-20 02:24:21 +00:00
Gab	c8c2dc319f	fix: add LLMTableExtraction to Docker API deserialization allowlist	2026-04-17 15:43:56 -04:00
Nasrin	4e86399bfa	Merge pull request #1913 from unclecode/fix/nlp-sentence-chunking-1909 fix(chunking): preserve sentence order in NlpSentenceChunking	2026-04-16 10:24:59 +02:00
ntohidi	3d4bda122a	fix(deep-crawl): use set(False) instead of reset(token) for ContextVar (#1917 ) ContextVar.reset(token) requires the same Context that created the token. When Starlette's StreamingResponse consumes the async generator in a different Task, the Context changes and reset() raises ValueError. Replaced with set(False) which works across context boundaries. Safe because deep_crawl_active is never nested — the guard on line 21 prevents re-entry.	2026-04-16 13:49:32 +08:00
ntohidi	7bfc547bce	fix: preserve rowspan/colspan in cleaned_html (#1920 ) Add rowspan and colspan to IMPORTANT_ATTRS so they survive attribute stripping in remove_unwanted_attributes_fast().	2026-04-16 12:42:36 +08:00
ntohidi	c837c0d9cb	fix(chunking): preserve sentence order in NlpSentenceChunking (#1909 ) Remove broken re-import of load_nltk_punkt (already imported at module level). Replace list(set(sens)) with plain return — set() destroyed document order and silently dropped duplicate sentences.	2026-04-11 17:27:18 +08:00
hafezparast	c5612f7551	fix: correct arun() return type from RunManyReturn to CrawlResultContainer (#1898 ) arun() always returns CrawlResultContainer, never AsyncGenerator. The RunManyReturn type (Union[CrawlResultContainer, AsyncGenerator]) caused Pylance/Pyright to flag result.markdown as an error because AsyncGenerator doesn't have that attribute. Also adds test_type_annotations.py — 11 static analysis tests that catch annotation mismatches (return types, missing annotations, export checks) without needing pyright in CI. Would have caught this bug before it was reported. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 21:35:17 +08:00
Nasrin	3d02d75edb	Merge pull request #1852 from hafezparast/feat/maysam-arun-many-config-list-1837 feat: expose arun_many config-list support in Docker API (#1837)	2026-04-06 10:26:44 +02:00
unclecode	ec560f13d2	fix: default LLMExtractionStrategy extraction_type to schema Block mode returns an internal index/tags/content format that is rarely useful. Schema mode returns clean structured JSON, either matching a provided schema or inferred from the instruction.	2026-04-04 09:26:35 +00:00
unclecode	e326da9166	fix(security): complete AST sandbox escape remediation (CVSS 9.8) Addresses the gi_frame.f_back chain exploit reported by Song Binglin (q1uf3ng). - Delete _safe_eval_expression() and _SAFE_EVAL_BUILTINS entirely from extraction_strategy.py. Dead security-sensitive code is a liability. The eval path was already disabled; this removes the function itself. - Fix hook_manager.py module injection: replace broken exec("import X", ns) pattern (silently failed due to missing __import__) with direct module injection. Sanitize asyncio to strip subprocess access (RCE vector). - Add startup warning when CRAWL4AI_API_TOKEN is unset (all endpoints unauthenticated). - Expand adversarial test suite to 87 tests: hook sandbox escapes, asyncio.subprocess RCE verification, end-to-end exploit payload from vuln report, dead code deletion checks, codebase eval/exec audit.	2026-03-31 13:01:57 +00:00
unclecode	2fc39cbe89	fix(security): remove eval() from computed fields, harden config deserializer - Disable eval() in _compute_field expression path (RCE vector via untrusted input). Expression key now logs warning and returns default; function key still works. - Harden _safe_eval_config in server.py with name/attribute allowlists, block lambdas, generators, comprehensions in constructor args. - Remove getattr/setattr from hook_manager allowed builtins (sandbox escape vectors). - Add 67 adversarial security tests covering all eval/exec attack surfaces. Closes #1886, closes #1855	2026-03-31 12:02:43 +00:00
UncleCode	1debe5f5fc	Merge pull request #1885 from unclecode/develop docs: update version references to 0.8.6	2026-03-30 09:59:58 +07:00
ntohidi	bcbccbea2f	docs: update version references to 0.8.6 in README and Docker docs	2026-03-30 10:57:13 +08:00
Nasrin	7e7533ec7c	Merge pull request #1882 from hafezparast/fix/crawler-config-dict-validation-1880 fix: validate markdown_generator type to catch bad JSON format (#1880)	2026-03-30 04:50:32 +02:00
hafezparast	e9f832274e	fix: validate markdown_generator type in CrawlerRunConfig to catch bad JSON format (#1880 ) When the Docker API receives markdown_generator as JSON with "options" instead of "params", from_serializable_dict silently passes the raw dict through. This later crashes with a confusing "'dict' object has no attribute 'generate_markdown'" deep in the crawl pipeline. Add type validation for markdown_generator in CrawlerRunConfig.__init__ (matching existing extraction_strategy/chunking_strategy validation). When a dict slips through, the error now clearly states: - What type was expected vs received - That "params" is the required key (not "options") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 07:39:28 +08:00
unclecode	af648e104f	fix: bump Dockerfile version to 0.8.6 docker-rebuild-v0.8.6	2026-03-24 15:19:18 +00:00
unclecode	4e4a996878	fix: replace litellm with unclecode-litellm due to PyPI supply chain compromise litellm 1.82.7-1.82.8 on PyPI were compromised with malicious code. PyPI quarantined the entire package (all versions uninstallable). Switched to unclecode-litellm==1.81.13, a pre-compromise fork published under our own PyPI account. Drop-in replacement — all imports unchanged. v0.8.6	2026-03-24 14:49:36 +00:00
unclecode	f4bda05178	release: bump version to 0.8.6 Pin litellm to safe fork due to PyPI supply chain compromise (versions 1.82.7-1.82.8 compromised, entire package quarantined).	2026-03-24 14:13:41 +00:00
unclecode	01c685cd3a	fix: pin litellm to safe fork (v1.81.13) due to PyPI supply chain compromise litellm versions 1.82.7 and 1.82.8 on PyPI were compromised with malicious code. PyPI has quarantined the entire package, blocking all installs. Temporarily pin to our own fork at a known-safe version.	2026-03-24 14:03:26 +00:00
Nasrin	1a40ccf093	Merge pull request #1844 from hafezparast/fix/maysam-browser-none-guard-1842 fix: improve browser None guard in create_browser_context (#1842)	2026-03-24 11:37:46 +01:00
Nasrin	6eb2530bd9	Merge pull request #1849 from hafezparast/fix/maysam-serialize-skip-non-config-1848 fix: skip non-allowlisted types in serialization/deserialization (#1848)	2026-03-24 11:36:03 +01:00
Nasrin	fb24ee592e	Merge pull request #1851 from hafezparast/fix/maysam-mcp-sse-asgi-1850 fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)	2026-03-24 11:17:35 +01:00
ntohidi	3846b738cf	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into main	2026-03-24 18:10:40 +08:00
UncleCode	1a597cb97f	Merge pull request #1836 from unclecode/release/v0.8.5 Release v0.8.5	2026-03-24 11:06:58 +01:00
hafezparast	8995c1bbd6	feat: expose arun_many config-list support in Docker API (#1837 ) The /crawl endpoint now accepts an optional crawler_configs field (list of CrawlerRunConfig dicts) alongside the existing crawler_config. When provided with multiple URLs, each config is deserialized and passed as a list to arun_many(), enabling per-URL configuration with url_matcher patterns. Single-URL requests and requests without crawler_configs are unchanged (backward compatible). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 09:56:53 +08:00
hafezparast	219416e49d	fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850 ) Starlette's Route wraps async functions in request_response(), calling handler(request) instead of handler(scope, receive, send). This broke the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable class instead of an async function — Route passes class instances through as raw ASGI apps without wrapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 08:55:41 +08:00
hafezparast	e603e4a722	fix: skip non-allowlisted types in serialization/deserialization (#1848 ) to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES (returns None), preventing objects like logging.Logger from being serialized as {"type": "Logger", "params": {...}} which then fails deserialization. from_serializable_dict returns None for unknown types instead of raising ValueError, handling payloads from older clients. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 08:17:02 +08:00
hafezparast	2fd0f4c6a7	fix: preserve mermaid diagram text from SVGs during scraping (#1043 ) Mermaid diagrams rendered as SVGs were completely stripped during HTML cleaning, losing all text content. Now detects SVGs with id="mermaid-*", extracts node/edge labels, and replaces the SVG with a fenced mermaid code block containing the diagram type and extracted text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 11:43:54 +08:00
hafezparast	310b52b663	fix: improve browser None guard in create_browser_context (#1842 ) The existing guard assumed self.browser=None only meant persistent context mode. In reality, the browser can be None because it was closed by the janitor, crashed, or never started. This caused a misleading error message. Now the guard distinguishes between persistent context and closed/crashed browser with appropriate messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 10:45:38 +08:00
ntohidi	37da8b8f97	fix: pin redis-tools version to match redis-server in Dockerfile docker-rebuild-v0.8.5	2026-03-21 14:26:23 +08:00

1 2 3 4 5 ...

1501 Commits