crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 15:58:15 +00:00

Author	SHA1	Message	Date
Nasrin	4e86399bfa	Merge pull request #1913 from unclecode/fix/nlp-sentence-chunking-1909 fix(chunking): preserve sentence order in NlpSentenceChunking	2026-04-16 10:24:59 +02:00
ntohidi	3d4bda122a	fix(deep-crawl): use set(False) instead of reset(token) for ContextVar (#1917 ) ContextVar.reset(token) requires the same Context that created the token. When Starlette's StreamingResponse consumes the async generator in a different Task, the Context changes and reset() raises ValueError. Replaced with set(False) which works across context boundaries. Safe because deep_crawl_active is never nested — the guard on line 21 prevents re-entry.	2026-04-16 13:49:32 +08:00
ntohidi	7bfc547bce	fix: preserve rowspan/colspan in cleaned_html (#1920 ) Add rowspan and colspan to IMPORTANT_ATTRS so they survive attribute stripping in remove_unwanted_attributes_fast().	2026-04-16 12:42:36 +08:00
unclecode	c9914691db	chore: add GitHub Security Advisory payloads for release day	2026-04-15 06:08:42 +00:00
unclecode	c45ccf20f6	fix: credit wulonchia by requested handle	2026-04-15 05:45:15 +00:00
unclecode	0f20f8bb83	fix(security): batch 2 - JWT secret, eval removal, execute_js, hook sandbox Fixes for 4 vulnerabilities reported by by111/August829 (2026-04-14): 1. Hardcoded JWT secret (CVSS 9.8): Removed "mysecret" default from auth.py. Added weak secret validation (blocklist + min 32 chars). Auto-generates ephemeral key when none set. 2. eval() in /config/dump (CVSS 9.1): Replaced eval-based config parsing with JSON input {type, params} validated by Pydantic. Added authentication. Deleted _safe_eval_config and all AST allowlist code. 3. /execute_js endpoint (CVSS 8.1): Disabled by default via CRAWL4AI_EXECUTE_JS_ENABLED env var. Added SSRF blocklist on destination URL. Removed --disable-web-security from default browser args. 4. Hook sandbox escape (CVSS 9.8): Strip __builtins__, __loader__, __spec__ from injected module proxies. Removed type, hasattr, __build_class__ from allowed builtins. Also added SECURITY-CREDITS.md tracking all reporters. 30 adversarial tests added. DO NOT PUSH until release day.	2026-04-15 05:42:14 +00:00
unclecode	7976b45817	fix(security): patch 4 vulns - file write, SSRF, monitor auth, XSS Fixes for 4 vulnerabilities reported by Jeongbean Jeon (2026-04-13): 1. Arbitrary File Write (CVSS 9.1): /screenshot and /pdf output_path now validated via validate_output_path() restricting writes to CRAWL4AI_OUTPUT_DIR. Pydantic validator rejects '..' at schema level. 2. SSRF via Webhook (CVSS 8.6): validate_webhook_url() blocks private IPs (RFC 1918, loopback, link-local, cloud metadata), dangerous hostnames (localhost, metadata.google.internal, host.docker.internal). Validated at job submission + send time. follow_redirects=False set. 3. Monitor Auth Bypass (CVSS 6.5): monitor_router now mounted with dependencies=[Depends(token_dep)]. WebSocket /ws endpoint checks CRAWL4AI_API_TOKEN from query params. 4. Stored XSS (CVSS 6.1): Server-side html.escape() on URLs and errors in monitor.py. Client-side escapeHtml() wrapping all innerHTML template injections in index.html (active/completed/error lists + WebSocket updates). 33 adversarial security tests added. DO NOT PUSH until release day. Merge to develop + tag + advisory together.	2026-04-13 11:29:54 +00:00
ntohidi	c837c0d9cb	fix(chunking): preserve sentence order in NlpSentenceChunking (#1909 ) Remove broken re-import of load_nltk_punkt (already imported at module level). Replace list(set(sens)) with plain return — set() destroyed document order and silently dropped duplicate sentences.	2026-04-11 17:27:18 +08:00
hafezparast	c5612f7551	fix: correct arun() return type from RunManyReturn to CrawlResultContainer (#1898 ) arun() always returns CrawlResultContainer, never AsyncGenerator. The RunManyReturn type (Union[CrawlResultContainer, AsyncGenerator]) caused Pylance/Pyright to flag result.markdown as an error because AsyncGenerator doesn't have that attribute. Also adds test_type_annotations.py — 11 static analysis tests that catch annotation mismatches (return types, missing annotations, export checks) without needing pyright in CI. Would have caught this bug before it was reported. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 21:35:17 +08:00
Nasrin	3d02d75edb	Merge pull request #1852 from hafezparast/feat/maysam-arun-many-config-list-1837 feat: expose arun_many config-list support in Docker API (#1837)	2026-04-06 10:26:44 +02:00
unclecode	ec560f13d2	fix: default LLMExtractionStrategy extraction_type to schema Block mode returns an internal index/tags/content format that is rarely useful. Schema mode returns clean structured JSON, either matching a provided schema or inferred from the instruction.	2026-04-04 09:26:35 +00:00
unclecode	e326da9166	fix(security): complete AST sandbox escape remediation (CVSS 9.8) Addresses the gi_frame.f_back chain exploit reported by Song Binglin (q1uf3ng). - Delete _safe_eval_expression() and _SAFE_EVAL_BUILTINS entirely from extraction_strategy.py. Dead security-sensitive code is a liability. The eval path was already disabled; this removes the function itself. - Fix hook_manager.py module injection: replace broken exec("import X", ns) pattern (silently failed due to missing __import__) with direct module injection. Sanitize asyncio to strip subprocess access (RCE vector). - Add startup warning when CRAWL4AI_API_TOKEN is unset (all endpoints unauthenticated). - Expand adversarial test suite to 87 tests: hook sandbox escapes, asyncio.subprocess RCE verification, end-to-end exploit payload from vuln report, dead code deletion checks, codebase eval/exec audit.	2026-03-31 13:01:57 +00:00
unclecode	2fc39cbe89	fix(security): remove eval() from computed fields, harden config deserializer - Disable eval() in _compute_field expression path (RCE vector via untrusted input). Expression key now logs warning and returns default; function key still works. - Harden _safe_eval_config in server.py with name/attribute allowlists, block lambdas, generators, comprehensions in constructor args. - Remove getattr/setattr from hook_manager allowed builtins (sandbox escape vectors). - Add 67 adversarial security tests covering all eval/exec attack surfaces. Closes #1886, closes #1855	2026-03-31 12:02:43 +00:00
UncleCode	1debe5f5fc	Merge pull request #1885 from unclecode/develop docs: update version references to 0.8.6	2026-03-30 09:59:58 +07:00
ntohidi	bcbccbea2f	docs: update version references to 0.8.6 in README and Docker docs	2026-03-30 10:57:13 +08:00
Nasrin	7e7533ec7c	Merge pull request #1882 from hafezparast/fix/crawler-config-dict-validation-1880 fix: validate markdown_generator type to catch bad JSON format (#1880)	2026-03-30 04:50:32 +02:00
hafezparast	e9f832274e	fix: validate markdown_generator type in CrawlerRunConfig to catch bad JSON format (#1880 ) When the Docker API receives markdown_generator as JSON with "options" instead of "params", from_serializable_dict silently passes the raw dict through. This later crashes with a confusing "'dict' object has no attribute 'generate_markdown'" deep in the crawl pipeline. Add type validation for markdown_generator in CrawlerRunConfig.__init__ (matching existing extraction_strategy/chunking_strategy validation). When a dict slips through, the error now clearly states: - What type was expected vs received - That "params" is the required key (not "options") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 07:39:28 +08:00
unclecode	af648e104f	fix: bump Dockerfile version to 0.8.6 docker-rebuild-v0.8.6	2026-03-24 15:19:18 +00:00
unclecode	4e4a996878	fix: replace litellm with unclecode-litellm due to PyPI supply chain compromise litellm 1.82.7-1.82.8 on PyPI were compromised with malicious code. PyPI quarantined the entire package (all versions uninstallable). Switched to unclecode-litellm==1.81.13, a pre-compromise fork published under our own PyPI account. Drop-in replacement — all imports unchanged. v0.8.6	2026-03-24 14:49:36 +00:00
unclecode	f4bda05178	release: bump version to 0.8.6 Pin litellm to safe fork due to PyPI supply chain compromise (versions 1.82.7-1.82.8 compromised, entire package quarantined).	2026-03-24 14:13:41 +00:00
unclecode	01c685cd3a	fix: pin litellm to safe fork (v1.81.13) due to PyPI supply chain compromise litellm versions 1.82.7 and 1.82.8 on PyPI were compromised with malicious code. PyPI has quarantined the entire package, blocking all installs. Temporarily pin to our own fork at a known-safe version.	2026-03-24 14:03:26 +00:00
Nasrin	1a40ccf093	Merge pull request #1844 from hafezparast/fix/maysam-browser-none-guard-1842 fix: improve browser None guard in create_browser_context (#1842)	2026-03-24 11:37:46 +01:00
Nasrin	6eb2530bd9	Merge pull request #1849 from hafezparast/fix/maysam-serialize-skip-non-config-1848 fix: skip non-allowlisted types in serialization/deserialization (#1848)	2026-03-24 11:36:03 +01:00
Nasrin	fb24ee592e	Merge pull request #1851 from hafezparast/fix/maysam-mcp-sse-asgi-1850 fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850)	2026-03-24 11:17:35 +01:00
ntohidi	3846b738cf	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into main	2026-03-24 18:10:40 +08:00
UncleCode	1a597cb97f	Merge pull request #1836 from unclecode/release/v0.8.5 Release v0.8.5	2026-03-24 11:06:58 +01:00
hafezparast	8995c1bbd6	feat: expose arun_many config-list support in Docker API (#1837 ) The /crawl endpoint now accepts an optional crawler_configs field (list of CrawlerRunConfig dicts) alongside the existing crawler_config. When provided with multiple URLs, each config is deserialized and passed as a list to arun_many(), enabling per-URL configuration with url_matcher patterns. Single-URL requests and requests without crawler_configs are unchanged (backward compatible). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 09:56:53 +08:00
hafezparast	219416e49d	fix: MCP SSE endpoint crash on Starlette >=0.50 (#1850 ) Starlette's Route wraps async functions in request_response(), calling handler(request) instead of handler(scope, receive, send). This broke the MCP SSE endpoint which needs raw ASGI access. Fix: use a callable class instead of an async function — Route passes class instances through as raw ASGI apps without wrapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 08:55:41 +08:00
hafezparast	e603e4a722	fix: skip non-allowlisted types in serialization/deserialization (#1848 ) to_serializable_dict now skips types not in ALLOWED_DESERIALIZE_TYPES (returns None), preventing objects like logging.Logger from being serialized as {"type": "Logger", "params": {...}} which then fails deserialization. from_serializable_dict returns None for unknown types instead of raising ValueError, handling payloads from older clients. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 08:17:02 +08:00
hafezparast	2fd0f4c6a7	fix: preserve mermaid diagram text from SVGs during scraping (#1043 ) Mermaid diagrams rendered as SVGs were completely stripped during HTML cleaning, losing all text content. Now detects SVGs with id="mermaid-*", extracts node/edge labels, and replaces the SVG with a fenced mermaid code block containing the diagram type and extracted text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 11:43:54 +08:00
hafezparast	310b52b663	fix: improve browser None guard in create_browser_context (#1842 ) The existing guard assumed self.browser=None only meant persistent context mode. In reality, the browser can be None because it was closed by the janitor, crashed, or never started. This caused a misleading error message. Now the guard distinguishes between persistent context and closed/crashed browser with appropriate messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 10:45:38 +08:00
ntohidi	37da8b8f97	fix: pin redis-tools version to match redis-server in Dockerfile docker-rebuild-v0.8.5	2026-03-21 14:26:23 +08:00
ntohidi	29d27ed1ae	fix: install curl and gnupg in Dockerfile to support Redis repository addition	2026-03-21 14:17:27 +08:00
unclecode	c4389adddf	fix: Prevent scan_full_page from hanging on dynamic/infinite-scroll pages - Default max_scroll_steps to 10 when not explicitly set (was None/unlimited) - Wrap _handle_full_page_scan in asyncio.wait_for with page_timeout - On timeout, log warning and continue with partial scroll instead of hanging Previously, scan_full_page could hang indefinitely because: 1. max_scroll_steps defaulted to None (no limit) 2. Dynamic pages keep growing total_height on each scroll 3. No asyncio timeout wrapper to interrupt hung coroutines	2026-03-18 15:36:12 +00:00
unclecode	3ecd852011	fix: Re-check is_blocked() when fallback fetch fails When fallback_fetch_function was invoked but failed (exception or empty response), the final is_blocked() re-check was skipped because fallback_fetch_used=True. This left crawl_result.success=True even though the result was a blocked page from the last proxy attempt. Changed the condition to check resolved_by=='fallback_fetch' (set only on success) instead of fallback_fetch_used (set before the try block).	2026-03-18 14:36:57 +00:00
ntohidi	4bf17796d4	feat: add version 0.8.5 release highlights including anti-bot detection, shadow DOM support, and critical security fixes to README v0.8.5	2026-03-18 11:23:20 +08:00
unclecode	9b571bb947	feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.) The HTTP crawler strategy now checks Content-Type and Content-Disposition headers to detect non-HTML file responses. When a file download is detected, raw bytes are saved to disk and the path is returned via downloaded_files. Text-based files (CSV, JSON, XML) also populate the html field for backward compatibility. Binary files (PDF, images) set html to empty string — content is only available via downloaded_files. Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).	2026-03-16 14:03:43 +00:00
ntohidi	bb6406a2d0	release: Crawl4AI v0.8.5 Bump version to 0.8.5 across all references (Dockerfile, README, Docker README, blog index, __version__.py). Add release notes, blog post, demo verification script (13 real-crawl tests), and releases directory entry. Key highlights: - Anti-bot detection with 3-tier proxy escalation - Shadow DOM flattening - Deep crawl cancellation - Config defaults API - 60+ bug fixes and critical security patches	2026-03-16 18:46:05 +08:00
ntohidi	f6ab207e25	fix: remove shared LOCK contention in monitor to prevent pod deadlock (#1754 ) The monitor's update_timeline(), get_health_summary(), and get_browser_list() all acquired the crawler pool's global LOCK to read pool stats. That same lock is held during slow browser start/close operations (get_crawler, janitor, close_all), causing the monitor to block indefinitely and the pod to become unresponsive after sustained crawling. Replaced all three lock acquisitions in monitor.py with a lock-free get_pool_snapshot() in crawler_pool.py that returns shallow dict copies. Under CPython's GIL, dict.copy() and len() are atomic — safe for read-only monitoring with at most slightly stale counts.	2026-03-13 12:17:52 +08:00
Nasrin	648f36b622	Merge pull request #1827 from hafezparast/fix/maysam-llm-provider-redis-config-1611-1817 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)	2026-03-13 03:59:28 +01:00
Nasrin	6e4299577f	Merge pull request #1833 from hafezparast/fix/maysam-css-selector-raw-1484 fix: css_selector ignored in LXML scraping for raw:// URLs (#1484)	2026-03-13 03:38:15 +01:00
hafezparast	8de83a3590	fix: css_selector ignored in LXML scraping for raw:// URLs (#1484 ) css_selector was skipped in _scrap() — only target_elements was applied. Now css_selector filters the DOM first, then target_elements narrows within that selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 20:00:33 +08:00
unclecode	bf1158a61b	fix: upgrade Redis to 7.2.7 for CVE-2025-49844 (CVSS 10.0) (#1671 ) Add official Redis apt repository and pin redis-server to 7.2.7 which patches the Lua use-after-free vulnerability. REDIS_VERSION build arg allows override.	2026-03-12 11:24:42 +00:00
unclecode	a73bc1c076	fix: MCP SSE endpoint crash — mount via raw ASGI Route (#1594 ) Replace @app.get() with starlette.routing.Route() for the SSE handler. The MCP SDK's SseServerTransport calls raw ASGI (scope, receive, send) internally, which conflicts with Starlette's middleware wrapping. Also update CONTRIBUTORS.md for PR #1829.	2026-03-12 11:22:48 +00:00
hafezparast	3f481e9e5c	fix: screenshot distortion, deep crawl timeout/arun_many, CLI encoding (#1370 , #1818 , #1509 , #1762 ) - #1370: Freeze element dimensions via CSS before viewport resize in take_screenshot_scroller() to prevent responsive reflow on Elementor sites; restore original viewport after capture. - #1818: Call window.stop() on session-reused pages before navigation to abort pending loads; move event listener cleanup outside session_id guard so listeners don't accumulate across reuses. - #1509: Bypass dispatcher in arun_many() when deep_crawl_strategy is set — call arun() directly per URL so the DeepCrawlDecorator can invoke the strategy (dispatcher crashes on List[CrawlResult] return). - #1762: Add encoding="utf-8" to the remaining open() call in save_global_config() (cli.py line 58). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 18:17:13 +08:00
hafezparast	480d938f67	fix: /llm per-request provider override, Redis config from host/port/password (#1611 , #1817 ) - #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional provider, temperature, base_url query params with fallback to server config. Consistent with /md and /llm/job endpoints. - #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT, REDIS_PASSWORD environment variable overrides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:53:04 +08:00
Nasrin	d907e167a5	Merge pull request #1823 from hafezparast/fix/maysam-screenshot-scan-full-page-1750 fix: screenshot respects scan_full_page=False (#1750)	2026-03-12 07:39:52 +01:00
Maysam Hafezparast	57b0d09934	fix: deduplicate BM25ContentFilter output (#1213 ) (#1824 ) BM25ContentFilter.filter_content() returned duplicate text chunks when the same content appeared in multiple DOM elements. Added exact-text deduplication after threshold filtering, keeping the first occurrence in document order. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:23:34 +08:00
unclecode	35034f551b	docs: add hafezparast to CONTRIBUTORS.md Recognized for identifying and confirming the PDFContentScrapingStrategy deserialization fix (#1815).	2026-03-12 05:43:48 +00:00
hafezparast	6efbffe345	fix: screenshot respects scan_full_page=False (#1750 ) take_screenshot() ignored the scan_full_page config flag — tall pages always got a full-page screenshot even when scan_full_page=False. Now passes scan_full_page through to take_screenshot() and uses viewport-only capture when False. Includes 16 tests (8 unit + 8 integration). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 12:04:45 +08:00

1 2 3 4 5 ...

1533 Commits