crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-11 00:08:01 +00:00

Author	SHA1	Message	Date
unclecode	aa81e8fe7d	security: non-breaking hardening patch (0.8.8) Backward-compatible fixes for the Docker server - features keep working, only the unsafe behavior is closed. (The secure-by-default redesign is the later major.) - SSRF: replace the explicit blocklist with the one rule (reject any resolved IP where not ip.is_global) evaluated on embedded IPv4 transition forms too, closing the gaps - IPv6 unspecified ::, NAT64 64:ff9b::/96, 6to4 2002::/16, v4-mapped. Error messages are now opaque (no resolved-IP leak). - output_path arbitrary write: harden validate_output_path with realpath containment (defeats a symlinked path component) and write via O_NOFOLLOW (write_output_file). output_path stays supported. - LLM base_url key exfil: ignore a request-supplied base_url in /md, /llm, /llm/job; the endpoint is always server-derived. Field still accepted (no 4xx) for compatibility. - env:SECRET_KEY exfil gadget: LLMConfig refuses env: resolution of protected names (SECRET/PASSWORD/PRIVATE substrings, CRAWL4AI/AWS_SECRET prefixes, SECRET_KEY/REDIS_PASSWORD/TOKEN). Normal provider keys (OPENAI_API_KEY, ...) unaffected. - CRLF log injection: CRLFSafeFilter strips CR/LF/control from log records. - Webhook header injection: sanitize_webhook_headers (name pattern, no control chars, deny hop-by-hop/sensitive) at send time + a WebhookConfig validator for early 422. Bump 0.8.7 -> 0.8.8 (__version__ + Dockerfile C4AI_VER). 30 new behavioral tests; existing 111 security tests + 112 library config tests still pass. NOT included (breaking -> deferred to the major): auth-by-default, trust boundary, declarative hooks, output_path removal, base_url/provider removal, loopback bind, redis password, TLS-verify-on, CORS, bounded queue. The exec-hook RCE and unauth-by-default criticals have no non-breaking fix and are closed only in the major (hooks are already off by default).	2026-06-02 12:39:04 +00:00
unclecode	fcaf08b3b3	merge: slot April 2026 security batch (Docker API vulns, SSRF, JWT, file-write, XSS, execute_js) into develop for 0.8.7	2026-06-01 12:40:37 +00:00
unclecode	1e25edcb5c	fix(security): block IPv6-mapped IPv4 SSRF bypass Caught during internal review. `http://[::ffff:127.0.0.1]/` bypassed validate_webhook_url because getaddrinfo returns ::ffff:7f00:1, which is not in any IPv4 blocklist (127.0.0.0/8) nor IPv6 blocklist (::1/128). Fix: added _expand_ip_candidates() helper that unwraps IPv4 from IPv4-mapped (::ffff:X.Y.Z.W, via .ipv4_mapped) and IPv4-compatible (::X.Y.Z.W, via low-32-bits) IPv6 addresses. Blocklist now checks both the original IP and the unwrapped IPv4 form. Added 6 new TestIPv6MappedBypass tests covering: - Loopback, RFC 1918, link-local (cloud metadata) via ::ffff: mapping - IPv4-compatible variant (::127.0.0.1) - Regression test that plain ::1 still blocked Also updated stale test assertion in test_eval_security_adversarial: hasattr, type, __build_class__ were removed from hook builtins in batch 2 but the test still expected hasattr to remain. DO NOT PUSH until release day.	2026-04-20 10:10:59 +00:00
unclecode	f77c0a856f	fix(security): SSRF protection on all crawl/md/llm URL entry points Reported by secsys_codex (2026-04-18): /md, /crawl, /llm endpoints pass user URLs to crawler.arun() with no private IP validation. - Add validate_url_destination() to utils.py with opt-out via CRAWL4AI_ALLOW_INTERNAL_URLS=true env var for users who need to crawl internal services. - Integrate into validate_url_scheme() (covers all server.py endpoints). - Add validation at all 4 URL entry points in api.py (handle_llm_qa, handle_markdown_request, create_new_task, handle_crawl_request). - raw: URLs bypass check (inline HTML, no network fetch). - 16 adversarial + source coverage tests added. - secsys_codex added to SECURITY-CREDITS.md. DO NOT PUSH until release day.	2026-04-20 09:42:43 +00:00
Gab	c8c2dc319f	fix: add LLMTableExtraction to Docker API deserialization allowlist	2026-04-17 15:43:56 -04:00
unclecode	0f20f8bb83	fix(security): batch 2 - JWT secret, eval removal, execute_js, hook sandbox Fixes for 4 vulnerabilities reported by by111/August829 (2026-04-14): 1. Hardcoded JWT secret (CVSS 9.8): Removed "mysecret" default from auth.py. Added weak secret validation (blocklist + min 32 chars). Auto-generates ephemeral key when none set. 2. eval() in /config/dump (CVSS 9.1): Replaced eval-based config parsing with JSON input {type, params} validated by Pydantic. Added authentication. Deleted _safe_eval_config and all AST allowlist code. 3. /execute_js endpoint (CVSS 8.1): Disabled by default via CRAWL4AI_EXECUTE_JS_ENABLED env var. Added SSRF blocklist on destination URL. Removed --disable-web-security from default browser args. 4. Hook sandbox escape (CVSS 9.8): Strip __builtins__, __loader__, __spec__ from injected module proxies. Removed type, hasattr, __build_class__ from allowed builtins. Also added SECURITY-CREDITS.md tracking all reporters. 30 adversarial tests added. DO NOT PUSH until release day.	2026-04-15 05:42:14 +00:00
unclecode	7976b45817	fix(security): patch 4 vulns - file write, SSRF, monitor auth, XSS Fixes for 4 vulnerabilities reported by Jeongbean Jeon (2026-04-13): 1. Arbitrary File Write (CVSS 9.1): /screenshot and /pdf output_path now validated via validate_output_path() restricting writes to CRAWL4AI_OUTPUT_DIR. Pydantic validator rejects '..' at schema level. 2. SSRF via Webhook (CVSS 8.6): validate_webhook_url() blocks private IPs (RFC 1918, loopback, link-local, cloud metadata), dangerous hostnames (localhost, metadata.google.internal, host.docker.internal). Validated at job submission + send time. follow_redirects=False set. 3. Monitor Auth Bypass (CVSS 6.5): monitor_router now mounted with dependencies=[Depends(token_dep)]. WebSocket /ws endpoint checks CRAWL4AI_API_TOKEN from query params. 4. Stored XSS (CVSS 6.1): Server-side html.escape() on URLs and errors in monitor.py. Client-side escapeHtml() wrapping all innerHTML template injections in index.html (active/completed/error lists + WebSocket updates). 33 adversarial security tests added. DO NOT PUSH until release day. Merge to develop + tag + advisory together.	2026-04-13 11:29:54 +00:00
unclecode	2fc39cbe89	fix(security): remove eval() from computed fields, harden config deserializer - Disable eval() in _compute_field expression path (RCE vector via untrusted input). Expression key now logs warning and returns default; function key still works. - Harden _safe_eval_config in server.py with name/attribute allowlists, block lambdas, generators, comprehensions in constructor args. - Remove getattr/setattr from hook_manager allowed builtins (sandbox escape vectors). - Add 67 adversarial security tests covering all eval/exec attack surfaces. Closes #1886, closes #1855	2026-03-31 12:02:43 +00:00
unclecode	0104db6de2	Fix critical RCE via deserialization and eval() in /crawl endpoint - Replace raw eval() in _compute_field() with AST-validated _safe_eval_expression() that blocks __import__, dunder attribute access, and import statements while preserving safe transforms - Add ALLOWED_DESERIALIZE_TYPES allowlist to from_serializable_dict() preventing arbitrary class instantiation from API input - Update security contact email and add v0.8.1 security fixes to SECURITY.md with researcher acknowledgment - Add 17 security tests covering both fixes	2026-01-30 08:46:32 +00:00
unclecode	f24396c23e	Fix critical RCE and LFI vulnerabilities in Docker API deployment Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly)	2026-01-12 04:14:37 +00:00
unclecode	aba4036ab6	Add demo and test scripts for monitor dashboard activity - Introduced a demo script (`demo_monitor_dashboard.py`) to showcase various monitoring features through simulated activity. - Implemented a test script (`test_monitor_demo.py`) to generate dashboard activity and verify monitor health and endpoint statistics. - Added a logo image to the static assets for branding purposes.	2025-10-17 22:43:06 +08:00
unclecode	b97eaeea4c	feat(docker): implement smart browser pool with 10x memory efficiency Major refactoring to eliminate memory leaks and enable high-scale crawling: - Smart 3-Tier Browser Pool: - Permanent browser (always-ready default config) - Hot pool (configs used 3+ times, longer TTL) - Cold pool (new/rare configs, short TTL) - Auto-promotion: cold → hot after 3 uses - 100% pool reuse achieved in tests - Container-Aware Memory Detection: - Read cgroup v1/v2 memory limits (not host metrics) - Accurate memory pressure detection in Docker - Memory-based browser creation blocking - Adaptive Janitor: - Dynamic cleanup intervals (10s/30s/60s based on memory) - Tiered TTLs: cold 30-300s, hot 120-600s - Aggressive cleanup at high memory pressure - Unified Pool Usage: - All endpoints now use pool (/html, /screenshot, /pdf, /execute_js, /md, /llm) - Fixed config signature mismatch (permanent browser matches endpoints) - get_default_browser_config() helper for consistency - Configuration: - Reduced idle_ttl: 1800s → 300s (30min → 5min) - Fixed port: 11234 → 11235 (match Gunicorn) Performance Results (from stress tests): - Memory: 10x reduction (500-700MB × N → 270MB permanent) - Latency: 30-50x faster (<100ms pool hits vs 3-5s startup) - Reuse: 100% for default config, 60%+ for variants - Capacity: 100+ concurrent requests (vs ~20 before) - Leak: 0 MB/cycle (stable across tests) Test Infrastructure: - 7-phase sequential test suite (tests/) - Docker stats integration + log analysis - Pool promotion verification - Memory leak detection - Full endpoint coverage Fixes memory issues reported in production deployments.	2025-10-17 20:38:39 +08:00

12 Commits