crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-11 00:08:01 +00:00

Author	SHA1	Message	Date
unclecode	aa81e8fe7d	security: non-breaking hardening patch (0.8.8) Backward-compatible fixes for the Docker server - features keep working, only the unsafe behavior is closed. (The secure-by-default redesign is the later major.) - SSRF: replace the explicit blocklist with the one rule (reject any resolved IP where not ip.is_global) evaluated on embedded IPv4 transition forms too, closing the gaps - IPv6 unspecified ::, NAT64 64:ff9b::/96, 6to4 2002::/16, v4-mapped. Error messages are now opaque (no resolved-IP leak). - output_path arbitrary write: harden validate_output_path with realpath containment (defeats a symlinked path component) and write via O_NOFOLLOW (write_output_file). output_path stays supported. - LLM base_url key exfil: ignore a request-supplied base_url in /md, /llm, /llm/job; the endpoint is always server-derived. Field still accepted (no 4xx) for compatibility. - env:SECRET_KEY exfil gadget: LLMConfig refuses env: resolution of protected names (SECRET/PASSWORD/PRIVATE substrings, CRAWL4AI/AWS_SECRET prefixes, SECRET_KEY/REDIS_PASSWORD/TOKEN). Normal provider keys (OPENAI_API_KEY, ...) unaffected. - CRLF log injection: CRLFSafeFilter strips CR/LF/control from log records. - Webhook header injection: sanitize_webhook_headers (name pattern, no control chars, deny hop-by-hop/sensitive) at send time + a WebhookConfig validator for early 422. Bump 0.8.7 -> 0.8.8 (__version__ + Dockerfile C4AI_VER). 30 new behavioral tests; existing 111 security tests + 112 library config tests still pass. NOT included (breaking -> deferred to the major): auth-by-default, trust boundary, declarative hooks, output_path removal, base_url/provider removal, loopback bind, redis password, TLS-verify-on, CORS, bounded queue. The exec-hook RCE and unauth-by-default criticals have no non-breaking fix and are closed only in the major (hooks are already off by default).	2026-06-02 12:39:04 +00:00
unclecode	1e25edcb5c	fix(security): block IPv6-mapped IPv4 SSRF bypass Caught during internal review. `http://[::ffff:127.0.0.1]/` bypassed validate_webhook_url because getaddrinfo returns ::ffff:7f00:1, which is not in any IPv4 blocklist (127.0.0.0/8) nor IPv6 blocklist (::1/128). Fix: added _expand_ip_candidates() helper that unwraps IPv4 from IPv4-mapped (::ffff:X.Y.Z.W, via .ipv4_mapped) and IPv4-compatible (::X.Y.Z.W, via low-32-bits) IPv6 addresses. Blocklist now checks both the original IP and the unwrapped IPv4 form. Added 6 new TestIPv6MappedBypass tests covering: - Loopback, RFC 1918, link-local (cloud metadata) via ::ffff: mapping - IPv4-compatible variant (::127.0.0.1) - Regression test that plain ::1 still blocked Also updated stale test assertion in test_eval_security_adversarial: hasattr, type, __build_class__ were removed from hook builtins in batch 2 but the test still expected hasattr to remain. DO NOT PUSH until release day.	2026-04-20 10:10:59 +00:00
unclecode	f77c0a856f	fix(security): SSRF protection on all crawl/md/llm URL entry points Reported by secsys_codex (2026-04-18): /md, /crawl, /llm endpoints pass user URLs to crawler.arun() with no private IP validation. - Add validate_url_destination() to utils.py with opt-out via CRAWL4AI_ALLOW_INTERNAL_URLS=true env var for users who need to crawl internal services. - Integrate into validate_url_scheme() (covers all server.py endpoints). - Add validation at all 4 URL entry points in api.py (handle_llm_qa, handle_markdown_request, create_new_task, handle_crawl_request). - raw: URLs bypass check (inline HTML, no network fetch). - 16 adversarial + source coverage tests added. - secsys_codex added to SECURITY-CREDITS.md. DO NOT PUSH until release day.	2026-04-20 09:42:43 +00:00
unclecode	0f20f8bb83	fix(security): batch 2 - JWT secret, eval removal, execute_js, hook sandbox Fixes for 4 vulnerabilities reported by by111/August829 (2026-04-14): 1. Hardcoded JWT secret (CVSS 9.8): Removed "mysecret" default from auth.py. Added weak secret validation (blocklist + min 32 chars). Auto-generates ephemeral key when none set. 2. eval() in /config/dump (CVSS 9.1): Replaced eval-based config parsing with JSON input {type, params} validated by Pydantic. Added authentication. Deleted _safe_eval_config and all AST allowlist code. 3. /execute_js endpoint (CVSS 8.1): Disabled by default via CRAWL4AI_EXECUTE_JS_ENABLED env var. Added SSRF blocklist on destination URL. Removed --disable-web-security from default browser args. 4. Hook sandbox escape (CVSS 9.8): Strip __builtins__, __loader__, __spec__ from injected module proxies. Removed type, hasattr, __build_class__ from allowed builtins. Also added SECURITY-CREDITS.md tracking all reporters. 30 adversarial tests added. DO NOT PUSH until release day.	2026-04-15 05:42:14 +00:00
unclecode	7976b45817	fix(security): patch 4 vulns - file write, SSRF, monitor auth, XSS Fixes for 4 vulnerabilities reported by Jeongbean Jeon (2026-04-13): 1. Arbitrary File Write (CVSS 9.1): /screenshot and /pdf output_path now validated via validate_output_path() restricting writes to CRAWL4AI_OUTPUT_DIR. Pydantic validator rejects '..' at schema level. 2. SSRF via Webhook (CVSS 8.6): validate_webhook_url() blocks private IPs (RFC 1918, loopback, link-local, cloud metadata), dangerous hostnames (localhost, metadata.google.internal, host.docker.internal). Validated at job submission + send time. follow_redirects=False set. 3. Monitor Auth Bypass (CVSS 6.5): monitor_router now mounted with dependencies=[Depends(token_dep)]. WebSocket /ws endpoint checks CRAWL4AI_API_TOKEN from query params. 4. Stored XSS (CVSS 6.1): Server-side html.escape() on URLs and errors in monitor.py. Client-side escapeHtml() wrapping all innerHTML template injections in index.html (active/completed/error lists + WebSocket updates). 33 adversarial security tests added. DO NOT PUSH until release day. Merge to develop + tag + advisory together.	2026-04-13 11:29:54 +00:00
unclecode	3a75dd3f4c	fix: batch fix for 10 open issues (#1520 , #1489 , #1374 , #1424 , #1183 , #1354 , #880 , #1031 , #1251 , #1758 ) - #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance) - #1489: Preserve query parameter key casing in normalize_url - #1374: Close NamedTemporaryFile handle before reopening (Windows fix) - #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1) - #1183: Fix extract_xml_data regex matching tag names in prose text - #1354: Make import_knowledge_base async (fix asyncio.run in running loop) - #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences) - #1031: Make Docker playground code editor resizable with overflow-auto - #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes - #1758: Change screenshot stitching format from BMP to PNG	2026-03-07 09:47:38 +00:00
unclecode	761664d29e	fix: add TTL expiry for Redis task data to prevent memory growth (#1730 ) From PR #1730 by @hoi	2026-03-07 06:17:58 +00:00
unclecode	b97eaeea4c	feat(docker): implement smart browser pool with 10x memory efficiency Major refactoring to eliminate memory leaks and enable high-scale crawling: - Smart 3-Tier Browser Pool: - Permanent browser (always-ready default config) - Hot pool (configs used 3+ times, longer TTL) - Cold pool (new/rare configs, short TTL) - Auto-promotion: cold → hot after 3 uses - 100% pool reuse achieved in tests - Container-Aware Memory Detection: - Read cgroup v1/v2 memory limits (not host metrics) - Accurate memory pressure detection in Docker - Memory-based browser creation blocking - Adaptive Janitor: - Dynamic cleanup intervals (10s/30s/60s based on memory) - Tiered TTLs: cold 30-300s, hot 120-600s - Aggressive cleanup at high memory pressure - Unified Pool Usage: - All endpoints now use pool (/html, /screenshot, /pdf, /execute_js, /md, /llm) - Fixed config signature mismatch (permanent browser matches endpoints) - get_default_browser_config() helper for consistency - Configuration: - Reduced idle_ttl: 1800s → 300s (30min → 5min) - Fixed port: 11234 → 11235 (match Gunicorn) Performance Results (from stress tests): - Memory: 10x reduction (500-700MB × N → 270MB permanent) - Latency: 30-50x faster (<100ms pool hits vs 3-5s startup) - Reuse: 100% for default config, 60%+ for variants - Capacity: 100+ concurrent requests (vs ~20 before) - Leak: 0 MB/cycle (stable across tests) Test Infrastructure: - 7-phase sequential test suite (tests/) - Docker stats integration + log analysis - Pool promotion verification - Memory leak detection - Full endpoint coverage Fixes memory issues reported in production deployments.	2025-10-17 20:38:39 +08:00
ntohidi	159207b86f	feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035 Implement hierarchical configuration for LLM parameters with support for: - Temperature control (0.0-2.0) to adjust response creativity - Custom base_url for proxy servers and alternative endpoints - 4-tier priority: request params > provider env > global env > defaults Add helper functions in utils.py, update API schemas and handlers, support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.), and provide comprehensive documentation with examples.	2025-08-26 16:44:07 +08:00
ntohidi	95051020f4	fix(docker): Fix LLM API key handling for multi-provider support Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers due to a hardcoded api_key_env fallback in config.yml. This caused authentication errors when using non-OpenAI providers like Gemini. Changes: - Remove api_key_env from config.yml to let litellm handle provider-specific env vars - Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys - Update validate_llm_provider() to trust litellm's built-in key detection - Update documentation to reflect the new automatic key handling The fix leverages litellm's existing capability to automatically find the correct environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.) without manual configuration. ref #1291	2025-08-21 14:01:04 +08:00
ntohidi	ff6ea41ac3	feat(docker): add flexible LLM provider configuration - Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini) - Add optional 'provider' parameter to API endpoints for per-request overrides - Implement provider validation to ensure API keys exist - Update documentation and examples with new configuration options Closes the need to hardcode providers in config.yml	2025-08-05 14:09:54 +08:00
UncleCode	94e9959fe0	feat(docker-api): add job-based polling endpoints for crawl and LLM tasks Implements new asynchronous endpoints for handling long-running crawl and LLM tasks: - POST /crawl/job and GET /crawl/job/{task_id} for crawl operations - POST /llm/job and GET /llm/job/{task_id} for LLM operations - Added Redis-based task management with configurable TTL - Moved schema definitions to dedicated schemas.py - Added example polling client demo_docker_polling.py This change allows clients to handle long-running operations asynchronously through a polling pattern rather than holding connections open.	2025-05-01 21:24:52 +08:00
UncleCode	2864015469	feat(docker): implement supervisor and secure API endpoints Add supervisor configuration for managing Redis and Gunicorn processes Replace direct process management with supervisord Add secure and token-free API server variants Implement JWT authentication for protected endpoints Update datetime handling in async dispatcher Add email domain verification BREAKING CHANGE: Server startup now uses supervisord instead of direct process management	2025-02-17 20:31:20 +08:00
UncleCode	33a21d6a7a	refactor(docker): improve server architecture and configuration Complete overhaul of Docker deployment setup with improved architecture: - Add Redis integration for task management - Implement rate limiting and security middleware - Add Prometheus metrics and health checks - Improve error handling and logging - Add support for streaming responses - Implement proper configuration management - Add platform-specific optimizations for ARM64/AMD64 BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure	2025-02-02 20:19:51 +08:00
UncleCode	7b1ef07c41	refactor(docker): remove unused models and utilities for cleaner codebase	2025-02-01 20:10:13 +08:00
UncleCode	53ac3ec0b4	feat(docker): add Docker service integration and config serialization Add Docker service integration with FastAPI server and client implementation. Implement serialization utilities for BrowserConfig and CrawlerRunConfig to support Docker service communication. Clean up imports and improve error handling. - Add Crawl4aiDockerClient class - Implement config serialization/deserialization - Add FastAPI server with streaming support - Add health check endpoint - Clean up imports and type hints	2025-01-31 18:00:16 +08:00

16 Commits