crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 15:58:15 +00:00

Author	SHA1	Message	Date
unclecode	fcaf08b3b3	merge: slot April 2026 security batch (Docker API vulns, SSRF, JWT, file-write, XSS, execute_js) into develop for 0.8.7	2026-06-01 12:40:37 +00:00
unclecode	f77c0a856f	fix(security): SSRF protection on all crawl/md/llm URL entry points Reported by secsys_codex (2026-04-18): /md, /crawl, /llm endpoints pass user URLs to crawler.arun() with no private IP validation. - Add validate_url_destination() to utils.py with opt-out via CRAWL4AI_ALLOW_INTERNAL_URLS=true env var for users who need to crawl internal services. - Integrate into validate_url_scheme() (covers all server.py endpoints). - Add validation at all 4 URL entry points in api.py (handle_llm_qa, handle_markdown_request, create_new_task, handle_crawl_request). - raw: URLs bypass check (inline HTML, no network fetch). - 16 adversarial + source coverage tests added. - secsys_codex added to SECURITY-CREDITS.md. DO NOT PUSH until release day.	2026-04-20 09:42:43 +00:00
hafezparast	8995c1bbd6	feat: expose arun_many config-list support in Docker API (#1837 ) The /crawl endpoint now accepts an optional crawler_configs field (list of CrawlerRunConfig dicts) alongside the existing crawler_config. When provided with multiple URLs, each config is deserialized and passed as a list to arun_many(), enabling per-URL configuration with url_matcher patterns. Single-URL requests and requests without crawler_configs are unchanged (backward compatible). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 09:56:53 +08:00
hafezparast	480d938f67	fix: /llm per-request provider override, Redis config from host/port/password (#1611 , #1817 ) - #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional provider, temperature, base_url query params with fallback to server config. Consistent with /md and /llm/job endpoints. - #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT, REDIS_PASSWORD environment variable overrides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:53:04 +08:00
unclecode	7c0cc3ed88	fix: batch merge of community PRs (#1622 , #1786 , #1796 , #1795 , #1798 , #1734 , #1290 , #1668 ) Bug fixes: - Verify redirect targets are alive before returning from URL seeder (#1622) - Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786) - Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796) Security/Docker: - Require api_token for /token endpoint when configured (#1795) - Deep-crawl streaming now mirrors Python library behavior via arun() (#1798) CI: - Bump GitHub Actions to latest versions - checkout v6, setup-python v6, build-push-action v6, setup-buildx v4, login v4 (#1734) Features: - Support type-list pipeline in JsonCssExtractionStrategy for chained extraction like ["attribute", "regex"] (#1290) - Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting for Unicode preservation in JSON output (#1668)	2026-03-07 08:45:11 +00:00
unclecode	761664d29e	fix: add TTL expiry for Redis task data to prevent memory growth (#1730 ) From PR #1730 by @hoi	2026-03-07 06:17:58 +00:00
unclecode	c0912f7234	feat: add avoid_ads/avoid_css resource filtering and pool release lifecycle Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker domains and CSS resources at the browser context level. Refactor crawler pool with release_crawler() and active_requests tracking to prevent janitor from closing browsers with in-flight requests. Add proper finally blocks to all Docker API/server handlers. Update docs for new config options. Inspired by #1689.	2026-02-25 07:12:28 +00:00
Soham Kukreti	7a133e22cc	feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides	2025-11-28 18:50:04 +05:30
ntohidi	89cc29fe44	Merge branch 'fix/docker' into develop	2025-11-12 17:06:31 +01:00
ntohidi	c7b7475b92	fix: remove duplicate comma in webhook_config parameter	2025-10-22 13:12:42 +02:00
ntohidi	b71d624168	Merge branch 'implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp' into develop	2025-10-22 13:12:25 +02:00
ntohidi	d670dcde0a	feat: add webhook support for /llm/job endpoint Add comprehensive webhook notification support for the /llm/job endpoint, following the same pattern as the existing /crawl/job implementation. Changes: - Add webhook_config field to LlmJobPayload model (job.py) - Implement webhook notifications in process_llm_extraction() with 4 notification points: success, provider validation failure, extraction failure, and general exceptions (api.py) - Store webhook_config in Redis task data for job tracking - Initialize WebhookDeliveryService with exponential backoff retry logic Documentation: - Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks - Update Flask webhook handler to support both crawl and llm_extraction tasks - Add TypeScript client examples for LLM jobs - Add comprehensive examples to docker_webhook_example.py with schema support - Clarify data structure differences between webhook and API responses Testing: - Add test_llm_webhook_feature.py with 7 validation tests (all passing) - Verify pattern consistency with /crawl/job implementation - Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)	2025-10-22 13:03:09 +02:00
Claude	8a37710313	feat: add webhook notifications for crawl job completion Implements webhook support for the crawl job API to eliminate polling requirements. Changes: - Added WebhookConfig and WebhookPayload schemas to schemas.py - Created webhook.py with WebhookDeliveryService class - Integrated webhook notifications in api.py handle_crawl_job - Updated job.py CrawlJobPayload to accept webhook_config - Added webhook configuration section to config.yml - Included comprehensive usage examples in WEBHOOK_EXAMPLES.md Features: - Webhook notifications on job completion (success/failure) - Configurable data inclusion in webhook payload - Custom webhook headers support - Global default webhook URL configuration - Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s) - 30-second timeout per webhook call Usage: POST /crawl/job with optional webhook_config: - webhook_url: URL to receive notifications - webhook_data_in_payload: include full results (default: false) - webhook_headers: custom headers for authentication Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-21 16:17:40 +00:00
unclecode	aba4036ab6	Add demo and test scripts for monitor dashboard activity - Introduced a demo script (`demo_monitor_dashboard.py`) to showcase various monitoring features through simulated activity. - Implemented a test script (`test_monitor_demo.py`) to generate dashboard activity and verify monitor health and endpoint statistics. - Added a logo image to the static assets for branding purposes.	2025-10-17 22:43:06 +08:00
unclecode	b97eaeea4c	feat(docker): implement smart browser pool with 10x memory efficiency Major refactoring to eliminate memory leaks and enable high-scale crawling: - Smart 3-Tier Browser Pool: - Permanent browser (always-ready default config) - Hot pool (configs used 3+ times, longer TTL) - Cold pool (new/rare configs, short TTL) - Auto-promotion: cold → hot after 3 uses - 100% pool reuse achieved in tests - Container-Aware Memory Detection: - Read cgroup v1/v2 memory limits (not host metrics) - Accurate memory pressure detection in Docker - Memory-based browser creation blocking - Adaptive Janitor: - Dynamic cleanup intervals (10s/30s/60s based on memory) - Tiered TTLs: cold 30-300s, hot 120-600s - Aggressive cleanup at high memory pressure - Unified Pool Usage: - All endpoints now use pool (/html, /screenshot, /pdf, /execute_js, /md, /llm) - Fixed config signature mismatch (permanent browser matches endpoints) - get_default_browser_config() helper for consistency - Configuration: - Reduced idle_ttl: 1800s → 300s (30min → 5min) - Fixed port: 11234 → 11235 (match Gunicorn) Performance Results (from stress tests): - Memory: 10x reduction (500-700MB × N → 270MB permanent) - Latency: 30-50x faster (<100ms pool hits vs 3-5s startup) - Reuse: 100% for default config, 60%+ for variants - Capacity: 100+ concurrent requests (vs ~20 before) - Leak: 0 MB/cycle (stable across tests) Test Infrastructure: - 7-phase sequential test suite (tests/) - Docker stats integration + log analysis - Pool promotion verification - Memory leak detection - Full endpoint coverage Fixes memory issues reported in production deployments.	2025-10-17 20:38:39 +08:00
Martin Sjöborg	8d30662647	fix: remove this import as it causes python to treat "json" as a variable in the except block	2025-10-02 09:19:15 +02:00
ntohidi	fef715a891	Merge branch 'feature/docker-hooks' into develop	2025-09-25 14:11:46 +08:00
AHMET YILMAZ	a1950afd98	#1505 fix(api): update config handling to only set base config if not provided by user	2025-09-22 17:19:27 +08:00
Soham Kukreti	2ad3fb5fc8	feat(docker): improve docker error handling - Return comprehensive error messages along with status codes for api internal errors. - Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints - Add sanitization to ensure fit_html is always JSON-serializable (string or None) - Add comprehensive error handling test suite.	2025-08-26 23:18:35 +05:30
ntohidi	159207b86f	feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035 Implement hierarchical configuration for LLM parameters with support for: - Temperature control (0.0-2.0) to adjust response creativity - Custom base_url for proxy servers and alternative endpoints - 4-tier priority: request params > provider env > global env > defaults Add helper functions in utils.py, update API schemas and handlers, support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.), and provide comprehensive documentation with examples.	2025-08-26 16:44:07 +08:00
ntohidi	95051020f4	fix(docker): Fix LLM API key handling for multi-provider support Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers due to a hardcoded api_key_env fallback in config.yml. This caused authentication errors when using non-OpenAI providers like Gemini. Changes: - Remove api_key_env from config.yml to let litellm handle provider-specific env vars - Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys - Update validate_llm_provider() to trust litellm's built-in key detection - Update documentation to reflect the new automatic key handling The fix leverages litellm's existing capability to automatically find the correct environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.) without manual configuration. ref #1291	2025-08-21 14:01:04 +08:00
Nasrin	ef174a4c7a	Merge pull request #1104 from emmanuel-ferdman/main fix(docker-api): migrate to modern datetime library API	2025-08-20 10:57:39 +08:00
Soham Kukreti	f30811b524	fix: Check for raw: and raw:// URLs before auto-appending https:// prefix - Add raw HTML URL validation alongside http/https checks - Fix URL preprocessing logic to handle raw: and raw:// prefixes - Update error message and add comprehensive test cases	2025-08-11 22:10:53 +05:30
ntohidi	be63c98db3	feat(docker): add user-provided hooks support to Docker API Implements comprehensive hooks functionality allowing users to provide custom Python functions as strings that execute at specific points in the crawling pipeline. Key Features: - Support for all 8 crawl4ai hook points: • on_browser_created: Initialize browser settings • on_page_context_created: Configure page context • before_goto: Pre-navigation setup • after_goto: Post-navigation processing • on_user_agent_updated: User agent modification handling • on_execution_started: Crawl execution initialization • before_retrieve_html: Pre-extraction processing • before_return_html: Final HTML processing Implementation Details: - Created UserHookManager for validation, compilation, and safe execution - Added IsolatedHookWrapper for error isolation and timeout protection - AST-based validation ensures code structure correctness - Sandboxed execution with restricted builtins for security - Configurable timeout (1-120 seconds) prevents infinite loops - Comprehensive error handling ensures hooks don't crash main process - Execution tracking with detailed statistics and logging API Changes: - Added HookConfig schema with code and timeout fields - Extended CrawlRequest with optional hooks parameter - Added /hooks/info endpoint for hook discovery - Updated /crawl and /crawl/stream endpoints to support hooks Safety Features: - Malformed hooks return clear validation errors - Hook errors are isolated and reported without stopping crawl - Execution statistics track success/failure/timeout rates - All hook results are JSON-serializable Testing: - Comprehensive test suite covering all 8 hooks - Error handling and timeout scenarios validated - Authentication, performance, and content extraction examples - 100% success rate in production testing Documentation: - Added extensive hooks section to docker-deployment.md - Security warnings about user-provided code risks - Real-world examples using httpbin.org, GitHub, BBC - Best practices and troubleshooting guide ref #1377	2025-08-11 13:25:17 +08:00
ntohidi	ff6ea41ac3	feat(docker): add flexible LLM provider configuration - Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini) - Add optional 'provider' parameter to API endpoints for per-request overrides - Implement provider validation to ensure API keys exist - Update documentation and examples with new configuration options Closes the need to hardcode providers in config.yml	2025-08-05 14:09:54 +08:00
Emmanuel Ferdman	8e3c411a3e	Merge branch 'main' into main	2025-07-29 14:05:35 +03:00
ntohidi	1b6a31f88f	fix: encode PDF results to base64 in /crawl endpoint. ref #1301	2025-07-23 13:52:18 +02:00
ntohidi	b55e27d2ef	fix: chanegd error variable name handle_crawl_request, docker api	2025-05-26 11:08:23 +02:00
Emmanuel Ferdman	1e1c887a2f	fix(docker-api): migrate to modern datetime library API Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-05-13 00:04:58 -07:00
UncleCode	94e9959fe0	feat(docker-api): add job-based polling endpoints for crawl and LLM tasks Implements new asynchronous endpoints for handling long-running crawl and LLM tasks: - POST /crawl/job and GET /crawl/job/{task_id} for crawl operations - POST /llm/job and GET /llm/job/{task_id} for LLM operations - Added Redis-based task management with configurable TTL - Moved schema definitions to dedicated schemas.py - Added example polling client demo_docker_polling.py This change allows clients to handle long-running operations asynchronously through a polling pattern rather than holding connections open.	2025-05-01 21:24:52 +08:00
unclecode	f3ebb38edf	Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md	2025-04-22 14:56:47 +08:00
UncleCode	a58c8000aa	refactor(server): migrate to pool-based crawler management Replace crawler_manager.py with simpler crawler_pool.py implementation: - Add global page semaphore for hard concurrency cap - Implement browser pool with idle cleanup - Add playground UI for testing and stress testing - Update API handlers to use pooled crawlers - Enhance logging levels and symbols BREAKING CHANGE: Removes CrawlerManager class in favor of simpler pool-based approach	2025-04-20 20:14:26 +08:00
Aravind Karnam	b27bb367e8	merge next. Resolve conflicts. Fix some import errors and error handling in server.py	2025-04-19 20:27:47 +05:30
UncleCode	16b2318242	feat(api): implement crawler pool manager for improved resource handling Adds a new CrawlerManager class to handle browser instance pooling and failover: - Implements auto-scaling based on system resources - Adds primary/backup crawler management - Integrates memory monitoring and throttling - Adds streaming support with memory tracking - Updates API endpoints to use pooled crawlers BREAKING CHANGE: API endpoints now require CrawlerManager initialization	2025-04-18 22:26:24 +08:00
Aravind Karnam	eed7f88f29	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-04-17 10:50:02 +05:30
UncleCode	3179d6ad0c	fix(core): improve error handling and stability in core components Enhance error handling and stability across multiple components: - Add safety checks in async_configs.py for type and params existence - Fix browser manager initialization and cleanup logic - Add default LLM config fallback in extraction strategy - Add comprehensive Docker deployment guide and server tests BREAKING CHANGE: BrowserManager.start() now automatically closes existing instances	2025-04-11 20:58:39 +08:00
Aravind Karnam	84883be513	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-18 15:12:21 +05:30
UncleCode	7884a98be7	feat(crawler): add experimental parameters support and optimize browser handling Add experimental parameters dictionary to CrawlerRunConfig to support beta features Make CSP nonce headers optional via experimental config Remove default cookie injection Clean up browser context creation code Improve code formatting in API handler BREAKING CHANGE: Default cookie injection has been removed from page initialization	2025-03-14 14:39:24 +08:00
Aravind Karnam	a3954dd4c6	refactor: Move the checking of protocol and prepending protocol inside api handlers	2025-03-14 09:39:10 +05:30
UncleCode	6e3c048328	feat(api): refactor crawl request handling to streamline single and multiple URL processing	2025-03-13 22:30:38 +08:00
UncleCode	b750542e6d	feat(crawler): optimize single URL handling and add performance comparison Add special handling for single URL requests in Docker API to use arun() instead of arun_many() Add new example script demonstrating performance differences between sequential and parallel crawling Update cache mode from aggressive to bypass in examples and tests Remove unused dependencies (zstandard, msgpack) BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples	2025-03-13 22:15:15 +08:00
UncleCode	baee4949d3	refactor(llm): rename LlmConfig to LLMConfig for consistency Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions. Update all imports and usages to use the new name. Update documentation and examples to reflect the change. BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.	2025-03-05 14:17:04 +08:00
Aravind	a9e24307cc	Release prep (#749 ) * fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown	2025-02-28 19:53:35 +08:00
UncleCode	392c923980	feat(docker): add JWT authentication and improve server architecture Add JWT token-based authentication to Docker server and client. Refactor server architecture for better code organization and error handling. Move Dockerfile to root deploy directory and update configuration. Add comprehensive documentation and examples. BREAKING CHANGE: Docker server now requires authentication by default. Endpoints require JWT tokens when security.jwt_enabled is true in config.	2025-02-18 22:07:13 +08:00
UncleCode	2864015469	feat(docker): implement supervisor and secure API endpoints Add supervisor configuration for managing Redis and Gunicorn processes Replace direct process management with supervisord Add secure and token-free API server variants Implement JWT authentication for protected endpoints Update datetime handling in async dispatcher Add email domain verification BREAKING CHANGE: Server startup now uses supervisord instead of direct process management	2025-02-17 20:31:20 +08:00
UncleCode	04bc643cec	feat(api): improve cache handling and add API tests Changes cache mode from BYPASS to WRITE_ONLY when cache is disabled to ensure results are still cached for future use. Also adds error handling for non-JSON LLM responses and comprehensive API test suite. - Changes default cache fallback from BYPASS to WRITE_ONLY - Adds error handling for LLM JSON parsing - Introduces new test suite for API endpoints	2025-02-02 20:53:31 +08:00
UncleCode	33a21d6a7a	refactor(docker): improve server architecture and configuration Complete overhaul of Docker deployment setup with improved architecture: - Add Redis integration for task management - Implement rate limiting and security middleware - Add Prometheus metrics and health checks - Improve error handling and logging - Add support for streaming responses - Implement proper configuration management - Add platform-specific optimizations for ARM64/AMD64 BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure	2025-02-02 20:19:51 +08:00

47 Commits