Commit Graph

47 Commits

Author SHA1 Message Date
unclecode
fcaf08b3b3 merge: slot April 2026 security batch (Docker API vulns, SSRF, JWT, file-write, XSS, execute_js) into develop for 0.8.7 2026-06-01 12:40:37 +00:00
unclecode
f77c0a856f fix(security): SSRF protection on all crawl/md/llm URL entry points
Reported by secsys_codex (2026-04-18): /md, /crawl, /llm endpoints
pass user URLs to crawler.arun() with no private IP validation.

- Add validate_url_destination() to utils.py with opt-out via
  CRAWL4AI_ALLOW_INTERNAL_URLS=true env var for users who need
  to crawl internal services.
- Integrate into validate_url_scheme() (covers all server.py endpoints).
- Add validation at all 4 URL entry points in api.py (handle_llm_qa,
  handle_markdown_request, create_new_task, handle_crawl_request).
- raw: URLs bypass check (inline HTML, no network fetch).
- 16 adversarial + source coverage tests added.
- secsys_codex added to SECURITY-CREDITS.md.

DO NOT PUSH until release day.
2026-04-20 09:42:43 +00:00
hafezparast
8995c1bbd6 feat: expose arun_many config-list support in Docker API (#1837)
The /crawl endpoint now accepts an optional crawler_configs field
(list of CrawlerRunConfig dicts) alongside the existing crawler_config.
When provided with multiple URLs, each config is deserialized and passed
as a list to arun_many(), enabling per-URL configuration with url_matcher
patterns. Single-URL requests and requests without crawler_configs are
unchanged (backward compatible).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:56:53 +08:00
hafezparast
480d938f67 fix: /llm per-request provider override, Redis config from host/port/password (#1611, #1817)
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
  provider, temperature, base_url query params with fallback to server config.
  Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
  URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
  REDIS_PASSWORD environment variable overrides.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:53:04 +08:00
unclecode
7c0cc3ed88 fix: batch merge of community PRs (#1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668)
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)

Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)

CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
  build-push-action v6, setup-buildx v4, login v4 (#1734)

Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
  extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
  for Unicode preservation in JSON output (#1668)
2026-03-07 08:45:11 +00:00
unclecode
761664d29e fix: add TTL expiry for Redis task data to prevent memory growth (#1730)
From PR #1730 by @hoi
2026-03-07 06:17:58 +00:00
unclecode
c0912f7234 feat: add avoid_ads/avoid_css resource filtering and pool release lifecycle
Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker
domains and CSS resources at the browser context level. Refactor crawler pool
with release_crawler() and active_requests tracking to prevent janitor from
closing browsers with in-flight requests. Add proper finally blocks to all
Docker API/server handlers. Update docs for new config options.

Inspired by #1689.
2026-02-25 07:12:28 +00:00
Soham Kukreti
7a133e22cc feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides
2025-11-28 18:50:04 +05:30
ntohidi
89cc29fe44 Merge branch 'fix/docker' into develop 2025-11-12 17:06:31 +01:00
ntohidi
c7b7475b92 fix: remove duplicate comma in webhook_config parameter 2025-10-22 13:12:42 +02:00
ntohidi
b71d624168 Merge branch 'implement-webhook-crawl-feature-011CULZY1Jy8N5MUkZqXkRVp' into develop 2025-10-22 13:12:25 +02:00
ntohidi
d670dcde0a feat: add webhook support for /llm/job endpoint
Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.

Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
  notification points: success, provider validation failure, extraction
  failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses

Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)
2025-10-22 13:03:09 +02:00
Claude
8a37710313 feat: add webhook notifications for crawl job completion
Implements webhook support for the crawl job API to eliminate polling requirements.

Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md

Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call

Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication

Generated with Claude Code https://claude.com/claude-code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-21 16:17:40 +00:00
unclecode
aba4036ab6 Add demo and test scripts for monitor dashboard activity
- Introduced a demo script (`demo_monitor_dashboard.py`) to showcase various monitoring features through simulated activity.
- Implemented a test script (`test_monitor_demo.py`) to generate dashboard activity and verify monitor health and endpoint statistics.
- Added a logo image to the static assets for branding purposes.
2025-10-17 22:43:06 +08:00
unclecode
b97eaeea4c feat(docker): implement smart browser pool with 10x memory efficiency
Major refactoring to eliminate memory leaks and enable high-scale crawling:

- **Smart 3-Tier Browser Pool**:
  - Permanent browser (always-ready default config)
  - Hot pool (configs used 3+ times, longer TTL)
  - Cold pool (new/rare configs, short TTL)
  - Auto-promotion: cold → hot after 3 uses
  - 100% pool reuse achieved in tests

- **Container-Aware Memory Detection**:
  - Read cgroup v1/v2 memory limits (not host metrics)
  - Accurate memory pressure detection in Docker
  - Memory-based browser creation blocking

- **Adaptive Janitor**:
  - Dynamic cleanup intervals (10s/30s/60s based on memory)
  - Tiered TTLs: cold 30-300s, hot 120-600s
  - Aggressive cleanup at high memory pressure

- **Unified Pool Usage**:
  - All endpoints now use pool (/html, /screenshot, /pdf, /execute_js, /md, /llm)
  - Fixed config signature mismatch (permanent browser matches endpoints)
  - get_default_browser_config() helper for consistency

- **Configuration**:
  - Reduced idle_ttl: 1800s → 300s (30min → 5min)
  - Fixed port: 11234 → 11235 (match Gunicorn)

**Performance Results** (from stress tests):
- Memory: 10x reduction (500-700MB × N → 270MB permanent)
- Latency: 30-50x faster (<100ms pool hits vs 3-5s startup)
- Reuse: 100% for default config, 60%+ for variants
- Capacity: 100+ concurrent requests (vs ~20 before)
- Leak: 0 MB/cycle (stable across tests)

**Test Infrastructure**:
- 7-phase sequential test suite (tests/)
- Docker stats integration + log analysis
- Pool promotion verification
- Memory leak detection
- Full endpoint coverage

Fixes memory issues reported in production deployments.
2025-10-17 20:38:39 +08:00
Martin Sjöborg
8d30662647 fix: remove this import as it causes python to treat "json" as a variable in the except block 2025-10-02 09:19:15 +02:00
ntohidi
fef715a891 Merge branch 'feature/docker-hooks' into develop 2025-09-25 14:11:46 +08:00
AHMET YILMAZ
a1950afd98 #1505 fix(api): update config handling to only set base config if not provided by user 2025-09-22 17:19:27 +08:00
Soham Kukreti
2ad3fb5fc8 feat(docker): improve docker error handling
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.
2025-08-26 23:18:35 +05:30
ntohidi
159207b86f feat(docker): Add temperature and base_url parameters for LLM configuration. ref #1035
Implement hierarchical configuration for LLM parameters with support for:
  - Temperature control (0.0-2.0) to adjust response creativity
  - Custom base_url for proxy servers and alternative endpoints
  - 4-tier priority: request params > provider env > global env > defaults

  Add helper functions in utils.py, update API schemas and handlers,
  support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
  and provide comprehensive documentation with examples.
2025-08-26 16:44:07 +08:00
ntohidi
95051020f4 fix(docker): Fix LLM API key handling for multi-provider support
Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291
2025-08-21 14:01:04 +08:00
Nasrin
ef174a4c7a Merge pull request #1104 from emmanuel-ferdman/main
fix(docker-api): migrate to modern datetime library API
2025-08-20 10:57:39 +08:00
Soham Kukreti
f30811b524 fix: Check for raw: and raw:// URLs before auto-appending https:// prefix
- Add raw HTML URL validation alongside http/https checks
- Fix URL preprocessing logic to handle raw: and raw:// prefixes
- Update error message and add comprehensive test cases
2025-08-11 22:10:53 +05:30
ntohidi
be63c98db3 feat(docker): add user-provided hooks support to Docker API
Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377
2025-08-11 13:25:17 +08:00
ntohidi
ff6ea41ac3 feat(docker): add flexible LLM provider configuration
- Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini)
- Add optional 'provider' parameter to API endpoints for per-request overrides
- Implement provider validation to ensure API keys exist
- Update documentation and examples with new configuration options

Closes the need to hardcode providers in config.yml
2025-08-05 14:09:54 +08:00
Emmanuel Ferdman
8e3c411a3e Merge branch 'main' into main 2025-07-29 14:05:35 +03:00
ntohidi
1b6a31f88f fix: encode PDF results to base64 in /crawl endpoint. ref #1301 2025-07-23 13:52:18 +02:00
ntohidi
b55e27d2ef fix: chanegd error variable name handle_crawl_request, docker api 2025-05-26 11:08:23 +02:00
Emmanuel Ferdman
1e1c887a2f fix(docker-api): migrate to modern datetime library API
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-13 00:04:58 -07:00
UncleCode
94e9959fe0 feat(docker-api): add job-based polling endpoints for crawl and LLM tasks
Implements new asynchronous endpoints for handling long-running crawl and LLM tasks:
- POST /crawl/job and GET /crawl/job/{task_id} for crawl operations
- POST /llm/job and GET /llm/job/{task_id} for LLM operations
- Added Redis-based task management with configurable TTL
- Moved schema definitions to dedicated schemas.py
- Added example polling client demo_docker_polling.py

This change allows clients to handle long-running operations asynchronously through a polling pattern rather than holding connections open.
2025-05-01 21:24:52 +08:00
unclecode
f3ebb38edf Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md 2025-04-22 14:56:47 +08:00
UncleCode
a58c8000aa refactor(server): migrate to pool-based crawler management
Replace crawler_manager.py with simpler crawler_pool.py implementation:
- Add global page semaphore for hard concurrency cap
- Implement browser pool with idle cleanup
- Add playground UI for testing and stress testing
- Update API handlers to use pooled crawlers
- Enhance logging levels and symbols

BREAKING CHANGE: Removes CrawlerManager class in favor of simpler pool-based approach
2025-04-20 20:14:26 +08:00
Aravind Karnam
b27bb367e8 merge next. Resolve conflicts. Fix some import errors and error handling in server.py 2025-04-19 20:27:47 +05:30
UncleCode
16b2318242 feat(api): implement crawler pool manager for improved resource handling
Adds a new CrawlerManager class to handle browser instance pooling and failover:
- Implements auto-scaling based on system resources
- Adds primary/backup crawler management
- Integrates memory monitoring and throttling
- Adds streaming support with memory tracking
- Updates API endpoints to use pooled crawlers

BREAKING CHANGE: API endpoints now require CrawlerManager initialization
2025-04-18 22:26:24 +08:00
Aravind Karnam
eed7f88f29 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-04-17 10:50:02 +05:30
UncleCode
3179d6ad0c fix(core): improve error handling and stability in core components
Enhance error handling and stability across multiple components:
- Add safety checks in async_configs.py for type and params existence
- Fix browser manager initialization and cleanup logic
- Add default LLM config fallback in extraction strategy
- Add comprehensive Docker deployment guide and server tests

BREAKING CHANGE: BrowserManager.start() now automatically closes existing instances
2025-04-11 20:58:39 +08:00
Aravind Karnam
84883be513 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-03-18 15:12:21 +05:30
UncleCode
7884a98be7 feat(crawler): add experimental parameters support and optimize browser handling
Add experimental parameters dictionary to CrawlerRunConfig to support beta features
Make CSP nonce headers optional via experimental config
Remove default cookie injection
Clean up browser context creation code
Improve code formatting in API handler

BREAKING CHANGE: Default cookie injection has been removed from page initialization
2025-03-14 14:39:24 +08:00
Aravind Karnam
a3954dd4c6 refactor: Move the checking of protocol and prepending protocol inside api handlers 2025-03-14 09:39:10 +05:30
UncleCode
6e3c048328 feat(api): refactor crawl request handling to streamline single and multiple URL processing 2025-03-13 22:30:38 +08:00
UncleCode
b750542e6d feat(crawler): optimize single URL handling and add performance comparison
Add special handling for single URL requests in Docker API to use arun() instead of arun_many()
Add new example script demonstrating performance differences between sequential and parallel crawling
Update cache mode from aggressive to bypass in examples and tests
Remove unused dependencies (zstandard, msgpack)

BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples
2025-03-13 22:15:15 +08:00
UncleCode
baee4949d3 refactor(llm): rename LlmConfig to LLMConfig for consistency
Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions.
Update all imports and usages to use the new name.
Update documentation and examples to reflect the change.

BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.
2025-03-05 14:17:04 +08:00
Aravind
a9e24307cc Release prep (#749)
* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
2025-02-28 19:53:35 +08:00
UncleCode
392c923980 feat(docker): add JWT authentication and improve server architecture
Add JWT token-based authentication to Docker server and client.
Refactor server architecture for better code organization and error handling.
Move Dockerfile to root deploy directory and update configuration.
Add comprehensive documentation and examples.

BREAKING CHANGE: Docker server now requires authentication by default.
Endpoints require JWT tokens when security.jwt_enabled is true in config.
2025-02-18 22:07:13 +08:00
UncleCode
2864015469 feat(docker): implement supervisor and secure API endpoints
Add supervisor configuration for managing Redis and Gunicorn processes
Replace direct process management with supervisord
Add secure and token-free API server variants
Implement JWT authentication for protected endpoints
Update datetime handling in async dispatcher
Add email domain verification

BREAKING CHANGE: Server startup now uses supervisord instead of direct process management
2025-02-17 20:31:20 +08:00
UncleCode
04bc643cec feat(api): improve cache handling and add API tests
Changes cache mode from BYPASS to WRITE_ONLY when cache is disabled to ensure
results are still cached for future use. Also adds error handling for non-JSON
LLM responses and comprehensive API test suite.

- Changes default cache fallback from BYPASS to WRITE_ONLY
- Adds error handling for LLM JSON parsing
- Introduces new test suite for API endpoints
2025-02-02 20:53:31 +08:00
UncleCode
33a21d6a7a refactor(docker): improve server architecture and configuration
Complete overhaul of Docker deployment setup with improved architecture:
- Add Redis integration for task management
- Implement rate limiting and security middleware
- Add Prometheus metrics and health checks
- Improve error handling and logging
- Add support for streaming responses
- Implement proper configuration management
- Add platform-specific optimizations for ARM64/AMD64

BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure
2025-02-02 20:19:51 +08:00