0.8.8's SSRF check validated the crawl target URL but not the proxy address, so
an unauthenticated /crawl, /crawl/stream, or /crawl/job could route the browser
through a proxy pointing at an internal IP and reach internal services / cloud
metadata. Reported by Geo (geo-chen).
Fix (backward compatible): validate every proxy destination with the same
not-is_global check used for crawl URLs, before the browser is built -
browser_config.proxy, browser_config.proxy_config.server,
crawler_config.proxy_config.server - and strip proxy/DNS-redirecting flags
(--proxy-server / --proxy-pac-url / --proxy-bypass-list / --host-resolver-rules)
from extra_args. A legitimate public proxy still works; configure proxies via
proxy_config (validated), not raw extra_args flags. _enforce_proxy_safety is
called in both crawl handlers (and covers /crawl/job transitively); HTTPException
passthrough added so the 400 is not masked as a 500.
Bump 0.8.8 -> 0.8.9 (__version__ + Dockerfile). 20 new tests; full security
suite 161 pass. Changelog, release blog, README, SECURITY-CREDITS updated.
This vector was already fixed in the upcoming secure-by-default release; 0.8.9
brings it forward because it is an unauthenticated SSRF.
Backward-compatible fixes for the Docker server - features keep working, only
the unsafe behavior is closed. (The secure-by-default redesign is the later
major.)
- SSRF: replace the explicit blocklist with the one rule (reject any resolved
IP where not ip.is_global) evaluated on embedded IPv4 transition forms too,
closing the gaps - IPv6 unspecified ::, NAT64 64:ff9b::/96, 6to4 2002::/16,
v4-mapped. Error messages are now opaque (no resolved-IP leak).
- output_path arbitrary write: harden validate_output_path with realpath
containment (defeats a symlinked path component) and write via O_NOFOLLOW
(write_output_file). output_path stays supported.
- LLM base_url key exfil: ignore a request-supplied base_url in /md, /llm,
/llm/job; the endpoint is always server-derived. Field still accepted (no
4xx) for compatibility.
- env:SECRET_KEY exfil gadget: LLMConfig refuses env: resolution of protected
names (SECRET/PASSWORD/PRIVATE substrings, CRAWL4AI*/AWS_SECRET* prefixes,
SECRET_KEY/REDIS_PASSWORD/TOKEN). Normal provider keys (OPENAI_API_KEY, ...)
unaffected.
- CRLF log injection: CRLFSafeFilter strips CR/LF/control from log records.
- Webhook header injection: sanitize_webhook_headers (name pattern, no control
chars, deny hop-by-hop/sensitive) at send time + a WebhookConfig validator
for early 422.
Bump 0.8.7 -> 0.8.8 (__version__ + Dockerfile C4AI_VER). 30 new behavioral
tests; existing 111 security tests + 112 library config tests still pass.
NOT included (breaking -> deferred to the major): auth-by-default, trust
boundary, declarative hooks, output_path removal, base_url/provider removal,
loopback bind, redis password, TLS-verify-on, CORS, bounded queue. The
exec-hook RCE and unauth-by-default criticals have no non-breaking fix and are
closed only in the major (hooks are already off by default).
Reported by secsys_codex (2026-04-18): /md, /crawl, /llm endpoints
pass user URLs to crawler.arun() with no private IP validation.
- Add validate_url_destination() to utils.py with opt-out via
CRAWL4AI_ALLOW_INTERNAL_URLS=true env var for users who need
to crawl internal services.
- Integrate into validate_url_scheme() (covers all server.py endpoints).
- Add validation at all 4 URL entry points in api.py (handle_llm_qa,
handle_markdown_request, create_new_task, handle_crawl_request).
- raw: URLs bypass check (inline HTML, no network fetch).
- 16 adversarial + source coverage tests added.
- secsys_codex added to SECURITY-CREDITS.md.
DO NOT PUSH until release day.
The /crawl endpoint now accepts an optional crawler_configs field
(list of CrawlerRunConfig dicts) alongside the existing crawler_config.
When provided with multiple URLs, each config is deserialized and passed
as a list to arun_many(), enabling per-URL configuration with url_matcher
patterns. Single-URL requests and requests without crawler_configs are
unchanged (backward compatible).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- #1611: /llm GET endpoint hardcoded server's LLM_PROVIDER. Added optional
provider, temperature, base_url query params with fallback to server config.
Consistent with /md and /llm/job endpoints.
- #1817: Redis connection used non-existent config["redis"]["uri"]. Now builds
URL from host/port/password/db/ssl config fields with REDIS_HOST, REDIS_PORT,
REDIS_PASSWORD environment variable overrides.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bug fixes:
- Verify redirect targets are alive before returning from URL seeder (#1622)
- Wire mean_delay/max_range from CrawlerRunConfig into dispatcher rate limiter (#1786)
- Use DOMParser instead of innerHTML in process_iframes to prevent XSS (#1796)
Security/Docker:
- Require api_token for /token endpoint when configured (#1795)
- Deep-crawl streaming now mirrors Python library behavior via arun() (#1798)
CI:
- Bump GitHub Actions to latest versions - checkout v6, setup-python v6,
build-push-action v6, setup-buildx v4, login v4 (#1734)
Features:
- Support type-list pipeline in JsonCssExtractionStrategy for chained
extraction like ["attribute", "regex"] (#1290)
- Add --json-ensure-ascii CLI flag and JSON_ENSURE_ASCII config setting
for Unicode preservation in JSON output (#1668)
Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker
domains and CSS resources at the browser context level. Refactor crawler pool
with release_crawler() and active_requests tracking to prevent janitor from
closing browsers with in-flight requests. Add proper finally blocks to all
Docker API/server handlers. Update docs for new config options.
Inspired by #1689.
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
through LLMExtractionStrategy, LLMContentFilter, table extraction, and
Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
and document them in the md_v2 guides
Add comprehensive webhook notification support for the /llm/job endpoint,
following the same pattern as the existing /crawl/job implementation.
Changes:
- Add webhook_config field to LlmJobPayload model (job.py)
- Implement webhook notifications in process_llm_extraction() with 4
notification points: success, provider validation failure, extraction
failure, and general exceptions (api.py)
- Store webhook_config in Redis task data for job tracking
- Initialize WebhookDeliveryService with exponential backoff retry logic
Documentation:
- Add Example 6 to WEBHOOK_EXAMPLES.md showing LLM extraction with webhooks
- Update Flask webhook handler to support both crawl and llm_extraction tasks
- Add TypeScript client examples for LLM jobs
- Add comprehensive examples to docker_webhook_example.py with schema support
- Clarify data structure differences between webhook and API responses
Testing:
- Add test_llm_webhook_feature.py with 7 validation tests (all passing)
- Verify pattern consistency with /crawl/job implementation
- Add implementation guide (WEBHOOK_LLM_JOB_IMPLEMENTATION.md)
Implements webhook support for the crawl job API to eliminate polling requirements.
Changes:
- Added WebhookConfig and WebhookPayload schemas to schemas.py
- Created webhook.py with WebhookDeliveryService class
- Integrated webhook notifications in api.py handle_crawl_job
- Updated job.py CrawlJobPayload to accept webhook_config
- Added webhook configuration section to config.yml
- Included comprehensive usage examples in WEBHOOK_EXAMPLES.md
Features:
- Webhook notifications on job completion (success/failure)
- Configurable data inclusion in webhook payload
- Custom webhook headers support
- Global default webhook URL configuration
- Exponential backoff retry logic (5 attempts: 1s, 2s, 4s, 8s, 16s)
- 30-second timeout per webhook call
Usage:
POST /crawl/job with optional webhook_config:
- webhook_url: URL to receive notifications
- webhook_data_in_payload: include full results (default: false)
- webhook_headers: custom headers for authentication
Generated with Claude Code https://claude.com/claude-code
Co-Authored-By: Claude <noreply@anthropic.com>
- Introduced a demo script (`demo_monitor_dashboard.py`) to showcase various monitoring features through simulated activity.
- Implemented a test script (`test_monitor_demo.py`) to generate dashboard activity and verify monitor health and endpoint statistics.
- Added a logo image to the static assets for branding purposes.
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.
Implement hierarchical configuration for LLM parameters with support for:
- Temperature control (0.0-2.0) to adjust response creativity
- Custom base_url for proxy servers and alternative endpoints
- 4-tier priority: request params > provider env > global env > defaults
Add helper functions in utils.py, update API schemas and handlers,
support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
and provide comprehensive documentation with examples.
Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.
Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling
The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.
ref #1291
- Add raw HTML URL validation alongside http/https checks
- Fix URL preprocessing logic to handle raw: and raw:// prefixes
- Update error message and add comprehensive test cases
- Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini)
- Add optional 'provider' parameter to API endpoints for per-request overrides
- Implement provider validation to ensure API keys exist
- Update documentation and examples with new configuration options
Closes the need to hardcode providers in config.yml
Implements new asynchronous endpoints for handling long-running crawl and LLM tasks:
- POST /crawl/job and GET /crawl/job/{task_id} for crawl operations
- POST /llm/job and GET /llm/job/{task_id} for LLM operations
- Added Redis-based task management with configurable TTL
- Moved schema definitions to dedicated schemas.py
- Added example polling client demo_docker_polling.py
This change allows clients to handle long-running operations asynchronously through a polling pattern rather than holding connections open.
Replace crawler_manager.py with simpler crawler_pool.py implementation:
- Add global page semaphore for hard concurrency cap
- Implement browser pool with idle cleanup
- Add playground UI for testing and stress testing
- Update API handlers to use pooled crawlers
- Enhance logging levels and symbols
BREAKING CHANGE: Removes CrawlerManager class in favor of simpler pool-based approach
Adds a new CrawlerManager class to handle browser instance pooling and failover:
- Implements auto-scaling based on system resources
- Adds primary/backup crawler management
- Integrates memory monitoring and throttling
- Adds streaming support with memory tracking
- Updates API endpoints to use pooled crawlers
BREAKING CHANGE: API endpoints now require CrawlerManager initialization
Enhance error handling and stability across multiple components:
- Add safety checks in async_configs.py for type and params existence
- Fix browser manager initialization and cleanup logic
- Add default LLM config fallback in extraction strategy
- Add comprehensive Docker deployment guide and server tests
BREAKING CHANGE: BrowserManager.start() now automatically closes existing instances
Add experimental parameters dictionary to CrawlerRunConfig to support beta features
Make CSP nonce headers optional via experimental config
Remove default cookie injection
Clean up browser context creation code
Improve code formatting in API handler
BREAKING CHANGE: Default cookie injection has been removed from page initialization
Add special handling for single URL requests in Docker API to use arun() instead of arun_many()
Add new example script demonstrating performance differences between sequential and parallel crawling
Update cache mode from aggressive to bypass in examples and tests
Remove unused dependencies (zstandard, msgpack)
BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples
Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions.
Update all imports and usages to use the new name.
Update documentation and examples to reflect the change.
BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.
* fix: Update export of URLPatternFilter
* chore: Add dependancy for cchardet in requirements
* docs: Update example for deep crawl in release note for v0.5
* Docs: update the example for memory dispatcher
* docs: updated example for crawl strategies
* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.
* chore: removed cchardet from dependancy list, since unclecode is planning to remove it
* docs: updated the example for proxy rotation to a working example
* feat: Introduced ProxyConfig param
* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1
* chore: update and test new dependancies
* feat:Make PyPDF2 a conditional dependancy
* updated tutorial and release note for v0.5
* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename
* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult
* fix: Bug in serialisation of markdown in acache_url
* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown
* fix: remove deprecated markdown_v2 from docker
* Refactor: remove deprecated fit_markdown and fit_html from result
* refactor: fix cache retrieval for markdown as a string
* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
Add JWT token-based authentication to Docker server and client.
Refactor server architecture for better code organization and error handling.
Move Dockerfile to root deploy directory and update configuration.
Add comprehensive documentation and examples.
BREAKING CHANGE: Docker server now requires authentication by default.
Endpoints require JWT tokens when security.jwt_enabled is true in config.
Add supervisor configuration for managing Redis and Gunicorn processes
Replace direct process management with supervisord
Add secure and token-free API server variants
Implement JWT authentication for protected endpoints
Update datetime handling in async dispatcher
Add email domain verification
BREAKING CHANGE: Server startup now uses supervisord instead of direct process management
Changes cache mode from BYPASS to WRITE_ONLY when cache is disabled to ensure
results are still cached for future use. Also adds error handling for non-JSON
LLM responses and comprehensive API test suite.
- Changes default cache fallback from BYPASS to WRITE_ONLY
- Adds error handling for LLM JSON parsing
- Introduces new test suite for API endpoints
Complete overhaul of Docker deployment setup with improved architecture:
- Add Redis integration for task management
- Implement rate limiting and security middleware
- Add Prometheus metrics and health checks
- Improve error handling and logging
- Add support for streaming responses
- Implement proper configuration management
- Add platform-specific optimizations for ARM64/AMD64
BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure