Commit Graph

441 Commits

Author SHA1 Message Date
Nasrin
532b105fc7 Merge pull request #1975 from nightcityblade/fix/issue-1973
Thanks @nightcityblade for the quick fix! Clean and straightforward CSS change.
2026-05-25 11:59:25 +02:00
nightcityblade
791aae3a21 fix: allow assistant toolbar to scroll
Fixes unclecode/crawl4ai#1973
2026-05-20 23:07:47 +08:00
unclecode
0e92b5e239 docs: add Privacy Policy, Terms of Service, and Support pages
Add legal pages required for Google Workspace Marketplace listing
verification. Pages cover the whole Crawl4AI Cloud business (OSS
library, hosted API, dashboard, integrations, Workspace add-ons),
not specific to any single product.

- privacy.md: data collection, usage, retention, Workspace Limited Use
- terms.md: account, billing, acceptable use, IP, governing law (SG)
- support.md: email, docs, GitHub, Discord, security disclosure
2026-04-20 02:24:21 +00:00
ntohidi
bb6406a2d0 release: Crawl4AI v0.8.5
Bump version to 0.8.5 across all references (Dockerfile, README,
Docker README, blog index, __version__.py).

Add release notes, blog post, demo verification script (13 real-crawl
tests), and releases directory entry.

Key highlights:
- Anti-bot detection with 3-tier proxy escalation
- Shadow DOM flattening
- Deep crawl cancellation
- Config defaults API
- 60+ bug fixes and critical security patches
2026-03-16 18:46:05 +08:00
unclecode
3a75dd3f4c fix: batch fix for 10 open issues (#1520, #1489, #1374, #1424, #1183, #1354, #880, #1031, #1251, #1758)
- #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance)
- #1489: Preserve query parameter key casing in normalize_url
- #1374: Close NamedTemporaryFile handle before reopening (Windows fix)
- #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1)
- #1183: Fix extract_xml_data regex matching tag names in prose text
- #1354: Make import_knowledge_base async (fix asyncio.run in running loop)
- #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences)
- #1031: Make Docker playground code editor resizable with overflow-auto
- #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes
- #1758: Change screenshot stitching format from BMP to PNG
2026-03-07 09:47:38 +00:00
unclecode
04e83aa3c7 docs: modernize deprecated API usage across shipped docs (#1770)
Update docs/examples to use current API:
- proxy → proxy_config in BrowserConfig
- result.fit_markdown → result.markdown.fit_markdown
- result.fit_html → result.markdown.fit_html
- markdown_v2 deprecation notes updated
- bypass_cache → cache_mode=CacheMode.BYPASS
- LLMExtractionStrategy now uses llm_config=LLMConfig(...)
- CrawlerConfig → CrawlerRunConfig
- cache_mode string values → CacheMode enum
- Fix missing CacheMode import in local-files.md
- Fix indentation in app-detail.html example
- Fix tautological cache mode descriptions in arun.md

From PR #1770 by @maksimzayats
2026-03-07 07:01:06 +00:00
unclecode
d6a8f57fdd docs: fix css_selector type from list to string in examples (#1308)
From PR #1308 by @dominicx
2026-03-07 06:15:14 +00:00
unclecode
e6c2a65625 docs: fix return type annotations to use RunManyReturn (#1716)
From PR #1716 by @YuriNachos
2026-03-07 06:14:49 +00:00
unclecode
5601861555 docs: add missing CacheMode import in quickstart example (#1715)
From PR #1715 by @YuriNachos
2026-03-07 06:13:32 +00:00
unclecode
a4cc0a9f04 feat: add separate query_llm_config for adaptive crawler query expansion (#1682)
The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.

Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.

Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).

Inspired by PR #1683 from @sthakrar — thank you for identifying the
issue and proposing the initial approach.
2026-02-25 12:26:39 +00:00
unclecode
c0912f7234 feat: add avoid_ads/avoid_css resource filtering and pool release lifecycle
Add opt-in BrowserConfig flags (avoid_ads, avoid_css) for blocking ad/tracker
domains and CSS resources at the browser context level. Refactor crawler pool
with release_crawler() and active_requests tracking to prevent janitor from
closing browsers with in-flight requests. Add proper finally blocks to all
Docker API/server handlers. Update docs for new config options.

Inspired by #1689.
2026-02-25 07:12:28 +00:00
Nasrin
d419199a4c Merge pull request #1775 from unclecode/fix/issue-1748-screenshot-scroll-delay
Fix/issue 1748 screenshot scroll delay
2026-02-25 05:54:24 +01:00
Ahmed-tawfik94
9cfeb4626d Document scroll_delay parameter for full-page screenshot crawling 2026-02-25 06:52:59 +03:00
unclecode
cbd36b74b2 Add stats dashboard page for LP summit
- Create scripts/update_stats.py: fetches GitHub, PyPI, Docker Hub data
  and generates docs/md_v2/stats.md with Chart.js visualizations
- Add stats.md with 5 metric cards and 5 charts (monthly downloads,
  star growth, cumulative downloads, daily trend, GitHub traffic)
- Add "Growth" nav entry in mkdocs.yml
- Update .gitignore to allow scripts/update_stats.py
2026-02-24 12:58:34 +00:00
unclecode
c9cb0160cf Add token usage tracking to generate_schema / agenerate_schema
generate_schema can make up to 5 internal LLM calls (field inference,
schema generation, validation retries) with no way to track token
consumption. Add an optional `usage: TokenUsage = None` parameter that
accumulates prompt/completion/total tokens across all calls in-place.

- _infer_target_json: accept and populate usage accumulator
- agenerate_schema: track usage after every aperform_completion call
  in the retry loop, forward usage to _infer_target_json
- generate_schema (sync): forward usage to agenerate_schema

Fully backward-compatible — omitting usage changes nothing.
2026-02-18 06:44:17 +00:00
unclecode
8576331d4e Add Shadow DOM flattening and reorder js_code execution pipeline
- Add `flatten_shadow_dom` option to CrawlerRunConfig that serializes
  shadow DOM content into the light DOM before HTML capture. Uses a
  recursive serializer that resolves <slot> projections and strips
  only shadow-scoped <style> tags. Also injects an init script to
  force-open closed shadow roots via attachShadow patching.

- Move `js_code` execution to after `wait_for` + `delay_before_return_html`
  so user scripts run on the fully-hydrated page. Add `js_code_before_wait`
  for the less common case of triggering loading before waiting.

- Add JS snippet (flatten_shadow_dom.js), integration test, example,
  and documentation across all relevant doc files.
2026-02-18 06:43:00 +00:00
unclecode
d267c650cb Add source (sibling selector) support to JSON extraction strategies
Many sites (e.g. Hacker News) split a single item's data across sibling
elements. Field selectors only search descendants, making sibling data
unreachable. The new "source" field key navigates to a sibling element
before running the selector: {"source": "+ tr"} finds the next sibling
<tr>, then extracts from there.

- Add _resolve_source abstract method to JsonElementExtractionStrategy
- Implement in all 4 subclasses (CSS/BS4, XPath/lxml, two lxml/CSS)
- Modify _extract_field to resolve source before type dispatch
- Update CSS and XPath LLM prompts with source docs and HN example
- Default generate_schema validate=True so schemas are checked on creation
- Add schema validation with feedback loop for auto-refinement
- Add messages param to completion helpers for multi-turn refinement
- Document source field and schema validation in docs
- Add 14 unit tests covering CSS, XPath, backward compat, edge cases
2026-02-17 09:04:40 +00:00
unclecode
879553955c Add ProxyConfig.DIRECT sentinel for direct-then-proxy escalation
Allow "direct" or None in proxy_config list to explicitly try
without a proxy before escalating to proxy servers. The retry
loop already handled None as direct — this exposes it as a
clean user-facing API via ProxyConfig.DIRECT.
2026-02-14 10:25:07 +00:00
unclecode
875207287e Unify proxy_config to accept list, add crawl_stats tracking
- proxy_config on CrawlerRunConfig now accepts a single ProxyConfig or
  a list of ProxyConfig tried in order (first-come-first-served)
- Remove is_fallback from ProxyConfig and fallback_proxy_configs from
  CrawlerRunConfig — proxy escalation handled entirely by list order
- Add _get_proxy_list() normalizer for the retry loop
- Add CrawlResult.crawl_stats with attempts, retries, proxies_used,
  fallback_fetch_used, and resolved_by for billing and observability
- Set success=False with error_message when all attempts are blocked
- Simplify retry loop — no more is_fallback stashing logic
- Update docs and tests to reflect new API
2026-02-14 07:53:46 +00:00
unclecode
72b546c48d Add anti-bot detection, retry, and fallback system
Automatically detect when crawls are blocked by anti-bot systems
(Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.) and
escalate through configurable retry and fallback strategies.

New features on CrawlerRunConfig:
- max_retries: retry rounds when blocking is detected
- fallback_proxy_configs: list of fallback proxies tried each round
- fallback_fetch_function: async last-resort function returning raw HTML

New field on ProxyConfig:
- is_fallback: skip proxy on first attempt, activate only when blocked

Escalation chain per round: main proxy → fallback proxies in order.
After all rounds: fallback_fetch_function as last resort.

Detection uses tiered heuristics — structural HTML markers (high
confidence) trigger on any page, generic patterns only on short
error pages to avoid false positives.
2026-02-14 05:24:07 +00:00
unclecode
3fc7730aaf Add remove_consent_popups flag and fix from_kwargs dict deserialization
Add CrawlerRunConfig.remove_consent_popups (bool, default False) that
targets GDPR/cookie consent popups from 70+ known CMP providers including
OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Usercentrics,
Sourcepoint, Google FundingChoices, and many more.

The JS strategy uses a 5-phase approach:
1. Click "Accept All" buttons (cleanest dismissal, sets cookies)
2. Try CMP JavaScript APIs (__tcfapi, Didomi, Cookiebot, Osano, Klaro)
3. Remove known CMP containers by selector (~120 selectors)
4. Handle iframe-based and shadow DOM CMPs
5. Restore body scroll and remove CMP body classes

Also fix from_kwargs() in CrawlerRunConfig and BrowserConfig to
auto-deserialize dict values using the existing from_serializable_dict()
infrastructure. Previously, strategy objects like markdown_generator
arriving as {"type": "DefaultMarkdownGenerator", "params": {...}} from
JSON APIs were passed through as raw dicts, causing crashes when the
crawler later called methods on them.
2026-02-11 12:46:47 +00:00
unclecode
fbc52813a4 Add tests, docs, and contributors for PRs #1463 and #1435
- Add tests for device_scale_factor (config + integration)
- Add tests for redirected_status_code (model + redirect + raw HTML)
- Document device_scale_factor in browser config docs and API reference
- Document redirected_status_code in crawler result docs and API reference
- Add TristanDonze and charlaie to CONTRIBUTORS.md
- Update PR-TODOLIST with session results
2026-02-06 09:30:19 +00:00
ntohidi
4e56f3e00d Add contributing guide and update mkdocs navigation for community resources 2026-02-03 09:46:54 +01:00
unclecode
5be0d2d75e Add contributor and docs for force_viewport_screenshot feature
- Add TheRedRad to CONTRIBUTORS.md for PR #1694
- Document force_viewport_screenshot in API parameters reference
- Add viewport screenshot note in browser-crawler-config guide
- Add viewport-only screenshot example in screenshot docs
2026-02-01 01:10:20 +00:00
unclecode
55a2cc8181 Document set_defaults/get_defaults/reset_defaults in config guides 2026-01-31 11:46:53 +00:00
unclecode
fbfbc6995c Fix deep crawl cancellation example to use DFS for precise control 2026-01-22 06:25:34 +00:00
unclecode
1e2b7fe7e6 Add documentation and example for deep crawl cancellation
- Add Section 11 "Cancellation Support for Deep Crawls" to deep-crawling.md
- Document should_cancel callback, cancel() method, and cancelled property
- Include complete example for cloud platform job cancellation
- Add docs/examples/deep_crawl_cancellation.py with 6 comprehensive examples
- Update summary section to mention cancellation feature
2026-01-22 06:10:54 +00:00
Nasrin
f6f7f1b551 Release v0.8.0: Crash Recovery, Prefetch Mode & Security Fixes (#1712)
* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* Add browser_context_id and target_id parameters to BrowserConfig

Enable Crawl4AI to connect to pre-created CDP browser contexts, which is
essential for cloud browser services that pre-create isolated contexts.

Changes:
- Add browser_context_id and target_id parameters to BrowserConfig
- Update from_kwargs() and to_dict() methods
- Modify BrowserManager.start() to use existing context when provided
- Add _get_page_by_target_id() helper method
- Update get_page() to handle pre-existing targets
- Add test for browser_context_id functionality

This enables cloud services to:
1. Create isolated CDP contexts before Crawl4AI connects
2. Pass context/target IDs to BrowserConfig
3. Have Crawl4AI reuse existing contexts instead of creating new ones

* Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios

* Fix: add cdp_cleanup_on_close to from_kwargs

* Fix: find context by target_id for concurrent CDP connections

* Fix: use target_id to find correct page in get_page

* Fix: use CDP to find context by browserContextId for concurrent sessions

* Revert context matching attempts - Playwright cannot see CDP-created contexts

* Add create_isolated_context flag for concurrent CDP crawls

When True, forces creation of a new browser context instead of reusing
the default context. Essential for concurrent crawls on the same browser
to prevent navigation conflicts.

* Add context caching to create_isolated_context branch

Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts
for multiple URLs with same config. Still creates new page per crawl
for navigation isolation. Benefits batch/deep crawls.

* Add init_scripts support to BrowserConfig for pre-page-load JS injection

This adds the ability to inject JavaScript that runs before any page loads,
useful for stealth evasions (canvas/audio fingerprinting, userAgentData).

- Add init_scripts parameter to BrowserConfig (list of JS strings)
- Apply init_scripts in setup_context() via context.add_init_script()
- Update from_kwargs() and to_dict() for serialization

* Fix CDP connection handling: support WS URLs and proper cleanup

Changes to browser_manager.py:

1. _verify_cdp_ready(): Support multiple URL formats
   - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly
   - HTTP URLs with query params: Properly parse with urlparse to preserve query string
   - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params

2. close(): Proper cleanup when cdp_cleanup_on_close=True
   - Close all sessions (pages)
   - Close all contexts
   - Call browser.close() to disconnect (doesn't terminate browser, just releases connection)
   - Wait 1 second for CDP connection to fully release
   - Stop Playwright instance to prevent memory leaks

This enables:
- Connecting to specific browsers via WS URL
- Reusing the same browser with multiple sequential connections
- No user wait needed between connections (internal 1s delay handles it)

Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests.

* Update gitignore

* Some debugging for caching

* Add _generate_screenshot_from_html for raw: and file:// URLs

Implements the missing method that was being called but never defined.
Now raw: and file:// URLs can generate screenshots by:
1. Loading HTML into a browser page via page.set_content()
2. Taking screenshot using existing take_screenshot() method
3. Cleaning up the page afterward

This enables cached HTML to be rendered with screenshots in crawl4ai-cloud.

* Add PDF and MHTML support for raw: and file:// URLs

- Replace _generate_screenshot_from_html with _generate_media_from_html
- New method handles screenshot, PDF, and MHTML in one browser session
- Update raw: and file:// URL handlers to use new method
- Enables cached HTML to generate all media types

* Add crash recovery for deep crawl strategies

Add optional resume_state and on_state_change parameters to all deep
crawl strategies (BFS, DFS, Best-First) for cloud deployment crash
recovery.

Features:
- resume_state: Pass saved state to resume from checkpoint
- on_state_change: Async callback fired after each URL for real-time
  state persistence to external storage (Redis, DB, etc.)
- export_state(): Get last captured state manually
- Zero overhead when features are disabled (None defaults)

State includes visited URLs, pending queue/stack, depths, and
pages_crawled count. All state is JSON-serializable.

* Fix: HTTP strategy raw: URL parsing truncates at # character

The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract
content from raw: URLs. This caused HTML with CSS color codes like #eee
to be truncated because # is treated as a URL fragment delimiter.

Before: raw:body{background:#eee} -> parsed.path = 'body{background:'
After:  raw:body{background:#eee} -> raw_content = 'body{background:#eee'

Fix: Strip the raw: or raw:// prefix directly instead of using urlparse,
matching how the browser strategy handles it.

* Add base_url parameter to CrawlerRunConfig for raw HTML processing

When processing raw: HTML (e.g., from cache), the URL parameter is meaningless
for markdown link resolution. This adds a base_url parameter that can be set
explicitly to provide proper URL resolution context.

Changes:
- Add base_url parameter to CrawlerRunConfig.__init__
- Add base_url to CrawlerRunConfig.from_kwargs
- Update aprocess_html to use base_url for markdown generation

Usage:
  config = CrawlerRunConfig(base_url='https://example.com')
  result = await crawler.arun(url='raw:{html}', config=config)

* Add prefetch mode for two-phase deep crawling

- Add `prefetch` parameter to CrawlerRunConfig
- Add `quick_extract_links()` function for fast link extraction
- Add short-circuit in aprocess_html() for prefetch mode
- Add 42 tests (unit, integration, regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Updates on proxy rotation and proxy configuration

* Add proxy support to HTTP crawler strategy

* Add browser pipeline support for raw:/file:// URLs

- Add process_in_browser parameter to CrawlerRunConfig
- Route raw:/file:// URLs through _crawl_web() when browser operations needed
- Use page.set_content() instead of goto() for local content
- Fix cookie handling for non-HTTP URLs in browser_manager
- Auto-detect browser requirements: js_code, wait_for, screenshot, etc.
- Maintain fast path for raw:/file:// without browser params

Fixes #310

* Add smart TTL cache for sitemap URL seeder

- Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig
- New JSON cache format with metadata (version, created_at, lastmod, url_count)
- Cache validation by TTL expiry and sitemap lastmod comparison
- Auto-migration from old .jsonl to new .json format
- Fixes bug where incomplete cache was used indefinitely

* Update URL seeder docs with smart TTL cache parameters

- Add cache_ttl_hours and validate_sitemap_lastmod to parameter table
- Document smart TTL cache validation with examples
- Add cache-related troubleshooting entries
- Update key features summary

* Add MEMORY.md to gitignore

* Docs: Add multi-sample schema generation section

Add documentation explaining how to pass multiple HTML samples
to generate_schema() for stable selectors that work across pages
with varying DOM structures.

Includes:
- Problem explanation (fragile nth-child selectors)
- Solution with code example
- Key points for multi-sample queries
- Comparison table of fragile vs stable selectors

* Fix critical RCE and LFI vulnerabilities in Docker API deployment

Security fixes for vulnerabilities reported by ProjectDiscovery:

1. Remote Code Execution via Hooks (CVE pending)
   - Remove __import__ from allowed_builtins in hook_manager.py
   - Prevents arbitrary module imports (os, subprocess, etc.)
   - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var

2. Local File Inclusion via file:// URLs (CVE pending)
   - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html
   - Block file://, javascript:, data: and other dangerous schemes
   - Only allow http://, https://, and raw: (where appropriate)

3. Security hardening
   - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks)
   - Add security warning comments in config.yml
   - Add validate_url_scheme() helper for consistent validation

Testing:
   - Add unit tests (test_security_fixes.py) - 16 tests
   - Add integration tests (run_security_tests.py) for live server

Affected endpoints:
   - POST /crawl (hooks disabled by default)
   - POST /crawl/stream (hooks disabled by default)
   - POST /execute_js (URL validation added)
   - POST /screenshot (URL validation added)
   - POST /pdf (URL validation added)
   - POST /html (URL validation added)

Breaking changes:
   - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function
   - file:// URLs no longer work on API endpoints (use library directly)

* Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests

* Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates

* Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates

Documentation for v0.8.0 release:

- SECURITY.md: Security policy and vulnerability reporting guidelines
- RELEASE_NOTES_v0.8.0.md: Comprehensive release notes
- migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide
- security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts
- CHANGELOG.md: Updated with v0.8.0 changes

Breaking changes documented:
- Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED)
- file:// URLs blocked on Docker API endpoints

Security fixes credited to Neo by ProjectDiscovery

* Add examples for deep crawl crash recovery and prefetch mode in documentation

* Release v0.8.0: The v0.8.0 Update

- Updated version to 0.8.0
- Added comprehensive demo and release notes
- Updated all documentation

* Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery

* Add async agenerate_schema method for schema generation

- Extract prompt building to shared _build_schema_prompt() method
- Add agenerate_schema() async version using aperform_completion_with_backoff
- Refactor generate_schema() to use shared prompt builder
- Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI)

* Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility

O-series (o1, o3) and GPT-5 models only support temperature=1.
Setting litellm.drop_params=True auto-drops unsupported parameters
instead of throwing UnsupportedParamsError.

Fixes temperature=0.01 error for these models in LLM extraction.

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: unclecode <unclecode@kidocode.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-17 14:19:15 +01:00
Nasrin
a87e8c1c9e Release/v0.7.8 (#1662)
* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* announcement: add application form for cloud API closed beta

* Release v0.7.8: Stability & Bug Fix Release

- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.

* docs: add section for Crawl4AI Cloud API closed beta with application link

* fix: add disk cleanup step to Docker workflow

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
2025-12-11 11:04:52 +01:00
Aravind
0024c82cdc Sponsors/new (#1637) 2025-11-24 13:29:33 +01:00
ntohidi
6244f56f36 Release v0.7.7
- Updated version to 0.7.7
- Added comprehensive demo and release notes
- Updated all documentation
2025-11-14 10:23:31 +01:00
Nasrin
f3146de969 Merge pull request #1609 from unclecode/fix/update-config-documentation
Update browser and crawler run config documentation to match async_configs.py implementation
2025-11-13 21:52:53 +08:00
Soham Kukreti
d6b6d11a2d docs: update browser and crawler run config documentation to match async_configs.py implementation
Updated browser-crawler-config.md and parameters.md to ensure complete
accuracy with the actual BrowserConfig and CrawlerRunConfig implementations.

Changes:
- Removed non-existent parameters from documentation:
  * enable_rate_limiting, rate_limit_config (never implemented)
  * memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher)
  * display_mode (doesn't exist)

- Added missing BrowserConfig parameters (14 total):
  * browser_mode, use_managed_browser, cdp_url, debugging_port, host
  * viewport, chrome_channel, channel
  * accept_downloads, downloads_path, storage_state, sleep_on_close
  * user_agent_mode, user_agent_generator_config, enable_stealth

- Added missing CrawlerRunConfig parameters (29 total):
  * chunking_strategy, keep_attrs, parser_type, scraping_strategy
  * proxy_config, proxy_rotation_strategy
  * locale, timezone_id, geolocation, fetch_ssl_certificate
  * shared_data, wait_for_timeout
  * c4a_script, max_scroll_steps
  * exclude_all_images, table_score_threshold, table_extraction
  * exclude_internal_links, score_links
  * capture_network_requests, capture_console_messages
  * method, stream, url, user_agent, user_agent_mode, user_agent_generator_config
  * deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental

- Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write)
- Reorganized parameters into logical sections (Content Processing, Browser Location & Identity,
  Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain
  Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features)
- Ensured all parameter descriptions match source code docstrings
- Added proper default values from __init__ signatures
2025-11-13 14:54:16 +05:30
Nasrin
466be69e72 Merge pull request #1607 from unclecode/fix/dfs_deep_crawling
Fix/dfs deep crawling
2025-11-13 16:43:47 +08:00
ntohidi
998c809e08 Rename folder name for NSTProxy integration examples for crawl4ai 2025-11-13 09:36:39 +01:00
ntohidi
d0fb53540d Update proxy-security documentation 2025-11-13 09:23:44 +01:00
Nasrin
8116b15b63 Merge pull request #1596 from unclecode/docs-proxy-security
#1591 enhance proxy configuration with security, SSL analysis, and rotation examples
2025-11-13 16:22:28 +08:00
AHMET YILMAZ
fe353c4e27 Refactor proxy configuration documentation for clarity and consistency 2025-11-13 11:20:24 +08:00
ntohidi
89cc29fe44 Merge branch 'fix/docker' into develop 2025-11-12 17:06:31 +01:00
Nasrin
cdcb8836b7 Merge pull request #1605 from Nstproxy/feat/nstproxy
feat: Add Nstproxy Proxies
2025-11-12 23:56:14 +08:00
AHMET YILMAZ
1bd3de6a47 #1510 : Add DFS deep crawler demonstration script and enhance DFS strategy with seen URL tracking 2025-11-12 17:44:43 +08:00
nstproxy
80452166c8 feat: Add Nstproxy Proxies 2025-11-12 16:25:39 +08:00
AHMET YILMAZ
2e8f8c9b49 #1551 : Fix casing and variable name consistency for LLMConfig in documentation 2025-11-10 15:38:14 +08:00
AHMET YILMAZ
263ac890fd #1591
: Enhance proxy configuration documentation with security features, SSL analysis, and improved examples
2025-11-10 11:42:07 +08:00
unclecode
1a22fb4d4f docs: rename Docker deployment to self-hosting guide with comprehensive monitoring documentation
Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system.

Changes:
- Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition
- Updated mkdocs.yml navigation to "Self-Hosting Guide"
- Completely rewrote introduction emphasizing self-hosting benefits:
  * Data privacy and ownership
  * Cost control and transparency
  * Performance and security advantages
  * Full customization capabilities

- Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with:
  * Monitoring Dashboard section documenting the /monitor UI
  * Complete feature breakdown (system health, requests, browsers, janitor, errors)
  * Monitor API Endpoints with all REST endpoints and examples
  * WebSocket Streaming integration guide with Python examples
  * Control Actions for manual browser management
  * Production Integration patterns (Prometheus, custom dashboards, alerting)
  * Key production metrics to track

- Enhanced summary section:
  * What users learned checklist
  * Why self-hosting matters
  * Clear next steps
  * Key resources with monitoring dashboard URL

The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable.
Users will understand they have complete operational visibility at http://localhost:11235/monitor
with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs.

This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level
monitoring capabilities, not just a Docker deployment.
2025-11-09 13:31:52 +08:00
CapSolver
57aeb70f00 Add CapSolver Captcha Solver 2025-11-06 15:37:31 +08:00
Nasrin
6534ece026 Merge pull request #1532 from unclecode/fix/update-documentation
Standardize C4A-Script tutorial, add CLI identity-based crawling, and add sponsorship CTA
2025-11-05 23:37:05 +08:00
ntohidi
c0f1865287 feat(api): update marketplace version and build date in root endpoint response 2025-10-26 11:35:39 +01:00
ntohidi
46ef1116c4 fix(app-detail): enhance tab functionality, hide documentation and support tabs in marketplace 2025-10-26 11:21:29 +01:00
Nasrin
4df83893ac Merge pull request #1560 from unclecode/fix/marketplace
Fix/marketplace
2025-10-23 22:17:06 +08:00