Commit Graph

493 Commits

Author SHA1 Message Date
Martin Sjöborg
35dd206925 fix: always return a list, even if we catch an exception 2025-10-02 09:21:44 +02:00
ntohidi
77559f3373 feat(StealthAdapter): fix stealth features for Playwright integration. ref #1481 2025-09-18 15:39:06 +08:00
Nasrin
3899ac3d3b Merge pull request #1464 from unclecode/fix/proxy_deprecation
Fix/proxy deprecation
2025-09-16 15:48:45 +08:00
Nasrin
23431d8109 Merge pull request #1389 from unclecode/fix/deep-crawl-scoring
fix(deep-crawl): BestFirst priority inversion
2025-09-16 15:45:54 +08:00
AHMET YILMAZ
1717827732 refactor(BrowserConfig): change deprecation warning for 'proxy' parameter to UserWarning 2025-09-12 11:10:38 +08:00
ntohidi
3bc56dd028 fix: allow custom LLM providers for adaptive crawler embedding config. ref: #1291
- Change embedding_llm_config from Dict to Union[LLMConfig, Dict] for type safety
  - Add backward-compatible conversion property _embedding_llm_config_dict
  - Replace all hardcoded OpenAI embedding configs with configurable options
  - Fix LLMConfig object attribute access in query expansion logic
  - Add comprehensive example demonstrating multiple provider configurations
  - Update documentation with both LLMConfig object and dictionary usage patterns

  Users can now specify any LLM provider for query expansion in embedding strategy:
  - New: embedding_llm_config=LLMConfig(provider='anthropic/claude-3', api_token='key')
  - Old: embedding_llm_config={'provider': 'openai/gpt-4', 'api_token': 'key'} (still works)
2025-09-09 12:49:55 +08:00
ntohidi
487839640f fix: raise error on last attempt failure in perform_completion_with_backoff. ref #989 2025-09-02 16:49:01 +08:00
Nasrin
ae67d66b81 Merge pull request #1454 from nafeqq-1306/docstring-changes
issue #1329: Docs are not detected due to triplequotes not being first line
2025-09-02 11:59:59 +08:00
Nasrin
5e7fcb17e1 Merge pull request #1448 from unclecode/fix/https-reditrect
feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling
2025-09-01 16:11:25 +08:00
nafeqq-1306
9749e2832d issue #1329 refactor(crawler): move unwanted properties to CrawlerRunConfig class 2025-08-29 10:20:47 +08:00
ntohidi
f566c5a376 feat: add preserve_https_for_internal_links flag to maintain HTTPS during crawling. Ref #1410
Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.
2025-08-28 17:38:40 +08:00
AHMET YILMAZ
f7a3366f72 #1375 : refactor(proxy) Deprecate 'proxy' parameter in BrowserConfig and enhance proxy string parsing
- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.
2025-08-28 17:21:49 +08:00
Nasrin
4e1c4bd24e Merge pull request #1436 from unclecode/fix/docker-filter
fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy
2025-08-27 11:08:42 +08:00
ntohidi
38f3ea42a7 fix(logger): ensure logger is a Logger instance in crawling strategies. ref #1437 2025-08-26 12:06:56 +08:00
ntohidi
102352eac4 fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy (ref #1419)
- Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly
2025-08-25 14:04:08 +08:00
ntohidi
40ab287c90 fix(utils): Improve URL normalization by avoiding quote/unquote to preserve '+' signs. ref #1332 2025-08-22 12:05:21 +08:00
ntohidi
f4a432829e fix(crawler): Removed the incorrect reference in browser_config variable #1310 2025-08-18 10:59:14 +08:00
UncleCode
22c7932ba3 chore(version): update version to 0.7.4 2025-08-17 19:22:23 +08:00
UncleCode
2ab0bf27c2 refactor(utils): move memory utilities to utils and update imports 2025-08-17 19:14:55 +08:00
ntohidi
d30dc9fdc1 fix(http-crawler): bring back HTTP crawler strategy 2025-08-16 09:27:23 +08:00
ntohidi
a51545c883 feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables
BREAKING CHANGE: Table extraction now uses Strategy Design Pattern

This epic commit introduces a game-changing approach to table extraction in Crawl4AI:

 NEW FEATURES:
- LLMTableExtraction: AI-powered extraction for complex HTML tables with rowspan/colspan
- Smart Chunking: Automatically splits massive tables into optimal chunks at row boundaries
- Parallel Processing: Processes multiple chunks simultaneously for blazing-fast extraction
- Intelligent Merging: Seamlessly combines chunk results into complete tables
- Header Preservation: Each chunk maintains context with original headers
- Auto-retry Logic: Built-in resilience with configurable retry attempts

🏗️ ARCHITECTURE:
- Strategy Design Pattern for pluggable table extraction strategies
- ThreadPoolExecutor for concurrent chunk processing
- Token-based chunking with configurable thresholds
- Handles tables without headers gracefully

 PERFORMANCE:
- Process 1000+ row tables without timeout
- Parallel processing with up to 5 concurrent chunks
- Smart token estimation prevents LLM context overflow
- Optimized for providers like Groq for massive tables

🔧 CONFIGURATION:
- enable_chunking: Auto-handle large tables (default: True)
- chunk_token_threshold: When to split (default: 3000 tokens)
- min_rows_per_chunk: Meaningful chunk sizes (default: 10)
- max_parallel_chunks: Concurrent processing (default: 5)

📚 BACKWARD COMPATIBILITY:
- Existing code continues to work unchanged
- DefaultTableExtraction remains the default strategy
- Progressive enhancement approach

This is the future of web table extraction - handling everything from simple tables to massive, complex data grids with merged cells and nested structures. The chunking is completely transparent to users while providing unprecedented scalability.
2025-08-14 18:21:24 +08:00
Nasrin
926e41aab8 Merge pull request #1378 from unclecode/fix/exit_with_q
Cross Platform fix for browser profiler
2025-08-13 14:16:47 +08:00
Nasrin
b92be4ef66 Merge pull request #1371 from unclecode/bug/proxy_config
#1057 : enhance ProxyConfig initialization to support dict and string…
2025-08-12 16:55:52 +08:00
ntohidi
dfcfd8ae57 fix(dispatcher): enable true concurrency for fast-completing tasks in arun_many. REF: #560
The MemoryAdaptiveDispatcher was processing tasks sequentially despite
  max_session_permit > 1 due to fetching only one task per event loop iteration.
  This particularly affected raw:// URLs which complete in microseconds.

  Changes:
  - Replace single task fetch with greedy slot filling using get_nowait()
  - Fill all available slots (up to max_session_permit) immediately
  - Break on empty queue instead of waiting with timeout

  This ensures proper parallelization for all task types, especially
  ultra-fast operations like raw HTML processing.
2025-08-12 16:51:22 +08:00
ntohidi
955110a8b0 Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-08-12 12:22:25 +08:00
ntohidi
8146d477e9 Merge branch 'main' into develop 2025-08-11 18:56:15 +08:00
ntohidi
96c4b0de67 fix(browser_manager): serialize new_page on persistent context to avoid races ref #1198
- Add _page_lock and guarded creation; handle empty context.pages safely
  - Prevents BrowserContext.new_page “Target page/context closed” during concurrent arun_many
2025-08-11 18:55:43 +08:00
Nasrin
57c14db7cb Merge pull request #1381 from unclecode/fix/base-tag-link-resolution
fix: Implement base tag support in link extraction (#1147)
2025-08-11 18:32:32 +08:00
ntohidi
88a9fbbb7e fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253
Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.
2025-08-11 18:16:57 +08:00
Soham Kukreti
18ad3ef159 fix: Implement base tag support in link extraction (#1147)
- Extract base href from <head><base> tag using XPath in _process_element method
- Use base URL as the primary URL for link normalization when present
- Add error handling with logging for malformed or problematic base tags
- Maintain backward compatibility when no base tag is present
- Add test to verify the functionality of the base tag extraction.
2025-08-08 20:11:57 +05:30
AHMET YILMAZ
b61b2ee676 feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling 2025-08-08 11:18:34 +08:00
AHMET YILMAZ
89cf5aba2b #1057 : enhance ProxyConfig initialization to support dict and string formats 2025-08-06 18:34:58 +08:00
ntohidi
6b0b5301ba Release v0.7.3:
- Updated version to 0.7.3
- Added release notes
- Updated documentation
2025-08-06 17:52:01 +08:00
Nasrin
64f37792a7 Merge pull request #1170 from prokopis3/fix/create-profile
fix(browser_profiler): cross-platform 'q' to quit - create profile
2025-08-06 16:29:14 +08:00
ntohidi
437395e490 Merge branch 'feat/undetected-browser' into develop-future 2025-08-06 15:03:30 +08:00
ntohidi
7a6ad547f0 Squashed commit of the following:
commit 2def6524cdacb69c72760bf55a41089257c0bb07
Author: ntohidi <nasrin@kidocode.com>
Date:   Mon Aug 4 18:59:10 2025 +0800

    refactor: consolidate WebScrapingStrategy to use LXML implementation only

    BREAKING CHANGE: None - full backward compatibility maintained

    This commit simplifies the content scraping architecture by removing the
    redundant BeautifulSoup-based WebScrapingStrategy implementation and making
    it an alias for LXMLWebScrapingStrategy.

    Changes:
    - Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code
    - Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy
    - Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy
    - Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy
    - Maintain 100% backward compatibility - existing code continues to work

    Code changes:
    - crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias
    - crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports
    - crawl4ai/__init__.py: Update imports to show alias relationship
    - crawl4ai/types.py: Update type definitions
    - crawl4ai/legacy/web_crawler.py: Update import to use alias
    - tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy
    - docs/examples/scraping_strategies_performance.py: Update to use single strategy

    Documentation updates:
    - docs/md_v2/core/content-selection.md: Update scraping modes section
    - docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide
    - CHANGELOG.md: Document the refactoring under [Unreleased]

    Benefits:
    - 10-20x faster HTML parsing for large documents
    - Reduced memory usage and simplified codebase
    - Consistent parsing behavior
    - No migration required for existing users

    All existing code using WebScrapingStrategy continues to work without
    modification, while benefiting from LXML's superior performance.
2025-08-04 19:02:01 +08:00
ntohidi
307fe28b32 fix: Correct URL matcher fallback behavior and improve memory monitoring
Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring.

Bug fixes:
- Change select_config() to return None when no config matches instead of using first config
- Add proper error handling in dispatchers when no config matches a URL
- Return failed CrawlResult with "No matching configuration found" error message
- Fix is_match() to return True when url_matcher is None (matches all URLs)
- Import and use get_true_memory_usage_percent() for more accurate memory monitoring

Behavior clarification:
- CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing)
- This is the intended behavior for default/fallback configurations
- Enables clean pattern: specific configs first, default config last

Documentation updates:
- Clarify that configs without url_matcher match everything
- Explain "No matching configuration found" error when no default config
- Add examples showing proper default config usage
- Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md
- Simplify API config examples by removing extraction_strategy

Demo and test updates:
- Update demo_multi_config_clean.py with commented default config to show behavior
- Change example URL to w3schools.com to demonstrate no-match scenario
- Uncomment all test URLs in test_multi_config.py for comprehensive testing

Breaking changes: None - this restores the intended behavior

This ensures URLs only get processed with appropriate configs, preventing
issues like HTML pages being processed with PDF extraction strategies.
2025-08-03 16:50:54 +08:00
ntohidi
a03e68fa2f feat: Add URL-specific crawler configurations for multi-URL crawling
Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns.

Key additions:
- Add url_matcher and match_mode parameters to CrawlerRunConfig
- Implement is_match() method supporting string patterns, functions, and mixed lists
- Add MatchMode enum for OR/AND logic when combining multiple matchers
- Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig]
- Add select_config() method to dispatchers for runtime config selection
- First matching config wins, with fallback to default

Pattern matching supports:
- Glob-style strings: *.pdf, */blog/*, *api*
- Lambda functions: lambda url: 'github.com' in url
- Mixed patterns with AND/OR logic for complex matching

This enables optimal per-URL configuration:
- PDFs: Use PDFContentScrapingStrategy without JavaScript
- Blogs: Apply content filtering to reduce noise
- APIs: Skip JavaScript, use JSON extraction
- Dynamic sites: Execute only necessary JavaScript

Breaking changes: None - fully backward compatible
2025-08-02 19:10:36 +08:00
Charlie C
508b6fc233 fix: Enable following redirects in sitemap fetching for seeder 2025-07-31 12:06:10 +08:00
UncleCode
48647300b4 chore: Bump version to 0.7.2 2025-07-25 17:42:48 +08:00
UncleCode
9546773a07 fix: Move sentence-transformers to optional dependencies
- Moved sentence-transformers from core to optional dependencies in pyproject.toml
- Removed sentence-transformers from requirements.txt
- Added proper ImportError handling with helpful installation message
- This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default
- Users who need embedding features can install with: pip install 'crawl4ai[transformer]'
2025-07-24 21:24:40 +08:00
Nasrin
b8c261780f Merge pull request #1319 from volumetric/fix_for_bug_#1310
Removed the incorrect reference in browser_config variable
2025-07-23 12:45:12 +02:00
Nasrin
004d514f33 Merge pull request #1265 from unclecode/feature/nasrin-cli-deep-crawl
Feature/CLI - deep-crawl: Add --deep-crawl CLI option with BFS/DFS/Best-First strategies and fix serialization error. ref #874
2025-07-23 09:40:33 +02:00
Vinit Agrawal
3a9e2c716e Remvoed the incorrect reference in browser_config variable 2025-07-18 10:01:00 +05:30
ntohidi
26bad799e4 chore: update version to 0.7.1 2025-07-17 11:37:41 +02:00
ntohidi
cf8badfe27 feat: cleanup unused code and enhance documentation for v0.7.1
- Remove unused StealthConfig from browser_manager.py
- Update LinkPreviewConfig import path in __init__.py and examples
- Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf'))
- Remove sanitize_json_data functions from API endpoints
- Add comprehensive C4A Script documentation to release notes
- Update v0.7.0 release notes with improved code examples
- Create v0.7.1 release notes focusing on cleanup and documentation improvements
- Update demo files with corrected import paths and examples
- Fix virtual scroll and adaptive crawling examples across documentation

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-17 11:35:16 +02:00
unclecode
6a728cbe5b feat: add stealth mode and enhance undetected browser support
- Add playwright-stealth integration with enable_stealth parameter in BrowserConfig
- Merge undetected browser strategy into main async_crawler_strategy.py using adapter pattern
- Add browser adapters (BrowserAdapter, PlaywrightAdapter, UndetectedAdapter) for flexible browser switching
- Update install.py to install both playwright and patchright browsers automatically
- Add comprehensive documentation for anti-bot features (stealth mode + undetected browser)
- Create examples demonstrating stealth mode usage and comparison tests
- Update pyproject.toml and requirements.txt with patchright>=1.49.0 and other dependencies
- Remove duplicate/unused dependencies (alphashape, cssselect, pyperclip, shapely, selenium)
- Add dependency checker tool in tests/check_dependencies.py

Breaking changes: None - all existing functionality preserved

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-17 16:59:10 +08:00
unclecode
5c33cbcca2 feat: add undetected browser support with adapter pattern 2025-07-14 17:29:50 +08:00
UncleCode
0c8bb742b7 Release v0.7.0-r1: The Adaptive Intelligence Update
- Bump version to 0.7.0
- Add release notes and demo files
- Update README with v0.7.0 features
- Update Docker configurations for v0.7.0-r1
- Move v0.7.0 demo files to releases_review
- Fix BM25 scoring bug in URLSeeder

Major features:
- Adaptive Crawling with pattern learning
- Virtual Scroll support for infinite pages
- Link Preview with 3-layer scoring
- Async URL Seeder for massive discovery
- Performance optimizations
2025-07-12 18:51:13 +08:00
ntohidi
0ebce590f8 Merge branch '2025-JUN-1' into next-MAY 2025-07-09 09:41:03 +02:00