Commit Graph

  • de43505ae4 feat: update version to 0.3.742 0.3.742 unclecode 2024-11-24 19:36:30 +08:00
  • d7c5b900b8 feat: add support for arm64 platform in Docker commands and update INSTALL_TYPE variable in docker-compose unclecode 2024-11-24 19:35:53 +08:00
  • edad7b6a74 chore: remove Railway deployment configuration and related documentation unclecode 2024-11-24 18:48:39 +08:00
  • 829a1f7992 feat: update version to 0.3.741 and enhance content filtering with heuristic strategy. Fixing the issue that when the past HTML to BM25 content filter does not have any HTML elements. UncleCode 2024-11-23 19:45:41 +08:00
  • d729aa7d5e refactor: Add group ID to for images extracted from srcset. UncleCode 2024-11-23 18:00:32 +08:00
  • 2226ef53c8 fix: Exempting the start_url from can_process_url Aravind Karnam 2024-11-23 14:59:14 +05:30
  • 3d52b551f2 Merge pull request #8 from aravindkarnam/main aravind 2024-11-23 13:57:36 +05:30
  • f8e85b1499 Fixed a bug in _process_links, handled condition for when url_scorer is passed as None, renamed the scrapper folder to scraper. Aravind Karnam 2024-11-23 13:52:34 +05:30
  • c1797037c0 Fixed a few bugs, import errors and changed to asyncio wait_for instead of timeout to support python versions < 3.11 Aravind Karnam 2024-11-23 12:39:25 +05:30
  • 0d0cef3438 feat: add enhanced markdown generation example with citations and file output UncleCode 2024-11-22 20:14:58 +08:00
  • d7a112fefe Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-11-22 19:56:56 +08:00
  • a5decaa7cf Merge branch '0.3.74' UncleCode 2024-11-22 19:55:52 +08:00
  • 8dea3f470f chore: update README to include new features and improvements for version 0.3.74 0.3.74 UncleCode 2024-11-22 18:50:12 +08:00
  • e02935dc5b chore: update README to reflect new features and improvements in version 0.3.74 UncleCode 2024-11-22 18:49:22 +08:00
  • 24ad2fe2dd feat: enhance Markdown generation to include fit_html attribute UncleCode 2024-11-22 18:47:17 +08:00
  • 571dda6549 Update Redme UncleCode 2024-11-22 18:27:43 +08:00
  • 006bee4a5a feat: enhance image processing capabilities - Enhanced image processing with srcset support and validation checks for better image selection. UncleCode 2024-11-22 16:00:17 +08:00
  • dbb751c8f0 In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction. UncleCode 2024-11-21 18:21:43 +08:00
  • 3439f7886d fix: crawler strategy exception handling and fixes (#271) 程序员阿江(Relakkes) 2024-11-20 20:30:25 +08:00
  • d418a04602 Fix #260 prevent pass duplicated kwargs to scrapping_strategy (#269) Darwing Medina 2024-11-20 04:52:11 -06:00
  • 8179cae765 feat: adding test file to my branch feature/content-filter-nasrin-1 feature/content-filter ntohidikplay 2024-11-19 13:23:25 +01:00
  • fde35f644d feat: adding test file to my branch ntohidikplay 2024-11-19 13:02:52 +01:00
  • 7047422e48 Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0.3.74 UncleCode 2024-11-19 19:33:08 +08:00
  • 2bdec1fa5a chore: add manage-collab.sh to .gitignore UncleCode 2024-11-19 19:33:04 +08:00
  • b654c49e55 Update .gitignore to exclude additional scripts and files UncleCode 2024-11-19 19:32:06 +08:00
  • f2cb7d506d Delete test3.txt UncleCode 2024-11-19 19:12:14 +08:00
  • a6dad3fc6d test: trying to push to 0.3.74 ntohidikplay 2024-11-19 12:09:33 +01:00
  • fbcff85ecb Remove test files UncleCode 2024-11-19 19:03:23 +08:00
  • 788c67c29a Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-11-19 19:02:44 +08:00
  • 2f19d38693 Update .gitignore to include .gitboss/ and todo_executor.md UncleCode 2024-11-19 19:02:41 +08:00
  • 3aae30ed2a test1: trying to push to main ntohidikplay 2024-11-19 11:57:07 +01:00
  • 5eeb682719 Delete test.txt unclecode-patch-1 UncleCode 2024-11-19 18:55:11 +08:00
  • 593c7ad307 test: trying to push to main ntohidikplay 2024-11-19 11:45:26 +01:00
  • 73658c758a chore: update .gitignore to include manage-collab.sh UncleCode 2024-11-19 16:10:43 +08:00
  • b6af94cbbb Merge remote-tracking branch 'origin/main' into 0.3.74 UncleCode 2024-11-18 21:15:04 +08:00
  • 852729ff38 feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile feat(requirements): update requirements.txt to include snowballstemmer fix(version_manager): correct version parsing to use __version__.__version__ feat(main): introduce chunking strategy and content filter in CrawlRequest model feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance feat(logger): implement new async logger engine replacing print statements throughout library fix(database): resolve version-related deadlock and circular lock issues in database operations docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose UncleCode 2024-11-18 21:00:06 +08:00
  • 152ac35bc2 feat(docs): update README for version 0.3.74 with new features and improvements fix(version): update version number to 0.3.74 refactor(async_webcrawler): enhance logging and add domain-based request delay UncleCode 2024-11-17 21:09:26 +08:00
  • df63a40606 feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity UncleCode 2024-11-17 19:44:45 +08:00
  • a59c107b23 Update changelog for 0.3.74 UncleCode 2024-11-17 18:42:43 +08:00
  • f9fe6f89fe feat(database): implement version management and migration checks during initialization UncleCode 2024-11-17 18:09:33 +08:00
  • 2a82455b3d feat(crawl): implement direct crawl functionality and introduce CacheMode for improved caching control UncleCode 2024-11-17 17:17:34 +08:00
  • 3a524a3bdd fix(docs): remove unnecessary blank line in README for improved readability UncleCode 2024-11-17 16:00:39 +08:00
  • 3a66aa8a60 feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior chore(requirements): add colorama dependency refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code fix(docs): update example scripts for clarity and consistency UncleCode 2024-11-17 15:30:56 +08:00
  • 4b45b28f25 feat(docs): enhance deployment documentation with one-click setup, API security details, and Docker Compose examples UncleCode 2024-11-16 18:44:47 +08:00
  • 9139ef3125 feat(docker): update Dockerfile for improved installation process and enhance deployment documentation with Docker Compose setup and API token security UncleCode 2024-11-16 18:19:44 +08:00
  • 6360d0545a feat(api): add API token authentication and update Dockerfile description UncleCode 2024-11-16 18:08:56 +08:00
  • 1961adb530 refactor(docker): remove shared memory size configuration to streamline Dockerfile UncleCode 2024-11-16 17:35:27 +08:00
  • 79feab89c4 refactor(deploy): remove memory utilization alert configuration from deployment template UncleCode 2024-11-16 17:28:42 +08:00
  • 5d0b13294c feat(deploy): change instance size to professional-xs and update memory utilization alert window to 300 seconds UncleCode 2024-11-16 17:25:07 +08:00
  • 67edc2d641 feat(deploy): update instance size to professional-xs and add memory utilization alert parameters UncleCode 2024-11-16 17:23:32 +08:00
  • 6b569cceb5 feat(deploy): update branch to 0.3.74 and change instance size to basic-xs UncleCode 2024-11-16 17:21:45 +08:00
  • 6f2fe5954f feat(deploy): update instance size to professional-xs and add memory utilization alert UncleCode 2024-11-16 17:12:41 +08:00
  • fca1319b7d feat(docker): add MkDocs installation and build step for documentation UncleCode 2024-11-16 17:10:30 +08:00
  • f77f06a3bd feat(deploy): add deployment configuration and templates for crawl4ai UncleCode 2024-11-16 16:43:31 +08:00
  • e62c807295 feat(deploy): add Railway deployment configuration and setup instructions UncleCode 2024-11-16 16:38:13 +08:00
  • 90df6921b7 feat(crawl_sync): add synchronous crawl endpoint and corresponding test UncleCode 2024-11-16 15:34:30 +08:00
  • 5098442086 refactor: migrate versioning to __version__.py and remove deprecated _version.py UncleCode 2024-11-16 15:30:24 +08:00
  • d0014c6793 New async database manager and migration support - Introduced AsyncDatabaseManager for async DB management. - Added migration feature to transition to file-based storage. - Enhanced web crawler with improved caching logic. - Updated requirements and setup for async processing. UncleCode 2024-11-16 14:54:41 +08:00
  • 60670b2af6 Merge pull request #7 from aravindkarnam/main aravind 2024-11-15 20:43:54 +05:30
  • ae7ebc0bd8 chore: update .gitignore and enhance changelog with major feature additions and examples UncleCode 2024-11-15 20:16:13 +08:00
  • 1f269f9834 test(content_filter): add comprehensive tests for BM25ContentFilter functionality UncleCode 2024-11-15 18:11:11 +08:00
  • 7f1ae5adcf Update changelog UncleCode 2024-11-14 22:51:51 +08:00
  • 3d00fee6c2 - In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object. - Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well. - The cache database was updated to hold information about response headers and downloaded files. UncleCode 2024-11-14 22:50:59 +08:00
  • 17913f5acf feat(crawler): support local files and raw HTML input in AsyncWebCrawler UncleCode 2024-11-13 20:00:29 +08:00
  • 3a2cb7dacf test: Add comprehensive unit tests for AsyncExecutor functionality 0.3.75 UncleCode 2024-11-13 19:46:05 +08:00
  • c38ac29edb perf(crawler): major performance improvements & raw HTML support UncleCode 2024-11-13 19:40:40 +08:00
  • 38044d4afe Merge pull request #255 from maheshpec/feature/configure-cache-directory UncleCode 2024-11-13 09:43:29 +01:00
  • 61b93ebf36 Update change log UncleCode 2024-11-13 15:38:30 +08:00
  • bf91adf3f8 fix: Resolve unexpected BrowserContext closure during crawl in Docker UncleCode 2024-11-13 15:37:16 +08:00
  • 00026b5f8b feat(config): Adding a configurable way of setting the cache directory for constrained environments Mahesh 2024-11-12 14:52:51 -07:00
  • 8c22396d8b Merge pull request #234 from devatnull/patch-1 UncleCode 2024-11-12 08:37:14 +01:00
  • b6d6631b12 Enhance Async Crawler with Playwright support - Implemented new async crawler strategy using Playwright. - Introduced ManagedBrowser for better browser management. - Added support for persistent browser sessions and improved error handling. - Updated version from 0.3.73 to 0.3.731. - Enhanced logic in main.py for conditional mounting of static files. - Updated requirements to replace playwright_stealth with tf-playwright-stealth. UncleCode 2024-11-12 12:10:58 +08:00
  • a098483cbb Update Roadmap UncleCode 2024-11-09 20:40:30 +08:00
  • f9a297e08d Add Docker example script for testing Crawl4AI functionality UncleCode 2024-11-08 19:39:05 +08:00
  • bcdd80911f Remove some old files. UncleCode 2024-11-08 19:08:58 +08:00
  • 0d357ab7d2 feat(scraper): Enhance URL filtering and scoring systems scraper-uc UncleCode 2024-11-08 19:02:28 +08:00
  • bae4665949 feat(scraper): Enhance URL filtering and scoring systems UncleCode 2024-11-08 18:45:12 +08:00
  • d11c004fbb Enhanced BFS Strategy: Improved monitoring, resource management & configuration UncleCode 2024-11-08 15:57:23 +08:00
  • b120965b6a Fixed issues with the Manage Browser, including its inability to connect to the user directory and inability to create new pages within the Manage Browser context; all issues are now resolved. UncleCode 2024-11-07 20:15:03 +08:00
  • 16f918621f Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-11-07 19:30:22 +08:00
  • f7574230a1 Update API server request object. text_docker file and Readme UncleCode 2024-11-07 19:29:31 +08:00
  • 3d1c9a8434 Revieweing the BFS strategy. UncleCode 2024-11-07 18:54:53 +08:00
  • 2879344d9c Update README.md devatnull 2024-11-06 17:36:46 +03:00
  • 9f5eef1f38 Refactored the CustomHTML2Text class in content_scrapping_strategy.py to remove the handling logic for header tags (h1-h6), which are now commented out. This cleanup improves code readability and reduces maintenance overhead. UncleCode 2024-11-06 21:50:09 +08:00
  • be472c624c Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process. UncleCode 2024-11-06 21:09:47 +08:00
  • 06b21dcc50 Update .gitignore to include new directories for issues and documentation UncleCode 2024-11-06 18:44:03 +08:00
  • c5aa1bec18 Merge pull request #229 from bizrockman/main UncleCode 2024-11-06 07:31:07 +01:00
  • 0f0f60527d Merge pull request #172 from aravindkarnam/scraper UncleCode 2024-11-06 07:00:44 +01:00
  • 11721eb0ce Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-11-05 13:02:59 +00:00
  • b51263664e feat(api): add CORS support and static file serving, update root redirect UncleCode 2024-11-05 21:02:47 +08:00
  • 1222e456fb Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-11-05 12:58:30 +00:00
  • 1e7db0d293 docs(README): update release notes for version 0.3.73 with new features and improvements UncleCode 2024-11-05 20:12:20 +08:00
  • 2a54f3c048 refactor(core): remove main_v0.py file and associated functionality UncleCode 2024-11-05 20:11:07 +08:00
  • 1c20b815b3 docs(README): update Docker usage instructions and add deployment options UncleCode 2024-11-05 20:10:24 +08:00
  • 43a2b26f63 Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-11-05 20:08:20 +08:00
  • 3cf19a1bc2 chore(version): bump version to 0.3.73 0.3.73 UncleCode 2024-11-05 20:05:58 +08:00
  • 67a23c3182 feat(core): Release v0.3.73 with Browser Takeover and Docker Support UncleCode 2024-11-05 20:04:18 +08:00
  • 796dbaf08c Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_Extraction_Strategies_Cosine.md bizrockman 2024-11-04 20:19:43 +01:00
  • 3a3c88a2d0 Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Extraction_Strategies_LLM.md bizrockman 2024-11-04 20:19:20 +01:00
  • 870296fa7e Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_1_Extraction_Strategies_JSON_CSS.md bizrockman 2024-11-04 20:18:58 +01:00