Commit Graph

  • 44ce12c62c Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy Aravind Karnam 2024-09-09 13:13:34 +05:30
  • eb131bebdf Create series of quickstart files. unclecode 2024-09-04 15:33:24 +08:00
  • 5c15837677 chore: Update README, generate new notbook for quickstart unclecode 2024-09-04 14:46:22 +08:00
  • 2fada16abb chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy unclecode 2024-09-03 23:32:27 +08:00
  • c37614cbc8 Add Async Version, JsonCss Extrator unclecode 2024-09-03 01:27:00 +08:00
  • 3116f95c1a Merge branch 'pull-84' into staging unclecode 2024-09-01 16:44:06 +08:00
  • b0e8b66666 Merge branch 'proxy-support' into staging unclecode 2024-09-01 16:35:14 +08:00
  • 3caf48c9be refactor: Update LocalSeleniumCrawlerStrategy to execute JS code if provided proxy-support unclecode 2024-09-01 16:34:51 +08:00
  • 3c6ebb73ae Update web_crawler.py pull-84 Umut CAN 2024-08-30 15:30:06 +03:00
  • 0d9b638636 Merge pull request #75 from aravindkarnam/main UncleCode 2024-08-30 12:54:15 +02:00
  • 2ba70b9501 add use proxy and llm baseurl examples datehoer 2024-08-27 10:14:54 +08:00
  • 16f98cebc0 replace base64 image url to '' datehoer 2024-08-27 09:44:35 +08:00
  • fe9ff498ce add proxy and add ai base_url datehoer 2024-08-26 16:12:49 +08:00
  • eba831ca30 fix spelling mistake Datehoer 2024-08-26 15:29:23 +08:00
  • dec3d44224 refactor: Update extraction strategy to handle schema extraction with non-empty schema unclecode 2024-08-19 15:37:07 +08:00
  • 9ed1551125 Added support to source tags wrapped inside video and audio tags. Extended the text extraction to video and audio elements in media. https://github.com/unclecode/crawl4ai/issues/71 Aravind Karnam 2024-08-14 10:59:49 +05:30
  • 14e537fdd3 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-08-04 06:57:16 +00:00
  • e5e6a34e80 ## [v0.2.77] - 2024-08-04 v0.2.77 unclecode 2024-08-04 14:54:18 +08:00
  • 64b33af0e0 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-08-02 08:04:54 +00:00
  • 897e766728 Update README unclecode 2024-08-02 16:04:14 +08:00
  • 9200a6731d ## [v0.2.76] - 2024-08-02 unclecode 2024-08-02 16:02:42 +08:00
  • 61c166ab19 refactor: Update Crawl4AI version to v0.2.76 main-75 unclecode 2024-08-02 15:55:53 +08:00
  • 659c8cd953 refactor: Update image description minimum word threshold in get_content_of_website_optimized unclecode 2024-08-02 15:55:32 +08:00
  • 9ee988753d refactor: Update image description minimum word threshold in get_content_of_website_optimized unclecode 2024-08-02 14:53:11 +08:00
  • 8ae6c43ca4 refactor: Update Dockerfile to install Crawl4AI with specified options unclecode 2024-08-01 20:13:06 +08:00
  • b6713870ef refactor: Update Dockerfile to install Crawl4AI with specified options unclecode 2024-08-01 17:56:19 +08:00
  • 40477493d3 refactor: Remove image format dot in get_content_of_website_optimized unclecode 2024-07-31 16:15:55 +08:00
  • efcf3ac6eb Update LocalSeleniumCrawlerStrategy to resolve ChromeDriver version mismatch issue Kevin Moturi 2024-07-27 06:11:57 -05:00
  • 9e43f7beda refactor: Temporarily disable fetching image file size in get_content_of_website_optimized unclecode 2024-07-31 13:29:23 +08:00
  • aa9412e1b4 refactor: Set image_size to 0 in get_content_of_website_optimized unclecode 2024-07-23 13:08:53 +08:00
  • cf6c835e18 moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral. image-description Aravind Karnam 2024-07-21 15:18:23 +05:30
  • e5ecf291f3 Implemented filtering for images and grabbing the contextual text from nearest parent Aravind Karnam 2024-07-21 15:03:17 +05:30
  • 9d0cafcfa6 fixed import error in model_loader.py Aravind Karnam 2024-07-21 14:55:58 +05:30
  • 7715623430 chore: Fix typos and update .gitignore v0.0.75 unclecode 2024-07-19 17:42:39 +08:00
  • f5a4e80e2c chore: Fix typo in chunking_strategy.py and crawler_strategy.py unclecode 2024-07-19 17:40:31 +08:00
  • 8463aabedf chore: Remove .test_pads/ directory from .gitignore main-img-captionify unclecode 2024-07-19 17:09:29 +08:00
  • 7f30144ef2 chore: Remove .tests/ directory from .gitignore unclecode 2024-07-09 15:10:18 +08:00
  • fa5516aad6 chore: Refactor setup.py to use pathlib and shutil for folder creation and removal, to remove cache folder in cross platform manner. unclecode 2024-07-09 13:25:00 +08:00
  • 1afcdb6996 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-07-08 12:24:13 +00:00
  • ca0336af9e feat: Add error handling for rate limit exceeded in form submission unclecode 2024-07-08 20:24:00 +08:00
  • ca625b3152 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-07-08 12:02:19 +00:00
  • 65ed1aeade feat: Add rate limiting functionality with custom handlers unclecode 2024-07-08 20:02:12 +08:00
  • 6521b4745f Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-07-08 08:35:49 +00:00
  • 4d283ab386 ## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 v0.2.74 unclecode 2024-07-08 16:33:25 +08:00
  • 2101540819 chore: Update version to 0.2.74 in setup.py v0.2.74 unclecode 2024-07-08 16:30:28 +08:00
  • 9d98393606 Prepare branch for release 0.2.74 unclecode 2024-07-08 16:30:14 +08:00
  • 6f99368744 Add UTF encoding to resolve the windows machone "charmap" error. unclecode 2024-07-08 16:18:07 +08:00
  • ea2f83ac10 feat: Add delay after fetching URL in crawler hooks unclecode 2024-07-08 15:59:59 +08:00
  • 7f41ff4a74 The after_get_url hook is executed after getting the URL, allowing for further customization. unclecode 2024-07-06 14:28:01 +08:00
  • 236bdb4035 feat: Add MaxRetryError exception handling in LocalSeleniumCrawlerStrategy unclecode 2024-07-06 14:08:30 +08:00
  • 1368248254 feat: Sanitize input and handle encoding issues in LLMExtractionStrategy unclecode 2024-07-05 17:59:26 +08:00
  • b0ec54b9e9 feat: Sanitize input and handle encoding issues in LLMExtractionStrategy unclecode 2024-07-05 17:37:25 +08:00
  • fb6ed5f000 feat: Sanitize input and handle encoding issues in LLMExtractionStrategy unclecode 2024-07-05 17:30:58 +08:00
  • 597fe8bdb7 chore: Delete existing database file and initialize new database unclecode 2024-07-05 17:04:57 +08:00
  • 241862bfe6 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-07-03 07:27:37 +00:00
  • 3ff2a0d0e7 Merge branch 'main' of https://github.com/unclecode/crawl4ai v0.2.73 unclecode 2024-07-03 15:26:47 +08:00
  • 3cd1b3719f Bump version to v0.2.73, update documentation, and resolve installation issues unclecode 2024-07-03 15:26:43 +08:00
  • 9926eb9f95 feat: Bump version to v0.2.73 and update documentation unclecode 2024-07-03 15:19:22 +08:00
  • 3abaa82501 Merge pull request #37 from shivkumar0757/fix-readme-encoding UncleCode 2024-07-01 07:31:07 +02:00
  • 88d8cd8650 feat: Add page load check for LocalSeleniumCrawlerStrategy unclecode 2024-07-01 00:07:32 +08:00
  • a08f21d66c Fix UnicodeDecodeError by reading README.md with UTF-8 encoding shiv 2024-06-30 20:27:33 +05:30
  • f2491b6c1a Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-29 16:34:15 +00:00
  • d58286989c UPDATE DOCUMENTS unclecode 2024-06-30 00:34:02 +08:00
  • 886622cb1e Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-29 16:23:44 +00:00
  • b58af3349c chore: Update installation instructions with support for different modes v0.2.72 unclecode 2024-06-30 00:22:17 +08:00
  • 940df4631f Update ChangeLog unclecode 2024-06-30 00:18:40 +08:00
  • 685706e0aa Update version, and change log main-v0.2.72 unclecode 2024-06-30 00:17:43 +08:00
  • 7b0979e134 Update Redme and Docker file unclecode 2024-06-30 00:15:43 +08:00
  • 61ae2de841 1/Update setup.py to support following modes: - default (most frequent mode) - torch - transformers - all 2/ Update Docker file 3/ Update documentation as well. unclecode 2024-06-30 00:15:29 +08:00
  • 5b28eed2c0 Add a temporary solution for when we can't crawl websites in headless mode. unclecode 2024-06-29 23:25:50 +08:00
  • f8a11779fe Update change log unclecode 2024-06-26 16:48:36 +08:00
  • 13dc254438 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-26 07:35:06 +00:00
  • d11a83c232 ## [0.2.71] 2024-06-26 • Refactored crawler_strategy.py to handle exceptions and improve error messages • Improved get_content_of_website_optimized function in utils.py for better performance • Updated utils.py with latest changes • Migrated to ChromeDriverManager for resolving Chrome driver download issues v0.2.71 main-1 unclecode 2024-06-26 15:34:15 +08:00
  • 3255c7a3fa Update CHANGELOG.md with recent commits unclecode 2024-06-26 15:20:34 +08:00
  • 4756d0a532 Refactor crawler_strategy.py to handle exceptions and improve error messages unclecode 2024-06-26 15:04:33 +08:00
  • 7ba2142363 chore: Refactor get_content_of_website_optimized function in utils.py unclecode 2024-06-26 14:43:09 +08:00
  • 096929153f Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-26 05:45:25 +00:00
  • 96d1eb0d0d Some updated ins utils.py image-filterizer unclecode 2024-06-26 13:03:03 +08:00
  • 144cfa0eda Switch to ChromeDriverManager due some issues with download the chrome driver unclecode 2024-06-26 13:00:17 +08:00
  • a0dff192ae Update README for speed example unclecode 2024-06-24 23:06:12 +08:00
  • 1fffeeedd2 Update Readme: Showcase the speed unclecode 2024-06-24 23:02:08 +08:00
  • f51b078042 Update reame example. unclecode 2024-06-24 22:54:29 +08:00
  • b6023a51fb Add star chart unclecode 2024-06-24 22:47:46 +08:00
  • 7e95c38acb Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-24 14:40:48 +00:00
  • 78cfad8b2f chore: Update version to 0.2.7 and improve extraction function speed v0.2.7 unclecode 2024-06-24 22:39:56 +08:00
  • c697bf23e4 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-22 16:37:27 +00:00
  • b951d34ed0 chore: Update fetch URL to use HTTPS Unclecode 2024-06-22 16:37:21 +00:00
  • 68b3dff74a Update CSS unclecode 2024-06-23 00:36:03 +08:00
  • bfc4abd6e8 Update documents unclecode 2024-06-22 20:57:03 +08:00
  • c8a10dc455 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-22 12:54:41 +00:00
  • 8c77a760fc Fixed: - Redirect "/" to mkdocs unclecode 2024-06-22 20:54:32 +08:00
  • 9e0ded8da0 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-22 12:41:52 +00:00
  • b9bf8ac9d7 Fix mounting the "/" to mkdocs site folder unclecode 2024-06-22 20:41:39 +08:00
  • 48c27899b7 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-06-22 12:38:14 +00:00
  • d6182bedd7 chore: - Add demo page to the new mkdocs - Set website home page to mkdocs unclecode 2024-06-22 20:36:01 +08:00
  • 2217904876 Update .gitignore unclecode 2024-06-22 18:12:12 +08:00
  • 2c2362b4d3 issue 19 is resolved - Update Dockerfile to install mkdocs and build documentation v0.2.6 unclecode 2024-06-22 17:18:00 +08:00
  • 612ed3fef2 chore: Update print statement to use markdown format unclecode 2024-06-21 19:10:13 +08:00
  • fb2a6d0d04 chore: Update documentation link in README.md unclecode 2024-06-21 18:05:18 +08:00
  • 19d3d39115 Update Marge the DOCS branch unclecode 2024-06-21 18:04:13 +08:00