Commit Graph

  • a28046c233 Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to episode_08_Media_Handling_Images_Videos_and_Audio.md bizrockman 2024-11-04 20:18:26 +01:00
  • 0bba0e074f Preventing NoneType has no attribute get Errors bizrockman 2024-11-04 20:12:24 +01:00
  • c4c6227962 Creating the API server component UncleCode 2024-11-04 20:33:15 +08:00
  • e6c914d2fa Refactor version management and remove deprecated gitignore.dev file UncleCode 2024-11-04 16:51:59 +08:00
  • be8f4fc59a Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0.3.73 UncleCode 2024-11-04 14:12:07 +08:00
  • fbdf870fbf Update CHANGELOG unclecode 2024-11-04 14:10:27 +08:00
  • 7b0cca41b4 Update gitignore UncleCode 2024-11-04 13:48:26 +08:00
  • 33d0e9ec8c Update dev gitignore UncleCode 2024-11-04 13:42:37 +08:00
  • 42f1c67ca8 Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0.3.73 UncleCode 2024-11-04 13:39:39 +08:00
  • e28c49a8fe Refactor .gitignore.dev file: Add ignore patterns for various files and directories UncleCode 2024-11-04 13:39:38 +08:00
  • 54d5a3a259 Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies. unclecode 2024-11-04 13:22:13 +08:00
  • de6b43f334 Merge pull request #215 from mjvankampen/build/flexible-requirements UncleCode 2024-11-03 08:30:06 +01:00
  • 07f508bd0c Merge pull request #218 from timoa/main UncleCode 2024-11-03 06:59:30 +01:00
  • 62a86dbe8d Refactor mission section in README and add mission diagram UncleCode 2024-10-31 16:38:56 +08:00
  • 492ada0ed4 Add mission diagram to MISSION.md UncleCode 2024-10-31 15:26:43 +08:00
  • d8eef02867 Add link to mission statement in README UncleCode 2024-10-31 15:23:58 +08:00
  • 6c7235d6a7 Add mission.md file UncleCode 2024-10-31 15:22:00 +08:00
  • 0a09d78fa5 chore(docs): fix documentation links + markdown lint Damien Laureaux 2024-10-31 05:50:22 +01:00
  • e8aaa57cb2 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-10-30 12:59:34 +00:00
  • 19c3f3efb2 Refactor tutorial markdown files: Update numbering and formatting UncleCode 2024-10-30 20:58:07 +08:00
  • a661b3173d Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-10-30 12:47:07 +00:00
  • e97e8df6ba Update README: Fix typo in project name UncleCode 2024-10-30 20:45:20 +08:00
  • cb6f5323ae Update README UncleCode 2024-10-30 20:44:57 +08:00
  • 47464cedec Update README UncleCode 2024-10-30 20:42:27 +08:00
  • 982d203d91 Merge branch '0.3.73' UncleCode 2024-10-30 20:40:09 +08:00
  • 9307c19f35 Update documents, upload new version of quickstart. UncleCode 2024-10-30 20:39:35 +08:00
  • 605a82793b fix dev requirements and lock playwright due to failing tests Mark Jan van Kampen 2024-10-30 10:41:37 +01:00
  • df9ee44d42 build: make requirements more flexible Mark Jan van Kampen 2024-10-30 10:03:22 +01:00
  • e9f7d5e73a Merge branch '0.3.73' UncleCode 2024-10-30 00:16:49 +08:00
  • 3529c2e732 Update new tutorial documents and added to the docs folder. UncleCode 2024-10-30 00:16:18 +08:00
  • d9e0b7abab Fix README badge UncleCode 2024-10-28 15:14:16 +08:00
  • b2800fefc6 Add badges to README UncleCode 2024-10-28 15:10:12 +08:00
  • d913e20edc Update Readme UncleCode 2024-10-28 15:09:37 +08:00
  • b781b6df96 Merge branch 'main' of https://github.com/unclecode/crawl4ai Unclecode 2024-10-27 11:42:23 +00:00
  • c2a71a5abe Update Docs folder, prepare branch for new version 0.3.73 v.3.72 UncleCode 2024-10-27 19:35:13 +08:00
  • d61615e0b0 Merge branch '0.3.72' UncleCode 2024-10-27 19:33:05 +08:00
  • ac9d83c72f Update gitignore main-0.3.7 UncleCode 2024-10-27 19:29:04 +08:00
  • ff9149b5c9 Merge branch 'main' of https://github.com/unclecode/crawl4ai UncleCode 2024-10-27 19:28:05 +08:00
  • 4239654722 Update Documentation 0.3.72 UncleCode 2024-10-27 19:24:46 +08:00
  • 38474bd66a Update version UncleCode 2024-10-24 20:24:21 +08:00
  • bcfe83f702 feat: enhance crawler with overlay removal and improved screenshot capabilities UncleCode 2024-10-24 20:22:47 +08:00
  • 32f57c49d6 Merge pull request #194 from IdrisHanafi/feat/customize-crawl-base-directory UncleCode 2024-10-24 13:09:27 +02:00
  • 60ba131ac8 [v0.3.72] Enhance content extraction and proxy support UncleCode 2024-10-22 20:19:22 +08:00
  • a5f627ba1a feat: customize crawl base directory Idris Hanafi 2024-10-21 17:58:39 -04:00
  • 04d16e6d2b Fix Base64 image parsing in WebScrappingStrategy (issue 182) UncleCode 2024-10-20 19:25:25 +08:00
  • 1dd36f9035 Refactor content scrapping strategy and improve error handling UncleCode 2024-10-20 19:11:18 +08:00
  • 6ec4cb33ca Enhance Markdown generation and external content control UncleCode 2024-10-20 18:56:58 +08:00
  • e7cd8a1c2d Update Changelog UncleCode 2024-10-19 18:37:12 +08:00
  • 4e2852d5ff [v0.3.71] Enhance chunking strategies and improve overall performance UncleCode 2024-10-19 18:36:59 +08:00
  • b309bc34e1 Fix the model nam ein quick start example 0.3.7 UncleCode 2024-10-18 15:32:25 +08:00
  • b8147b64e0 chore: Bump version to 0.3.71 and improve error handling UncleCode 2024-10-18 13:31:12 +08:00
  • aab6ea022e Update requirements and switch to 0.3.8 UncleCode 2024-10-18 12:51:23 +08:00
  • dd17ed0e63 Rename some flags name, introducing magic flag. UncleCode 2024-10-18 12:35:09 +08:00
  • dbb587d681 Update gitignore UncleCode 2024-10-17 21:38:48 +08:00
  • 768aa06ceb feat(crawler): Enhance stealth and flexibility, improve error handling UncleCode 2024-10-17 21:37:48 +08:00
  • 8105fd178e Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it. Aravind Karnam 2024-10-17 15:42:43 +05:30
  • ce7fce4b16 1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches 2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated. 3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results. Aravind Karnam 2024-10-17 12:25:17 +05:30
  • de28b59aca removed unused imports Aravind Karnam 2024-10-16 22:36:48 +05:30
  • 04d8b47b92 Exposed min_crawl_delay for BFSScraperStrategy Aravind Karnam 2024-10-16 22:34:54 +05:30
  • 2943feeecf 1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option 2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper 3. Added some error handling for cases where robots.txt cannot be fetched or parsed. Aravind Karnam 2024-10-16 22:05:29 +05:30
  • 8a7d29ce85 updated some comments and removed content type checking functionality from core as it's implemented as a filter Aravind Karnam 2024-10-16 15:59:37 +05:30
  • 159bd875bd Merge pull request #5 from aravindkarnam/main aravind 2024-10-16 10:41:22 +05:30
  • 9ffa34b697 Update README v0.3.6 unclecode/issue167 unclecode/issue157 unclecode 2024-10-14 22:58:27 +08:00
  • 740802c491 Merge branch '0.3.6' unclecode 2024-10-14 22:55:24 +08:00
  • b9ac96c332 Merge branch 'main' of https://github.com/unclecode/crawl4ai unclecode 2024-10-14 22:54:23 +08:00
  • d06535388a Update gitignore unclecode 2024-10-14 22:53:56 +08:00
  • 5b84ac9186 Merge branch '0.3.5' of https://github.com/unclecode/crawl4ai into 0.3.5 0.3.5 unclecode 2024-10-14 22:53:09 +08:00
  • 7ea5603576 Update gitignore unclecode 2024-10-14 22:52:00 +08:00
  • 2b73bdf6b0 Update changelog 0.3.6 unclecode 2024-10-14 21:04:02 +08:00
  • 6aa803d712 Update gitignore unclecode 2024-10-14 21:03:40 +08:00
  • 320afdea64 feat: Enhance crawler flexibility and LLM extraction capabilities unclecode 2024-10-14 21:03:28 +08:00
  • ccbe72cfc1 Merge pull request #135 from hitesh22rana/fix/docs-example UncleCode 2024-10-13 14:39:07 +08:00
  • b9bbd42373 Update Quickstart examples unclecode 2024-10-13 14:37:45 +08:00
  • 68e9144ce3 feat: Enhance crawling control and LLM extraction flexibility unclecode 2024-10-12 14:48:22 +08:00
  • 9b2b267820 CHANGELOG UPDATE unclecode 2024-10-12 13:42:56 +08:00
  • ff3524d9b1 feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts unclecode 2024-10-12 13:42:42 +08:00
  • b99d20b725 Add pypi_build.sh to .gitignore unclecode 2024-10-08 18:10:57 +08:00
  • 768b93140f docs: fixed css_selector for example hitesh22rana 2024-10-05 00:25:41 +09:00
  • d743adac68 Fixed some bugs in robots.txt processing Aravind Karnam 2024-10-03 15:58:57 +05:30
  • 7fe220dbd5 1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing 2. Introduced a dictionary for depth tracking across various tasks 3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object. Aravind Karnam 2024-10-03 11:17:11 +05:30
  • 65e013d9d1 Merge pull request #3 from aravindkarnam/main aravind 2024-10-03 09:52:12 +05:30
  • 4750810a67 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities unclecode 2024-10-02 17:34:56 +08:00
  • e0e0db4247 Bump version to 0.3.4 0.3.4 unclecode 2024-09-29 17:07:52 +08:00
  • bccadec887 Remove dependency on psutil, PyYaml, and extend requests version range unclecode 2024-09-29 17:07:06 +08:00
  • 0759503e50 Extend numpy version range to support Python 3.9 v0.3.3 unclecode 2024-09-29 00:08:02 +08:00
  • 7f1c020746 Update README to add link to previous version in branch V0.2.76 unclecode 2024-09-28 00:31:53 +08:00
  • 7afa11a02f Update .gitignore to include test_env/ and tmp/ directories v0.2.76 unclecode 2024-09-28 00:12:58 +08:00
  • 5d4e92db7d Update quickstart_async.py to improve performance and add Firecrawl simulation v0.3.0 staging unclecode 2024-09-28 00:11:39 +08:00
  • 8b6e88c85c Update .gitignore to ignore temporary and test directories unclecode 2024-09-26 15:09:49 +08:00
  • 64190dd0c4 Update README unclecode 2024-09-25 17:26:13 +08:00
  • 7100bcdf04 Add session based crawling documentation unclecode 2024-09-25 17:16:55 +08:00
  • 10cdad039d Update documents and README unclecode 2024-09-25 16:52:11 +08:00
  • f1eee09cf4 Update README, add manifest, make selenium optional library unclecode 2024-09-25 16:35:14 +08:00
  • 4d48bd31ca Push async version last changes for merge to main branch unclecode 2024-09-24 20:52:08 +08:00
  • 7f3e2e47ed Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt Aravind Karnam 2024-09-19 12:34:12 +05:30
  • 78f26ac263 Merge pull request #2 from aravindkarnam/staging aravind 2024-09-18 18:16:23 +05:30
  • d628bc4034 Refactor content_scrapping_strategy.py to remove excluded tags unclecode 2024-09-12 17:35:45 +08:00
  • b179aa9b6f Refactor website content and setup.py descriptions for consistent terminology unclecode 2024-09-12 16:50:52 +08:00
  • 30807f5535 Remove excluded tags from website content unclecode 2024-09-12 16:11:20 +08:00
  • 396f430022 Refactor AsyncCrawlerStrategy to return AsyncCrawlResponse unclecode 2024-09-12 15:49:49 +08:00