crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 15:58:15 +00:00

Files

unclecode 9b571bb947 feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.)

The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.

Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).

2026-03-16 14:03:43 +00:00

sample_wikipedia.html

perf(crawler): major performance improvements & raw HTML support

2024-11-13 19:40:40 +08:00

test_0.4.2_browser_manager.py

Release/v0.7.6 (#1556 )

2025-10-22 20:41:06 +08:00

test_0.4.2_config_params.py

Update all documentation to import extraction strategies directly from crawl4ai.

2025-06-10 18:08:27 +08:00

test_async_doanloader.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_basic_crawling.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_browser_lifecycle.py

Add memory-saving mode, browser recycling, and CDP leak fixes

2026-02-04 02:00:53 +00:00

test_browser_memory.py

Add memory-saving mode, browser recycling, and CDP leak fixes

2026-02-04 02:00:53 +00:00

test_browser_recycle_v2.py

Fix browser recycling under high concurrency — version-based approach

2026-02-05 07:48:12 +00:00

test_caching.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_chunking_and_extraction_strategies.py

Update all documentation to import extraction strategies directly from crawl4ai.

2025-06-10 18:08:27 +08:00

test_content_extraction.py

fix: Implement base tag support in link extraction (#1147 )

2025-08-08 20:11:57 +05:30

test_content_filter_bm25.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_content_filter_prune.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_content_scraper_strategy.py

Squashed commit of the following:

2025-08-04 19:02:01 +08:00

test_crawler_strategy.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_database_operations.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_dispatchers.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_edge_cases.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_error_handling.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_evaluation_scraping_methods_performance.configs.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_http_file_download.py

feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.)

2026-03-16 14:03:43 +00:00

test_markdown_genertor.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_parameters_and_options.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_performance.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00

test_redirect_url_resolution.py

Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

2025-12-02 13:00:54 +01:00

test_screenshot.py

Apply Ruff Corrections

2025-01-13 19:19:58 +08:00