9 Commits

Author SHA1 Message Date
unclecode
9b571bb947 feat: HTTP strategy detects and saves file downloads (CSV, PDF, etc.)
The HTTP crawler strategy now checks Content-Type and Content-Disposition
headers to detect non-HTML file responses. When a file download is
detected, raw bytes are saved to disk and the path is returned via
downloaded_files. Text-based files (CSV, JSON, XML) also populate the
html field for backward compatibility. Binary files (PDF, images) set
html to empty string — content is only available via downloaded_files.

Adds downloads_path to HTTPCrawlerConfig (defaults to ~/.crawl4ai/downloads/).
2026-03-16 14:03:43 +00:00
unclecode
0c9e3c427e Update CONTRIBUTORS and PR-TODOLIST for batch 5 (15 PRs resolved)
Batch 5 merged: #1622, #1786, #1796, #1795, #1798, #1734, #1290, #1668
Closed as superseded: #1592
Closed as won't merge: #999, #1180, #1425, #1702, #1707, #1729
2026-03-07 08:49:32 +00:00
unclecode
31d0de23df Update PR-TODOLIST for batch 4 merge (10 PRs) and refresh open PR list 2026-03-07 06:50:26 +00:00
unclecode
fbc52813a4 Add tests, docs, and contributors for PRs #1463 and #1435
- Add tests for device_scale_factor (config + integration)
- Add tests for redirected_status_code (model + redirect + raw HTML)
- Document device_scale_factor in browser config docs and API reference
- Document redirected_status_code in crawler result docs and API reference
- Add TristanDonze and charlaie to CONTRIBUTORS.md
- Update PR-TODOLIST with session results
2026-02-06 09:30:19 +00:00
unclecode
719e83e105 Update PR todolist — refresh open PRs, add 6 new, classify
- Added PRs #475, #462, #416, #335, #332, #312
- Flagged #475 as duplicate of merged #1296
- Corrected author for #1450 (rbushri)
- Updated total count to ~63 open PRs
- Updated date to 2026-02-06
2026-02-06 09:06:13 +00:00
unclecode
ffd3face6b Remove duplicate PROMPT_EXTRACT_BLOCKS definition in prompts.py
The first definition (with tags/questions fields) was immediately
overwritten by the second simpler definition — pure dead code.
Removes 61 lines of unused prompt text.

Inspired by PR #931 (stevenaldinger).
2026-02-02 07:04:35 +00:00
unclecode
bb523b6c6c Merge PRs #1077, #1281 — bs4 deprecation and proxy auth fix
- PR #1077: Fix bs4 deprecation warning (text -> string)
- PR #1281: Fix proxy auth ERR_INVALID_AUTH_CREDENTIALS
- Comment on PR #1081 guiding author on needed DFS/BFF fixes
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 07:06:39 +00:00
unclecode
a56dd07559 Merge PRs #1667, #1296, #1364 — CLI deep-crawl, env var, script tags
- PR #1667: Fix deep-crawl CLI outputting only the first page
- PR #1296: Fix VersionManager ignoring CRAWL4_AI_BASE_DIRECTORY
- PR #1364: Fix script tag removal losing adjacent text
- Fix: restore .crawl4ai subfolder in VersionManager path
- Close #1150 (already fixed on develop)
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 06:53:53 +00:00
unclecode
dc4ae73221 Merge PRs #1714, #1721, #1719, #1717 and fix base tag pipeline
- PR #1714: Replace tf-playwright-stealth with playwright-stealth
- PR #1721: Respect <base> tag in html2text for relative links
- PR #1719: Include GoogleSearchCrawler script.js in package data
- PR #1717: Allow local embeddings by removing OpenAI fallback
- Fix: Extract <base href> from raw HTML before head gets stripped
- Close duplicates: #1703, #1698, #1697, #1710, #1720
- Update CONTRIBUTORS.md and PR-TODOLIST.md
2026-02-01 05:41:33 +00:00