crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-11 00:08:01 +00:00

Author	SHA1	Message	Date
unclecode	72fd78e59e	chore: gitignore out/ local pipeline output	2026-06-02 05:14:11 +00:00
unclecode	7259d734a1	security(credits): credit IcySun & Yashon (co-reporter) per their request	2026-06-01 14:44:43 +00:00
unclecode	7b280d36b1	chore: gitignore .security/ - advisory payloads stay private, published via GHSA only v0.8.7 docker-rebuild-v0.8.7	2026-06-01 14:09:23 +00:00
unclecode	4bf6071226	chore: bump version to 0.8.7	2026-06-01 14:04:51 +00:00
unclecode	30187e6dc7	docs: 0.8.7 release notes, changelog, README highlights; finalize security credits (q1uf3ng confirmed, Velayutham Selvaraj)	2026-06-01 14:04:11 +00:00
unclecode	d705d7c4f1	security(credits): acknowledge independent reporters Velayutham S and IcySun	2026-06-01 13:25:47 +00:00
unclecode	71e1667bd1	security(advisory): fold direct /crawl,/md,/llm SSRF + IPv6-mapped bypass into advisory-2; credit secsys_codex	2026-06-01 13:13:57 +00:00
unclecode	bd20700626	merge: domain mapper + 4 core fixes cherry-picked from cloud (OSS-safe, audited) for 0.8.7	2026-06-01 13:05:41 +00:00
unclecode	c06a3dcf56	fix(logger): default Console width=200 for non-TTY contexts Rich's Console() falls back to width=80 when stdout isn't a TTY (Docker container logs, CI, captured output). That truncates diagnostic lines that carry structured fields like "[ANTIBOT] tier N/M proxy=... status= validator=... reason=..." — exactly the information you want when debugging anti-bot escalation from logs alone. Setting width=200 only affects the non-TTY fallback; real terminals still auto-detect their actual width. Pure ergonomics improvement for cloud/container deployments where everyone reads logs via `docker logs` or log aggregators rather than interactively. (cherry picked from commit 398851562ca102d73f979d7ceab589079817cc0d)	2026-06-01 12:59:32 +00:00
unclecode	62c619d454	fix(prefetch): honor <head><base href> in quick_extract_links (#752 ) Prefetch link extraction (used by cloud scan/site) resolved every relative href against the page URL and ignored the <base> tag — so a page with <base href> mis-resolved its relative links, while the full LXML scraping path (non-prefetch) handled it correctly. Parse <head><base href> and urljoin it onto the page URL as the resolution base, mirroring LXMLWebScrapingStrategy. base_domain stays computed from the page origin so internal/external classification is unaffected. (cherry picked from commit 09fe40c606ea15704d3452dafbf9ad043a2f5ae6)	2026-06-01 12:59:00 +00:00
unclecode	7059170614	fix(async_webcrawler): skip block validator when downloaded_files is set The post-fallback final block-check ran the validator on `crawl_result.html` unconditionally. For binary downloads (PDFs, archives, executables), html is empty by design — the content lives in `downloaded_files`. The keyword detector sees `is_blocked(200, "")` → "Near-empty content (0 bytes) with HTTP 200" → false BLOCKED verdict → success flipped to False even though the file was retrieved cleanly. Reproduced consistently with federalregister.gov PDFs: * downloaded_files: ['<S3 url>'] (file is valid, opens fine) * success: False * error_message: "Blocked by anti-bot protection: keyword:Near-empty content (0 bytes) with HTTP 200" Fix: extend the validator-skip condition to include binary downloads. The validator only knows about HTML — when downloaded_files is populated, the response succeeded via a non-HTML path and the blocker check has no basis to fire. Sibling skips already in place: raw: URLs (caller-provided content) and fallback-fetch successes. Cloud-side bug report: unclecode/crawl4ai-cloud#710. (cherry picked from commit d364fa1435b41b6f21355fe068b4a0955fb25dd6)	2026-06-01 12:59:00 +00:00
unclecode	a615d697af	fix(async_webcrawler): success=True for binary downloads `crawl_result.success = bool(html)` flipped success to False whenever html was empty — including the case where a binary file (PDF, archive, executable) was downloaded successfully into `downloaded_files`. The sibling validator-skip patch in the prior commit handled the false "Blocked by anti-bot" error message, but `success` was being set to False before that, so the result still surfaced as a failed crawl even though the file was retrieved cleanly. Fix: success is True when html is non-empty OR `downloaded_files` is populated. Applies to both the live-fetch path (~line 527) and the cached path (~line 740) so a cached PDF replays correctly. After this + the validator-skip patch: * sync /v1/crawl PDF: success=True, error_message="", downloaded_files=[<S3 URL>] * async /v1/crawl/async PDF: same * Non-binary captcha pages still flag blocked (no regression) Cloud bug: unclecode/crawl4ai-cloud#710 + #711 (combined PDF cluster). (cherry picked from commit 09f898f410bf27235601b680f996d7b80afc96eb)	2026-06-01 12:58:30 +00:00
unclecode	858c827145	feat: add per-source timeout to DomainMapper Each discovery source now has its own timeout (source_timeout, default 30s). Slow sources (wayback, crt.sh) get killed individually instead of blocking the entire scan. Fast sources always return results. (cherry picked from commit 3fe7f3c29b4b0b2006d091d895ba6fdba1fc8f87)	2026-06-01 12:58:23 +00:00
unclecode	ed60c6628a	feat: add include_subdomains flag to DomainMapperConfig When include_subdomains=False, DomainMapper skips all subdomain discovery (crt.sh, DNS guessing, Wayback/CC host expansion) and only scans the exact domain provided. Default is True (existing behavior unchanged). Used by crawl4ai-cloud /v1/scan endpoint to give users control over subdomain discovery scope. (cherry picked from commit 4d60dba53cb6da1ad5e8261d0652ab8c771f8474)	2026-06-01 12:58:23 +00:00
unclecode	9d5bcf78e2	feat: Add DomainMapper for comprehensive domain URL discovery Add DomainMapper class that discovers all URLs under a domain using 8 sources: sitemap, Common Crawl, Wayback Machine, Certificate Transparency (crt.sh), path probing, robots.txt mining, RSS/Atom feeds, and homepage link extraction. Key features: - Subdomain discovery via crt.sh, Wayback, CC, and DNS guessing - Soft-404 detection: fingerprints SPA sites and filters fake pages - Per-host scanning with parallel execution across discovered hosts - URL normalization, deduplication, and source attribution - BM25 relevance scoring with head metadata extraction - Nonsense filter for static assets, webpack chunks, Wayback garbage For superdesign.dev: finds 171 URLs across 11 hosts in ~13s (vs 4 URLs from AsyncUrlSeeder) New files: - crawl4ai/domain_mapper.py (DomainMapper class) - crawl4ai/async_configs.py (DomainMapperConfig) - docs/md_v2/core/domain-mapping.md (documentation) - docs/examples/domain_mapper/domain_mapper_demo.py - 67 tests across unit/integration/adversarial/regression (cherry picked from commit 2d10534a8742177f1d5f521e3174ae66591d3533)	2026-06-01 12:58:23 +00:00
unclecode	fcaf08b3b3	merge: slot April 2026 security batch (Docker API vulns, SSRF, JWT, file-write, XSS, execute_js) into develop for 0.8.7	2026-06-01 12:40:37 +00:00
Nasrin	820cbb59f9	Merge pull request #1960 from NaabZer/bugfix/stealth-import-mismatch fix(stealth): use Stealth class from playwright-stealth 2.x	2026-05-25 12:50:56 +02:00
Nasrin	1ef552f351	Merge pull request #1967 from unclecode/fix/mcp-ensure-ascii-cjk-encoding Preserve native Unicode in MCP tool responses by disabling ASCII escaping	2026-05-25 12:26:37 +02:00
Nasrin	be71585239	Merge pull request #1969 from cgseyhan/fix/async-logger-stderr-mcp-1968 fix: route AsyncLogger output to stderr by default (fixes #1968)	2026-05-25 12:20:01 +02:00
Nasrin	532b105fc7	Merge pull request #1975 from nightcityblade/fix/issue-1973 Thanks @nightcityblade for the quick fix! Clean and straightforward CSS change.	2026-05-25 11:59:25 +02:00
nightcityblade	791aae3a21	fix: allow assistant toolbar to scroll Fixes unclecode/crawl4ai#1973	2026-05-20 23:07:47 +08:00
cemgo	944eb1e456	fix(logger): route AsyncLogger output to stderr by default MCP stdio transport uses stdout for JSON-RPC messages. AsyncLogger was writing Rich progress output to stdout (the default Console() target), which caused clients to receive garbled JSON and log lines interleaved in the same stream. Changes: - Pass stderr=True to Console() so all log output goes to stderr, which is the correct channel for library diagnostics and aligns with the behaviour of Python's own logging.StreamHandler. - Add an injectable console parameter so downstream wrappers (e.g. mcp-crawl4ai, FastMCP integrations) can override the target stream without monkey-patching. - Add import sys (used in docstring example). - Add tests/test_async_logger_stderr.py with 7 tests covering the default-to-stderr behaviour, custom console injection, verbose=False suppression, file logging, and an end-to-end MCP scenario. Fixes #1968	2026-05-14 14:13:30 +03:00
Soham Kukreti	76f56af2dd	fix: use ensure_ascii=False in MCP bridge json.dumps to preserve CJK characters	2026-05-13 14:10:19 +05:30
NaabZer	5568a9ad38	Change browser_adapter to use Stealth import instead of stealth_*	2026-05-07 18:10:40 +09:00
Nasrin	dfb525edec	Merge pull request #1951 from unclecode/fix/bedrock-provider-prefix Add bedrock to PROVIDER_MODELS_PREFIXES so AWS credential auth works	2026-05-06 10:37:05 +02:00
Nasrin	47a4c256c9	Merge pull request #1952 from hafezparast/fix/maysam-silent-scrape-failure-1949 fix: log failure reason before COMPLETE and fix misleading SCRAPE ✓ (#1949)	2026-05-05 07:52:44 +02:00
Nasrin	a45c678ee4	Merge pull request #1939 from unclecode/fix/preserve-tail-text-1938 fix: preserve .tail text when removing empty elements (#1938)	2026-05-05 07:46:47 +02:00
hafezparast	5e5519b1c6	fix: log failure reason before COMPLETE and fix misleading SCRAPE ✓ (#1949 ) Two issues caused silent COMPLETE ✗ with no diagnostic output: 1. When crawl_result.success=False (anti-bot detection, empty HTML, etc.), the error_message was set on the CrawlResult but never logged — users saw only [COMPLETE] ✗ with zero explanation. Fix: emit an [ERROR] log containing error_message before the COMPLETE line whenever success=False. 2. The SCRAPE log in aprocess_html always emitted success=True regardless of whether scraping produced any content. Fix: use bool(cleaned_html) so SCRAPE reflects the actual outcome. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-01 16:18:03 +08:00
Soham Kukreti	660f49c879	fix: add bedrock to PROVIDER_MODELS_PREFIXES so AWS credential auth works LLMConfig.__init__ checks PROVIDER_MODELS_PREFIXES when api_token=None. If the provider prefix isn't found, it silently falls through to the else branch and overwrites self.provider with DEFAULT_PROVIDER (openai/gpt-4o), meaning any bedrock/* model string was being replaced before the LLM call was even made. This broke supported Bedrock auth methods when api_token is not passed in the LLMConfig Only passing api_token=<bearer_token> explicitly worked, because the truthy api_token bypassed the prefix check entirely. Adding "bedrock": None to PROVIDER_MODELS_PREFIXES keeps self.provider intact so the correct Bedrock provider is used. The actual auth (SigV4 signing or Bearer header) is handled downstream based on what credentials are available in the environment.	2026-04-30 22:13:54 +05:30
Nasrin	388ce3f033	Merge pull request #1940 from hafezparast/fix/mermaid-sequence-fence-1043 fix: broaden mermaid SVG text extraction and prevent nested fences (#1043)	2026-04-30 13:34:00 +02:00
hafezparast	dba38c7886	fix: broaden mermaid SVG text extraction and prevent nested fences (#1043 ) Three improvements over the initial fix: 1. Add SVG text/tspan fallback for sequence, gantt, and git diagrams which don't use foreignObject / .nodeLabel spans but instead render labels via native SVG <text> elements. 2. Detect when a mermaid SVG is already wrapped in an outer <pre> (e.g. deepwiki.com) and replace it with a plain <span> instead of a new <pre> block, avoiding invalid nested markdown fences. 3. Add data-language attribute support to CustomHTML2Text so that pre[data-language=mermaid] emits a proper ```mermaid fence in the markdown output, instead of an unlabelled ``` block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 01:36:52 +08:00
Nasrin	4d139247b9	Merge pull request #1934 from hafezparast/fix/maysam-dispatcher-semaphore-count-1927 fix: wire semaphore_count into auto-created MemoryAdaptiveDispatcher (#1927)	2026-04-24 18:53:56 +02:00
ntohidi	e8f1af7c16	fix: preserve .tail text when removing empty elements (#1938 ) remove_empty_elements_fast() was dropping trailing text attached to elements via lxml .tail when removing empty elements. Now appends the tail to the previous sibling or parent before removal.	2026-04-24 18:51:24 +02:00
ntohidi	04985ea15e	docs: update arun() docstring to match CrawlResultContainer return type	2026-04-24 18:35:57 +02:00
Nasrin	35ee366e28	Merge pull request #1901 from hafezparast/fix/maysam-arun-type-hint-1898 fix: correct arun() return type annotation (#1898)	2026-04-24 18:33:41 +02:00
Nasrin	244fbf7b58	Merge pull request #1929 from atomic-carpenter/listen-on-all-addressess docker: listen on all addresses	2026-04-24 17:59:48 +02:00
hafezparast	4e72f31011	fix: use semaphore_count default of 10 to match CrawlerRunConfig default	2026-04-24 23:57:17 +08:00
Nasrin	d595679d25	Merge pull request #1925 from sevenmoonlightsteps/fix/docker-llm-table-extraction-allowlist fix: add LLMTableExtraction to Docker API deserialization allowlist	2026-04-24 17:11:56 +02:00
Nasrin	936e4470eb	Merge pull request #1845 from hafezparast/fix/maysam-mermaid-svg-text-1043 fix: preserve mermaid diagram text from SVGs during scraping (#1043)	2026-04-24 16:49:59 +02:00
Nasrin	5e56e34840	Merge pull request #1922 from unclecode/fix/deep-crawl-streaming-contextvar-1917 fix(deep-crawl): ContextVar crash in streaming deep crawl (#1917)	2026-04-24 16:36:08 +02:00
hafezparast	d3c92ee3df	fix: wire semaphore_count into auto-created MemoryAdaptiveDispatcher (#1927 ) When arun_many() created a MemoryAdaptiveDispatcher automatically, max_session_permit was always 20 (the class default), silently ignoring the user's semaphore_count setting in CrawlerRunConfig. Now reads semaphore_count from the primary config and passes it as max_session_permit. The max(1, ... or 5) guard handles zero/None/negative values safely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 10:59:52 +08:00
Atomic Carpenter	e06b19ca09	docker: listen on all addresses Fixes: https://github.com/unclecode/crawl4ai/issues/1928	2026-04-22 00:47:08 +02:00
unclecode	1e25edcb5c	fix(security): block IPv6-mapped IPv4 SSRF bypass Caught during internal review. `http://[::ffff:127.0.0.1]/` bypassed validate_webhook_url because getaddrinfo returns ::ffff:7f00:1, which is not in any IPv4 blocklist (127.0.0.0/8) nor IPv6 blocklist (::1/128). Fix: added _expand_ip_candidates() helper that unwraps IPv4 from IPv4-mapped (::ffff:X.Y.Z.W, via .ipv4_mapped) and IPv4-compatible (::X.Y.Z.W, via low-32-bits) IPv6 addresses. Blocklist now checks both the original IP and the unwrapped IPv4 form. Added 6 new TestIPv6MappedBypass tests covering: - Loopback, RFC 1918, link-local (cloud metadata) via ::ffff: mapping - IPv4-compatible variant (::127.0.0.1) - Regression test that plain ::1 still blocked Also updated stale test assertion in test_eval_security_adversarial: hasattr, type, __build_class__ were removed from hook builtins in batch 2 but the test still expected hasattr to remain. DO NOT PUSH until release day.	2026-04-20 10:10:59 +00:00
unclecode	f77c0a856f	fix(security): SSRF protection on all crawl/md/llm URL entry points Reported by secsys_codex (2026-04-18): /md, /crawl, /llm endpoints pass user URLs to crawler.arun() with no private IP validation. - Add validate_url_destination() to utils.py with opt-out via CRAWL4AI_ALLOW_INTERNAL_URLS=true env var for users who need to crawl internal services. - Integrate into validate_url_scheme() (covers all server.py endpoints). - Add validation at all 4 URL entry points in api.py (handle_llm_qa, handle_markdown_request, create_new_task, handle_crawl_request). - raw: URLs bypass check (inline HTML, no network fetch). - 16 adversarial + source coverage tests added. - secsys_codex added to SECURITY-CREDITS.md. DO NOT PUSH until release day.	2026-04-20 09:42:43 +00:00
unclecode	0e92b5e239	docs: add Privacy Policy, Terms of Service, and Support pages Add legal pages required for Google Workspace Marketplace listing verification. Pages cover the whole Crawl4AI Cloud business (OSS library, hosted API, dashboard, integrations, Workspace add-ons), not specific to any single product. - privacy.md: data collection, usage, retention, Workspace Limited Use - terms.md: account, billing, acceptable use, IP, governing law (SG) - support.md: email, docs, GitHub, Discord, security disclosure	2026-04-20 02:24:21 +00:00
Gab	c8c2dc319f	fix: add LLMTableExtraction to Docker API deserialization allowlist	2026-04-17 15:43:56 -04:00
Nasrin	4e86399bfa	Merge pull request #1913 from unclecode/fix/nlp-sentence-chunking-1909 fix(chunking): preserve sentence order in NlpSentenceChunking	2026-04-16 10:24:59 +02:00
ntohidi	3d4bda122a	fix(deep-crawl): use set(False) instead of reset(token) for ContextVar (#1917 ) ContextVar.reset(token) requires the same Context that created the token. When Starlette's StreamingResponse consumes the async generator in a different Task, the Context changes and reset() raises ValueError. Replaced with set(False) which works across context boundaries. Safe because deep_crawl_active is never nested — the guard on line 21 prevents re-entry.	2026-04-16 13:49:32 +08:00
ntohidi	7bfc547bce	fix: preserve rowspan/colspan in cleaned_html (#1920 ) Add rowspan and colspan to IMPORTANT_ATTRS so they survive attribute stripping in remove_unwanted_attributes_fast().	2026-04-16 12:42:36 +08:00
unclecode	c9914691db	chore: add GitHub Security Advisory payloads for release day	2026-04-15 06:08:42 +00:00

1 2 3 4 5 ...

1529 Commits