crawl4ai/docs/md_v2 at develop - crawl4ai - Public git mirror

unclecode/crawl4ai

mirror of https://github.com/unclecode/crawl4ai.git synced 2026-06-10 07:48:50 +00:00

Files

History

unclecode 9d5bcf78e2 feat: Add DomainMapper for comprehensive domain URL discovery

Add DomainMapper class that discovers all URLs under a domain using
8 sources: sitemap, Common Crawl, Wayback Machine, Certificate
Transparency (crt.sh), path probing, robots.txt mining, RSS/Atom
feeds, and homepage link extraction.

Key features:
- Subdomain discovery via crt.sh, Wayback, CC, and DNS guessing
- Soft-404 detection: fingerprints SPA sites and filters fake pages
- Per-host scanning with parallel execution across discovered hosts
- URL normalization, deduplication, and source attribution
- BM25 relevance scoring with head metadata extraction
- Nonsense filter for static assets, webpack chunks, Wayback garbage

For superdesign.dev: finds 171 URLs across 11 hosts in ~13s
(vs 4 URLs from AsyncUrlSeeder)

New files:
- crawl4ai/domain_mapper.py (DomainMapper class)
- crawl4ai/async_configs.py (DomainMapperConfig)
- docs/md_v2/core/domain-mapping.md (documentation)
- docs/examples/domain_mapper/domain_mapper_demo.py
- 67 tests across unit/integration/adversarial/regression

(cherry picked from commit 2d10534a8742177f1d5f521e3174ae66591d3533)

2026-06-01 12:58:23 +00:00

..

docs: modernize deprecated API usage across shipped docs (#1770 )

2026-03-07 07:01:06 +00:00

fix: batch fix for 10 open issues (#1520 , #1489 , #1374 , #1424 , #1183 , #1354 , #880 , #1031 , #1251 , #1758 )

2026-03-07 09:47:38 +00:00

fix: allow assistant toolbar to scroll

2026-05-20 23:07:47 +08:00

feat(docs): update documentation and disable Ask AI feature

2025-04-23 19:02:39 +08:00

docs: modernize deprecated API usage across shipped docs (#1770 )

2026-03-07 07:01:06 +00:00

Merge branch 'vr0.4.3b2'

2025-01-22 20:51:46 +08:00

release: Crawl4AI v0.8.5

2026-03-16 18:46:05 +08:00

Release/v0.7.6 (#1556 )

2025-10-22 20:41:06 +08:00

feat: Add DomainMapper for comprehensive domain URL discovery

2026-06-01 12:58:23 +00:00

fix: batch fix for 10 open issues (#1520 , #1489 , #1374 , #1424 , #1183 , #1354 , #880 , #1031 , #1251 , #1758 )

2026-03-07 09:47:38 +00:00

feat(favicon): add new favicon images for improved branding

2025-05-17 19:03:51 +08:00

docs: modernize deprecated API usage across shipped docs (#1770 )

2026-03-07 07:01:06 +00:00

feat: 🚀 Introduce revolutionary LLMTableExtraction with intelligent chunking for massive tables

2025-08-14 18:21:24 +08:00

feat(theme): enable dark color mode in mkdocs configuration

2025-05-16 21:44:56 +08:00

complete-sdk-reference.md

fix: batch fix for 10 open issues (#1520 , #1489 , #1374 , #1424 , #1183 , #1354 , #880 , #1031 , #1251 , #1758 )

2026-03-07 09:47:38 +00:00

CONTRIBUTING.md

Add contributing guide and update mkdocs navigation for community resources

2026-02-03 09:46:54 +01:00

favicon.ico

feat(favicon): add new favicon images for improved branding

2025-05-17 19:03:51 +08:00

index.md

Release/v0.7.8 (#1662 )

2025-12-11 11:04:52 +01:00

privacy.md

docs: add Privacy Policy, Terms of Service, and Support pages

2026-04-20 02:24:21 +00:00

stats.md

Add stats dashboard page for LP summit

2026-02-24 12:58:34 +00:00

support.md

docs: add Privacy Policy, Terms of Service, and Support pages

2026-04-20 02:24:21 +00:00

terms.md

docs: add Privacy Policy, Terms of Service, and Support pages

2026-04-20 02:24:21 +00:00