mirror of
https://github.com/unclecode/crawl4ai.git
synced 2026-06-10 15:58:15 +00:00
- #1520: Preserve trailing slashes in URL normalization (RFC 3986 compliance) - #1489: Preserve query parameter key casing in normalize_url - #1374: Close NamedTemporaryFile handle before reopening (Windows fix) - #1424: Fix CosineStrategy returning empty results (delimiter fallback + at_least_k >= 1) - #1183: Fix extract_xml_data regex matching tag names in prose text - #1354: Make import_knowledge_base async (fix asyncio.run in running loop) - #880: Fix 404 sample_ecommerce.html gist URL in docs (6 occurrences) - #1031: Make Docker playground code editor resizable with overflow-auto - #1251: Add DEFAULT_CONFIG with deep-merge in load_config to prevent KeyError crashes - #1758: Change screenshot stitching format from BMP to PNG
Adaptive Crawling Examples
This directory contains examples demonstrating various aspects of Crawl4AI's Adaptive Crawling feature.
Examples Overview
1. basic_usage.py
- Simple introduction to adaptive crawling
- Uses default statistical strategy
- Shows how to get crawl statistics and relevant content
2. embedding_strategy.py ⭐ NEW
- Demonstrates the embedding-based strategy for semantic understanding
- Shows query expansion and irrelevance detection
- Includes configuration for both local and API-based embeddings
3. embedding_vs_statistical.py ⭐ NEW
- Direct comparison between statistical and embedding strategies
- Helps you choose the right strategy for your use case
- Shows performance and accuracy trade-offs
4. embedding_configuration.py ⭐ NEW
- Advanced configuration options for embedding strategy
- Parameter tuning guide for different scenarios
- Examples for research, exploration, and quality-focused crawling
5. advanced_configuration.py
- Shows various configuration options for both strategies
- Demonstrates threshold tuning and performance optimization
6. custom_strategies.py
- How to implement your own crawling strategy
- Extends the base CrawlStrategy class
- Advanced use case for specialized requirements
7. export_import_kb.py
- Export crawled knowledge base to JSONL
- Import and continue crawling from saved state
- Useful for building persistent knowledge bases
Quick Start
For your first adaptive crawling experience, run:
python basic_usage.py
To try the new embedding strategy with semantic understanding:
python embedding_strategy.py
To compare strategies and see which works best for your use case:
python embedding_vs_statistical.py
Strategy Selection Guide
Use Statistical Strategy (Default) When:
- Working with technical documentation
- Queries contain specific terms or code
- Speed is critical
- No API access available
Use Embedding Strategy When:
- Queries are conceptual or ambiguous
- Need semantic understanding beyond exact matches
- Want to detect irrelevant content
- Working with diverse content sources
Requirements
- Crawl4AI installed
- For embedding strategy with local models:
sentence-transformers - For embedding strategy with OpenAI: Set
OPENAI_API_KEYenvironment variable