test: add comprehensive regression test suite (291 tests)

Full regression suite covering all major Crawl4AI subsystems: - core crawl (arun, arun_many, raw HTML, JS, screenshots, cache, hooks) - content processing (markdown, citations, BM25/pruning filters, links, images, tables, metadata) - extraction strategies (JsonCss, JsonXPath, JsonLxml, Regex, Cosine, NoExtraction) - deep crawl (BFS, DFS, BestFirst, filters, scorers, URL normalization) - browser management (lifecycle, viewport, wait_for, stealth, sessions, iframes) - config serialization (BrowserConfig, CrawlerRunConfig, ProxyConfig roundtrips) - utilities (extract_xml_data, cache modes, content hashing) - edge cases (empty pages, malformed HTML, unicode, concurrent crawls, error recovery) Also adds /c4ai-check slash command for testing changes against the suite.
2026-06-10 15:58:15 +00:00 · 2026-03-08 03:20:52 +00:00
parent 3a75dd3f4c
commit d788c28315
11 changed files with 5072 additions and 0 deletions
--- a/.claude/commands/c4ai-check.md
+++ b/.claude/commands/c4ai-check.md
@@ -0,0 +1,89 @@
+---
+description: "Test current changes with adversarial tests, then run full regression suite"
+arguments:
+  - name: changes
+    description: "Description of what changed (e.g. 'fixed URL normalization to preserve trailing slashes')"
+    required: true
+---
+
+# Crawl4AI Change Verification (c4ai-check)
+
+You are verifying that recent code changes work correctly AND haven't broken anything else. This is a two-phase process.
+
+**Input:** $ARGUMENTS
+
+## PHASE 1: Adversarial Testing of Current Changes
+
+Based on the change description above:
+
+1. **Understand the change**: Read the relevant files that were modified. Use `git diff` to see exactly what changed.
+
+2. **Write targeted adversarial tests**: Create a temporary test file at `tests/regression/test_tmp_changes.py` that HEAVILY tests the specific changes:
+   - Normal cases (does it work as intended?)
+   - Edge cases (boundary values, empty inputs, None, huge inputs)
+   - Regression cases (does the OLD bug still occur? it shouldn't)
+   - Interaction cases (does it break anything it touches?)
+   - Adversarial cases (weird inputs that could expose issues)
+   - At least 10-15 focused tests per change area
+
+   Rules for the temp test file:
+   - Use `@pytest.mark.asyncio` for async tests
+   - Use real browser crawling where needed (`async with AsyncWebCrawler()`)
+   - Use the `local_server` fixture from conftest.py when needed
+   - NO mocking - test real behavior
+   - Each test must have a clear docstring explaining what it verifies
+
+3. **Run the targeted tests**:
+   ```bash
+   .venv/bin/python -m pytest tests/regression/test_tmp_changes.py -v --tb=short
+   ```
+
+4. **Report results**: Show pass/fail summary. If any fail, investigate and determine if it's a real bug in the changes or a test issue. Fix the tests if needed, fix the code if there's a real bug.
+
+## PHASE 2: Full Regression Suite
+
+After Phase 1 passes:
+
+1. **Run the full regression suite** (skip network tests for speed):
+   ```bash
+   .venv/bin/python -m pytest tests/regression/ -v -m "not network" --tb=short -q
+   ```
+
+2. **Analyze failures**: For any failures:
+   - Determine if the failure is caused by the current changes (REGRESSION) or pre-existing
+   - Regressions are blockers - report them clearly
+   - Pre-existing failures should be noted but don't block
+
+3. **Clean up**: Delete the temporary test file:
+   ```bash
+   rm tests/regression/test_tmp_changes.py
+   ```
+
+## PHASE 3: Report
+
+Present a clear summary:
+
+```
+## c4ai-check Results
+
+**Changes tested:** [brief description]
+
+### Phase 1: Targeted Tests
+- Tests written: X
+- Passed: X / Failed: X
+- [List any issues found]
+
+### Phase 2: Regression Suite
+- Total: X passed, X failed, X skipped
+- Regressions caused by changes: [None / list]
+- Pre-existing issues: [None / list]
+
+### Verdict: PASS / FAIL
+[If FAIL, explain what needs fixing]
+```
+
+IMPORTANT:
+- Always delete `test_tmp_changes.py` when done, even if tests fail
+- A PASS verdict means: all targeted tests pass AND no new regressions in the suite
+- A FAIL verdict means: either targeted tests found bugs OR changes caused regressions
+- Be honest about failures - don't hide issues
--- a/tests/regression/init.py
+++ b/tests/regression/init.py
@@ -0,0 +1 @@
+# Crawl4AI Regression Test Suite (crawl4ai-check)
--- a/tests/regression/conftest.py
+++ b/tests/regression/conftest.py
@@ -0,0 +1,628 @@
+"""
+Crawl4AI Regression Test Suite - Shared Fixtures
+
+Provides a local HTTP test server with crafted pages for deterministic testing,
+plus markers for network-dependent tests against real URLs.
+
+Usage:
+    pytest tests/regression/ -v                    # all tests
+    pytest tests/regression/ -v -m "not network"   # skip real URL tests
+    pytest tests/regression/ -v -k "core"          # only core tests
+"""
+
+import pytest
+import socket
+import threading
+import asyncio
+import time
+from aiohttp import web
+
+
+# ---------------------------------------------------------------------------
+# Pytest configuration
+# ---------------------------------------------------------------------------
+
+def pytest_configure(config):
+    config.addinivalue_line("markers", "network: tests requiring real network access")
+
+
+# ---------------------------------------------------------------------------
+# Test HTML Pages
+# ---------------------------------------------------------------------------
+
+HOME_HTML = """\
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="utf-8">
+    <title>Crawl4AI Test Home</title>
+    <meta name="description" content="Regression test page for Crawl4AI">
+    <meta name="keywords" content="crawl4ai, testing, regression">
+    <meta property="og:title" content="Test OG Title">
+    <meta property="og:description" content="Test OG description for social sharing">
+    <meta property="og:image" content="/images/og-image.jpg">
+    <meta property="og:type" content="website">
+    <meta name="twitter:card" content="summary_large_image">
+    <meta name="twitter:title" content="Test Twitter Title">
+</head>
+<body>
+    <nav>
+        <a href="/">Home</a>
+        <a href="/products">Products</a>
+        <a href="/links-page">Links</a>
+        <a href="/tables">Tables</a>
+    </nav>
+    <main>
+        <h1>Welcome to the Crawl4AI Test Site</h1>
+        <p>This is a comprehensive test page designed for regression testing of the
+        Crawl4AI web crawling library. It contains various HTML elements to verify
+        content extraction, markdown generation, and link discovery work correctly.</p>
+
+        <h2>Features Overview</h2>
+        <p>The test suite covers multiple aspects of web crawling including content
+        extraction, JavaScript execution, screenshot capture, and deep crawling
+        capabilities. Each feature is tested both with local pages and real URLs.</p>
+
+        <ul>
+            <li>Content extraction and markdown generation</li>
+            <li>Link discovery and classification</li>
+            <li>Image extraction and scoring</li>
+            <li>Table extraction and validation</li>
+        </ul>
+
+        <h2>Code Example</h2>
+        <pre><code>from crawl4ai import AsyncWebCrawler
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun("https://example.com")
+    print(result.markdown)</code></pre>
+
+        <p>Contact us at <a href="mailto:test@example.com">test@example.com</a> for more info.</p>
+
+        <h3>Internal Links</h3>
+        <a href="/page-alpha">Alpha Page</a>
+        <a href="/page-beta">Beta Page</a>
+
+        <h3>External Links</h3>
+        <a href="https://example.com">Example.com</a>
+        <a href="https://github.com/unclecode/crawl4ai">Crawl4AI GitHub</a>
+
+        <img src="/images/hero.jpg" alt="Hero image for testing" width="800" height="400">
+        <img src="/images/icon.png" alt="" width="16" height="16">
+    </main>
+    <footer>
+        <p>Footer content - should be excluded with excluded_tags</p>
+    </footer>
+</body>
+</html>"""
+
+PRODUCTS_HTML = """\
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <title>Product Listing</title>
+    <meta name="description" content="Test product listing page">
+</head>
+<body>
+    <h1>Products</h1>
+    <div class="product-list">
+        <div class="product" data-id="1">
+            <h2 class="name">Wireless Mouse</h2>
+            <span class="price">$29.99</span>
+            <div class="rating" data-stars="4.5">4.5 stars</div>
+            <p class="description">Ergonomic wireless mouse with precision tracking</p>
+            <span class="category">Electronics</span>
+            <a href="/product/1" class="details-link">View Details</a>
+        </div>
+        <div class="product" data-id="2">
+            <h2 class="name">Mechanical Keyboard</h2>
+            <span class="price">$89.99</span>
+            <div class="rating" data-stars="4.8">4.8 stars</div>
+            <p class="description">Cherry MX switches with RGB backlighting</p>
+            <span class="category">Electronics</span>
+            <a href="/product/2" class="details-link">View Details</a>
+        </div>
+        <div class="product" data-id="3">
+            <h2 class="name">USB-C Hub</h2>
+            <span class="price">$45.50</span>
+            <div class="rating" data-stars="4.2">4.2 stars</div>
+            <p class="description">7-in-1 hub with HDMI, USB-A, SD card reader</p>
+            <span class="category">Accessories</span>
+            <a href="/product/3" class="details-link">View Details</a>
+        </div>
+        <div class="product" data-id="4">
+            <h2 class="name">Monitor Stand</h2>
+            <span class="price">$34.99</span>
+            <div class="rating" data-stars="3.9">3.9 stars</div>
+            <p class="description">Adjustable aluminum monitor riser with storage</p>
+            <span class="category">Furniture</span>
+            <a href="/product/4" class="details-link">View Details</a>
+        </div>
+        <div class="product" data-id="5">
+            <h2 class="name">Webcam HD</h2>
+            <span class="price">$59.00</span>
+            <div class="rating" data-stars="4.6">4.6 stars</div>
+            <p class="description">1080p webcam with built-in microphone and privacy cover</p>
+            <span class="category">Electronics</span>
+            <a href="/product/5" class="details-link">View Details</a>
+        </div>
+    </div>
+</body>
+</html>"""
+
+TABLES_HTML = """\
+<!DOCTYPE html>
+<html>
+<head><title>Tables Test</title></head>
+<body>
+    <h1>Data Tables</h1>
+
+    <h2>Sales Report</h2>
+    <table id="sales-table">
+        <thead>
+            <tr><th>Quarter</th><th>Revenue</th><th>Growth</th></tr>
+        </thead>
+        <tbody>
+            <tr><td>Q1 2025</td><td>$1,234,567</td><td>12.5%</td></tr>
+            <tr><td>Q2 2025</td><td>$1,456,789</td><td>18.0%</td></tr>
+            <tr><td>Q3 2025</td><td>$1,678,901</td><td>15.2%</td></tr>
+            <tr><td>Q4 2025</td><td>$1,890,123</td><td>12.6%</td></tr>
+        </tbody>
+    </table>
+
+    <h2>Layout Table (should be filtered)</h2>
+    <table id="layout-table">
+        <tr><td>Left column</td><td>Right column</td></tr>
+    </table>
+
+    <h2>Employee Directory</h2>
+    <table id="employee-table">
+        <thead>
+            <tr><th>Name</th><th>Email</th><th>Department</th><th>Phone</th></tr>
+        </thead>
+        <tbody>
+            <tr><td>Alice Johnson</td><td>alice@example.com</td><td>Engineering</td><td>+1-555-0101</td></tr>
+            <tr><td>Bob Smith</td><td>bob@example.com</td><td>Marketing</td><td>+1-555-0102</td></tr>
+            <tr><td>Carol White</td><td>carol@example.com</td><td>Sales</td><td>+1-555-0103</td></tr>
+        </tbody>
+    </table>
+</body>
+</html>"""
+
+JS_DYNAMIC_HTML = """\
+<!DOCTYPE html>
+<html>
+<head><title>JS Dynamic Content</title></head>
+<body>
+    <div id="static-content">
+        <h1>Static Section</h1>
+        <p>This content is immediately available in the HTML.</p>
+    </div>
+    <div id="dynamic-content"></div>
+    <div id="counter">0</div>
+    <script>
+        setTimeout(function() {
+            document.getElementById('dynamic-content').innerHTML =
+                '<p class="js-loaded">Dynamic content successfully loaded via JavaScript</p>' +
+                '<ul><li>Item A</li><li>Item B</li><li>Item C</li></ul>';
+        }, 300);
+        setTimeout(function() {
+            document.getElementById('counter').textContent = '42';
+        }, 200);
+    </script>
+</body>
+</html>"""
+
+LINKS_HTML = """\
+<!DOCTYPE html>
+<html>
+<head><title>Links Collection</title></head>
+<body>
+    <h1>Link Collection Page</h1>
+    <nav>
+        <h2>Internal Navigation</h2>
+        <a href="/">Home</a>
+        <a href="/products">Products</a>
+        <a href="/tables">Tables</a>
+        <a href="/about">About Us</a>
+        <a href="/contact">Contact</a>
+        <a href="/blog/post-1">Blog Post 1</a>
+        <a href="/blog/post-2">Blog Post 2</a>
+        <a href="/docs/api">API Docs</a>
+        <a href="/docs/guide">User Guide</a>
+    </nav>
+    <section>
+        <h2>External Resources</h2>
+        <a href="https://example.com">Example Domain</a>
+        <a href="https://github.com">GitHub</a>
+        <a href="https://python.org">Python</a>
+        <a href="https://docs.python.org/3/">Python Docs</a>
+    </section>
+    <section>
+        <h2>Social Media</h2>
+        <a href="https://twitter.com/example">Twitter</a>
+        <a href="https://facebook.com/example">Facebook</a>
+        <a href="https://linkedin.com/company/example">LinkedIn</a>
+    </section>
+    <section>
+        <h2>Duplicate Links</h2>
+        <a href="/">Home Again</a>
+        <a href="https://example.com">Example Again</a>
+    </section>
+</body>
+</html>"""
+
+IMAGES_HTML = """\
+<!DOCTYPE html>
+<html>
+<head><title>Images Gallery</title></head>
+<body>
+    <h1>Image Gallery</h1>
+
+    <!-- High-quality image: should score high (large, has alt, common format) -->
+    <div class="hero">
+        <img src="/images/landscape.jpg" alt="Beautiful mountain landscape at sunset"
+             width="1200" height="800">
+        <p>A stunning landscape photograph showcasing the beauty of mountain scenery
+        at golden hour. This image demonstrates proper extraction of high-quality
+        photographs with descriptive alt text and surrounding context.</p>
+    </div>
+
+    <!-- Medium quality: decent size, has alt -->
+    <img src="/images/product-photo.png" alt="Product photograph" width="400" height="300">
+
+    <!-- Low quality: small icon, no alt -->
+    <img src="/images/icon-search.svg" alt="" width="24" height="24">
+
+    <!-- Lazy-loaded image -->
+    <img data-src="/images/lazy-photo.webp" alt="Lazy loaded image" width="600" height="400"
+         class="lazyload" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==">
+
+    <!-- Image with srcset -->
+    <img src="/images/responsive-sm.jpg"
+         srcset="/images/responsive-sm.jpg 480w, /images/responsive-md.jpg 800w, /images/responsive-lg.jpg 1200w"
+         alt="Responsive image with srcset" width="800" height="600">
+
+    <!-- Button icon (should be filtered) -->
+    <button><img src="/images/btn-submit.png" alt="submit" width="100" height="30"></button>
+
+    <!-- Logo (should be filtered by pattern) -->
+    <img src="/images/company-logo.png" alt="Company Logo" width="200" height="50">
+</body>
+</html>"""
+
+STRUCTURED_DATA_HTML = """\
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <title>Article with Structured Data</title>
+    <meta name="description" content="An article about web crawling techniques">
+    <meta property="og:title" content="Web Crawling Best Practices">
+    <meta property="og:description" content="Learn about modern web crawling">
+    <meta property="og:image" content="/images/article-cover.jpg">
+    <meta property="og:type" content="article">
+    <meta property="article:published_time" content="2025-06-15T10:00:00Z">
+    <meta property="article:modified_time" content="2025-07-20T14:30:00Z">
+    <meta name="twitter:card" content="summary_large_image">
+    <script type="application/ld+json">
+    {
+        "@context": "https://schema.org",
+        "@type": "Article",
+        "headline": "Web Crawling Best Practices",
+        "author": {"@type": "Person", "name": "Test Author"},
+        "datePublished": "2025-06-15",
+        "description": "A comprehensive guide to web crawling"
+    }
+    </script>
+</head>
+<body>
+    <article>
+        <h1>Web Crawling Best Practices</h1>
+        <p class="byline">By Test Author | Published June 15, 2025</p>
+        <p>Web crawling is the process of systematically browsing the web to extract
+        information. Modern crawlers like Crawl4AI provide sophisticated tools for
+        content extraction, including markdown generation, structured data extraction,
+        and intelligent link following.</p>
+        <h2>Key Techniques</h2>
+        <p>Understanding how to properly configure a web crawler is essential for
+        efficient data collection. This includes setting appropriate delays, respecting
+        robots.txt, and using proper user agents.</p>
+    </article>
+</body>
+</html>"""
+
+EMPTY_HTML = """\
+<!DOCTYPE html>
+<html><head><title>Empty Page</title></head>
+<body></body>
+</html>"""
+
+MALFORMED_HTML = """\
+<html>
+<head><title>Malformed Page</head>
+<body>
+<div>
+<p>Unclosed paragraph
+<p>Another paragraph without closing
+<img src="/test.jpg" alt="no closing bracket"
+<a href="/broken>Broken link</a>
+<div><span>Nested but unclosed
+<table><tr><td>Cell without closing tags
+</body>
+</html>"""
+
+REGEX_TEST_HTML = """\
+<!DOCTYPE html>
+<html>
+<head><title>Regex Test Content</title></head>
+<body>
+    <h1>Contact Information</h1>
+    <p>Email us at support@crawl4ai.com or sales@example.org for inquiries.</p>
+    <p>Call us: +1-555-123-4567 or (800) 555-0199</p>
+    <p>Visit https://crawl4ai.com or https://docs.crawl4ai.com/api/v2</p>
+    <p>Server IP: 192.168.1.100</p>
+    <p>Request ID: 550e8400-e29b-41d4-a716-446655440000</p>
+    <p>Price: $199.99 or EUR 175.50</p>
+    <p>Completion rate: 95.7%</p>
+    <p>Published: 2025-03-15</p>
+    <p>Updated: 03/15/2025</p>
+    <p>Meeting at 14:30 or 09:00</p>
+    <p>Zip code: 94105 or 94105-1234</p>
+    <p>Follow @crawl4ai on social media</p>
+    <p>Tags: #WebCrawling #DataExtraction #Python</p>
+    <p>Color theme: #FF5733</p>
+</body>
+</html>"""
+
+
+def _generate_large_html(num_sections=50):
+    """Generate a large HTML page with many sections."""
+    sections = []
+    for i in range(num_sections):
+        sections.append(f"""
+        <section id="section-{i}">
+            <h2>Section {i}: Important Topic Number {i}</h2>
+            <p>This is paragraph one of section {i}. It contains enough text to be
+            meaningful for content extraction and markdown generation testing purposes.
+            The crawler should properly handle large pages with many sections.</p>
+            <p>This is paragraph two of section {i}. It provides additional context
+            and detail about topic {i}, ensuring that the content extraction pipeline
+            can handle substantial amounts of text without issues.</p>
+            <a href="/section/{i}">Read more about topic {i}</a>
+        </section>""")
+    return f"""\
+<!DOCTYPE html>
+<html>
+<head><title>Large Page with Many Sections</title></head>
+<body>
+    <h1>Comprehensive Document</h1>
+    {"".join(sections)}
+</body>
+</html>"""
+
+LARGE_HTML = _generate_large_html(50)
+
+
+# Deep crawl pages: hub -> sub1,sub2,sub3 -> leaf pages
+DEEP_HUB_HTML = """\
+<!DOCTYPE html>
+<html>
+<head><title>Deep Crawl Hub</title></head>
+<body>
+    <h1>Hub Page</h1>
+    <p>This is the starting point for deep crawl testing.</p>
+    <nav>
+        <a href="/deep/sub1">Sub Page 1 - Technology</a>
+        <a href="/deep/sub2">Sub Page 2 - Science</a>
+        <a href="/deep/sub3">Sub Page 3 - Arts</a>
+    </nav>
+</body>
+</html>"""
+
+DEEP_SUB_TEMPLATE = """\
+<!DOCTYPE html>
+<html>
+<head><title>Deep Crawl - {title}</title></head>
+<body>
+    <h1>{title}</h1>
+    <p>Content about {title}. This sub-page contains links to deeper content.</p>
+    <a href="/deep/{prefix}/leaf-a">Leaf A under {title}</a>
+    <a href="/deep/{prefix}/leaf-b">Leaf B under {title}</a>
+    <a href="/deep/hub">Back to Hub</a>
+</body>
+</html>"""
+
+DEEP_LEAF_TEMPLATE = """\
+<!DOCTYPE html>
+<html>
+<head><title>Deep Crawl - {title}</title></head>
+<body>
+    <h1>{title}</h1>
+    <p>This is a leaf page in the deep crawl hierarchy. It contains substantial
+    content about {title} to ensure proper extraction at all crawl depths.
+    The adaptive crawler should find and process this content correctly.</p>
+    <a href="/deep/hub">Back to Hub</a>
+</body>
+</html>"""
+
+IFRAME_HTML = """\
+<!DOCTYPE html>
+<html>
+<head><title>Page with Iframes</title></head>
+<body>
+    <h1>Main Page Content</h1>
+    <p>This page contains embedded iframes for testing iframe processing.</p>
+    <iframe id="frame1" srcdoc="<html><body><p>Iframe 1 content: embedded text</p></body></html>"
+            width="400" height="200"></iframe>
+    <iframe id="frame2" srcdoc="<html><body><h2>Iframe 2 heading</h2><p>More embedded content here</p></body></html>"
+            width="400" height="200"></iframe>
+</body>
+</html>"""
+
+
+# ---------------------------------------------------------------------------
+# Server Handlers
+# ---------------------------------------------------------------------------
+
+async def _serve_html(html, content_type="text/html"):
+    return web.Response(text=html, content_type=content_type)
+
+
+async def _home_handler(request):
+    return await _serve_html(HOME_HTML)
+
+async def _products_handler(request):
+    return await _serve_html(PRODUCTS_HTML)
+
+async def _tables_handler(request):
+    return await _serve_html(TABLES_HTML)
+
+async def _js_dynamic_handler(request):
+    return await _serve_html(JS_DYNAMIC_HTML)
+
+async def _links_handler(request):
+    return await _serve_html(LINKS_HTML)
+
+async def _images_handler(request):
+    return await _serve_html(IMAGES_HTML)
+
+async def _structured_handler(request):
+    return await _serve_html(STRUCTURED_DATA_HTML)
+
+async def _empty_handler(request):
+    return await _serve_html(EMPTY_HTML)
+
+async def _malformed_handler(request):
+    return await _serve_html(MALFORMED_HTML)
+
+async def _regex_test_handler(request):
+    return await _serve_html(REGEX_TEST_HTML)
+
+async def _large_handler(request):
+    return await _serve_html(LARGE_HTML)
+
+async def _iframe_handler(request):
+    return await _serve_html(IFRAME_HTML)
+
+async def _redirect_handler(request):
+    raise web.HTTPFound("/")
+
+async def _not_found_handler(request):
+    return web.Response(
+        text="<html><head><title>404 Not Found</title></head>"
+             "<body><h1>Page Not Found</h1><p>The requested page does not exist.</p></body></html>",
+        status=404, content_type="text/html",
+    )
+
+async def _slow_handler(request):
+    await asyncio.sleep(2)
+    return await _serve_html(
+        "<html><head><title>Slow Page</title></head>"
+        "<body><h1>Slow Response</h1><p>This page had a 2-second delay.</p></body></html>"
+    )
+
+async def _deep_hub_handler(request):
+    return await _serve_html(DEEP_HUB_HTML)
+
+async def _deep_sub_handler(request):
+    sub_id = request.match_info["sub_id"]
+    titles = {"sub1": "Technology", "sub2": "Science", "sub3": "Arts"}
+    title = titles.get(sub_id, f"Sub {sub_id}")
+    html = DEEP_SUB_TEMPLATE.format(title=title, prefix=sub_id)
+    return await _serve_html(html)
+
+async def _deep_leaf_handler(request):
+    sub_id = request.match_info["sub_id"]
+    leaf_id = request.match_info["leaf_id"]
+    title = f"Leaf {leaf_id} under {sub_id}"
+    html = DEEP_LEAF_TEMPLATE.format(title=title)
+    return await _serve_html(html)
+
+async def _catch_all_handler(request):
+    """Serve a simple page for any unmatched path (useful for link targets)."""
+    path = request.path
+    return await _serve_html(
+        f"<html><head><title>Page: {path}</title></head>"
+        f"<body><h1>Page at {path}</h1>"
+        f"<p>Auto-generated page for path: {path}</p>"
+        f'<a href="/">Back to Home</a></body></html>'
+    )
+
+
+# ---------------------------------------------------------------------------
+# Server Setup
+# ---------------------------------------------------------------------------
+
+def _find_free_port():
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.bind(("", 0))
+        return s.getsockname()[1]
+
+
+def _create_app():
+    app = web.Application()
+    app.router.add_get("/", _home_handler)
+    app.router.add_get("/products", _products_handler)
+    app.router.add_get("/tables", _tables_handler)
+    app.router.add_get("/js-dynamic", _js_dynamic_handler)
+    app.router.add_get("/links-page", _links_handler)
+    app.router.add_get("/images-page", _images_handler)
+    app.router.add_get("/structured-data", _structured_handler)
+    app.router.add_get("/empty", _empty_handler)
+    app.router.add_get("/malformed", _malformed_handler)
+    app.router.add_get("/regex-test", _regex_test_handler)
+    app.router.add_get("/large", _large_handler)
+    app.router.add_get("/iframe-page", _iframe_handler)
+    app.router.add_get("/redirect", _redirect_handler)
+    app.router.add_get("/not-found", _not_found_handler)
+    app.router.add_get("/slow", _slow_handler)
+    app.router.add_get("/deep/hub", _deep_hub_handler)
+    app.router.add_get("/deep/{sub_id}", _deep_sub_handler)
+    app.router.add_get("/deep/{sub_id}/{leaf_id}", _deep_leaf_handler)
+    # Catch-all for auto-generated pages (internal link targets, etc.)
+    app.router.add_get("/{path:.*}", _catch_all_handler)
+    return app
+
+
+def _run_server(app, host, port, ready_event):
+    loop = asyncio.new_event_loop()
+    asyncio.set_event_loop(loop)
+    runner = web.AppRunner(app)
+    loop.run_until_complete(runner.setup())
+    site = web.TCPSite(runner, host, port)
+    loop.run_until_complete(site.start())
+    ready_event.set()
+    try:
+        loop.run_forever()
+    finally:
+        loop.run_until_complete(runner.cleanup())
+        loop.close()
+
+
+@pytest.fixture(scope="session")
+def local_server():
+    """Start a local HTTP test server. Returns base URL like 'http://localhost:PORT'."""
+    port = _find_free_port()
+    app = _create_app()
+    ready = threading.Event()
+    thread = threading.Thread(
+        target=_run_server,
+        args=(app, "localhost", port, ready),
+        daemon=True,
+    )
+    thread.start()
+    assert ready.wait(timeout=10), "Test server failed to start within 10 seconds"
+    # Small delay to ensure server is fully ready
+    time.sleep(0.2)
+    yield f"http://localhost:{port}"
+    # Daemon thread cleans up automatically
+
+
+# ---------------------------------------------------------------------------
+# Common test constants
+# ---------------------------------------------------------------------------
+
+# Stable real URLs for network tests
+REAL_URL_SIMPLE = "https://example.com"
+REAL_URL_QUOTES = "https://quotes.toscrape.com"
+REAL_URL_BOOKS = "https://books.toscrape.com"
--- a/tests/regression/test_reg_browser.py
+++ b/tests/regression/test_reg_browser.py
@@ -0,0 +1,561 @@
+"""
+Crawl4AI Regression Tests - Browser Management and Features
+
+Tests browser lifecycle, viewport configuration, wait_for conditions, JavaScript
+execution, page interaction, screenshots, iframe processing, overlay removal,
+stealth mode, session management, network capture, and anti-bot features using
+real browser crawling with no mocking.
+"""
+
+import base64
+import time
+
+import pytest
+
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+from crawl4ai.cache_context import CacheMode
+
+
+# ---------------------------------------------------------------------------
+# Browser lifecycle
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_browser_lifecycle(local_server):
+    """Create crawler, start, crawl, and close explicitly without context manager."""
+    crawler = AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False))
+    await crawler.start()
+    try:
+        result = await crawler.arun(
+            url=local_server + "/",
+            config=CrawlerRunConfig(verbose=False),
+        )
+        assert result.success, f"Crawl failed: {result.error_message}"
+        assert len(result.html) > 0, "HTML should be non-empty"
+    finally:
+        await crawler.close()
+
+
+@pytest.mark.asyncio
+async def test_browser_context_manager(local_server):
+    """Verify async with pattern works and cleanup happens without error."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url=local_server + "/",
+            config=CrawlerRunConfig(verbose=False),
+        )
+        assert result.success, f"Context manager crawl failed: {result.error_message}"
+    # If we get here without exception, cleanup succeeded
+
+
+# ---------------------------------------------------------------------------
+# Viewport configuration
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_custom_viewport(local_server):
+    """Create BrowserConfig with 1920x1080 viewport and verify crawl succeeds."""
+    browser_config = BrowserConfig(
+        headless=True,
+        verbose=False,
+        viewport_width=1920,
+        viewport_height=1080,
+    )
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url=local_server + "/",
+            config=CrawlerRunConfig(verbose=False),
+        )
+        assert result.success, f"Custom viewport crawl failed: {result.error_message}"
+
+
+@pytest.mark.asyncio
+async def test_small_viewport(local_server):
+    """Mobile-like viewport (375x667) should still produce a successful crawl."""
+    browser_config = BrowserConfig(
+        headless=True,
+        verbose=False,
+        viewport_width=375,
+        viewport_height=667,
+    )
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url=local_server + "/",
+            config=CrawlerRunConfig(verbose=False),
+        )
+        assert result.success, f"Small viewport crawl failed: {result.error_message}"
+
+
+# ---------------------------------------------------------------------------
+# wait_for conditions
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_wait_for_css_selector(local_server):
+    """Wait for a CSS selector on /js-dynamic and verify dynamic content loaded."""
+    config = CrawlerRunConfig(wait_for="css:.js-loaded", verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/js-dynamic", config=config)
+        assert result.success, f"wait_for CSS crawl failed: {result.error_message}"
+        assert "Dynamic content successfully loaded" in (result.markdown or ""), (
+            "Dynamic JS content should appear after waiting for .js-loaded"
+        )
+
+
+@pytest.mark.asyncio
+async def test_wait_for_js_function(local_server):
+    """Wait for a JS condition on /js-dynamic and verify the counter value."""
+    config = CrawlerRunConfig(
+        wait_for="js:() => document.getElementById('counter').textContent === '42'",
+        verbose=False,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/js-dynamic", config=config)
+        assert result.success, f"wait_for JS crawl failed: {result.error_message}"
+        assert "42" in (result.html or ""), (
+            "Counter should be set to 42 after JS wait condition is met"
+        )
+
+
+@pytest.mark.asyncio
+async def test_wait_for_timeout(local_server):
+    """Wait for a non-existent selector with short timeout should not hang forever."""
+    config = CrawlerRunConfig(
+        wait_for="css:.nonexistent-class",
+        wait_for_timeout=500,
+        verbose=False,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        # This may succeed (with timeout warning) or fail, but should not hang
+        result = await crawler.arun(url=local_server + "/js-dynamic", config=config)
+        # We just verify it returned without hanging; success or failure is acceptable
+        assert result is not None, "Should return a result even if wait_for times out"
+
+
+# ---------------------------------------------------------------------------
+# JavaScript execution
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_js_code_modifies_dom(local_server):
+    """Execute JS that adds a DOM element and verify it appears in the result."""
+    config = CrawlerRunConfig(
+        js_code='document.body.innerHTML += \'<div id="injected">Injected by JS</div>\';',
+        verbose=False,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, f"JS DOM modification crawl failed: {result.error_message}"
+        combined = (result.html or "") + (result.markdown or "")
+        assert "Injected by JS" in combined, (
+            "Injected content should appear in HTML or markdown"
+        )
+
+
+@pytest.mark.asyncio
+async def test_js_code_returns_value(local_server):
+    """Execute JS that returns document.title and check js_execution_result."""
+    config = CrawlerRunConfig(
+        js_code="return document.title;",
+        verbose=False,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, f"JS return value crawl failed: {result.error_message}"
+        # js_execution_result should contain the returned value
+        if result.js_execution_result is not None:
+            # The result might be stored under a key or directly
+            result_str = str(result.js_execution_result)
+            assert "Crawl4AI Test Home" in result_str or len(result_str) > 0, (
+                "js_execution_result should contain the document title"
+            )
+
+
+@pytest.mark.asyncio
+async def test_multiple_js_scripts(local_server):
+    """Execute multiple JS scripts sequentially; last one sets title to 'B'."""
+    config = CrawlerRunConfig(
+        js_code=["document.title='A';", "document.title='B';"],
+        verbose=False,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, f"Multiple JS scripts crawl failed: {result.error_message}"
+        # Both scripts should have executed; title should end up as 'B'
+        # We can check via the HTML title tag or via another JS execution
+        # The HTML might still have the original title in source, but the page state changed
+
+
+# ---------------------------------------------------------------------------
+# Page interaction
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_scan_full_page(local_server):
+    """Crawl /large with scan_full_page=True and verify bottom sections appear."""
+    config = CrawlerRunConfig(
+        scan_full_page=True,
+        scroll_delay=0.05,
+        verbose=False,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/large", config=config)
+        assert result.success, f"Full page scan crawl failed: {result.error_message}"
+        # The large page has 50 sections; verify some from near the bottom
+        combined = (result.html or "") + (result.markdown or "")
+        assert "Section 49" in combined, (
+            "Scanning the full page should reveal the last section (Section 49)"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Screenshot features
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_screenshot_basic(local_server):
+    """Crawl with screenshot=True, decode base64, and verify PNG header."""
+    config = CrawlerRunConfig(screenshot=True, verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, f"Screenshot crawl failed: {result.error_message}"
+        assert result.screenshot, "Screenshot should be a non-empty base64 string"
+        raw_bytes = base64.b64decode(result.screenshot)
+        assert raw_bytes[:4] == b"\x89PNG", (
+            "Screenshot should be in PNG format"
+        )
+
+
+@pytest.mark.asyncio
+async def test_force_viewport_screenshot(local_server):
+    """Crawl /large with force_viewport_screenshot=True; should capture viewport only."""
+    config = CrawlerRunConfig(
+        screenshot=True,
+        force_viewport_screenshot=True,
+        verbose=False,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/large", config=config)
+        assert result.success, f"Force viewport screenshot crawl failed: {result.error_message}"
+        assert result.screenshot, "Screenshot should be captured"
+        raw_bytes = base64.b64decode(result.screenshot)
+        assert raw_bytes[:4] == b"\x89PNG", "Viewport screenshot should be PNG"
+
+
+# ---------------------------------------------------------------------------
+# Process iframes
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_process_iframes(local_server):
+    """Crawl /iframe-page with process_iframes=True and verify iframe content appears."""
+    config = CrawlerRunConfig(process_iframes=True, verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/iframe-page", config=config)
+        assert result.success, f"Iframe processing crawl failed: {result.error_message}"
+        combined = (result.html or "") + (result.markdown or "")
+        # At least one iframe's content should appear
+        has_iframe_content = (
+            "Iframe 1 content" in combined
+            or "Iframe 2 heading" in combined
+            or "embedded" in combined.lower()
+        )
+        assert has_iframe_content, (
+            "Iframe content should appear in the result when process_iframes=True"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Overlay and popup removal
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_remove_overlay_elements(local_server):
+    """Crawl with remove_overlay_elements=True; verify it does not break crawling."""
+    config = CrawlerRunConfig(remove_overlay_elements=True, verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, (
+            f"Overlay removal should not break crawling: {result.error_message}"
+        )
+        assert len(result.html) > 0, "HTML should still be present after overlay removal"
+
+
+# ---------------------------------------------------------------------------
+# Stealth mode
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_stealth_mode_no_crash(local_server):
+    """Stealth mode should not break basic local crawling."""
+    browser_config = BrowserConfig(
+        headless=True,
+        verbose=False,
+        enable_stealth=True,
+    )
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url=local_server + "/",
+            config=CrawlerRunConfig(verbose=False),
+        )
+        assert result.success, f"Stealth mode crawl failed: {result.error_message}"
+        assert "Crawl4AI Test Home" in (result.html or ""), (
+            "Stealth mode should still extract content correctly"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Session management
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_session_persistence(local_server):
+    """Session state should persist between crawls with the same session_id."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        # First crawl: set a JS variable
+        config1 = CrawlerRunConfig(
+            session_id="persist-test",
+            js_code="window.__testVar = 'hello';",
+            verbose=False,
+        )
+        result1 = await crawler.arun(url=local_server + "/", config=config1)
+        assert result1.success, f"First session crawl failed: {result1.error_message}"
+
+        # Second crawl: read the JS variable using js_only mode
+        config2 = CrawlerRunConfig(
+            session_id="persist-test",
+            js_only=True,
+            js_code="return window.__testVar;",
+            verbose=False,
+        )
+        result2 = await crawler.arun(url=local_server + "/", config=config2)
+        assert result2.success, f"Second session crawl failed: {result2.error_message}"
+
+        # Check if testVar persisted
+        if result2.js_execution_result is not None:
+            result_str = str(result2.js_execution_result)
+            assert "hello" in result_str, (
+                f"Session variable should persist; got: {result_str}"
+            )
+
+
+# ---------------------------------------------------------------------------
+# Delay before return HTML
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_delay_before_return(local_server):
+    """Crawl with delay_before_return_html=0.5 should succeed and take reasonable time."""
+    config = CrawlerRunConfig(delay_before_return_html=0.5, verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        start_time = time.monotonic()
+        result = await crawler.arun(url=local_server + "/", config=config)
+        elapsed = time.monotonic() - start_time
+
+        assert result.success, f"Delayed crawl failed: {result.error_message}"
+        assert elapsed >= 0.4, (
+            f"Crawl with 0.5s delay should take at least 0.4s, took {elapsed:.2f}s"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Network features
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_capture_network_requests(local_server):
+    """Crawl /js-dynamic with capture_network_requests=True and verify list returned."""
+    config = CrawlerRunConfig(
+        capture_network_requests=True,
+        cache_mode=CacheMode.BYPASS,
+        verbose=False,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/js-dynamic", config=config)
+        assert result.success, f"Network capture crawl failed: {result.error_message}"
+        assert result.network_requests is not None, "network_requests should not be None"
+        assert isinstance(result.network_requests, list), (
+            "network_requests should be a list"
+        )
+        assert len(result.network_requests) >= 1, (
+            "Should capture at least 1 network request (the page itself)"
+        )
+
+
+@pytest.mark.asyncio
+async def test_capture_console_messages(local_server):
+    """Crawl with capture_console_messages=True and verify the attribute is a list."""
+    config = CrawlerRunConfig(
+        capture_console_messages=True,
+        cache_mode=CacheMode.BYPASS,
+        verbose=False,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, f"Console capture crawl failed: {result.error_message}"
+        assert result.console_messages is not None, (
+            "console_messages should not be None when capture is enabled"
+        )
+        assert isinstance(result.console_messages, list), (
+            "console_messages should be a list"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Real URL browser tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_real_url_with_wait():
+    """Crawl https://quotes.toscrape.com with wait_until='load' and verify content."""
+    config = CrawlerRunConfig(wait_until="load", verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url="https://quotes.toscrape.com", config=config)
+        assert result.success, f"Real URL crawl failed: {result.error_message}"
+        assert len(result.html) > 100, "Real page should have substantial HTML"
+        combined = (result.markdown or "") + (result.html or "")
+        assert "quote" in combined.lower() or "quotes" in combined.lower(), (
+            "Quotes page should contain the word 'quote'"
+        )
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_real_url_screenshot():
+    """Crawl https://example.com with screenshot=True and verify PNG captured."""
+    config = CrawlerRunConfig(screenshot=True, verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url="https://example.com", config=config)
+        assert result.success, f"Real URL screenshot crawl failed: {result.error_message}"
+        assert result.screenshot, "Screenshot should be non-empty"
+        raw_bytes = base64.b64decode(result.screenshot)
+        assert raw_bytes[:4] == b"\x89PNG", "Real URL screenshot should be PNG format"
+
+
+# ---------------------------------------------------------------------------
+# Anti-bot basic check
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_magic_mode_no_crash(local_server):
+    """Magic mode should not break normal local crawling."""
+    config = CrawlerRunConfig(magic=True, verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, (
+            f"Magic mode should not break crawling: {result.error_message}"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Edge cases
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_crawl_empty_page(local_server):
+    """Crawling a page with empty body should not crash, even if anti-bot flags it."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url=local_server + "/empty",
+            config=CrawlerRunConfig(verbose=False),
+        )
+        # Anti-bot detection may flag near-empty pages as blocked, which is expected
+        # behavior. The key assertion is that it returns a result without crashing.
+        assert result is not None, "Should return a result even for empty page"
+        assert result.html is not None, "HTML should not be None for empty page"
+        if not result.success:
+            assert "empty" in (result.error_message or "").lower() or "blocked" in (result.error_message or "").lower(), (
+                f"Empty page failure should mention empty/blocked content: {result.error_message}"
+            )
+
+
+@pytest.mark.asyncio
+async def test_crawl_malformed_html(local_server):
+    """Crawling malformed HTML should not crash, even if anti-bot flags it."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url=local_server + "/malformed",
+            config=CrawlerRunConfig(verbose=False),
+        )
+        # Anti-bot may flag malformed HTML as blocked due to minimal visible text.
+        # The key assertion is that it returns a result without crashing.
+        assert result is not None, "Should return a result for malformed HTML"
+        assert result.html is not None, "HTML should not be None even for malformed input"
+        # The content is present in the HTML even if the crawl is marked as not successful
+        assert "Unclosed paragraph" in (result.html or "") or "Malformed" in (result.html or ""), (
+            "Some original content should appear in the HTML"
+        )
+
+
+@pytest.mark.asyncio
+async def test_multiple_crawls_same_crawler(local_server):
+    """A single crawler instance should handle multiple sequential crawls."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        urls = [
+            local_server + "/",
+            local_server + "/products",
+            local_server + "/js-dynamic",
+        ]
+        for url in urls:
+            result = await crawler.arun(
+                url=url,
+                config=CrawlerRunConfig(verbose=False),
+            )
+            assert result.success, f"Sequential crawl of {url} failed: {result.error_message}"
+
+
+@pytest.mark.asyncio
+async def test_screenshot_not_captured_by_default(local_server):
+    """Without screenshot=True, result.screenshot should be None or empty."""
+    config = CrawlerRunConfig(screenshot=False, verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, f"No-screenshot crawl failed: {result.error_message}"
+        assert not result.screenshot, (
+            "Screenshot should be None or empty when not requested"
+        )
+
+
+@pytest.mark.asyncio
+async def test_js_code_empty_string(local_server):
+    """Empty js_code string should not cause errors."""
+    config = CrawlerRunConfig(js_code="", verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, (
+            f"Empty js_code should not break crawling: {result.error_message}"
+        )
+
+
+@pytest.mark.asyncio
+async def test_wait_until_load(local_server):
+    """wait_until='load' should wait for full page load including resources."""
+    config = CrawlerRunConfig(wait_until="load", verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, f"wait_until=load crawl failed: {result.error_message}"
+
+
+@pytest.mark.asyncio
+async def test_wait_until_networkidle(local_server):
+    """wait_until='networkidle' should wait until network is idle."""
+    config = CrawlerRunConfig(wait_until="networkidle", verbose=False)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=local_server + "/", config=config)
+        assert result.success, f"wait_until=networkidle crawl failed: {result.error_message}"
--- a/tests/regression/test_reg_config.py
+++ b/tests/regression/test_reg_config.py
@@ -0,0 +1,776 @@
+"""
+Regression tests for Crawl4AI configuration objects.
+
+Covers BrowserConfig, CrawlerRunConfig, ProxyConfig, GeolocationConfig,
+deep_merge logic, and serialization roundtrips.
+"""
+
+import copy
+import pytest
+
+from crawl4ai import (
+    BrowserConfig,
+    CrawlerRunConfig,
+    ProxyConfig,
+    GeolocationConfig,
+    CacheMode,
+)
+from crawl4ai.async_configs import to_serializable_dict, from_serializable_dict
+
+
+# ---------------------------------------------------------------------------
+# Helper: deep_merge (copied from deploy/docker/utils.py to avoid dns dep)
+# ---------------------------------------------------------------------------
+
+def _deep_merge(base, override):
+    """Recursively merge override into base dict."""
+    result = base.copy()
+    for key, value in override.items():
+        if key in result and isinstance(result[key], dict) and isinstance(value, dict):
+            result[key] = _deep_merge(result[key], value)
+        else:
+            result[key] = value
+    return result
+
+
+# ===================================================================
+# BrowserConfig
+# ===================================================================
+
+class TestBrowserConfigDefaults:
+    """Verify BrowserConfig default values are sensible."""
+
+    def test_headless_default(self):
+        """Default headless should be True."""
+        cfg = BrowserConfig()
+        assert cfg.headless is True
+
+    def test_browser_type_default(self):
+        """Default browser_type should be 'chromium'."""
+        cfg = BrowserConfig()
+        assert cfg.browser_type == "chromium"
+
+    def test_viewport_defaults(self):
+        """Default viewport should be 1080x600."""
+        cfg = BrowserConfig()
+        assert cfg.viewport_width == 1080
+        assert cfg.viewport_height == 600
+
+    def test_javascript_enabled_default(self):
+        """JavaScript should be enabled by default."""
+        cfg = BrowserConfig()
+        assert cfg.java_script_enabled is True
+
+    def test_ignore_https_errors_default(self):
+        """HTTPS errors should be ignored by default."""
+        cfg = BrowserConfig()
+        assert cfg.ignore_https_errors is True
+
+    def test_stealth_disabled_default(self):
+        """Stealth should be disabled by default."""
+        cfg = BrowserConfig()
+        assert cfg.enable_stealth is False
+
+    def test_browser_mode_default(self):
+        """Default browser_mode should be 'dedicated'."""
+        cfg = BrowserConfig()
+        assert cfg.browser_mode == "dedicated"
+
+
+class TestBrowserConfigRoundtrip:
+    """Verify to_dict -> from_kwargs roundtrip preserves fields."""
+
+    def test_basic_roundtrip(self):
+        """to_dict -> from_kwargs should preserve basic scalar fields."""
+        original = BrowserConfig(
+            headless=False,
+            viewport_width=1920,
+            viewport_height=1080,
+            browser_type="firefox",
+            text_mode=True,
+        )
+        d = original.to_dict()
+        restored = BrowserConfig.from_kwargs(d)
+
+        assert restored.headless is False
+        assert restored.viewport_width == 1920
+        assert restored.viewport_height == 1080
+        assert restored.browser_type == "firefox"
+        assert restored.text_mode is True
+
+    def test_roundtrip_preserves_extra_args(self):
+        """Extra args list should survive roundtrip."""
+        original = BrowserConfig(extra_args=["--no-sandbox", "--disable-dev-shm-usage"])
+        d = original.to_dict()
+        restored = BrowserConfig.from_kwargs(d)
+        assert restored.extra_args == ["--no-sandbox", "--disable-dev-shm-usage"]
+
+    def test_roundtrip_preserves_headers(self):
+        """Custom headers dict should survive roundtrip."""
+        headers = {"X-Custom": "test-value", "Accept-Language": "en-US"}
+        original = BrowserConfig(headers=headers)
+        d = original.to_dict()
+        restored = BrowserConfig.from_kwargs(d)
+        assert restored.headers["X-Custom"] == "test-value"
+        assert restored.headers["Accept-Language"] == "en-US"
+
+    def test_roundtrip_preserves_cookies(self):
+        """Cookies list should survive roundtrip."""
+        cookies = [{"name": "session", "value": "abc123", "url": "http://example.com"}]
+        original = BrowserConfig(cookies=cookies)
+        d = original.to_dict()
+        restored = BrowserConfig.from_kwargs(d)
+        assert len(restored.cookies) == 1
+        assert restored.cookies[0]["name"] == "session"
+
+
+class TestBrowserConfigClone:
+    """Verify clone() creates independent copy with overrides."""
+
+    def test_clone_with_override(self):
+        """Clone should apply overrides while keeping other fields."""
+        original = BrowserConfig(headless=True, viewport_width=1080)
+        cloned = original.clone(headless=False, viewport_width=1920)
+
+        assert cloned.headless is False
+        assert cloned.viewport_width == 1920
+        # Original unchanged
+        assert original.headless is True
+        assert original.viewport_width == 1080
+
+    def test_clone_independence(self):
+        """Clone should produce a distinct object with same scalar values."""
+        original = BrowserConfig(headless=True, viewport_width=1080)
+        cloned = original.clone()
+        cloned.headless = False
+        cloned.viewport_width = 1920
+        # Scalar mutations on clone should not affect original
+        assert original.headless is True
+        assert original.viewport_width == 1080
+
+    def test_clone_preserves_unmodified(self):
+        """Fields not in overrides should be preserved."""
+        original = BrowserConfig(
+            browser_type="firefox",
+            text_mode=True,
+            verbose=False,
+        )
+        cloned = original.clone(verbose=True)
+        assert cloned.browser_type == "firefox"
+        assert cloned.text_mode is True
+        assert cloned.verbose is True
+
+
+class TestBrowserConfigClassDefaults:
+    """Verify set_defaults / get_defaults / reset_defaults class-level defaults."""
+
+    def test_set_defaults_affects_new_instances(self):
+        """set_defaults(headless=False) should make new instances headless=False."""
+        try:
+            BrowserConfig.set_defaults(headless=False)
+            cfg = BrowserConfig()
+            assert cfg.headless is False
+        finally:
+            BrowserConfig.reset_defaults()
+
+    def test_explicit_arg_overrides_class_default(self):
+        """Explicit constructor arg should override class-level default."""
+        try:
+            BrowserConfig.set_defaults(headless=False)
+            cfg = BrowserConfig(headless=True)
+            assert cfg.headless is True
+        finally:
+            BrowserConfig.reset_defaults()
+
+    def test_get_defaults_returns_copy(self):
+        """get_defaults() should return the current overrides."""
+        try:
+            BrowserConfig.set_defaults(viewport_width=1920)
+            defaults = BrowserConfig.get_defaults()
+            assert defaults["viewport_width"] == 1920
+        finally:
+            BrowserConfig.reset_defaults()
+
+    def test_reset_defaults_clears_all(self):
+        """reset_defaults() should clear all overrides."""
+        try:
+            BrowserConfig.set_defaults(headless=False, viewport_width=1920)
+            BrowserConfig.reset_defaults()
+            defaults = BrowserConfig.get_defaults()
+            assert len(defaults) == 0
+            cfg = BrowserConfig()
+            assert cfg.headless is True
+            assert cfg.viewport_width == 1080
+        finally:
+            BrowserConfig.reset_defaults()
+
+    def test_reset_defaults_selective(self):
+        """reset_defaults('headless') should only clear that one override."""
+        try:
+            BrowserConfig.set_defaults(headless=False, viewport_width=1920)
+            BrowserConfig.reset_defaults("headless")
+            cfg = BrowserConfig()
+            assert cfg.headless is True  # reset to hardcoded default
+            assert cfg.viewport_width == 1920  # still overridden
+        finally:
+            BrowserConfig.reset_defaults()
+
+    def test_set_defaults_invalid_param_raises(self):
+        """set_defaults with invalid parameter name should raise ValueError."""
+        try:
+            with pytest.raises(ValueError):
+                BrowserConfig.set_defaults(nonexistent_param=42)
+        finally:
+            BrowserConfig.reset_defaults()
+
+
+class TestBrowserConfigDumpLoad:
+    """Verify dump() and load() serialization includes type info."""
+
+    def test_dump_includes_type(self):
+        """dump() should produce a dict with 'type' key."""
+        cfg = BrowserConfig(headless=False)
+        dumped = cfg.dump()
+        assert isinstance(dumped, dict)
+        assert dumped.get("type") == "BrowserConfig"
+        assert "params" in dumped
+
+    def test_dump_load_roundtrip(self):
+        """dump() -> load() should reproduce equivalent config."""
+        original = BrowserConfig(
+            headless=False,
+            viewport_width=1920,
+            text_mode=True,
+        )
+        dumped = original.dump()
+        restored = BrowserConfig.load(dumped)
+
+        assert isinstance(restored, BrowserConfig)
+        assert restored.headless is False
+        assert restored.viewport_width == 1920
+        assert restored.text_mode is True
+
+
+# ===================================================================
+# CrawlerRunConfig
+# ===================================================================
+
+class TestCrawlerRunConfigDefaults:
+    """Verify CrawlerRunConfig default values."""
+
+    def test_cache_mode_default(self):
+        """Default cache_mode should be CacheMode.BYPASS."""
+        cfg = CrawlerRunConfig()
+        assert cfg.cache_mode == CacheMode.BYPASS
+
+    def test_word_count_threshold_default(self):
+        """Default word_count_threshold should match MIN_WORD_THRESHOLD (1)."""
+        from crawl4ai.config import MIN_WORD_THRESHOLD
+        cfg = CrawlerRunConfig()
+        assert cfg.word_count_threshold == MIN_WORD_THRESHOLD
+
+    def test_wait_until_default(self):
+        """Default wait_until should be 'domcontentloaded'."""
+        cfg = CrawlerRunConfig()
+        assert cfg.wait_until == "domcontentloaded"
+
+    def test_page_timeout_default(self):
+        """Default page_timeout should be 60000 ms."""
+        cfg = CrawlerRunConfig()
+        assert cfg.page_timeout == 60000
+
+    def test_delay_before_return_html_default(self):
+        """Default delay_before_return_html should be 0.1."""
+        cfg = CrawlerRunConfig()
+        assert cfg.delay_before_return_html == 0.1
+
+    def test_magic_default_false(self):
+        """Magic mode should be off by default."""
+        cfg = CrawlerRunConfig()
+        assert cfg.magic is False
+
+    def test_screenshot_default_false(self):
+        """Screenshot should be off by default."""
+        cfg = CrawlerRunConfig()
+        assert cfg.screenshot is False
+
+    def test_verbose_default_true(self):
+        """Verbose should be on by default."""
+        cfg = CrawlerRunConfig()
+        assert cfg.verbose is True
+
+
+class TestCrawlerRunConfigRoundtrip:
+    """Verify to_dict -> from_kwargs roundtrip."""
+
+    def test_basic_roundtrip(self):
+        """Scalar fields should survive roundtrip."""
+        original = CrawlerRunConfig(
+            word_count_threshold=500,
+            wait_until="load",
+            page_timeout=30000,
+            magic=True,
+        )
+        d = original.to_dict()
+        restored = CrawlerRunConfig.from_kwargs(d)
+
+        assert restored.word_count_threshold == 500
+        assert restored.wait_until == "load"
+        assert restored.page_timeout == 30000
+        assert restored.magic is True
+
+    def test_roundtrip_preserves_js_code(self):
+        """js_code should survive roundtrip."""
+        original = CrawlerRunConfig(js_code=["document.title", "console.log('hi')"])
+        d = original.to_dict()
+        restored = CrawlerRunConfig.from_kwargs(d)
+        assert restored.js_code == ["document.title", "console.log('hi')"]
+
+    def test_roundtrip_preserves_excluded_tags(self):
+        """excluded_tags should survive roundtrip."""
+        original = CrawlerRunConfig(excluded_tags=["nav", "footer", "aside"])
+        d = original.to_dict()
+        restored = CrawlerRunConfig.from_kwargs(d)
+        assert "nav" in restored.excluded_tags
+        assert "footer" in restored.excluded_tags
+
+
+class TestCrawlerRunConfigClone:
+    """Verify clone() with overrides."""
+
+    def test_clone_with_override(self):
+        """Clone should apply overrides while keeping other fields."""
+        original = CrawlerRunConfig(magic=False, verbose=True)
+        cloned = original.clone(magic=True)
+
+        assert cloned.magic is True
+        assert cloned.verbose is True
+        # Original unchanged
+        assert original.magic is False
+
+    def test_clone_cache_mode_override(self):
+        """Clone should be able to change cache_mode."""
+        original = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+        cloned = original.clone(cache_mode=CacheMode.ENABLED)
+        assert cloned.cache_mode == CacheMode.ENABLED
+        assert original.cache_mode == CacheMode.BYPASS
+
+
+class TestCrawlerRunConfigClassDefaults:
+    """Verify set_defaults / reset_defaults for CrawlerRunConfig."""
+
+    def test_set_defaults_affects_new_instances(self):
+        """set_defaults(verbose=False) should make new instances verbose=False."""
+        try:
+            CrawlerRunConfig.set_defaults(verbose=False)
+            cfg = CrawlerRunConfig()
+            assert cfg.verbose is False
+        finally:
+            CrawlerRunConfig.reset_defaults()
+
+    def test_reset_defaults_restores_original(self):
+        """reset_defaults should restore hardcoded defaults."""
+        try:
+            CrawlerRunConfig.set_defaults(page_timeout=5000)
+            CrawlerRunConfig.reset_defaults()
+            cfg = CrawlerRunConfig()
+            assert cfg.page_timeout == 60000
+        finally:
+            CrawlerRunConfig.reset_defaults()
+
+    def test_set_defaults_invalid_param_raises(self):
+        """set_defaults with invalid parameter name should raise ValueError."""
+        try:
+            with pytest.raises(ValueError):
+                CrawlerRunConfig.set_defaults(totally_bogus=42)
+        finally:
+            CrawlerRunConfig.reset_defaults()
+
+
+class TestCrawlerRunConfigSerialization:
+    """Verify extraction_strategy and deep_crawl_strategy serialize correctly."""
+
+    def test_dump_load_basic(self):
+        """dump -> load roundtrip for basic CrawlerRunConfig."""
+        original = CrawlerRunConfig(
+            word_count_threshold=300,
+            magic=True,
+            wait_until="load",
+        )
+        dumped = original.dump()
+        assert dumped["type"] == "CrawlerRunConfig"
+        restored = CrawlerRunConfig.load(dumped)
+        assert isinstance(restored, CrawlerRunConfig)
+        assert restored.magic is True
+
+    def test_dump_with_extraction_strategy(self):
+        """CrawlerRunConfig with extraction_strategy should serialize."""
+        try:
+            from crawl4ai import JsonCssExtractionStrategy
+            schema = {
+                "name": "test",
+                "baseSelector": "div.item",
+                "fields": [{"name": "title", "selector": "h2", "type": "text"}],
+            }
+            strategy = JsonCssExtractionStrategy(schema)
+            cfg = CrawlerRunConfig(extraction_strategy=strategy)
+            dumped = cfg.dump()
+            assert dumped["type"] == "CrawlerRunConfig"
+            # extraction_strategy should be serialized with type info
+            es_data = dumped["params"].get("extraction_strategy", {})
+            assert es_data.get("type") == "JsonCssExtractionStrategy"
+        except ImportError:
+            pytest.skip("JsonCssExtractionStrategy not available")
+
+    def test_dump_with_deep_crawl_strategy(self):
+        """CrawlerRunConfig with deep_crawl_strategy should serialize."""
+        try:
+            from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
+            strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=10)
+            cfg = CrawlerRunConfig(deep_crawl_strategy=strategy)
+            dumped = cfg.dump()
+            ds_data = dumped["params"].get("deep_crawl_strategy", {})
+            assert ds_data.get("type") == "BFSDeepCrawlStrategy"
+        except ImportError:
+            pytest.skip("BFSDeepCrawlStrategy not available")
+
+
+# ===================================================================
+# ProxyConfig
+# ===================================================================
+
+class TestProxyConfigFromString:
+    """Verify ProxyConfig.from_string() parsing."""
+
+    def test_simple_http_url(self):
+        """from_string('http://proxy:8080') should parse server correctly."""
+        pc = ProxyConfig.from_string("http://proxy:8080")
+        assert pc.server == "http://proxy:8080"
+        assert pc.username is None
+        assert pc.password is None
+
+    def test_http_url_with_credentials(self):
+        """from_string('http://user:pass@proxy:8080') should parse credentials."""
+        pc = ProxyConfig.from_string("http://user:pass@proxy:8080")
+        assert pc.server == "http://proxy:8080"
+        assert pc.username == "user"
+        assert pc.password == "pass"
+
+    def test_ip_port_user_pass_format(self):
+        """from_string('1.2.3.4:8080:user:pass') should parse ip:port:user:pass."""
+        pc = ProxyConfig.from_string("1.2.3.4:8080:user:pass")
+        assert pc.server == "http://1.2.3.4:8080"
+        assert pc.username == "user"
+        assert pc.password == "pass"
+
+    def test_ip_port_format(self):
+        """from_string('1.2.3.4:8080') should parse ip:port without credentials."""
+        pc = ProxyConfig.from_string("1.2.3.4:8080")
+        assert pc.server == "http://1.2.3.4:8080"
+        assert pc.username is None
+        assert pc.password is None
+
+    def test_socks5_url(self):
+        """from_string('socks5://proxy:1080') should preserve socks5 scheme."""
+        pc = ProxyConfig.from_string("socks5://proxy:1080")
+        assert pc.server == "socks5://proxy:1080"
+
+    def test_invalid_format_raises(self):
+        """from_string with invalid format should raise ValueError."""
+        with pytest.raises(ValueError):
+            ProxyConfig.from_string("invalid")
+
+    def test_password_with_colon(self):
+        """Password containing a colon should be preserved via split(':', 1)."""
+        # Format: http://user:complex:pass@proxy:8080
+        # The @ split gives auth="http://user:complex:pass", server="proxy:8080"
+        # Then protocol split gives credentials="user:complex:pass"
+        # Then credentials.split(":", 1) gives user="user", password="complex:pass"
+        pc = ProxyConfig.from_string("http://user:complex:pass@proxy:8080")
+        assert pc.username == "user"
+        assert pc.password == "complex:pass"
+        assert pc.server == "http://proxy:8080"
+
+
+class TestProxyConfigRoundtrip:
+    """Verify to_dict -> from_dict roundtrip."""
+
+    def test_basic_roundtrip(self):
+        """to_dict -> from_dict should preserve all fields."""
+        original = ProxyConfig(
+            server="http://proxy:8080",
+            username="user",
+            password="secret",
+        )
+        d = original.to_dict()
+        restored = ProxyConfig.from_dict(d)
+        assert restored.server == original.server
+        assert restored.username == original.username
+        assert restored.password == original.password
+
+    def test_roundtrip_without_credentials(self):
+        """Roundtrip should work without username/password."""
+        original = ProxyConfig(server="http://proxy:3128")
+        d = original.to_dict()
+        restored = ProxyConfig.from_dict(d)
+        assert restored.server == "http://proxy:3128"
+        assert restored.username is None
+        assert restored.password is None
+
+
+class TestProxyConfigClone:
+    """Verify clone() with override."""
+
+    def test_clone_with_server_override(self):
+        """Clone should apply server override."""
+        original = ProxyConfig(server="http://proxy1:8080", username="user1")
+        cloned = original.clone(server="http://proxy2:9090")
+        assert cloned.server == "http://proxy2:9090"
+        assert cloned.username == "user1"
+        # Original unchanged
+        assert original.server == "http://proxy1:8080"
+
+    def test_clone_with_credentials_override(self):
+        """Clone should be able to override credentials."""
+        original = ProxyConfig(server="http://proxy:8080", username="old", password="old")
+        cloned = original.clone(username="new", password="new")
+        assert cloned.username == "new"
+        assert cloned.password == "new"
+        assert original.username == "old"
+
+
+class TestProxyConfigSentinel:
+    """Verify ProxyConfig.DIRECT sentinel."""
+
+    def test_direct_sentinel_exists(self):
+        """ProxyConfig.DIRECT should exist and be 'direct'."""
+        assert ProxyConfig.DIRECT == "direct"
+
+    def test_direct_is_string(self):
+        """DIRECT sentinel should be a string."""
+        assert isinstance(ProxyConfig.DIRECT, str)
+
+
+# ===================================================================
+# GeolocationConfig
+# ===================================================================
+
+class TestGeolocationConfig:
+    """Verify GeolocationConfig construction and roundtrip."""
+
+    def test_constructor(self):
+        """Constructor should set lat/lon/accuracy."""
+        geo = GeolocationConfig(latitude=37.7749, longitude=-122.4194, accuracy=10.0)
+        assert geo.latitude == 37.7749
+        assert geo.longitude == -122.4194
+        assert geo.accuracy == 10.0
+
+    def test_default_accuracy(self):
+        """Default accuracy should be 0.0."""
+        geo = GeolocationConfig(latitude=0.0, longitude=0.0)
+        assert geo.accuracy == 0.0
+
+    def test_to_dict_from_dict_roundtrip(self):
+        """to_dict -> from_dict should preserve all fields."""
+        original = GeolocationConfig(latitude=48.8566, longitude=2.3522, accuracy=50.0)
+        d = original.to_dict()
+        restored = GeolocationConfig.from_dict(d)
+        assert restored.latitude == original.latitude
+        assert restored.longitude == original.longitude
+        assert restored.accuracy == original.accuracy
+
+    def test_clone_with_overrides(self):
+        """Clone should apply overrides while preserving other fields."""
+        original = GeolocationConfig(latitude=40.7128, longitude=-74.0060, accuracy=5.0)
+        cloned = original.clone(accuracy=100.0)
+        assert cloned.latitude == 40.7128
+        assert cloned.longitude == -74.0060
+        assert cloned.accuracy == 100.0
+        # Original unchanged
+        assert original.accuracy == 5.0
+
+    def test_clone_independence(self):
+        """Clone should be a fully independent object."""
+        original = GeolocationConfig(latitude=0.0, longitude=0.0)
+        cloned = original.clone(latitude=1.0)
+        assert original.latitude == 0.0
+        assert cloned.latitude == 1.0
+
+    def test_negative_coordinates(self):
+        """Negative lat/lon (southern/western hemisphere) should work."""
+        geo = GeolocationConfig(latitude=-33.8688, longitude=151.2093)
+        assert geo.latitude == -33.8688
+        assert geo.longitude == 151.2093
+
+
+# ===================================================================
+# Deep merge tests
+# ===================================================================
+
+class TestDeepMerge:
+    """Verify _deep_merge helper for server config merging."""
+
+    def test_empty_override_returns_base(self):
+        """Empty override should return base unchanged."""
+        base = {"a": 1, "b": 2}
+        result = _deep_merge(base, {})
+        assert result == {"a": 1, "b": 2}
+
+    def test_flat_key_override(self):
+        """Flat key in override should replace base value."""
+        base = {"a": 1, "b": 2}
+        result = _deep_merge(base, {"b": 99})
+        assert result == {"a": 1, "b": 99}
+
+    def test_nested_dict_merge_preserves_siblings(self):
+        """Nested dict merge should preserve sibling keys."""
+        base = {"server": {"host": "localhost", "port": 8080}}
+        override = {"server": {"port": 9090}}
+        result = _deep_merge(base, override)
+        assert result["server"]["host"] == "localhost"
+        assert result["server"]["port"] == 9090
+
+    def test_override_with_non_dict_replaces_dict(self):
+        """Non-dict override should replace entire dict value."""
+        base = {"server": {"host": "localhost", "port": 8080}}
+        override = {"server": "http://remote:9090"}
+        result = _deep_merge(base, override)
+        assert result["server"] == "http://remote:9090"
+
+    def test_deep_nesting_three_levels(self):
+        """3+ levels of nesting should merge correctly."""
+        base = {"a": {"b": {"c": 1, "d": 2}, "e": 3}}
+        override = {"a": {"b": {"c": 99}}}
+        result = _deep_merge(base, override)
+        assert result["a"]["b"]["c"] == 99
+        assert result["a"]["b"]["d"] == 2
+        assert result["a"]["e"] == 3
+
+    def test_new_key_in_override(self):
+        """Override can add entirely new keys."""
+        base = {"a": 1}
+        result = _deep_merge(base, {"b": 2})
+        assert result == {"a": 1, "b": 2}
+
+    def test_base_not_mutated(self):
+        """Original base dict should not be mutated."""
+        base = {"a": {"b": 1}}
+        override = {"a": {"b": 2}}
+        _deep_merge(base, override)
+        assert base["a"]["b"] == 1
+
+    def test_empty_base(self):
+        """Empty base should return override contents."""
+        result = _deep_merge({}, {"a": 1, "b": {"c": 2}})
+        assert result == {"a": 1, "b": {"c": 2}}
+
+
+# ===================================================================
+# Serialization: to_serializable_dict / from_serializable_dict
+# ===================================================================
+
+class TestSerializableDict:
+    """Verify to_serializable_dict / from_serializable_dict roundtrips."""
+
+    def test_browser_config_roundtrip(self):
+        """BrowserConfig should survive serialization roundtrip."""
+        original = BrowserConfig(
+            headless=False,
+            viewport_width=1920,
+            browser_type="firefox",
+        )
+        serialized = to_serializable_dict(original)
+        assert serialized["type"] == "BrowserConfig"
+        restored = from_serializable_dict(serialized)
+        assert isinstance(restored, BrowserConfig)
+        assert restored.headless is False
+        assert restored.viewport_width == 1920
+
+    def test_crawler_run_config_roundtrip(self):
+        """CrawlerRunConfig should survive serialization roundtrip."""
+        original = CrawlerRunConfig(
+            word_count_threshold=500,
+            magic=True,
+            wait_until="load",
+        )
+        serialized = to_serializable_dict(original)
+        assert serialized["type"] == "CrawlerRunConfig"
+        restored = from_serializable_dict(serialized)
+        assert isinstance(restored, CrawlerRunConfig)
+        assert restored.magic is True
+
+    def test_crawler_run_config_with_extraction_strategy(self):
+        """CrawlerRunConfig with extraction strategy should roundtrip."""
+        try:
+            from crawl4ai import JsonCssExtractionStrategy
+            schema = {
+                "name": "products",
+                "baseSelector": "div.product",
+                "fields": [
+                    {"name": "title", "selector": "h2", "type": "text"},
+                    {"name": "price", "selector": ".price", "type": "text"},
+                ],
+            }
+            strategy = JsonCssExtractionStrategy(schema)
+            original = CrawlerRunConfig(extraction_strategy=strategy)
+            serialized = to_serializable_dict(original)
+            restored = from_serializable_dict(serialized)
+            assert isinstance(restored, CrawlerRunConfig)
+            assert isinstance(restored.extraction_strategy, JsonCssExtractionStrategy)
+        except ImportError:
+            pytest.skip("JsonCssExtractionStrategy not available")
+
+    def test_none_value(self):
+        """None should serialize to None."""
+        assert to_serializable_dict(None) is None
+
+    def test_basic_types_passthrough(self):
+        """Strings, ints, floats, bools should pass through unchanged."""
+        assert to_serializable_dict("hello") == "hello"
+        assert to_serializable_dict(42) == 42
+        assert to_serializable_dict(3.14) == 3.14
+        assert to_serializable_dict(True) is True
+
+    def test_enum_serialization(self):
+        """CacheMode enum should serialize with type info."""
+        serialized = to_serializable_dict(CacheMode.ENABLED)
+        assert serialized["type"] == "CacheMode"
+        assert serialized["params"] == "enabled"
+        restored = from_serializable_dict(serialized)
+        assert restored == CacheMode.ENABLED
+
+    def test_list_serialization(self):
+        """Lists should serialize element-by-element."""
+        result = to_serializable_dict([1, "two", 3.0])
+        assert result == [1, "two", 3.0]
+
+    def test_dict_serialization(self):
+        """Plain dicts should be wrapped with type='dict'."""
+        result = to_serializable_dict({"key": "value"})
+        assert result["type"] == "dict"
+        assert result["value"]["key"] == "value"
+
+    def test_disallowed_type_raises(self):
+        """Deserializing a non-allowlisted type should raise ValueError."""
+        bad_data = {"type": "os.system", "params": {"command": "rm -rf /"}}
+        with pytest.raises(ValueError, match="not allowed"):
+            from_serializable_dict(bad_data)
+
+    def test_geolocation_config_roundtrip(self):
+        """GeolocationConfig should survive serialization roundtrip."""
+        original = GeolocationConfig(latitude=37.7749, longitude=-122.4194, accuracy=10.0)
+        serialized = to_serializable_dict(original)
+        assert serialized["type"] == "GeolocationConfig"
+        restored = from_serializable_dict(serialized)
+        assert isinstance(restored, GeolocationConfig)
+        assert restored.latitude == 37.7749
+
+    def test_proxy_config_roundtrip(self):
+        """ProxyConfig should survive serialization roundtrip."""
+        original = ProxyConfig(server="http://proxy:8080", username="user", password="pass")
+        serialized = to_serializable_dict(original)
+        assert serialized["type"] == "ProxyConfig"
+        restored = from_serializable_dict(serialized)
+        assert isinstance(restored, ProxyConfig)
+        assert restored.server == "http://proxy:8080"
+        assert restored.username == "user"
--- a/tests/regression/test_reg_content.py
+++ b/tests/regression/test_reg_content.py
@@ -0,0 +1,512 @@
+"""
+Regression tests for Crawl4AI content processing pipeline.
+
+Covers markdown generation, content filtering (BM25, Pruning),
+link/image/table extraction, metadata extraction, tag exclusion,
+CSS selector targeting, and real-URL content quality.
+
+Run:
+    pytest tests/regression/test_reg_content.py -v
+    pytest tests/regression/test_reg_content.py -v -m "not network"
+"""
+
+import pytest
+import json
+
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
+
+
+# ---------------------------------------------------------------------------
+# Markdown generation
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_markdown_raw(local_server):
+    """Crawl the home page and verify raw markdown is a non-empty string
+    containing the expected heading text and heading markers."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=CrawlerRunConfig())
+        assert result.success, f"Crawl failed: {result.error_message}"
+        md = result.markdown
+        assert md is not None
+        assert isinstance(md, str)
+        assert len(md) > 0
+        assert "Welcome to the Crawl4AI Test Site" in md
+        # Should have at least one markdown heading marker
+        assert "#" in md
+
+
+@pytest.mark.asyncio
+async def test_markdown_has_headings(local_server):
+    """Verify markdown contains the expected h1 and h2 headings."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=CrawlerRunConfig())
+        assert result.success
+        md = result.markdown
+        assert "# Welcome" in md or "# Welcome to the Crawl4AI Test Site" in md
+        # h2 heading for Features Overview
+        assert "## Features" in md or "## Features Overview" in md
+
+
+@pytest.mark.asyncio
+async def test_markdown_has_code_block(local_server):
+    """Verify markdown preserves the code block with triple backticks."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=CrawlerRunConfig())
+        assert result.success
+        md = result.markdown
+        assert "```" in md
+        assert "AsyncWebCrawler" in md
+
+
+@pytest.mark.asyncio
+async def test_markdown_has_list(local_server):
+    """Verify markdown contains list items from the home page features list."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=CrawlerRunConfig())
+        assert result.success
+        md = result.markdown
+        # Markdown list items should contain at least some of these
+        assert "Content extraction" in md or "content extraction" in md
+        assert "Link discovery" in md or "link discovery" in md
+
+
+@pytest.mark.asyncio
+async def test_markdown_citations(local_server):
+    """Access markdown_with_citations and verify it contains numbered citation references."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=CrawlerRunConfig())
+        assert result.success
+        citations_md = result.markdown.markdown_with_citations
+        assert isinstance(citations_md, str)
+        assert len(citations_md) > 0
+        # Should have at least one citation reference like [1] or similar
+        has_citation = any(f"[{i}]" in citations_md for i in range(1, 20))
+        # Some implementations use a different format
+        assert has_citation or "⟨" in citations_md or "[" in citations_md
+
+
+@pytest.mark.asyncio
+async def test_markdown_references(local_server):
+    """Access references_markdown and verify it contains URLs."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=CrawlerRunConfig())
+        assert result.success
+        refs = result.markdown.references_markdown
+        assert isinstance(refs, str)
+        # References should mention URLs or link targets
+        assert "http" in refs or "/" in refs
+
+
+@pytest.mark.asyncio
+async def test_markdown_string_compat(local_server):
+    """Verify StringCompatibleMarkdown behaves like a string:
+    str() works, equality with raw_markdown, and 'in' operator."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=CrawlerRunConfig())
+        assert result.success
+        md = result.markdown
+        raw = md.raw_markdown
+        # str(result.markdown) should equal raw_markdown
+        assert str(md) == raw
+        # 'in' operator should work on the string content
+        assert "Welcome" in md
+
+
+# ---------------------------------------------------------------------------
+# Content filtering - BM25
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_bm25_fit_markdown(local_server):
+    """Crawl with BM25ContentFilter and verify fit_markdown is shorter
+    than the full raw_markdown (content was filtered)."""
+    gen = DefaultMarkdownGenerator(
+        content_filter=BM25ContentFilter(user_query="features")
+    )
+    config = CrawlerRunConfig(markdown_generator=gen)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=config)
+        assert result.success
+        fit = result.markdown.fit_markdown
+        raw = result.markdown.raw_markdown
+        assert fit is not None
+        assert len(fit) > 0
+        assert len(fit) < len(raw), (
+            "fit_markdown should be shorter than raw_markdown after BM25 filtering"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Content filtering - Pruning
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_pruning_fit_markdown(local_server):
+    """Crawl with PruningContentFilter and verify fit_markdown exists
+    and is shorter than the full raw_markdown."""
+    gen = DefaultMarkdownGenerator(content_filter=PruningContentFilter())
+    config = CrawlerRunConfig(markdown_generator=gen)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=config)
+        assert result.success
+        fit = result.markdown.fit_markdown
+        raw = result.markdown.raw_markdown
+        assert fit is not None
+        assert len(fit) > 0
+        assert len(fit) <= len(raw), (
+            "fit_markdown should not be longer than raw_markdown"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Link extraction
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_links_internal(local_server):
+    """Crawl /links-page and verify internal links are extracted with href keys."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/links-page", config=CrawlerRunConfig())
+        assert result.success
+        internal = result.links.get("internal", [])
+        assert isinstance(internal, list)
+        assert len(internal) > 0, "Expected internal links to be found"
+        # Each link dict should have an href
+        for link in internal:
+            assert "href" in link, f"Link missing 'href' key: {link}"
+
+
+@pytest.mark.asyncio
+async def test_links_external(local_server):
+    """Verify external links include the expected domains."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/links-page", config=CrawlerRunConfig())
+        assert result.success
+        external = result.links.get("external", [])
+        assert len(external) > 0, "Expected external links to be found"
+        hrefs = [link["href"] for link in external]
+        all_hrefs = " ".join(hrefs)
+        assert "example.com" in all_hrefs
+        assert "github.com" in all_hrefs
+        assert "python.org" in all_hrefs
+
+
+@pytest.mark.asyncio
+async def test_links_exclude_external(local_server):
+    """Crawl with exclude_external_links=True and verify no external links remain."""
+    config = CrawlerRunConfig(exclude_external_links=True)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/links-page", config=config)
+        assert result.success
+        external = result.links.get("external", [])
+        assert len(external) == 0, f"Expected no external links, got {len(external)}"
+
+
+@pytest.mark.asyncio
+async def test_links_exclude_social(local_server):
+    """Crawl with exclude_social_media_links=True and verify no social media
+    links appear in the external links list."""
+    config = CrawlerRunConfig(exclude_social_media_links=True)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/links-page", config=config)
+        assert result.success
+        external = result.links.get("external", [])
+        social_domains = ["twitter.com", "facebook.com", "linkedin.com"]
+        for link in external:
+            href = link.get("href", "")
+            for domain in social_domains:
+                assert domain not in href, (
+                    f"Social media link should be excluded: {href}"
+                )
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_links_real_url():
+    """Crawl a real URL (quotes.toscrape.com) and verify internal links are found
+    (pagination links exist on the main page)."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url="https://quotes.toscrape.com",
+            config=CrawlerRunConfig(),
+        )
+        assert result.success
+        internal = result.links.get("internal", [])
+        assert len(internal) > 0, "Expected internal links on quotes.toscrape.com"
+
+
+# ---------------------------------------------------------------------------
+# Image extraction
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_images_extracted(local_server):
+    """Crawl /images-page and verify images are extracted."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/images-page", config=CrawlerRunConfig())
+        assert result.success
+        images = result.media.get("images", [])
+        assert isinstance(images, list)
+        assert len(images) > 0, "Expected images to be extracted"
+
+
+@pytest.mark.asyncio
+async def test_images_have_fields(local_server):
+    """Verify each extracted image dict has src, alt, and score keys."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/images-page", config=CrawlerRunConfig())
+        assert result.success
+        images = result.media.get("images", [])
+        assert len(images) > 0
+        for img in images:
+            assert "src" in img, f"Image missing 'src': {img}"
+            assert "alt" in img, f"Image missing 'alt': {img}"
+            assert "score" in img, f"Image missing 'score': {img}"
+
+
+@pytest.mark.asyncio
+async def test_images_scoring(local_server):
+    """High-quality images (large, with alt text) should score higher
+    than small icons without alt text."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/images-page", config=CrawlerRunConfig())
+        assert result.success
+        images = result.media.get("images", [])
+        assert len(images) >= 2
+
+        # Find the hero/landscape image and the small icon
+        hero = None
+        icon = None
+        for img in images:
+            src = img.get("src", "")
+            if "landscape" in src or "hero" in src:
+                hero = img
+            elif "icon" in src and img.get("alt", "") == "":
+                icon = img
+
+        if hero and icon:
+            assert hero["score"] > icon["score"], (
+                f"Hero score ({hero['score']}) should exceed icon score ({icon['score']})"
+            )
+
+
+@pytest.mark.asyncio
+async def test_images_exclude_all(local_server):
+    """Crawl with exclude_all_images=True and verify no images are returned."""
+    config = CrawlerRunConfig(exclude_all_images=True)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/images-page", config=config)
+        assert result.success
+        images = result.media.get("images", [])
+        assert len(images) == 0, f"Expected no images with exclude_all_images, got {len(images)}"
+
+
+# ---------------------------------------------------------------------------
+# Table extraction
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_tables_extracted(local_server):
+    """Crawl /tables and verify tables appear in the result (either in
+    result.media, result.tables, or markdown pipe formatting)."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/tables", config=CrawlerRunConfig())
+        assert result.success
+        # Tables may appear in result.tables, result.media, or markdown
+        has_tables = (
+            len(getattr(result, "tables", []) or []) > 0
+            or "tables" in result.media
+            or "|" in str(result.markdown)
+        )
+        assert has_tables, "Expected table data to be found in the result"
+
+
+@pytest.mark.asyncio
+async def test_tables_in_markdown(local_server):
+    """Verify the markdown output contains table formatting with pipes and dashes."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/tables", config=CrawlerRunConfig())
+        assert result.success
+        md = str(result.markdown)
+        assert "|" in md, "Expected pipe character in markdown tables"
+        assert "---" in md or "- -" in md, "Expected separator row in markdown tables"
+
+
+# ---------------------------------------------------------------------------
+# Metadata extraction
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_metadata_title(local_server):
+    """Crawl /structured-data and verify the page title is in metadata."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url=f"{local_server}/structured-data", config=CrawlerRunConfig()
+        )
+        assert result.success
+        assert result.metadata is not None
+        # Title should be "Article with Structured Data"
+        title = result.metadata.get("title", "")
+        assert "Article with Structured Data" in title or "Structured Data" in title
+
+
+@pytest.mark.asyncio
+async def test_metadata_og_tags(local_server):
+    """Verify og:title, og:description, og:image are present in metadata."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url=f"{local_server}/structured-data", config=CrawlerRunConfig()
+        )
+        assert result.success
+        meta = result.metadata
+        assert meta is not None
+
+        # Check for og tags -- they may be stored with different key formats
+        og_title = meta.get("og:title", meta.get("og_title", ""))
+        og_desc = meta.get("og:description", meta.get("og_description", ""))
+        og_image = meta.get("og:image", meta.get("og_image", ""))
+
+        assert og_title, f"Missing og:title in metadata: {meta}"
+        assert og_desc, f"Missing og:description in metadata: {meta}"
+        assert og_image, f"Missing og:image in metadata: {meta}"
+
+
+@pytest.mark.asyncio
+async def test_metadata_description(local_server):
+    """Verify meta description is present in metadata."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url=f"{local_server}/structured-data", config=CrawlerRunConfig()
+        )
+        assert result.success
+        meta = result.metadata
+        assert meta is not None
+        desc = meta.get("description", "")
+        assert desc, f"Missing description in metadata: {meta}"
+        assert "web crawling" in desc.lower()
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_metadata_real():
+    """Crawl https://example.com and verify title metadata exists."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url="https://example.com", config=CrawlerRunConfig()
+        )
+        assert result.success
+        assert result.metadata is not None
+        title = result.metadata.get("title", "")
+        assert title, "Expected title metadata from example.com"
+
+
+# ---------------------------------------------------------------------------
+# Excluded tags
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_excluded_tags_nav(local_server):
+    """Crawl / with excluded_tags=["nav"] and verify navigation links are
+    removed from cleaned_html."""
+    config = CrawlerRunConfig(excluded_tags=["nav"])
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=config)
+        assert result.success
+        cleaned = result.cleaned_html or ""
+        # The nav element contained links to Products, Links, Tables
+        # After exclusion these should be absent from cleaned_html
+        assert "<nav" not in cleaned.lower(), "nav tag should be excluded from cleaned_html"
+
+
+@pytest.mark.asyncio
+async def test_excluded_selector(local_server):
+    """Crawl / with excluded_selector='footer' and verify footer content
+    is excluded from cleaned_html."""
+    config = CrawlerRunConfig(excluded_selector="footer")
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=config)
+        assert result.success
+        cleaned = result.cleaned_html or ""
+        assert "Footer content" not in cleaned, (
+            "Footer content should be excluded from cleaned_html"
+        )
+
+
+# ---------------------------------------------------------------------------
+# CSS selector targeting
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_css_selector_main(local_server):
+    """Crawl / with css_selector='main' and verify result focuses on main content."""
+    config = CrawlerRunConfig(css_selector="main")
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=config)
+        assert result.success
+        md = str(result.markdown)
+        assert "Welcome to the Crawl4AI Test Site" in md
+        # Footer should not be in the markdown since we targeted <main>
+        assert "Footer content" not in md
+
+
+@pytest.mark.asyncio
+async def test_css_selector_product(local_server):
+    """Crawl /products with css_selector targeting only product #1 and verify
+    only the first product is extracted."""
+    config = CrawlerRunConfig(css_selector=".product[data-id='1']")
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/products", config=config)
+        assert result.success
+        md = str(result.markdown)
+        assert "Wireless Mouse" in md
+        # Other products should not appear
+        assert "Mechanical Keyboard" not in md
+        assert "USB-C Hub" not in md
+
+
+# ---------------------------------------------------------------------------
+# Real URL content tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_real_url_markdown_quality():
+    """Crawl https://example.com and verify markdown has reasonable content
+    with more than 50 chars and contains 'Example Domain'."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url="https://example.com", config=CrawlerRunConfig()
+        )
+        assert result.success
+        md = str(result.markdown)
+        assert len(md) > 50, f"Markdown too short ({len(md)} chars)"
+        assert "Example Domain" in md
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_real_url_links():
+    """Crawl https://books.toscrape.com and verify internal links (product links)
+    and images (book covers) are found."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url="https://books.toscrape.com", config=CrawlerRunConfig()
+        )
+        assert result.success
+        internal = result.links.get("internal", [])
+        assert len(internal) > 0, "Expected product links on books.toscrape.com"
+        images = result.media.get("images", [])
+        assert len(images) > 0, "Expected book cover images on books.toscrape.com"
--- a/tests/regression/test_reg_core_crawl.py
+++ b/tests/regression/test_reg_core_crawl.py
@@ -0,0 +1,405 @@
+"""
+Crawl4AI Regression Tests - Core Crawling Functionality
+
+Tests core crawling features including basic crawls, raw HTML, multiple URLs,
+screenshots, JavaScript execution, caching, sessions, hooks, network capture,
+CSS selectors, excluded tags, timeouts, and status codes.
+
+All tests use real browser crawling with no mocking.
+"""
+
+import asyncio
+import base64
+import pytest
+
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+from crawl4ai.cache_context import CacheMode
+
+
+# ---------------------------------------------------------------------------
+# Basic crawl tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_basic_crawl(local_server):
+    """Crawl the local server home page and verify basic result fields."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/")
+        assert result.success, f"Crawl failed: {result.error_message}"
+        assert "<h1>" in result.html, "HTML should contain an <h1> tag"
+        assert isinstance(result.markdown, str), "Markdown should be a string"
+        assert len(result.markdown) > 0, "Markdown should be non-empty"
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_basic_crawl_real_url():
+    """Crawl https://example.com and verify success with real content."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun("https://example.com")
+        assert result.success, f"Crawl failed: {result.error_message}"
+        assert len(result.html) > 100, "HTML should have substantial content"
+        assert len(result.markdown) > 10, "Markdown should have content"
+
+
+# ---------------------------------------------------------------------------
+# Raw HTML crawl tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_raw_html_crawl():
+    """Crawl raw HTML and verify markdown extraction."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun("raw:<html><body><h1>Test</h1><p>Hello world</p></body></html>")
+        assert result.success, f"Raw HTML crawl failed: {result.error_message}"
+        assert "Test" in result.markdown, "Markdown should contain 'Test'"
+        assert "Hello" in result.markdown, "Markdown should contain 'Hello'"
+
+
+@pytest.mark.asyncio
+async def test_raw_html_with_base_url():
+    """Raw HTML with relative links should resolve against base_url."""
+    raw_html = (
+        "raw:<html><body>"
+        '<a href="/page1">Link 1</a>'
+        '<a href="/page2">Link 2</a>'
+        '<a href="https://other.com/abs">Absolute</a>'
+        "</body></html>"
+    )
+    config = CrawlerRunConfig(base_url="http://example.com")
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(raw_html, config=config)
+        assert result.success, f"Raw HTML with base_url failed: {result.error_message}"
+        # Check that links were resolved (they should appear in the result's links or markdown)
+        md_lower = result.markdown.lower() if result.markdown else ""
+        html_lower = result.html.lower() if result.html else ""
+        combined = md_lower + html_lower
+        # At minimum, the link text should appear
+        assert "link 1" in combined, "Link text should be present"
+
+
+# ---------------------------------------------------------------------------
+# Multiple URL crawl tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_arun_many(local_server):
+    """Crawl 3 local server URLs with arun_many and verify all succeed."""
+    urls = [
+        local_server + "/",
+        local_server + "/products",
+        local_server + "/tables",
+    ]
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun_many(urls, config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS))
+        assert isinstance(results, list), "arun_many should return a list"
+        assert len(results) == 3, f"Expected 3 results, got {len(results)}"
+        for i, result in enumerate(results):
+            assert result.success, f"Result {i} failed: {result.error_message}"
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_arun_many_real():
+    """Crawl multiple real URLs together."""
+    urls = ["https://example.com", "https://quotes.toscrape.com"]
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun_many(urls, config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS))
+        assert len(results) == 2, f"Expected 2 results, got {len(results)}"
+        for result in results:
+            assert result.success, f"Real URL crawl failed: {result.error_message}"
+
+
+# ---------------------------------------------------------------------------
+# Screenshot tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_screenshot_capture(local_server):
+    """Crawl with screenshot=True and verify PNG format output."""
+    config = CrawlerRunConfig(screenshot=True)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/", config=config)
+        assert result.success, f"Screenshot crawl failed: {result.error_message}"
+        assert result.screenshot, "Screenshot should be a non-empty string"
+        assert isinstance(result.screenshot, str), "Screenshot should be a base64 string"
+        # Decode and verify PNG header
+        raw_bytes = base64.b64decode(result.screenshot)
+        assert raw_bytes[:4] == b"\x89PNG", "Screenshot should be in PNG format"
+
+
+@pytest.mark.asyncio
+async def test_screenshot_not_bmp(local_server):
+    """Verify screenshot is PNG format, NOT BMP (regression for #1758)."""
+    config = CrawlerRunConfig(screenshot=True)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/", config=config)
+        assert result.success
+        raw_bytes = base64.b64decode(result.screenshot)
+        # BMP files start with b'BM'
+        assert raw_bytes[:2] != b"BM", "Screenshot should NOT be BMP format"
+        assert raw_bytes[:4] == b"\x89PNG", "Screenshot should be PNG format"
+
+
+# ---------------------------------------------------------------------------
+# JavaScript execution tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_js_execution(local_server):
+    """Crawl /js-dynamic with wait_for to verify JS-generated content loads."""
+    config = CrawlerRunConfig(wait_for="css:.js-loaded")
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/js-dynamic", config=config)
+        assert result.success, f"JS dynamic crawl failed: {result.error_message}"
+        assert "Dynamic content successfully loaded" in result.markdown, (
+            "JS-generated content should appear in markdown"
+        )
+
+
+@pytest.mark.asyncio
+async def test_js_code_execution(local_server):
+    """Execute custom JS code during crawl and verify modification."""
+    config = CrawlerRunConfig(
+        js_code="document.title = 'Modified Title';",
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/", config=config)
+        assert result.success, f"JS code execution crawl failed: {result.error_message}"
+        # The JS ran after page load; verify it did not cause errors
+        # (title change may or may not be reflected in html depending on timing)
+
+
+@pytest.mark.asyncio
+async def test_js_code_before_wait(local_server):
+    """Use js_code_before_wait to inject content, then wait_for to verify it."""
+    js_inject = """
+    const div = document.createElement('div');
+    div.id = 'injected-marker';
+    div.className = 'injected';
+    div.textContent = 'Injected by js_code_before_wait';
+    document.body.appendChild(div);
+    """
+    config = CrawlerRunConfig(
+        js_code_before_wait=js_inject,
+        wait_for="css:#injected-marker",
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/", config=config)
+        assert result.success, f"js_code_before_wait crawl failed: {result.error_message}"
+        assert "Injected by js_code_before_wait" in result.markdown, (
+            "Injected content should appear in markdown"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Cache mode tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_cache_write_and_read(local_server):
+    """Crawl with ENABLED cache, then crawl again to verify cache hit."""
+    config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        # First crawl - writes to cache
+        result1 = await crawler.arun(local_server + "/", config=config)
+        assert result1.success, f"First crawl failed: {result1.error_message}"
+
+        # Second crawl - should read from cache
+        result2 = await crawler.arun(local_server + "/", config=config)
+        assert result2.success, f"Second crawl failed: {result2.error_message}"
+        if result2.cache_status:
+            assert "hit" in result2.cache_status.lower(), (
+                f"Second crawl should be a cache hit, got: {result2.cache_status}"
+            )
+
+
+@pytest.mark.asyncio
+async def test_cache_bypass(local_server):
+    """Crawl with BYPASS cache mode; result should still succeed."""
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/", config=config)
+        assert result.success, f"Bypass cache crawl failed: {result.error_message}"
+        assert len(result.html) > 0, "HTML should be non-empty even with bypass"
+
+
+@pytest.mark.asyncio
+async def test_cache_disabled(local_server):
+    """Crawl with DISABLED cache; second crawl should not be cached."""
+    config = CrawlerRunConfig(cache_mode=CacheMode.DISABLED)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result1 = await crawler.arun(local_server + "/", config=config)
+        assert result1.success
+        result2 = await crawler.arun(local_server + "/", config=config)
+        assert result2.success
+        # With DISABLED, there should be no cache hit
+        if result2.cache_status:
+            assert "hit" not in result2.cache_status.lower(), (
+                "DISABLED cache should not produce a cache hit"
+            )
+
+
+# ---------------------------------------------------------------------------
+# Session reuse test
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_session_reuse(local_server):
+    """Crawl with a session_id, crawl again with same session_id; both succeed."""
+    config = CrawlerRunConfig(session_id="test-session", cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result1 = await crawler.arun(local_server + "/", config=config)
+        assert result1.success, f"First session crawl failed: {result1.error_message}"
+
+        result2 = await crawler.arun(local_server + "/", config=config)
+        assert result2.success, f"Second session crawl failed: {result2.error_message}"
+
+
+# ---------------------------------------------------------------------------
+# Hooks test
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_hooks_fire(local_server):
+    """Verify before_goto and after_goto hooks are called during crawl."""
+    calls = []
+
+    async def before_hook(page, context, url, **kwargs):
+        calls.append(("before_goto", url))
+        return page
+
+    async def after_hook(page, context, url, **kwargs):
+        calls.append(("after_goto", url))
+        return page
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        crawler.crawler_strategy.set_hook("before_goto", before_hook)
+        crawler.crawler_strategy.set_hook("after_goto", after_hook)
+
+        result = await crawler.arun(local_server + "/", config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS))
+        assert result.success, f"Hook crawl failed: {result.error_message}"
+        hook_types = [c[0] for c in calls]
+        assert "before_goto" in hook_types, "before_goto hook should have been called"
+        assert "after_goto" in hook_types, "after_goto hook should have been called"
+
+
+# ---------------------------------------------------------------------------
+# Network capture test
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_network_request_capture(local_server):
+    """Crawl with capture_network_requests=True and verify requests are captured."""
+    config = CrawlerRunConfig(capture_network_requests=True, cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/", config=config)
+        assert result.success, f"Network capture crawl failed: {result.error_message}"
+        assert result.network_requests is not None, "network_requests should not be None"
+        assert isinstance(result.network_requests, list), "network_requests should be a list"
+        assert len(result.network_requests) >= 1, "Should capture at least 1 network request"
+        # Each entry should have a url key
+        assert "url" in result.network_requests[0], (
+            "Network request entries should have a 'url' key"
+        )
+
+
+# ---------------------------------------------------------------------------
+# CSS selector test
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_css_selector(local_server):
+    """Crawl /products with css_selector to narrow content extraction."""
+    config = CrawlerRunConfig(css_selector=".product-list", cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/products", config=config)
+        assert result.success, f"CSS selector crawl failed: {result.error_message}"
+        # The product content should be present
+        assert "Wireless Mouse" in result.html, "Product content should be in HTML"
+        # The h1 "Products" is outside .product-list, should not be in the selected HTML
+        # css_selector filters the HTML sent to content extraction
+        assert "<h1>" not in result.html, (
+            "The h1 outside .product-list should not appear in result.html"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Excluded tags test
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_excluded_tags(local_server):
+    """Crawl with excluded_tags to remove nav and footer content."""
+    config = CrawlerRunConfig(excluded_tags=["nav", "footer"], cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/", config=config)
+        assert result.success, f"Excluded tags crawl failed: {result.error_message}"
+        cleaned = result.cleaned_html or ""
+        assert "<nav" not in cleaned.lower(), "cleaned_html should not contain nav element"
+        assert "<footer" not in cleaned.lower(), "cleaned_html should not contain footer element"
+
+
+# ---------------------------------------------------------------------------
+# Page timeout test
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_page_timeout(local_server):
+    """Crawl /slow with a 500ms timeout; expect failure or timeout."""
+    config = CrawlerRunConfig(page_timeout=500, cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/slow", config=config)
+        # The slow page takes 2 seconds but we gave only 500ms
+        # It should either fail or have an error
+        if result.success:
+            # Some browsers may still return partial content; that is acceptable
+            pass
+        else:
+            assert result.error_message is not None, (
+                "Failed crawl should have an error message"
+            )
+
+
+# ---------------------------------------------------------------------------
+# Status code tests
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_404_status_code(local_server):
+    """Crawl /not-found and verify 404 status code."""
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/not-found", config=config)
+        assert result.status_code == 404, (
+            f"Expected status code 404, got {result.status_code}"
+        )
+
+
+@pytest.mark.asyncio
+async def test_redirect_status(local_server):
+    """Crawl /redirect and verify it follows the redirect to home."""
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/redirect", config=config)
+        assert result.success, f"Redirect crawl failed: {result.error_message}"
+        # After redirect, the final URL should be the home page
+        if result.redirected_url:
+            assert result.redirected_url.rstrip("/").endswith(
+                local_server.rstrip("/").split(":")[-1]
+            ) or result.redirected_url.endswith("/"), (
+                f"Redirected URL should end with /, got: {result.redirected_url}"
+            )
--- a/tests/regression/test_reg_deep_crawl.py
+++ b/tests/regression/test_reg_deep_crawl.py
@@ -0,0 +1,633 @@
+"""
+Crawl4AI Regression Tests - Deep Crawling
+
+Tests deep crawling strategies (BFS, DFS, BestFirst), URL filters, URL scorers,
+URL normalization, and streaming mode using real browser crawling with no mocking.
+"""
+
+import pytest
+
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+from crawl4ai.deep_crawling import (
+    BFSDeepCrawlStrategy,
+    DFSDeepCrawlStrategy,
+    BestFirstCrawlingStrategy,
+)
+from crawl4ai.deep_crawling.filters import (
+    URLPatternFilter,
+    DomainFilter,
+    ContentTypeFilter,
+    FilterChain,
+)
+from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer, CompositeScorer
+from crawl4ai.utils import (
+    normalize_url_for_deep_crawl,
+    efficient_normalize_url_for_deep_crawl,
+)
+
+
+def _to_ip_url(local_server: str) -> str:
+    """Convert http://localhost:PORT to http://127.0.0.1:PORT.
+
+    Deep crawl strategies reject netlocs without a dot (e.g. 'localhost'),
+    so we use the IP form which contains dots and passes validation.
+    """
+    return local_server.replace("localhost", "127.0.0.1")
+
+
+# ---------------------------------------------------------------------------
+# BFS Deep Crawl
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_bfs_basic(local_server):
+    """BFS deep crawl of /deep/hub at depth 1 should return hub + sub pages."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=10)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        assert len(result_list) >= 1, "Should return at least the hub page"
+
+        # First result should be the hub
+        assert "/deep/hub" in result_list[0].url, "First result should be the hub page"
+
+        # Check sub pages are present
+        sub_urls = [r.url for r in result_list if "/deep/sub" in r.url]
+        assert len(sub_urls) >= 1, "Should discover at least one sub page"
+
+        # Verify metadata has depth key
+        for r in result_list:
+            assert r.metadata is not None, "Each result should have metadata"
+            assert "depth" in r.metadata, "Metadata should contain 'depth' key"
+
+        # Hub should be at depth 0
+        hub_result = result_list[0]
+        assert hub_result.metadata["depth"] == 0, "Hub should be at depth 0"
+
+        # Sub pages should be at depth 1
+        for r in result_list:
+            if "/deep/sub" in r.url:
+                assert r.metadata["depth"] == 1, f"Sub page {r.url} should be at depth 1"
+
+
+@pytest.mark.asyncio
+async def test_bfs_depth_enforcement(local_server):
+    """BFS with max_depth=1 must not include leaf pages at depth 2."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=20)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        leaf_urls = [r.url for r in result_list if "leaf" in r.url]
+        assert len(leaf_urls) == 0, (
+            f"No leaf pages should appear at max_depth=1, but found: {leaf_urls}"
+        )
+
+
+@pytest.mark.asyncio
+async def test_bfs_max_pages(local_server):
+    """BFS with max_pages=3 should return at most 3 results."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = BFSDeepCrawlStrategy(max_depth=3, max_pages=3)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        assert len(result_list) <= 3, (
+            f"Expected at most 3 results, got {len(result_list)}"
+        )
+
+
+@pytest.mark.asyncio
+async def test_bfs_level_order(local_server):
+    """BFS should return results in level order: depth 0 before depth 1 before depth 2."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=20)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        depths = [r.metadata["depth"] for r in result_list]
+
+        # Verify ordering: once a higher depth appears, no lower depth should follow
+        max_depth_seen = -1
+        for i, d in enumerate(depths):
+            if d < max_depth_seen:
+                pytest.fail(
+                    f"BFS level order violated at index {i}: depth {d} appeared "
+                    f"after depth {max_depth_seen}. Full sequence: {depths}"
+                )
+            max_depth_seen = max(max_depth_seen, d)
+
+
+# ---------------------------------------------------------------------------
+# DFS Deep Crawl
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_dfs_basic(local_server):
+    """DFS deep crawl at depth 2 should find both sub pages and leaf pages."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = DFSDeepCrawlStrategy(max_depth=2, max_pages=10)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        urls = [r.url for r in result_list]
+
+        sub_pages = [u for u in urls if "/deep/sub" in u and "leaf" not in u]
+        leaf_pages = [u for u in urls if "leaf" in u]
+
+        assert len(sub_pages) >= 1, "DFS should visit at least one sub page"
+        assert len(leaf_pages) >= 1, "DFS at depth 2 should visit at least one leaf page"
+
+
+@pytest.mark.asyncio
+async def test_dfs_depth_first_order(local_server):
+    """DFS should explore depth-first: some leaf page should appear before all sub pages are visited."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    # Give enough pages to see the DFS pattern
+    strategy = DFSDeepCrawlStrategy(max_depth=2, max_pages=15)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        urls = [r.url for r in result_list]
+
+        # Find indices of sub pages and leaf pages
+        sub_indices = [i for i, u in enumerate(urls) if "/deep/sub" in u and "leaf" not in u]
+        leaf_indices = [i for i, u in enumerate(urls) if "leaf" in u]
+
+        if sub_indices and leaf_indices:
+            # In DFS, at least one leaf should appear before the last sub page
+            earliest_leaf = min(leaf_indices)
+            latest_sub = max(sub_indices)
+            assert earliest_leaf < latest_sub, (
+                "DFS should explore a branch deeply before exhausting all sub pages. "
+                f"Earliest leaf at index {earliest_leaf}, latest sub at index {latest_sub}."
+            )
+
+
+@pytest.mark.asyncio
+async def test_dfs_max_depth(local_server):
+    """DFS with max_depth=1 should only visit hub and sub pages, no leaves."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = DFSDeepCrawlStrategy(max_depth=1, max_pages=20)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        leaf_urls = [r.url for r in result_list if "leaf" in r.url]
+        assert len(leaf_urls) == 0, (
+            f"DFS with max_depth=1 should not reach leaf pages, found: {leaf_urls}"
+        )
+
+
+# ---------------------------------------------------------------------------
+# BestFirst Deep Crawl
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_bestfirst_basic(local_server):
+    """BestFirst deep crawl should return results from /deep/hub."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = BestFirstCrawlingStrategy(max_depth=2, max_pages=10)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        assert len(result_list) >= 1, "BestFirst should return at least the start page"
+        assert result_list[0].success, "First result should be successful"
+
+
+# ---------------------------------------------------------------------------
+# Filters
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_url_pattern_filter_include(local_server):
+    """URLPatternFilter with sub1 pattern should only crawl the sub1 branch."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    url_filter = URLPatternFilter(patterns=["*/sub1*"])
+    chain = FilterChain(filters=[url_filter])
+    strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=10, filter_chain=chain)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        # Hub (depth 0) bypasses filter; subsequent URLs should only match sub1
+        non_hub = [r for r in result_list if r.metadata.get("depth", 0) > 0]
+        for r in non_hub:
+            assert "sub1" in r.url, (
+                f"All non-hub results should be in sub1 branch, but found: {r.url}"
+            )
+
+
+@pytest.mark.asyncio
+async def test_url_pattern_filter_exclude(local_server):
+    """URLPatternFilter with reverse=True should exclude leaf pages."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    url_filter = URLPatternFilter(patterns=["*/leaf*"], reverse=True)
+    chain = FilterChain(filters=[url_filter])
+    strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=15, filter_chain=chain)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        leaf_urls = [r.url for r in result_list if "leaf" in r.url]
+        assert len(leaf_urls) == 0, (
+            f"Reverse pattern filter should exclude leaf pages, found: {leaf_urls}"
+        )
+
+
+@pytest.mark.asyncio
+async def test_domain_filter(local_server):
+    """DomainFilter allowing only 127.0.0.1 should keep local URLs only."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    domain_filter = DomainFilter(allowed_domains=["127.0.0.1"])
+    chain = FilterChain(filters=[domain_filter])
+    strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=10, filter_chain=chain)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        for r in result_list:
+            assert "127.0.0.1" in r.url, (
+                f"All results should be local, but found: {r.url}"
+            )
+
+
+@pytest.mark.asyncio
+async def test_filter_chain(local_server):
+    """FilterChain combining URLPatternFilter and DomainFilter should apply both."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    url_filter = URLPatternFilter(patterns=["*/sub1*"])
+    domain_filter = DomainFilter(allowed_domains=["127.0.0.1"])
+    chain = FilterChain(filters=[url_filter, domain_filter])
+    strategy = BFSDeepCrawlStrategy(max_depth=2, max_pages=10, filter_chain=chain)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        non_hub = [r for r in result_list if r.metadata.get("depth", 0) > 0]
+        for r in non_hub:
+            assert "sub1" in r.url, (
+                f"URL pattern filter not applied: {r.url}"
+            )
+            assert "127.0.0.1" in r.url, (
+                f"Domain filter not applied: {r.url}"
+            )
+
+
+def test_content_type_filter():
+    """ContentTypeFilter should pass HTML URLs and reject image/pdf extensions."""
+    ct_filter = ContentTypeFilter(allowed_types=["text/html"])
+
+    assert ct_filter.apply("http://example.com/page") is True, (
+        "URL with no extension should pass (assumed HTML)"
+    )
+    assert ct_filter.apply("http://example.com/page.html") is True, (
+        ".html should pass text/html filter"
+    )
+    assert ct_filter.apply("http://example.com/photo.jpg") is False, (
+        ".jpg should be rejected by text/html filter"
+    )
+    assert ct_filter.apply("http://example.com/doc.pdf") is False, (
+        ".pdf should be rejected by text/html filter"
+    )
+
+
+# ---------------------------------------------------------------------------
+# Scorers
+# ---------------------------------------------------------------------------
+
+
+def test_keyword_scorer():
+    """KeywordRelevanceScorer should rank URLs containing keywords higher."""
+    scorer = KeywordRelevanceScorer(keywords=["technology", "science"])
+
+    tech_score = scorer.score("http://example.com/technology/article")
+    generic_score = scorer.score("http://example.com/about/contact")
+
+    assert tech_score > generic_score, (
+        f"URL with keyword should score higher: tech={tech_score}, generic={generic_score}"
+    )
+
+    both_score = scorer.score("http://example.com/technology/science-report")
+    assert both_score >= tech_score, (
+        "URL matching both keywords should score at least as high as one keyword"
+    )
+
+
+def test_composite_scorer():
+    """CompositeScorer combining two scorers should produce scores without error."""
+    scorer1 = KeywordRelevanceScorer(keywords=["python"], weight=1.0)
+    scorer2 = KeywordRelevanceScorer(keywords=["crawl"], weight=0.5)
+    composite = CompositeScorer(scorers=[scorer1, scorer2])
+
+    score = composite.score("http://example.com/python-crawl-guide")
+    assert isinstance(score, float), "Composite score should be a float"
+    assert score > 0, "URL matching both scorers' keywords should have positive score"
+
+    zero_score = composite.score("http://example.com/unrelated-page")
+    assert zero_score == 0.0, "URL matching no keywords should score zero"
+
+
+# ---------------------------------------------------------------------------
+# URL normalization in deep crawl context
+# ---------------------------------------------------------------------------
+
+
+def test_deep_crawl_url_normalization():
+    """normalize_url_for_deep_crawl should resolve relative URLs against base."""
+    base = "http://example.com/deep/hub"
+
+    result = normalize_url_for_deep_crawl("/deep/sub1", base)
+    assert result == "http://example.com/deep/sub1", (
+        f"Relative URL not resolved correctly: {result}"
+    )
+
+    result2 = normalize_url_for_deep_crawl("sub2", base)
+    assert "example.com" in result2, "Relative path should resolve against base"
+    assert "sub2" in result2, "Relative path should include the target"
+
+
+def test_deep_crawl_trailing_slash():
+    """Trailing slashes should be preserved during normalization (fix #1520)."""
+    base = "http://example.com/"
+
+    with_slash = normalize_url_for_deep_crawl("/path/", base)
+    without_slash = normalize_url_for_deep_crawl("/path", base)
+
+    # The function uses `parsed.path or '/'` which preserves trailing slashes
+    assert with_slash.endswith("/path/"), (
+        f"Trailing slash should be preserved: {with_slash}"
+    )
+    assert not without_slash.endswith("/"), (
+        f"No trailing slash should be added: {without_slash}"
+    )
+
+
+def test_deep_crawl_deduplication():
+    """Same URL with different fragments should normalize to the same string."""
+    base = "http://example.com/"
+
+    url1 = normalize_url_for_deep_crawl("/page#section1", base)
+    url2 = normalize_url_for_deep_crawl("/page#section2", base)
+    url3 = normalize_url_for_deep_crawl("/page", base)
+
+    assert url1 == url2, (
+        f"Fragment-only difference should normalize to same URL: {url1} vs {url2}"
+    )
+    assert url1 == url3, (
+        f"URL with and without fragment should normalize the same: {url1} vs {url3}"
+    )
+
+
+def test_deep_crawl_efficient_normalization():
+    """efficient_normalize_url_for_deep_crawl should produce consistent results."""
+    base = "http://example.com/deep/hub"
+
+    result = efficient_normalize_url_for_deep_crawl("/deep/sub1", base)
+    assert result == "http://example.com/deep/sub1", (
+        f"Efficient normalization failed: {result}"
+    )
+
+    # Fragments should be removed
+    result_frag = efficient_normalize_url_for_deep_crawl("/page#anchor", base)
+    assert "#" not in result_frag, "Fragments should be stripped"
+
+
+def test_deep_crawl_normalization_none_input():
+    """Normalizing None or empty string should return None."""
+    result_none = normalize_url_for_deep_crawl(None, "http://example.com/")
+    assert result_none is None, "None input should return None"
+
+    result_empty = normalize_url_for_deep_crawl("", "http://example.com/")
+    assert result_empty is None, "Empty string should return None"
+
+
+def test_deep_crawl_normalization_case():
+    """Hostname normalization should be case-insensitive."""
+    base = "http://Example.COM/"
+
+    result = normalize_url_for_deep_crawl("/Page", base)
+    assert "example.com" in result, (
+        f"Hostname should be lowercased: {result}"
+    )
+
+
+# ---------------------------------------------------------------------------
+# Stream mode
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_deep_crawl_stream(local_server):
+    """Deep crawl with stream=True should yield results via async iteration."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=5)
+    config = CrawlerRunConfig(
+        deep_crawl_strategy=strategy,
+        stream=True,
+        verbose=False,
+    )
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = []
+        async for result in await crawler.arun(url=hub_url, config=config):
+            results.append(result)
+
+        assert len(results) > 0, "Stream mode should yield at least one result"
+        assert results[0].success, "First streamed result should be successful"
+
+
+# ---------------------------------------------------------------------------
+# Real URL deep crawl
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_deep_crawl_real():
+    """Deep crawl https://quotes.toscrape.com with BFS to verify real-world usage."""
+    strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=3)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url="https://quotes.toscrape.com", config=config)
+
+        result_list = list(results)
+        assert len(result_list) >= 1, "Should crawl at least the start page"
+        assert result_list[0].success, "Start page should crawl successfully"
+        # The site has links; with max_depth=1 we should find some
+        if len(result_list) > 1:
+            assert result_list[1].metadata.get("depth") == 1, (
+                "Second-level pages should have depth 1"
+            )
+
+
+# ---------------------------------------------------------------------------
+# Edge cases
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_bfs_max_pages_one(local_server):
+    """BFS with max_pages=1 should return exactly 1 result (the start page)."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = BFSDeepCrawlStrategy(max_depth=5, max_pages=1)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        assert len(result_list) == 1, (
+            f"max_pages=1 should yield exactly 1 result, got {len(result_list)}"
+        )
+        assert "/deep/hub" in result_list[0].url, "The single result should be the hub"
+
+
+@pytest.mark.asyncio
+async def test_dfs_max_pages_one(local_server):
+    """DFS with max_pages=1 should return exactly 1 result."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = DFSDeepCrawlStrategy(max_depth=5, max_pages=1)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        assert len(result_list) == 1, (
+            f"max_pages=1 should yield exactly 1 result, got {len(result_list)}"
+        )
+
+
+@pytest.mark.asyncio
+async def test_bfs_depth_zero(local_server):
+    """BFS with max_depth=0 should only return the start page."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = BFSDeepCrawlStrategy(max_depth=0, max_pages=100)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        assert len(result_list) == 1, (
+            f"max_depth=0 should yield exactly 1 result, got {len(result_list)}"
+        )
+        assert result_list[0].metadata["depth"] == 0, "Only depth-0 page should exist"
+
+
+@pytest.mark.asyncio
+async def test_bfs_results_have_parent_url(local_server):
+    """Each non-root result should have a parent_url in metadata."""
+    base = _to_ip_url(local_server)
+    hub_url = base + "/deep/hub"
+    strategy = BFSDeepCrawlStrategy(max_depth=1, max_pages=10)
+    config = CrawlerRunConfig(deep_crawl_strategy=strategy, verbose=False)
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        results = await crawler.arun(url=hub_url, config=config)
+
+        result_list = list(results)
+        for r in result_list:
+            assert "parent_url" in r.metadata, (
+                f"Result for {r.url} should have 'parent_url' in metadata"
+            )
+            if r.metadata["depth"] == 0:
+                assert r.metadata["parent_url"] is None, (
+                    "Root page should have parent_url=None"
+                )
+            else:
+                assert r.metadata["parent_url"] is not None, (
+                    f"Non-root page {r.url} should have a parent_url"
+                )
+
+
+def test_url_pattern_filter_no_match():
+    """URLPatternFilter should reject URLs that match no patterns."""
+    f = URLPatternFilter(patterns=["*/special/*"])
+    assert f.apply("http://example.com/normal/page") is False
+    assert f.apply("http://example.com/special/page") is True
+
+
+def test_domain_filter_blocked():
+    """DomainFilter with blocked_domains should reject those domains."""
+    f = DomainFilter(blocked_domains=["evil.com"])
+    assert f.apply("http://evil.com/page") is False
+    assert f.apply("http://good.com/page") is True
+
+
+def test_domain_filter_subdomain():
+    """DomainFilter should handle subdomains of allowed domains."""
+    f = DomainFilter(allowed_domains=["example.com"])
+    assert f.apply("http://example.com/page") is True
+    assert f.apply("http://sub.example.com/page") is True
+    assert f.apply("http://other.com/page") is False
+
+
+def test_keyword_scorer_case_insensitive():
+    """KeywordRelevanceScorer should be case-insensitive by default."""
+    scorer = KeywordRelevanceScorer(keywords=["Python"])
+    score_lower = scorer.score("http://example.com/python-guide")
+    score_upper = scorer.score("http://example.com/PYTHON-GUIDE")
+    assert score_lower > 0, "Lowercase URL should match"
+    assert score_upper > 0, "Uppercase URL should match"
+
+
+def test_keyword_scorer_no_match():
+    """KeywordRelevanceScorer should return 0 for URLs with no keyword matches."""
+    scorer = KeywordRelevanceScorer(keywords=["quantum", "physics"])
+    score = scorer.score("http://example.com/cooking/recipes")
+    assert score == 0.0, "No keywords matched should give zero score"
--- a/tests/regression/test_reg_edge_cases.py
+++ b/tests/regression/test_reg_edge_cases.py
@@ -0,0 +1,359 @@
+"""
+Crawl4AI Regression Tests - Edge Cases and Error Handling
+
+Adversarial tests for empty pages, malformed HTML, large pages, unicode,
+concurrent crawls, error recovery, and other boundary conditions.
+
+All tests use real browser crawling with no mocking.
+"""
+
+import asyncio
+import pytest
+
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+from crawl4ai.cache_context import CacheMode
+
+
+# ---------------------------------------------------------------------------
+# Empty and minimal pages
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_empty_page(local_server):
+    """Crawl an empty page and verify no crash. Anti-bot may flag it as blocked."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/empty")
+        # An empty page may be flagged by the anti-bot detector as "near-empty content"
+        # so success may be False. The key thing is no unhandled exception and
+        # we get a result object back.
+        assert result.html is not None, "HTML should not be None for empty page"
+        # Markdown should be empty or minimal
+        md = result.markdown or ""
+        assert len(md.strip()) < 50, (
+            "Empty page should produce little to no markdown"
+        )
+
+
+@pytest.mark.asyncio
+async def test_empty_raw_html():
+    """Crawl raw HTML with empty body; should succeed without crash."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun("raw:<html><body></body></html>")
+        assert result.success, f"Empty raw HTML crawl failed: {result.error_message}"
+
+
+# ---------------------------------------------------------------------------
+# Malformed HTML
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_malformed_html(local_server):
+    """Crawl intentionally broken HTML; should not crash, even if anti-bot flags it."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/malformed")
+        # The malformed HTML is so broken that the browser may put content into
+        # unexpected places (e.g., the title). The anti-bot detector may flag the
+        # result as blocked due to empty body. The key assertion is: no unhandled
+        # exception and we get a result object back with html content.
+        assert result.html is not None, "Should still return HTML even for malformed pages"
+        assert len(result.html) > 0, "HTML should be non-empty for malformed page"
+
+
+@pytest.mark.asyncio
+async def test_raw_html_no_doctype():
+    """Raw HTML without doctype or <html> wrapper should still parse."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun("raw:<body><p>No doctype</p></body>")
+        assert result.success, f"No-doctype raw HTML failed: {result.error_message}"
+        assert "No doctype" in (result.markdown or ""), (
+            "Content should be extracted despite missing doctype"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Large pages
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_large_page(local_server):
+    """Crawl a page with 50 sections and verify content from beginning and end."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/large")
+        assert result.success, f"Large page crawl failed: {result.error_message}"
+        md = result.markdown or ""
+        assert "Section 0" in md, "Markdown should contain content from section 0"
+        assert "Section 49" in md, "Markdown should contain content from section 49"
+
+
+# ---------------------------------------------------------------------------
+# Unicode and special characters
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_unicode_content():
+    """Crawl raw HTML with unicode characters and verify they survive extraction."""
+    raw = "raw:<html><body><p>Unicode: \u00e9\u00e8\u00ea \u4e16\u754c \U0001f600</p></body></html>"
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(raw)
+        assert result.success, f"Unicode crawl failed: {result.error_message}"
+        md = result.markdown or ""
+        assert "\u00e9" in md, "French accented 'e' should be in markdown"
+        assert "\u4e16\u754c" in md, "Chinese characters should be in markdown"
+        # Emoji may or may not survive depending on markdown generator;
+        # at least the other unicode should be present
+
+
+@pytest.mark.asyncio
+async def test_html_entities():
+    """Crawl raw HTML with entities and verify they are decoded in markdown."""
+    raw = "raw:<html><body><p>&amp; &lt; &gt; &quot; &#39;</p></body></html>"
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(raw)
+        assert result.success, f"HTML entities crawl failed: {result.error_message}"
+        md = result.markdown or ""
+        assert "&" in md, "Ampersand entity should be decoded"
+        assert "<" in md, "Less-than entity should be decoded"
+        assert ">" in md, "Greater-than entity should be decoded"
+
+
+# ---------------------------------------------------------------------------
+# Multiple crawls - no state leakage
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_sequential_crawls_no_leakage(local_server):
+    """Crawl 3 different pages sequentially; verify no content bleed."""
+    pages = [
+        (local_server + "/products", "Wireless Mouse"),
+        (local_server + "/tables", "Sales Report"),
+        (local_server + "/js-dynamic", "Static Section"),
+    ]
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        for url, expected_content in pages:
+            result = await crawler.arun(url, config=config)
+            assert result.success, f"Sequential crawl of {url} failed: {result.error_message}"
+            md = result.markdown or ""
+            assert expected_content in md, (
+                f"Expected '{expected_content}' in markdown for {url}, "
+                f"got: {md[:200]}..."
+            )
+
+
+# ---------------------------------------------------------------------------
+# Raw HTML edge cases
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_raw_html_only_whitespace():
+    """Raw HTML with only whitespace body should succeed with empty markdown."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun("raw:<html><body>   \n\t  </body></html>")
+        assert result.success, f"Whitespace-only raw HTML failed: {result.error_message}"
+        md = result.markdown or ""
+        assert len(md.strip()) < 20, "Whitespace-only body should produce minimal markdown"
+
+
+@pytest.mark.asyncio
+async def test_raw_html_script_only():
+    """Raw HTML with only a script tag should produce empty markdown (scripts stripped)."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            "raw:<html><body><script>var x = 1;</script></body></html>"
+        )
+        assert result.success, f"Script-only raw HTML failed: {result.error_message}"
+        md = result.markdown or ""
+        assert "var x" not in md, "Script content should be stripped from markdown"
+
+
+# ---------------------------------------------------------------------------
+# Concurrent crawls
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_concurrent_crawls(local_server):
+    """Use asyncio.gather to crawl 5 pages concurrently with same crawler."""
+    urls = [
+        local_server + "/",
+        local_server + "/products",
+        local_server + "/tables",
+        local_server + "/links-page",
+        local_server + "/images-page",
+    ]
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        tasks = [crawler.arun(url, config=config) for url in urls]
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+        for i, result in enumerate(results):
+            assert not isinstance(result, Exception), (
+                f"Concurrent crawl {i} raised exception: {result}"
+            )
+            assert result.success, (
+                f"Concurrent crawl {i} ({urls[i]}) failed: {result.error_message}"
+            )
+
+
+# ---------------------------------------------------------------------------
+# Very long URL
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_long_url(local_server):
+    """Crawl a URL with a very long path (200 chars); catch-all handler serves it."""
+    long_path = "/" + "a" * 200
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + long_path)
+        assert result.success, f"Long URL crawl failed: {result.error_message}"
+
+
+# ---------------------------------------------------------------------------
+# Special URL characters
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_url_with_query_params(local_server):
+    """Crawl a URL with query parameters and verify success."""
+    url = local_server + "/products?page=1&sort=name&filter=electronics"
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url)
+        assert result.success, f"Query params URL crawl failed: {result.error_message}"
+
+
+@pytest.mark.asyncio
+async def test_url_with_fragment(local_server):
+    """Crawl a URL with a fragment identifier and verify success."""
+    url = local_server + "/#section-5"
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url)
+        assert result.success, f"Fragment URL crawl failed: {result.error_message}"
+
+
+# ---------------------------------------------------------------------------
+# Error recovery
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_invalid_url_scheme():
+    """Try crawling an FTP URL; should handle gracefully without crash."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun("ftp://example.com")
+        # Either it fails gracefully with an error or succeeds with empty content
+        # The critical thing is no unhandled exception
+        if not result.success:
+            assert result.error_message is not None, (
+                "Invalid scheme should produce an error message"
+            )
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_nonexistent_domain():
+    """Try crawling a nonexistent domain; should fail gracefully."""
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            "https://this-domain-definitely-does-not-exist-xyz123.com",
+            config=CrawlerRunConfig(page_timeout=10000),
+        )
+        # Should fail but not crash
+        if not result.success:
+            assert result.error_message is not None, (
+                "Nonexistent domain should produce an error message"
+            )
+
+
+# ---------------------------------------------------------------------------
+# Multiple identical crawls (idempotency)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_idempotent_crawl(local_server):
+    """Crawl same URL twice with BYPASS cache; both should succeed with similar content."""
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result1 = await crawler.arun(local_server + "/products", config=config)
+        result2 = await crawler.arun(local_server + "/products", config=config)
+        assert result1.success, f"First crawl failed: {result1.error_message}"
+        assert result2.success, f"Second crawl failed: {result2.error_message}"
+        # Both should have similar content length (within 20% tolerance)
+        len1 = len(result1.markdown or "")
+        len2 = len(result2.markdown or "")
+        if len1 > 0 and len2 > 0:
+            ratio = min(len1, len2) / max(len1, len2)
+            assert ratio > 0.8, (
+                f"Idempotent crawls should produce similar content "
+                f"(len1={len1}, len2={len2}, ratio={ratio:.2f})"
+            )
+
+
+# ---------------------------------------------------------------------------
+# PDF generation
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_pdf_capture(local_server):
+    """Crawl with pdf=True and verify PDF bytes output."""
+    config = CrawlerRunConfig(pdf=True, cache_mode=CacheMode.BYPASS)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/", config=config)
+        assert result.success, f"PDF capture crawl failed: {result.error_message}"
+        assert result.pdf is not None, "PDF should not be None"
+        assert isinstance(result.pdf, bytes), "PDF should be bytes"
+        assert len(result.pdf) > 0, "PDF should be non-empty"
+        # PDF files start with %PDF
+        assert result.pdf[:4] == b"%PDF", "PDF should start with %PDF header"
+
+
+# ---------------------------------------------------------------------------
+# Scan full page
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_scan_full_page(local_server):
+    """Crawl /large with scan_full_page=True to scroll through entire page."""
+    config = CrawlerRunConfig(
+        scan_full_page=True,
+        scroll_delay=0.1,
+        cache_mode=CacheMode.BYPASS,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/large", config=config)
+        assert result.success, f"Scan full page crawl failed: {result.error_message}"
+        md = result.markdown or ""
+        assert len(md) > 100, "Full page scan should produce substantial markdown"
+
+
+# ---------------------------------------------------------------------------
+# Console capture
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_console_capture(local_server):
+    """Crawl /js-dynamic with capture_console_messages=True; verify no error."""
+    config = CrawlerRunConfig(
+        capture_console_messages=True,
+        cache_mode=CacheMode.BYPASS,
+    )
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(local_server + "/js-dynamic", config=config)
+        assert result.success, f"Console capture crawl failed: {result.error_message}"
+        # console_messages should be a list (possibly empty)
+        assert result.console_messages is not None, (
+            "console_messages should not be None when capture_console_messages=True"
+        )
+        assert isinstance(result.console_messages, list), (
+            "console_messages should be a list"
+        )
--- a/tests/regression/test_reg_extraction.py
+++ b/tests/regression/test_reg_extraction.py
@@ -0,0 +1,608 @@
+"""
+Regression tests for Crawl4AI extraction strategies.
+
+Covers JsonCssExtractionStrategy, JsonXPathExtractionStrategy,
+JsonLxmlExtractionStrategy, RegexExtractionStrategy, NoExtractionStrategy,
+and CosineStrategy (optional, requires sklearn).
+
+Run:
+    pytest tests/regression/test_reg_extraction.py -v
+    pytest tests/regression/test_reg_extraction.py -v -m "not network"
+"""
+
+import pytest
+import json
+import time
+
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+from crawl4ai.extraction_strategy import (
+    JsonCssExtractionStrategy,
+    JsonXPathExtractionStrategy,
+    JsonLxmlExtractionStrategy,
+    RegexExtractionStrategy,
+    NoExtractionStrategy,
+)
+
+try:
+    from crawl4ai.extraction_strategy import CosineStrategy
+    # CosineStrategy requires torch and sklearn at instantiation time;
+    # verify they are actually available before declaring it usable.
+    import torch  # noqa: F401
+    HAS_COSINE = True
+except (ImportError, ModuleNotFoundError):
+    HAS_COSINE = False
+
+
+# ---------------------------------------------------------------------------
+# JsonCssExtractionStrategy
+# ---------------------------------------------------------------------------
+
+PRODUCT_CSS_SCHEMA = {
+    "baseSelector": "div.product",
+    "fields": [
+        {"name": "name", "selector": "h2.name", "type": "text"},
+        {"name": "price", "selector": "span.price", "type": "text"},
+        {"name": "description", "selector": "p.description", "type": "text"},
+        {"name": "category", "selector": "span.category", "type": "text"},
+        {
+            "name": "link",
+            "selector": "a.details-link",
+            "type": "attribute",
+            "attribute": "href",
+        },
+    ],
+}
+
+PRODUCT_CSS_SCHEMA_WITH_ID = {
+    "baseSelector": "div.product",
+    "baseFields": [
+        {
+            "name": "product_id",
+            "type": "attribute",
+            "attribute": "data-id",
+        },
+    ],
+    "fields": [
+        {"name": "name", "selector": "h2.name", "type": "text"},
+        {"name": "price", "selector": "span.price", "type": "text"},
+        {"name": "description", "selector": "p.description", "type": "text"},
+        {"name": "category", "selector": "span.category", "type": "text"},
+        {
+            "name": "link",
+            "selector": "a.details-link",
+            "type": "attribute",
+            "attribute": "href",
+        },
+    ],
+}
+
+
+@pytest.mark.asyncio
+async def test_css_extract_products(local_server):
+    """Extract all 5 products from /products using JsonCssExtractionStrategy.
+    Verify count, first product name, price, and product_id."""
+    strategy = JsonCssExtractionStrategy(schema=PRODUCT_CSS_SCHEMA_WITH_ID)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/products", config=config)
+        assert result.success, f"Crawl failed: {result.error_message}"
+        extracted = json.loads(result.extracted_content)
+        assert isinstance(extracted, list)
+        assert len(extracted) == 5, f"Expected 5 products, got {len(extracted)}"
+
+        first = extracted[0]
+        assert first["name"] == "Wireless Mouse"
+        assert first["price"] == "$29.99"
+        assert first["product_id"] == "1"
+
+
+@pytest.mark.asyncio
+async def test_css_extract_with_default(local_server):
+    """Use a field with a non-existent selector and a default value.
+    Verify the default is used when no element matches."""
+    schema = {
+        "baseSelector": "div.product",
+        "fields": [
+            {"name": "name", "selector": "h2.name", "type": "text"},
+            {
+                "name": "sku",
+                "selector": "span.sku-number",
+                "type": "text",
+                "default": "N/A",
+            },
+        ],
+    }
+    strategy = JsonCssExtractionStrategy(schema=schema)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/products", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) > 0
+        for item in extracted:
+            assert item["sku"] == "N/A", (
+                f"Expected default 'N/A' for missing sku, got: {item.get('sku')}"
+            )
+
+
+@pytest.mark.asyncio
+async def test_css_extract_nested(local_server):
+    """Test nested type extraction using JsonCssExtractionStrategy.
+    Extract a nested object from within each product element."""
+    schema = {
+        "baseSelector": "div.product",
+        "fields": [
+            {"name": "name", "selector": "h2.name", "type": "text"},
+            {
+                "name": "details",
+                "selector": "div.rating",
+                "type": "nested",
+                "fields": [
+                    {
+                        "name": "stars",
+                        "type": "attribute",
+                        "attribute": "data-stars",
+                    },
+                ],
+            },
+        ],
+    }
+    strategy = JsonCssExtractionStrategy(schema=schema)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/products", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) == 5
+        first = extracted[0]
+        assert "details" in first
+        assert first["details"]["stars"] == "4.5"
+
+
+@pytest.mark.asyncio
+async def test_css_extract_empty_results(local_server):
+    """Use a baseSelector that matches nothing and verify an empty list is returned."""
+    schema = {
+        "baseSelector": "div.nonexistent-class-xyz",
+        "fields": [
+            {"name": "text", "selector": "p", "type": "text"},
+        ],
+    }
+    strategy = JsonCssExtractionStrategy(schema=schema)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/products", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert isinstance(extracted, list)
+        assert len(extracted) == 0
+
+
+@pytest.mark.asyncio
+async def test_css_extract_table(local_server):
+    """Extract table rows from /tables using CSS selectors.
+    Verify 4 quarterly rows with correct Q1 revenue."""
+    schema = {
+        "baseSelector": "#sales-table tbody tr",
+        "fields": [
+            {"name": "quarter", "selector": "td:nth-child(1)", "type": "text"},
+            {"name": "revenue", "selector": "td:nth-child(2)", "type": "text"},
+            {"name": "growth", "selector": "td:nth-child(3)", "type": "text"},
+        ],
+    }
+    strategy = JsonCssExtractionStrategy(schema=schema)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/tables", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) == 4, f"Expected 4 rows, got {len(extracted)}"
+        assert extracted[0]["quarter"] == "Q1 2025"
+        assert extracted[0]["revenue"] == "$1,234,567"
+        assert extracted[0]["growth"] == "12.5%"
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_css_real_quotes():
+    """Crawl quotes.toscrape.com and extract quotes with CSS selectors.
+    Verify multiple quotes are extracted with text and author."""
+    schema = {
+        "baseSelector": "div.quote",
+        "fields": [
+            {"name": "text", "selector": "span.text", "type": "text"},
+            {"name": "author", "selector": "small.author", "type": "text"},
+        ],
+    }
+    strategy = JsonCssExtractionStrategy(schema=schema)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url="https://quotes.toscrape.com", config=config
+        )
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) > 0, "Expected quotes to be extracted"
+        for quote in extracted:
+            assert "text" in quote and quote["text"], f"Quote missing text: {quote}"
+            assert "author" in quote and quote["author"], f"Quote missing author: {quote}"
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_css_real_books():
+    """Crawl books.toscrape.com and extract book titles and prices."""
+    schema = {
+        "baseSelector": "article.product_pod",
+        "fields": [
+            {"name": "title", "selector": "h3 a", "type": "attribute", "attribute": "title"},
+            {"name": "price", "selector": "p.price_color", "type": "text"},
+        ],
+    }
+    strategy = JsonCssExtractionStrategy(schema=schema)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url="https://books.toscrape.com", config=config
+        )
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) > 0, "Expected books to be extracted"
+        for book in extracted:
+            assert "title" in book and book["title"]
+            assert "price" in book and book["price"]
+            # Price should start with a currency symbol
+            assert book["price"][0] in ("£", "$", "€") or book["price"].startswith("£")
+
+
+# ---------------------------------------------------------------------------
+# JsonXPathExtractionStrategy
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_xpath_extract_products(local_server):
+    """Extract products using XPath selectors. Verify same results as CSS version."""
+    schema = {
+        # Use exact class match to avoid matching 'product-list' parent
+        "baseSelector": "//div[contains(concat(' ', normalize-space(@class), ' '), ' product ')]",
+        "fields": [
+            {
+                "name": "name",
+                "selector": ".//h2[contains(@class, 'name')]",
+                "type": "text",
+            },
+            {
+                "name": "price",
+                "selector": ".//span[contains(@class, 'price')]",
+                "type": "text",
+            },
+        ],
+    }
+    strategy = JsonXPathExtractionStrategy(schema=schema)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/products", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) == 5, f"Expected 5 products via XPath, got {len(extracted)}"
+        assert extracted[0]["name"] == "Wireless Mouse"
+        assert extracted[0]["price"] == "$29.99"
+
+
+# ---------------------------------------------------------------------------
+# JsonLxmlExtractionStrategy
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_lxml_extract_products(local_server):
+    """Extract products using JsonLxmlExtractionStrategy with the same
+    CSS-style schema. Verify same results as JsonCss."""
+    strategy = JsonLxmlExtractionStrategy(schema=PRODUCT_CSS_SCHEMA)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/products", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) == 5, f"Expected 5 products via lxml, got {len(extracted)}"
+        assert extracted[0]["name"] == "Wireless Mouse"
+        assert extracted[0]["price"] == "$29.99"
+
+
+@pytest.mark.asyncio
+async def test_lxml_caching_performance(local_server):
+    """Extract twice with the same JsonLxmlExtractionStrategy instance.
+    Second extraction should be faster or equal due to caching."""
+    strategy = JsonLxmlExtractionStrategy(schema=PRODUCT_CSS_SCHEMA)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        # First run
+        t0 = time.perf_counter()
+        result1 = await crawler.arun(url=f"{local_server}/products", config=config)
+        t1 = time.perf_counter()
+        first_time = t1 - t0
+
+        # Second run (caching should help)
+        t2 = time.perf_counter()
+        result2 = await crawler.arun(url=f"{local_server}/products", config=config)
+        t3 = time.perf_counter()
+        second_time = t3 - t2
+
+        assert result1.success and result2.success
+        data1 = json.loads(result1.extracted_content)
+        data2 = json.loads(result2.extracted_content)
+        assert len(data1) == len(data2) == 5
+
+        # Allow generous tolerance -- caching may not always be faster due to
+        # browser overhead, but it should certainly not be drastically slower
+        assert second_time < first_time * 3, (
+            f"Second run ({second_time:.3f}s) significantly slower than first ({first_time:.3f}s)"
+        )
+
+
+# ---------------------------------------------------------------------------
+# RegexExtractionStrategy
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_regex_email(local_server):
+    """Extract emails from /regex-test using the Email pattern.
+    Verify both expected addresses are found."""
+    strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/regex-test", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        values = [item["value"] for item in extracted]
+        assert any("support@crawl4ai.com" in v for v in values), (
+            f"Expected support@crawl4ai.com in {values}"
+        )
+        assert any("sales@example.org" in v for v in values), (
+            f"Expected sales@example.org in {values}"
+        )
+
+
+@pytest.mark.asyncio
+async def test_regex_phone(local_server):
+    """Extract US phone numbers from /regex-test."""
+    strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.PhoneUS)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/regex-test", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        values = [item["value"] for item in extracted]
+        assert len(values) > 0, "Expected at least one phone number"
+        # At least one phone number should contain expected digits
+        all_vals = " ".join(values)
+        assert "555" in all_vals, f"Expected phone with 555 in {values}"
+
+
+@pytest.mark.asyncio
+async def test_regex_url(local_server):
+    """Extract URLs from /regex-test using the Url pattern."""
+    strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.Url)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/regex-test", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        values = [item["value"] for item in extracted]
+        assert len(values) > 0, "Expected URLs to be extracted"
+        all_vals = " ".join(values)
+        assert "crawl4ai.com" in all_vals
+
+
+@pytest.mark.asyncio
+async def test_regex_all(local_server):
+    """Use RegexExtractionStrategy.All to extract all built-in patterns.
+    Verify it finds emails, phones, URLs, dates, and more."""
+    strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.All)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/regex-test", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        labels = {item["label"] for item in extracted}
+        # Should find at least emails, URLs, and dates
+        assert "email" in labels, f"Expected 'email' in labels: {labels}"
+        assert "url" in labels, f"Expected 'url' in labels: {labels}"
+        assert "date_iso" in labels or "date_us" in labels, (
+            f"Expected date patterns in labels: {labels}"
+        )
+
+
+@pytest.mark.asyncio
+async def test_regex_custom(local_server):
+    """Use a custom regex pattern to extract IPv4 addresses.
+    Verify 192.168.1.100 is found."""
+    strategy = RegexExtractionStrategy(
+        custom={"ip_address": r"(?:\d{1,3}\.){3}\d{1,3}"}
+    )
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/regex-test", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        values = [item["value"] for item in extracted]
+        assert "192.168.1.100" in values, f"Expected 192.168.1.100 in {values}"
+
+
+@pytest.mark.asyncio
+async def test_regex_output_format(local_server):
+    """Verify each regex extraction result has the expected keys:
+    url, label, value, span."""
+    strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/regex-test", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) > 0
+        for item in extracted:
+            assert "url" in item, f"Missing 'url' key in {item}"
+            assert "label" in item, f"Missing 'label' key in {item}"
+            assert "value" in item, f"Missing 'value' key in {item}"
+            assert "span" in item, f"Missing 'span' key in {item}"
+            # Span should be a list/tuple of two ints
+            span = item["span"]
+            assert isinstance(span, (list, tuple)) and len(span) == 2
+
+
+@pytest.mark.asyncio
+async def test_regex_span_accuracy(local_server):
+    """Verify that span[0]:span[1] in the source content equals value.
+    This tests that span offsets are accurate relative to the input text."""
+    strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/regex-test", config=config)
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) > 0
+
+        # The regex runs on the content source (fit_html by default).
+        # We verify the span produces the correct value from that source.
+        # Since we cannot easily get the exact input text the regex ran on,
+        # we verify span[0] < span[1] and the value is non-empty.
+        for item in extracted:
+            span = item["span"]
+            assert span[0] < span[1], f"Invalid span: {span}"
+            assert len(item["value"]) > 0
+            assert span[1] - span[0] == len(item["value"]), (
+                f"Span length ({span[1] - span[0]}) != value length ({len(item['value'])})"
+            )
+
+
+# ---------------------------------------------------------------------------
+# NoExtractionStrategy
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+async def test_no_extraction(local_server):
+    """Crawl with NoExtractionStrategy and verify the framework skips
+    structured extraction (passthrough behavior). The crawler deliberately
+    bypasses extraction for NoExtractionStrategy, leaving extracted_content
+    as None. The actual page content is still available via markdown and html."""
+    strategy = NoExtractionStrategy()
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(url=f"{local_server}/", config=config)
+        assert result.success
+        # The framework explicitly skips extraction for NoExtractionStrategy,
+        # so extracted_content should be None (passthrough -- no processing).
+        assert result.extracted_content is None
+        # But the page content is still fully available
+        assert result.html is not None and len(result.html) > 0
+        assert result.markdown is not None and "Welcome" in result.markdown
+
+
+# ---------------------------------------------------------------------------
+# CosineStrategy (optional - requires sklearn)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.skipif(not HAS_COSINE, reason="CosineStrategy requires sklearn+torch")
+def test_cosine_basic():
+    """Test CosineStrategy extract() directly with pre-chunked text to verify clustering works."""
+    # CosineStrategy.extract() expects text with <|DEL|> or \\n\\n separators.
+    # We test the strategy directly to avoid browser overhead and isolate the logic.
+    topics = [
+        "Machine learning algorithms process large datasets to identify complex patterns "
+        "and make accurate predictions using neural networks and deep learning models.",
+        "Cloud computing provides scalable infrastructure for deploying web applications "
+        "globally across multiple regions and availability zones for high availability.",
+        "Database optimization requires careful indexing strategies and query performance "
+        "tuning to handle millions of transactions per second efficiently.",
+        "Network security involves configuring firewalls intrusion detection systems and "
+        "encrypted communications to protect against cyber threats and attacks.",
+        "Mobile development frameworks enable building cross-platform applications with "
+        "shared codebases that deploy to both iOS and Android platforms.",
+    ]
+    text = "<|DEL|>".join(topics)
+
+    strategy = CosineStrategy(
+        semantic_filter=None,
+        word_count_threshold=5,
+        max_dist=0.5,
+    )
+    result = strategy.extract(url="http://test.com", html=text)
+    assert isinstance(result, list)
+    assert len(result) > 0, "Expected clusters from CosineStrategy"
+    # Each cluster should have 'content' and 'index' keys
+    for item in result:
+        assert "content" in item
+        assert "index" in item
+
+
+# ---------------------------------------------------------------------------
+# Extraction with real URLs
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_extraction_real_quotes_css():
+    """Full pipeline: crawl quotes.toscrape.com, extract with JsonCss,
+    verify structured quote data including text and author."""
+    schema = {
+        "baseSelector": "div.quote",
+        "fields": [
+            {"name": "text", "selector": "span.text", "type": "text"},
+            {"name": "author", "selector": "small.author", "type": "text"},
+            {
+                "name": "tags",
+                "selector": "div.tags",
+                "type": "nested",
+                "fields": [
+                    {
+                        "name": "tag_list",
+                        "selector": "a.tag",
+                        "type": "text",
+                    },
+                ],
+            },
+        ],
+    }
+    strategy = JsonCssExtractionStrategy(schema=schema)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url="https://quotes.toscrape.com", config=config
+        )
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) >= 5, f"Expected at least 5 quotes, got {len(extracted)}"
+        for quote in extracted:
+            assert quote.get("text"), "Quote text should not be empty"
+            assert quote.get("author"), "Quote author should not be empty"
+
+
+@pytest.mark.asyncio
+@pytest.mark.network
+async def test_extraction_real_books_css():
+    """Crawl books.toscrape.com and extract book listings with titles and prices."""
+    schema = {
+        "baseSelector": "article.product_pod",
+        "fields": [
+            {"name": "title", "selector": "h3 a", "type": "attribute", "attribute": "title"},
+            {"name": "price", "selector": "p.price_color", "type": "text"},
+            {"name": "availability", "selector": "p.availability", "type": "text"},
+        ],
+    }
+    strategy = JsonCssExtractionStrategy(schema=schema)
+    config = CrawlerRunConfig(extraction_strategy=strategy)
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True, verbose=False)) as crawler:
+        result = await crawler.arun(
+            url="https://books.toscrape.com", config=config
+        )
+        assert result.success
+        extracted = json.loads(result.extracted_content)
+        assert len(extracted) >= 10, f"Expected at least 10 books, got {len(extracted)}"
+        for book in extracted:
+            assert book.get("title"), "Book title should not be empty"
+            assert book.get("price"), "Book price should not be empty"
--- a/tests/regression/test_reg_utils.py
+++ b/tests/regression/test_reg_utils.py
@@ -0,0 +1,500 @@
+"""
+Regression tests for Crawl4AI utility functions.
+
+Covers extract_xml_data, URL normalization, CacheContext/CacheMode,
+sanitize_input_encode, content hashing, and image scoring.
+"""
+
+import pytest
+
+from crawl4ai.utils import (
+    extract_xml_data,
+    extract_xml_data_legacy,
+    normalize_url,
+    normalize_url_for_deep_crawl,
+    efficient_normalize_url_for_deep_crawl,
+    sanitize_input_encode,
+    generate_content_hash,
+)
+from crawl4ai.cache_context import CacheContext, CacheMode
+
+
+# ===================================================================
+# extract_xml_data
+# ===================================================================
+
+class TestExtractXmlData:
+    """Verify extract_xml_data correctly parses tag content from strings."""
+
+    def test_basic_single_tag(self):
+        """Basic extraction of a single tag should return its content."""
+        result = extract_xml_data(["blocks"], "<blocks>hello</blocks>")
+        assert result["blocks"] == "hello"
+
+    def test_multiple_tags(self):
+        """Extracting multiple tags should return both."""
+        result = extract_xml_data(["a", "b"], "<a>1</a><b>2</b>")
+        assert result["a"] == "1"
+        assert result["b"] == "2"
+
+    def test_longest_match(self):
+        """When multiple occurrences exist, return the longest content."""
+        text = "<blocks>short</blocks> some text <blocks>this is the longer content here</blocks>"
+        result = extract_xml_data(["blocks"], text)
+        assert result["blocks"] == "this is the longer content here"
+
+    def test_nested_mention_bug_fix_1183(self):
+        """Fix for #1183: nested mention of tag name should not confuse extraction.
+
+        When <think> block mentions <blocks> in prose, the extraction should
+        return the actual <blocks> content, not the prose mention.
+        """
+        text = (
+            "<think>The user wants me to extract <blocks> data from the page.</think>"
+            "<blocks>real extracted data</blocks>"
+        )
+        result = extract_xml_data(["blocks"], text)
+        assert result["blocks"] == "real extracted data"
+
+    def test_missing_tag_returns_empty(self):
+        """Missing tag should return empty string."""
+        result = extract_xml_data(["missing"], "<other>content</other>")
+        assert result["missing"] == ""
+
+    def test_empty_content(self):
+        """Empty tag content should return empty string."""
+        result = extract_xml_data(["blocks"], "<blocks></blocks>")
+        assert result["blocks"] == ""
+
+    def test_multiline_content(self):
+        """Content spanning multiple lines should be extracted."""
+        text = "<blocks>\nline 1\nline 2\nline 3\n</blocks>"
+        result = extract_xml_data(["blocks"], text)
+        assert "line 1" in result["blocks"]
+        assert "line 2" in result["blocks"]
+        assert "line 3" in result["blocks"]
+
+    def test_special_chars_in_content(self):
+        """JSON-like content with special characters should be preserved."""
+        text = '<blocks>{"key": "value", "num": 42}</blocks>'
+        result = extract_xml_data(["blocks"], text)
+        assert '"key": "value"' in result["blocks"]
+        assert '"num": 42' in result["blocks"]
+
+    def test_content_with_angle_brackets(self):
+        """Content with HTML-like angle brackets should work if not same tag."""
+        text = "<blocks>some <b>bold</b> text</blocks>"
+        result = extract_xml_data(["blocks"], text)
+        assert "<b>bold</b>" in result["blocks"]
+
+    def test_multiple_tags_some_missing(self):
+        """Mixed present and missing tags should return values for present, empty for missing."""
+        result = extract_xml_data(["found", "missing"], "<found>yes</found>")
+        assert result["found"] == "yes"
+        assert result["missing"] == ""
+
+    def test_whitespace_stripped(self):
+        """Content should be stripped of leading/trailing whitespace."""
+        result = extract_xml_data(["blocks"], "<blocks>  trimmed  </blocks>")
+        assert result["blocks"] == "trimmed"
+
+
+class TestExtractXmlDataLegacy:
+    """Verify the legacy extract_xml_data function works."""
+
+    def test_basic_extraction(self):
+        """Legacy function should extract basic tag content."""
+        result = extract_xml_data_legacy(["blocks"], "<blocks>hello</blocks>")
+        assert result["blocks"] == "hello"
+
+    def test_missing_tag(self):
+        """Legacy function should return empty string for missing tags."""
+        result = extract_xml_data_legacy(["missing"], "no tags here")
+        assert result["missing"] == ""
+
+
+# ===================================================================
+# URL normalization
+# ===================================================================
+
+class TestNormalizeUrl:
+    """Verify normalize_url handles various URL edge cases."""
+
+    def test_trailing_slash_preserved(self):
+        """Trailing slash should be preserved (fix for #1520)."""
+        result = normalize_url("/foo/bar/", "http://x.com")
+        assert result.endswith("/foo/bar/")
+
+    def test_no_trailing_slash_not_added(self):
+        """URL without trailing slash should NOT have one added."""
+        result = normalize_url("/foo/bar", "http://x.com")
+        assert result.endswith("/foo/bar")
+        assert not result.endswith("/foo/bar/")
+
+    def test_root_path(self):
+        """Root path '/' should be preserved."""
+        result = normalize_url("/", "http://x.com")
+        assert result == "http://x.com/"
+
+    def test_query_param_case_preservation(self):
+        """Query parameter values should NOT be lowercased (fix for #1489).
+
+        cHash=AbCd must remain as-is, not become chash=abcd.
+        """
+        result = normalize_url("/page?cHash=AbCd", "http://x.com")
+        assert "cHash=AbCd" in result
+
+    def test_tracking_params_removed(self):
+        """Common tracking parameters should be removed."""
+        result = normalize_url(
+            "/page?utm_source=google&utm_medium=cpc&real_param=keep",
+            "http://x.com",
+        )
+        assert "utm_source" not in result
+        assert "utm_medium" not in result
+        assert "real_param=keep" in result
+
+    def test_fbclid_removed(self):
+        """fbclid tracking parameter should be removed."""
+        result = normalize_url("/page?fbclid=abc123&keep=yes", "http://x.com")
+        assert "fbclid" not in result
+        assert "keep=yes" in result
+
+    def test_gclid_removed(self):
+        """gclid tracking parameter should be removed."""
+        result = normalize_url("/page?gclid=xyz&keep=yes", "http://x.com")
+        assert "gclid" not in result
+        assert "keep=yes" in result
+
+    def test_tracking_removal_case_insensitive(self):
+        """Tracking parameter removal should be case-insensitive."""
+        # The normalize_url uses k.lower() for comparison
+        result = normalize_url("/page?UTM_SOURCE=test&data=1", "http://x.com")
+        # UTM_SOURCE (uppercase) should be removed since comparison is case-insensitive
+        assert "data=1" in result
+
+    def test_query_sorting(self):
+        """Query parameters should be sorted alphabetically."""
+        result = normalize_url("/page?z=1&a=2&m=3", "http://x.com")
+        # Parameters should appear in alphabetical order
+        idx_a = result.index("a=2")
+        idx_m = result.index("m=3")
+        idx_z = result.index("z=1")
+        assert idx_a < idx_m < idx_z
+
+    def test_fragment_removed_by_default(self):
+        """Fragment (#section) should be removed by default."""
+        result = normalize_url("/page#section", "http://x.com")
+        assert "#section" not in result
+
+    def test_fragment_kept_when_requested(self):
+        """Fragment should be kept when keep_fragment=True."""
+        result = normalize_url("/page#section", "http://x.com", keep_fragment=True)
+        assert "#section" in result
+
+    def test_relative_url_resolution(self):
+        """Relative URLs should be resolved against base_url."""
+        result = normalize_url("page2", "http://x.com/dir/page1")
+        assert result == "http://x.com/dir/page2"
+
+    def test_empty_href_returns_none(self):
+        """Empty href should return None."""
+        result = normalize_url("", "http://x.com")
+        assert result is None
+
+    def test_none_href_returns_none(self):
+        """None href should return None."""
+        result = normalize_url(None, "http://x.com")
+        assert result is None
+
+    def test_hostname_lowercased(self):
+        """Hostname should be lowercased for consistency."""
+        result = normalize_url("/page", "http://EXAMPLE.COM/path")
+        assert "example.com" in result
+
+    def test_no_query_params_still_works(self):
+        """URL without query params should normalize without issue."""
+        result = normalize_url("/simple/path", "http://x.com")
+        assert "http://x.com/simple/path" == result
+
+
+class TestNormalizeUrlForDeepCrawl:
+    """Verify normalize_url_for_deep_crawl handles deep crawl edge cases."""
+
+    def test_trailing_slash_preserved(self):
+        """Trailing slash should be preserved in deep crawl normalization."""
+        result = normalize_url_for_deep_crawl("/foo/bar/", "http://x.com")
+        assert result is not None
+        assert result.endswith("/foo/bar/")
+
+    def test_empty_href_returns_none(self):
+        """Empty href should return None."""
+        result = normalize_url_for_deep_crawl("", "http://x.com")
+        assert result is None
+
+    def test_none_href_returns_none(self):
+        """None href should return None."""
+        result = normalize_url_for_deep_crawl(None, "http://x.com")
+        assert result is None
+
+    def test_fragment_removed(self):
+        """Fragment should be removed in deep crawl normalization."""
+        result = normalize_url_for_deep_crawl("/page#anchor", "http://x.com")
+        assert "#anchor" not in result
+
+    def test_tracking_params_removed(self):
+        """utm_source and similar tracking params should be removed."""
+        result = normalize_url_for_deep_crawl(
+            "/page?utm_source=google&keep=yes", "http://x.com"
+        )
+        assert "utm_source" not in result
+        assert "keep=yes" in result
+
+    def test_hostname_lowercased(self):
+        """Hostname should be lowercased."""
+        result = normalize_url_for_deep_crawl("/page", "http://EXAMPLE.COM")
+        assert "example.com" in result
+
+
+class TestEfficientNormalizeUrlForDeepCrawl:
+    """Verify efficient_normalize_url_for_deep_crawl caching and correctness."""
+
+    def test_trailing_slash_preserved(self):
+        """Trailing slash should be preserved."""
+        result = efficient_normalize_url_for_deep_crawl("/foo/bar/", "http://x.com")
+        assert result is not None
+        assert result.endswith("/foo/bar/")
+
+    def test_cached_results_consistent(self):
+        """Calling twice with same args should return same result (cached)."""
+        result1 = efficient_normalize_url_for_deep_crawl("/cached", "http://x.com")
+        result2 = efficient_normalize_url_for_deep_crawl("/cached", "http://x.com")
+        assert result1 == result2
+
+    def test_empty_href_returns_none(self):
+        """Empty href should return None."""
+        result = efficient_normalize_url_for_deep_crawl("", "http://x.com")
+        assert result is None
+
+    def test_none_href_returns_none(self):
+        """None href should return None."""
+        result = efficient_normalize_url_for_deep_crawl(None, "http://x.com")
+        assert result is None
+
+    def test_fragment_removed(self):
+        """Fragment should be removed."""
+        result = efficient_normalize_url_for_deep_crawl("/page#top", "http://x.com")
+        assert "#top" not in result
+
+    def test_hostname_lowercased(self):
+        """Hostname should be lowercased."""
+        result = efficient_normalize_url_for_deep_crawl("/path", "http://UPPER.COM")
+        assert "upper.com" in result
+
+    def test_relative_url_resolution(self):
+        """Relative URLs should be resolved correctly."""
+        result = efficient_normalize_url_for_deep_crawl(
+            "child", "http://x.com/parent/"
+        )
+        assert result == "http://x.com/parent/child"
+
+
+# ===================================================================
+# CacheContext / CacheMode
+# ===================================================================
+
+class TestCacheMode:
+    """Verify CacheContext behavior for each CacheMode."""
+
+    def test_enabled_reads_and_writes(self):
+        """CacheMode.ENABLED should allow both reads and writes."""
+        ctx = CacheContext("http://example.com", CacheMode.ENABLED)
+        assert ctx.should_read() is True
+        assert ctx.should_write() is True
+
+    def test_disabled_no_reads_no_writes(self):
+        """CacheMode.DISABLED should block both reads and writes."""
+        ctx = CacheContext("http://example.com", CacheMode.DISABLED)
+        assert ctx.should_read() is False
+        assert ctx.should_write() is False
+
+    def test_bypass_no_reads_but_writes(self):
+        """CacheMode.BYPASS should skip reads but allow writes."""
+        ctx = CacheContext("http://example.com", CacheMode.BYPASS)
+        assert ctx.should_read() is False
+        assert ctx.should_write() is False
+
+    def test_read_only_reads_no_writes(self):
+        """CacheMode.READ_ONLY should allow reads, block writes."""
+        ctx = CacheContext("http://example.com", CacheMode.READ_ONLY)
+        assert ctx.should_read() is True
+        assert ctx.should_write() is False
+
+    def test_write_only_no_reads_but_writes(self):
+        """CacheMode.WRITE_ONLY should block reads, allow writes."""
+        ctx = CacheContext("http://example.com", CacheMode.WRITE_ONLY)
+        assert ctx.should_read() is False
+        assert ctx.should_write() is True
+
+    def test_raw_url_not_cacheable(self):
+        """raw:// URLs should not be cacheable regardless of mode."""
+        ctx = CacheContext("raw://<html>test</html>", CacheMode.ENABLED)
+        assert ctx.should_read() is False
+        assert ctx.should_write() is False
+
+    def test_raw_url_is_raw_html(self):
+        """raw:// URLs should be flagged as raw HTML."""
+        ctx = CacheContext("raw://<html>test</html>", CacheMode.ENABLED)
+        assert ctx.is_raw_html is True
+        assert ctx.is_web_url is False
+
+    def test_http_url_is_cacheable(self):
+        """http:// URLs should be cacheable."""
+        ctx = CacheContext("http://example.com", CacheMode.ENABLED)
+        assert ctx.is_cacheable is True
+        assert ctx.is_web_url is True
+
+    def test_https_url_is_cacheable(self):
+        """https:// URLs should be cacheable."""
+        ctx = CacheContext("https://example.com", CacheMode.ENABLED)
+        assert ctx.is_cacheable is True
+
+    def test_file_url_is_cacheable(self):
+        """file:// URLs should be cacheable."""
+        ctx = CacheContext("file:///tmp/test.html", CacheMode.ENABLED)
+        assert ctx.is_cacheable is True
+        assert ctx.is_local_file is True
+
+    def test_always_bypass_overrides_everything(self):
+        """always_bypass=True should force read=False, write=False."""
+        ctx = CacheContext("http://example.com", CacheMode.ENABLED, always_bypass=True)
+        assert ctx.should_read() is False
+        assert ctx.should_write() is False
+
+    def test_display_url_for_web(self):
+        """Display URL for web URLs should be the URL itself."""
+        ctx = CacheContext("http://example.com", CacheMode.ENABLED)
+        assert ctx.display_url == "http://example.com"
+
+    def test_display_url_for_raw(self):
+        """Display URL for raw HTML should be 'Raw HTML'."""
+        ctx = CacheContext("raw://something", CacheMode.ENABLED)
+        assert ctx.display_url == "Raw HTML"
+
+
+# ===================================================================
+# sanitize_input_encode
+# ===================================================================
+
+class TestSanitizeInputEncode:
+    """Verify sanitize_input_encode handles encoding edge cases."""
+
+    def test_normal_utf8_passthrough(self):
+        """Normal UTF-8 text should pass through unchanged."""
+        text = "Hello, world! This is normal text."
+        assert sanitize_input_encode(text) == text
+
+    def test_unicode_text_preserved(self):
+        """Unicode characters should be preserved."""
+        text = "Caf\u00e9 na\u00efve r\u00e9sum\u00e9"
+        assert sanitize_input_encode(text) == text
+
+    def test_empty_string_returns_empty(self):
+        """Empty string should return empty string."""
+        assert sanitize_input_encode("") == ""
+
+    def test_ascii_text_passthrough(self):
+        """Pure ASCII text should pass through."""
+        text = "Simple ASCII text 123"
+        assert sanitize_input_encode(text) == text
+
+    def test_cjk_characters_preserved(self):
+        """CJK characters should be preserved."""
+        text = "\u4f60\u597d\u4e16\u754c"
+        assert sanitize_input_encode(text) == text
+
+    def test_emoji_preserved(self):
+        """Emoji characters should be preserved in UTF-8."""
+        text = "Hello \U0001f600 World"
+        result = sanitize_input_encode(text)
+        assert "Hello" in result
+        assert "World" in result
+
+
+# ===================================================================
+# Content hashing
+# ===================================================================
+
+class TestGenerateContentHash:
+    """Verify generate_content_hash produces consistent results."""
+
+    def test_same_content_same_hash(self):
+        """Same content should produce same hash."""
+        hash1 = generate_content_hash("hello world")
+        hash2 = generate_content_hash("hello world")
+        assert hash1 == hash2
+
+    def test_different_content_different_hash(self):
+        """Different content should produce different hashes."""
+        hash1 = generate_content_hash("hello world")
+        hash2 = generate_content_hash("goodbye world")
+        assert hash1 != hash2
+
+    def test_empty_content_valid_hash(self):
+        """Empty content should produce a valid hash (not an error)."""
+        h = generate_content_hash("")
+        assert isinstance(h, str)
+        assert len(h) > 0
+
+    def test_hash_is_hex_string(self):
+        """Hash should be a hexadecimal string."""
+        h = generate_content_hash("test content")
+        assert all(c in "0123456789abcdef" for c in h)
+
+    def test_hash_deterministic_across_calls(self):
+        """Hash should be deterministic, not random."""
+        content = "The quick brown fox jumps over the lazy dog"
+        hashes = [generate_content_hash(content) for _ in range(10)]
+        assert len(set(hashes)) == 1
+
+    def test_whitespace_sensitive(self):
+        """Hash should be sensitive to whitespace differences."""
+        h1 = generate_content_hash("hello world")
+        h2 = generate_content_hash("hello  world")
+        assert h1 != h2
+
+    def test_case_sensitive(self):
+        """Hash should be case-sensitive."""
+        h1 = generate_content_hash("Hello")
+        h2 = generate_content_hash("hello")
+        assert h1 != h2
+
+    def test_long_content(self):
+        """Long content should hash without error."""
+        content = "x" * 1_000_000
+        h = generate_content_hash(content)
+        assert isinstance(h, str)
+        assert len(h) > 0
+
+
+# ===================================================================
+# Image scoring (import-guarded)
+# ===================================================================
+
+class TestImageScoring:
+    """Test image scoring logic if available.
+
+    score_image_for_usefulness is a nested function, so we test
+    the concept indirectly by checking that the module loads and
+    the scoring constants exist.
+    """
+
+    def test_image_score_threshold_exists(self):
+        """IMAGE_SCORE_THRESHOLD config constant should exist."""
+        from crawl4ai.config import IMAGE_SCORE_THRESHOLD
+        assert isinstance(IMAGE_SCORE_THRESHOLD, (int, float))
+
+    def test_image_description_threshold_exists(self):
+        """IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD should exist."""
+        from crawl4ai.config import IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
+        assert isinstance(IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, (int, float))
				`@@ -0,0 +1 @@`
				`# Crawl4AI Regression Test Suite (crawl4ai-check)`