Skip to main content

Website Crawling

Learn how to add content to your chatbot by crawling websites. ChatMaven's website crawler can automatically extract and process content from your website or documentation pages.

Getting Started

Prerequisites

  • Website URL
  • Access permissions
  • Sitemap (optional)
  • Robots.txt compliance

Basic Setup

  1. Go to "Data Sources" → "Website"
  2. Click "Add Website"
  3. Enter the website URL
  4. Configure basic settings:
    • Crawl depth
    • Update frequency
    • Language detection

Configuration Options

URL Settings

  1. Include Patterns

    https://example.com/docs/*
    https://example.com/blog/*
  2. Exclude Patterns

    https://example.com/private/*
    https://example.com/admin/*
  3. Parameters

    • Follow redirects
    • Handle dynamic content
    • Respect nofollow

Authentication

  1. Basic Auth

    • Username/password
    • API key
    • Bearer token
  2. Cookie-based

    • Session cookies
    • Authentication tokens
    • Custom headers
  3. OAuth

    • OAuth 2.0 support
    • Token management
    • Refresh handling

Crawling Settings

Depth and Scope

  1. Crawl Depth

    • Surface (1 level)
    • Medium (3 levels)
    • Deep (unlimited)
  2. Content Selection

    • Main content
    • Navigation
    • Footers
    • Sidebars
  3. Media Handling

    • Images
    • PDFs
    • Downloads
    • Embedded content

Rate Limiting

  • Requests per second
  • Concurrent connections
  • Bandwidth limits
  • Crawl window

Content Processing

Extraction Rules

  1. Content Selectors

    article.content
    div.documentation
    section.main
  2. Ignore Elements

    .navigation
    .footer
    .ads
  3. Custom Rules

    • XPath queries
    • CSS selectors
    • Regular expressions

Content Cleaning

  • Remove ads
  • Clean formatting
  • Extract main content
  • Preserve structure

Scheduling

Automatic Updates

  1. Frequency Options

    • Hourly
    • Daily
    • Weekly
    • Monthly
  2. Update Types

    • Full crawl
    • Incremental
    • Changed pages only
  3. Notifications

    • Completion
    • Errors
    • Changes detected

Monitoring

Performance Metrics

  • Pages crawled
  • Success rate
  • Processing time
  • Error count

Content Changes

  • New pages
  • Modified content
  • Deleted pages
  • Structure changes

Error Tracking

  • Connection issues
  • Authentication failures
  • Processing errors
  • Rate limiting

Best Practices

Optimization

  1. Performance

    • Set appropriate delays
    • Use incremental updates
    • Optimize selectors
  2. Resource Usage

    • Limit concurrent requests
    • Schedule during off-peak
    • Monitor bandwidth
  3. Content Quality

    • Verify extracted content
    • Check formatting
    • Test in chatbot

Maintenance

  1. Regular Tasks

    • Review crawl logs
    • Update patterns
    • Check authentication
    • Verify content
  2. Troubleshooting

    • Monitor errors
    • Check access
    • Verify settings
    • Test selectors

Next Steps


FAQ and troubleshooting

Crawl finds too few pages.

Submit a sitemap, check robots.txt, and ensure pages are linked—not orphaned behind JavaScript-only menus without public URLs.

Private or login-only content.

Crawlers cannot read behind logins. Upload those pages as files or use an API integration with server credentials.

How often does the site refresh?

Depends on product schedule and manual refresh. Update after large site migrations.