Website Crawling

Learn how to add content to your chatbot by crawling websites. ChatMaven's website crawler can automatically extract and process content from your website or documentation pages.

Getting Started

Prerequisites

Website URL
Access permissions
Sitemap (optional)
Robots.txt compliance

Basic Setup

Go to "Data Sources" → "Website"
Click "Add Website"
Enter the website URL
Configure basic settings:
- Crawl depth
- Update frequency
- Language detection

Configuration Options

URL Settings

Include Patterns

https://example.com/docs/*
https://example.com/blog/*

Exclude Patterns

https://example.com/private/*
https://example.com/admin/*

Parameters
- Follow redirects
- Handle dynamic content
- Respect nofollow

Authentication

Basic Auth
- Username/password
- API key
- Bearer token
Cookie-based
- Session cookies
- Authentication tokens
- Custom headers
OAuth
- OAuth 2.0 support
- Token management
- Refresh handling

Crawling Settings

Depth and Scope

Crawl Depth
- Surface (1 level)
- Medium (3 levels)
- Deep (unlimited)
Content Selection
- Main content
- Navigation
- Footers
- Sidebars
Media Handling
- Images
- PDFs
- Downloads
- Embedded content

Rate Limiting

Requests per second
Concurrent connections
Bandwidth limits
Crawl window

Content Processing

Extraction Rules

Content Selectors

article.content
div.documentation
section.main

Ignore Elements
```
.navigation
.footer
.ads
```
Custom Rules
- XPath queries
- CSS selectors
- Regular expressions

Content Cleaning

Remove ads
Clean formatting
Extract main content
Preserve structure

Scheduling

Automatic Updates

Frequency Options
- Hourly
- Daily
- Weekly
- Monthly
Update Types
- Full crawl
- Incremental
- Changed pages only
Notifications
- Completion
- Errors
- Changes detected

Monitoring

Performance Metrics

Pages crawled
Success rate
Processing time
Error count

Content Changes

New pages
Modified content
Deleted pages
Structure changes

Error Tracking

Connection issues
Authentication failures
Processing errors
Rate limiting

Best Practices

Optimization

Performance
- Set appropriate delays
- Use incremental updates
- Optimize selectors
Resource Usage
- Limit concurrent requests
- Schedule during off-peak
- Monitor bandwidth
Content Quality
- Verify extracted content
- Check formatting
- Test in chatbot

Maintenance

Regular Tasks
- Review crawl logs
- Update patterns
- Check authentication
- Verify content
Troubleshooting
- Monitor errors
- Check access
- Verify settings
- Test selectors

Next Steps

FAQ and troubleshooting

Crawl finds too few pages.

Submit a sitemap, check robots.txt, and ensure pages are linked—not orphaned behind JavaScript-only menus without public URLs.

Crawlers cannot read behind logins. Upload those pages as files or use an API integration with server credentials.

How often does the site refresh?

Depends on product schedule and manual refresh. Update after large site migrations.

Getting Started​

Prerequisites​

Basic Setup​

Configuration Options​

URL Settings​

Authentication​

Crawling Settings​

Depth and Scope​

Rate Limiting​

Content Processing​

Extraction Rules​

Content Cleaning​

Scheduling​

Automatic Updates​

Monitoring​

Performance Metrics​

Content Changes​

Error Tracking​

Best Practices​

Optimization​

Maintenance​

Next Steps​

FAQ and troubleshooting​

Crawl finds too few pages.​

Private or login-only content.​

How often does the site refresh?​