How can we help you?

web-scraper-guide

Web Scraper & Knowledge Training Guide

Use the Web Scraper to pull content from your website, help center, or documentation into your bot's knowledge base. The scraper reads pages, extracts meaningful content, and stores it so your bot can answer questions from it.

---

Creating a Scrape Job

Go to Bots → Training → Web Training and click New Scrape Job.

Required fields

  • Name — label for the job (e.g., "Support Site – May 2026")
  • URLs — one or more starting URLs; the scraper will crawl outward from each one

Key settings

  • Crawl Depth — how many link-levels deep to follow from each starting URL. Depth 1 = only pages directly linked from your start URL. Max depth 5.
  • Max Pages — hard cap on pages scraped per job. Your subscription plan sets the ceiling.
  • Smart Scan — skip pages that haven't changed since the last scrape. Recommended on for recurring jobs to reduce wasted pages.
  • Exclude Navigation URLs — drops login, cart, search, and account pages from the results. Recommended on for most sites.

---

Advanced Options

Expand Advanced Settings in the job creation panel to access:

  • Stealth Mode — mimics a real browser user agent to bypass bot-blocking
  • Remove Consent Popups — auto-dismisses GDPR/cookie banners before reading page content
  • Process iFrames — extracts content from embedded frames
  • Scroll Full Page — triggers lazy-loaded content to appear before reading
  • Follow Linked PDFs — downloads and reads PDF files linked from crawled pages
  • Page Load Delay — wait N seconds after page load for JavaScript to render (useful for SPAs)
  • URL Exclusion Pattern — regex pattern; matching URLs are skipped
  • URL Filter Pattern — regex pattern; only matching URLs are crawled

---

During a Scrape Job

The job list shows real-time progress: pages scraped, pages blocked, pages thin (too little content), and pages failed. Click View Activity Log on a running job to see a live per-URL event stream.

A running job can be stopped early. Completed pages are saved.

---

Reading Job Results

Click View Results on a completed job to browse every scraped page. Filter by success/failure and search by URL or content. Failed pages show the specific reason: blocked by bot protection, thin content, timeout, or HTTP error.

---

Gap Report

After a job completes, click Gap Report to see a coverage analysis:

  • Health score — Good / Fair / Poor based on block rate and sitemap coverage
  • Pages blocked — number of pages your scraper couldn't access
  • Pages thin — pages with too little content to be useful
  • Sitemap coverage — percentage of your sitemap pages that were indexed
  • Missing important pages — pages the system identified as likely important but not indexed
  • Recommendations — specific actions to improve coverage on the next run

---

Knowledge Gap Monitor

Go to Bots → Training → Gap Monitor to use Velaro's conversation-driven gap detection.

The Gap Monitor cross-references your scrape coverage against bot conversation signals (low CSAT scores, escalations). It produces a ranked re-crawl queue showing which pages are most likely responsible for bot failures — so you fix the right gaps first.

  • Bot Failure Signals — count of low-CSAT conversations in the last 30 days, by channel
  • Priority Re-crawl Queue — pages ranked by a failure signal score combining gap severity and conversation impact
  • Schedule Targeted Re-crawl — select specific pages and create a focused re-crawl job for only those URLs

Use the Gap Monitor after every major content update or when bot CSAT drops unexpectedly.

---

Publishing to Your Bot

Scraping alone does not update your bot. After a job completes:

1. Review the results and confirm the content looks correct

2. Click Publish on the job and select your bot's index

3. The bot's knowledge base updates within a few minutes

You can re-publish the same job multiple times. Each publish replaces the bot's indexed content from that job.

---

File Upload

Use File Training to upload PDFs, Word documents, Excel files, and CSVs directly — no URL needed. The scraper extracts text and tables from each file and indexes the content.

Supported formats: PDF, DOCX, XLSX, CSV, TXT, MD, HTML

---

Auto-Rebuild Schedule

Enterprise plans include automatic recurring scrape jobs that re-crawl your site on a schedule (daily, weekly, or monthly) and publish the results automatically. Check your schedule under Quota in the Web Training tab.

---

Troubleshooting

Pages showing as blocked

The site has bot protection. Enable Stealth Mode and add a Page Load Delay of 2–3 seconds. If blocking persists, the site may require authenticated access.

Thin content on most pages

Your navigation, footer, or cookie consent text is being captured. Enable Exclude Navigation URLs and Remove Consent Popups, then re-run.

Bot still can't answer after publishing

Check the Gap Report — the page may have been blocked or returned thin content. Use the Gap Monitor to identify which specific pages to re-crawl.

Scrape job won't start

You may have hit your daily job limit. Check the quota bar at the top of the Web Training tab.

Share: Email

Was this article helpful?