White Wi-Fi signal icon on a blue circular background.

Web Crawler

Web Crawler Connector for eDiscovery & Data Collection

Capture and index publicly accessible web content for eDiscovery, investigations, and compliance. The Onna Web Crawler lets your team collect and search web pages at scale, on a defensible, unified platform.

Get a Demo

See all Connectors

Why Use Onna's Web Crawler for eDiscovery Collections

Publicly accessible web content like forum posts, news articles, corporate websites, and regulatory filings can be critical evidence in litigation, investigations, and compliance reviews. But capturing that content in a structured, searchable, and defensible way isn't something standard browser tools are built to do.

Onna's Web Crawler was purpose-built to index web pages and bring their content into your collection workflow. Point it at a URL and Onna captures the page's text, structure, links, and metadata, making web content as searchable and producible as any other data source in your workspace.

Without the right tools, collecting web content creates real challenges:

Pages change or disappear after they become relevant to a matter

Manual screenshots lack the structure needed for search and review

Links, headings, and embedded content need to be captured alongside page text

Password-protected and bot-blocked pages require careful scoping

The Onna Web Crawler addresses these challenges by systematically indexing the content of specified URLs and preserving that content within a structured, auditable collection.

Key Capabilities

Web Crawler Connector Capabilities

Onna's Web Crawler is designed to bring publicly accessible web content into your eDiscovery and investigation workflows.

Key capabilities include:

URL-based web page indexing

Capture of headings, subheadings, paragraphs, images, and links

Simple, no-credential setup

One-time sync for point-in-time web captures

Audit logs for all collection activity

These capabilities allow organizations to capture and preserve web content as part of a broader collection strategy — no IT involvement or authentication required.

Data Collected

What Data Can Be Collected with the Web Crawler

The connector captures content and structure from publicly accessible web pages including:

Page Content

Headings and subheadings

Paragraphs and body text

Pictures

Links and text links

Page metadata

These collections preserve the structure and content of web pages so investigators can review and search captured material accurately.

Note: The Web Crawler does not collect files in their native format from links embedded on a web page. Linked files are captured as embedded links only, not as downloadable native files. The Web Crawler also does not collect content from password-protected sites, CAPTCHA-protected sites, or sites where a user agent block has been issued.

Metadata Schema

Web Crawler Metadata Collected

The Web Crawler captures standard page metadata alongside collected content, including page-level metadata embedded within the HTML of each indexed URL. There are no Web Crawler-specific metadata fields beyond what is present on the page itself.

Captured content fields include headings, subheadings, paragraph text, images, and both standard and text links — preserving the full readable structure of each indexed page.

How it works

How Web Crawler Data Collection Works

Setting up a Web Crawler collection in Onna takes just a few steps and requires no authentication or technical configuration.

01.

Add Web Crawler as a data source

Navigate to your workspace and add the Web Crawler as a source.

02.

Configure the collection

Enter a name for your source and provide the URL you want to collect from.
Make sure to include the full protocol (http:// or https://) when entering the URL.

03.

Start sync

Click Done to begin the collection. Your Web Crawler source will appear alphabetically in your list of connected sources once the sync is underway.

Sync Modes

Web Crawler Data Collection Options

The Onna Web Crawler currently supports one sync mode.

One-Time Sync

A point-in-time capture of the specified URL, suited for preserving the current state of a web page for litigation, investigation, or compliance purposes.

Use Cases

Common Web Crawler eDiscovery Use Cases

Litigation Response

Capture and preserve the contents of publicly accessible web pages relevant to a legal matter at a specific point in time.

Open Source Intelligence

Support investigations by indexing and searching publicly accessible web content alongside internal data sources in a single platform.

Regulatory Compliance

Archive public-facing web content to document what was publicly available at a given point in time.

Internal Investigations

Collect web content — such as forum posts, public profiles, or external communications — relevant to an incident or policy violation.

Related Connectors

Related Data Source Connectors

Onna connects to 29+ data sources — so when a matter spans web content and Slack, Microsoft Teams, Google Workspace, or Salesforce, you can bring it all together in one platform. No duplicating workflows. No switching between tools.

Get more info about related connectors:

Slack

SharePoint

Google

SharePoint

box

Salesforce

Teams

See All Connectors

Onna + Web Crawler Connector FAQs

Can the Web Crawler collect from password-protected websites?

No. The Web Crawler does not currently support password-protected or CAPTCHA-protected websites.

Does the Web Crawler collect linked files in their native format?

No. Links embedded on a web page are captured as links, not as downloadable native files. If you need to collect files in their native format, those sources should be connected directly through their respective Onna connectors.

What URL format is required?

URLs must be entered with the full protocol prefix — either http:// or https://. URLs entered without a protocol will not be accepted.

Can I collect from a site that has blocked web crawlers?

No. If a user agent block has been issued for a site, Onna will be unable to collect data from it.

Does the Web Crawler support ongoing or recurring collections?

Not at this time. The Web Crawler supports one-time syncs only, capturing a point-in-time snapshot of the specified URL.

Is the Web Crawler collection defensible?

Yes. Onna maintains a full audit log of all collection activity. Every collection has a documented chain of custody.

Start Capturing Web Content for eDiscovery

Add the Web Crawler to your workspace in minutes and begin indexing publicly accessible web pages alongside the rest of your organization's data sources.

Get a Demo

Talk to an Expert