
Web Crawler Connector for eDiscovery & Data Collection
Capture and index publicly accessible web content for eDiscovery, investigations, and compliance. The Onna Web Crawler lets your team collect and search web pages at scale, on a defensible, unified platform.
Why Use Onna's Web Crawler for eDiscovery Collections
Publicly accessible web content like forum posts, news articles, corporate websites, and regulatory filings can be critical evidence in litigation, investigations, and compliance reviews. But capturing that content in a structured, searchable, and defensible way isn't something standard browser tools are built to do.
Onna's Web Crawler was purpose-built to index web pages and bring their content into your collection workflow. Point it at a URL and Onna captures the page's text, structure, links, and metadata, making web content as searchable and producible as any other data source in your workspace.
Without the right tools, collecting web content creates real challenges:
Pages change or disappear after they become relevant to a matter
Manual screenshots lack the structure needed for search and review
Links, headings, and embedded content need to be captured alongside page text
Password-protected and bot-blocked pages require careful scoping
The Onna Web Crawler addresses these challenges by systematically indexing the content of specified URLs and preserving that content within a structured, auditable collection.
Web Crawler Connector Capabilities
Onna's Web Crawler is designed to bring publicly accessible web content into your eDiscovery and investigation workflows.
Key capabilities include:
URL-based web page indexing
Capture of headings, subheadings, paragraphs, images, and links
Simple, no-credential setup
One-time sync for point-in-time web captures
Audit logs for all collection activity
These capabilities allow organizations to capture and preserve web content as part of a broader collection strategy — no IT involvement or authentication required.
What Data Can Be Collected with the Web Crawler
The connector captures content and structure from publicly accessible web pages including:
Page Content
Headings and subheadings
Paragraphs and body text
Pictures
Links and text links
Page metadata
These collections preserve the structure and content of web pages so investigators can review and search captured material accurately.
Note: The Web Crawler does not collect files in their native format from links embedded on a web page. Linked files are captured as embedded links only, not as downloadable native files. The Web Crawler also does not collect content from password-protected sites, CAPTCHA-protected sites, or sites where a user agent block has been issued.
Web Crawler Metadata Collected
The Web Crawler captures standard page metadata alongside collected content, including page-level metadata embedded within the HTML of each indexed URL. There are no Web Crawler-specific metadata fields beyond what is present on the page itself.
Captured content fields include headings, subheadings, paragraph text, images, and both standard and text links — preserving the full readable structure of each indexed page.
How Web Crawler Data Collection Works
Setting up a Web Crawler collection in Onna takes just a few steps and requires no authentication or technical configuration.
Add Web Crawler as a data source
Navigate to your workspace and add the Web Crawler as a source.
Start sync
Click Done to begin the collection. Your Web Crawler source will appear alphabetically in your list of connected sources once the sync is underway.
Web Crawler Data Collection Options
The Onna Web Crawler currently supports one sync mode.
One-Time Sync
A point-in-time capture of the specified URL, suited for preserving the current state of a web page for litigation, investigation, or compliance purposes.
Common Web Crawler eDiscovery Use Cases
Litigation Response
Capture and preserve the contents of publicly accessible web pages relevant to a legal matter at a specific point in time.
Open Source Intelligence
Support investigations by indexing and searching publicly accessible web content alongside internal data sources in a single platform.
Regulatory Compliance
Archive public-facing web content to document what was publicly available at a given point in time.
Internal Investigations
Collect web content — such as forum posts, public profiles, or external communications — relevant to an incident or policy violation.
Related Data Source Connectors
Onna connects to 29+ data sources — so when a matter spans web content and Slack, Microsoft Teams, Google Workspace, or Salesforce, you can bring it all together in one platform. No duplicating workflows. No switching between tools.
Get more info about related connectors:
Onna + Web Crawler Connector FAQs
No. The Web Crawler does not currently support password-protected or CAPTCHA-protected websites.
No. Links embedded on a web page are captured as links, not as downloadable native files. If you need to collect files in their native format, those sources should be connected directly through their respective Onna connectors.
No. If a user agent block has been issued for a site, Onna will be unable to collect data from it.
Not at this time. The Web Crawler supports one-time syncs only, capturing a point-in-time snapshot of the specified URL.
Yes. Onna maintains a full audit log of all collection activity. Every collection has a documented chain of custody.
Start Capturing Web Content for eDiscovery
Add the Web Crawler to your workspace in minutes and begin indexing publicly accessible web pages alongside the rest of your organization's data sources.
%201.webp)


%201.webp)