Overcollection in eDiscovery occurs when organizations gather significantly more data than is proportionate or relevant to the legal matter at hand. This inflates processing costs, extends review timelines, and increases exposure risk, without improving case outcomes. Targeted data collection for investigations addresses this by applying defined parameters, such as custodian scope, date ranges, and keywords, before data enters the processing pipeline.
Why Overcollection Is a Real Business Problem
The volume of enterprise data has grown sharply in recent years. According to IDC's Global DataSphere research, the amount of data created and replicated globally reached 120 zettabytes in 2023 and is forecast to grow significantly through 2027. For legal and compliance teams, this growth translates directly into more data to assess, process, and review during any investigation or litigation.
When collection is broader than it needs to be, the downstream impact is compounded at every stage of the eDiscovery workflow:
- Processing costs scale with data volume, not relevance.
- Review hours increase, raising outside counsel fees.
- Sensitive data from non-relevant custodians is unnecessarily exposed.
- Audit trails become harder to manage and defend.
A 2023 survey by the Association of Certified E-Discovery Specialists (ACEDS) found that data volume and cost management remain among the top challenges practitioners face. Proportionality, which is embedded in Rule 26 of the Federal Rules of Civil Procedure, requires that discovery be proportionate to the needs of the case, making overcollection not just a cost issue, but a compliance risk.
How Targeted Data Collection for Investigations Works
Reducing overcollection is not about collecting less. It is about collecting more precisely. A structured collection approach applies defined parameters before data enters the review queue, reducing noise while preserving defensibility.
1. Scope Definition Before Collection Begins
Legal and compliance teams define custodians, relevant data sources, date ranges, and keyword lists before any data is pulled. This upstream discipline shapes everything that follows. Skipping this step is the most common cause of overcollection.
2. Targeted Collection from Structured and Unstructured Sources
Modern organizations run on email, file shares, and increasingly on collaboration apps data, platforms like Slack, Microsoft Teams, Google Workspace, and Zoom. These sources require collection tools that can filter at the source, not after the fact. Onna's eDiscovery collections capability enables teams to apply custodian, date, and keyword filters directly at the point of collection, before data is moved into the processing pipeline.
3. eDiscovery Processing With Deduplication and Filtering
Once collected, eDiscovery processing applies deduplication, near-deduplication, NIST filtering (to eliminate known system files), and format normalization. These steps further reduce volume before the data reaches the review layer, without removing anything potentially relevant.
4. Data Preservation That Is Defensible
Targeted collection does not mean skipping data preservation obligations. Legal holds must be issued, tracked, and documented before any filtering decisions are made. Preservation ensures that relevant data is protected from deletion or alteration, while targeted collection determines what actually moves forward into the review workflow.
Common Challenges in eDiscovery Collection
Even teams with clear processes encounter predictable friction points during data collection for investigations:
- Custodian sprawl: When the scope of relevant personnel is unclear, collection expands to cover everyone, including those with no material involvement.
- Collaboration app fragmentation: Platforms like Slack and Microsoft Teams store data in formats that are difficult to collect and filter without purpose-built data collection software.
- Late-stage filtering: Applying filters after collection rather than before moves the cost burden downstream without reducing the total data processed.
- Defensibility concerns: Legal teams sometimes overcollect out of caution, fearing that narrowed collection will be challenged. Clear documentation of collection decisions addresses this concern more effectively than volume.
- Lack of coordination between IT and legal: When IT executes collection without legal parameters, the result is often over-broad exports that legal teams must then sort through manually.
Practical Use Cases
HR Investigation at a Mid-Size Technology Company
An internal HR matter involves four employees across two departments over a six-month period. Rather than pulling all email and Slack data organization-wide, the legal team defines custodians, sets a date range, and applies keyword filters specific to the matter. Collection is limited to the four individuals across only the relevant data sources. eDiscovery processing then deduplicates thread content across custodians. The resulting review set is a fraction of what an unconstrained collection would have produced.
Regulatory Response for a Financial Services Firm
A regulator requests documentation related to a specific product line over a 12-month window. Using Onna's platform, the compliance team collects from targeted custodians across email, cloud storage, and Microsoft Teams. Filters are applied at the source. The legal team receives a proportionate, defensible dataset rather than a bulk export requiring weeks of manual triage.
eDiscovery Collection Checklist: Reducing Overcollection
Use this checklist to structure targeted data collection for investigations across pre-collection, active collection, and quality assurance stages.
Frequently Asked Questions
What is overcollection in eDiscovery?
Overcollection refers to gathering more data than is proportionate or relevant to a legal or compliance matter. It typically occurs when collection parameters, such as custodians, date ranges, and keywords, are not defined in advance, resulting in broad data pulls that inflate processing and review costs.
How does targeted collection reduce eDiscovery costs?
When filters are applied at the point of collection, only relevant data enters the processing and review pipeline. This reduces the volume of data that must be processed, deduplicated, and reviewed, directly lowering the costs associated with each of those stages.
Can targeted collection put defensibility at risk?
Defensibility is a function of documented process, not data volume. When collection decisions, custodian scope, date parameters, keyword logic, are clearly documented and aligned with legal hold requirements, a targeted collection is fully defensible. The Sedona Conference Commentary on Proportionality supports the use of proportionate collection methods in civil litigation.
How are collaboration apps handled in eDiscovery collection?
Collaboration platforms like Slack, Microsoft Teams, and Google Chat present unique challenges because their data structures differ significantly from traditional email. Purpose-built data collection software that integrates directly with these platforms enables keyword, custodian, and date filtering at the source, before data is exported or processed.
What is the difference between data preservation and data collection in eDiscovery?
Data preservation refers to the legal obligation to protect potentially relevant information from deletion or modification once litigation or investigation is reasonably anticipated. Data collection is the subsequent process of actually gathering that data for processing and review. Preservation is a legal duty; collection is an operational decision about what, specifically, to move forward in the workflow.
Ready to implement targeted eDiscovery collection? Explore how Onna supports defensible, proportionate data collection for investigations across email, cloud storage, and collaboration platforms. Contact the Onna team to learn more.
Subscribe to our newsletter
Get Complete Visibility into Your Unstructured Data, Today
Complete initial setup and first collection in one business day. No lengthy implementations. No IT backlog. Just full visibility into your collaboration data when you need it most.

