Some companies are finding that they are the corporate equivalent of hoarders, needlessly holding on to piles and piles of documents and files that can run up e-Discovery costs and make finding important information more difficult.

For example, a large company with a heavy legal caseload recently wanted to assess costs related to legal holds. When it analyzed a random pool of nine cases in a given year, it found that 47 million pages of documents had been retained for the litigation at a price tag of $23 million in discovery costs. When it took a closer look at the contents of the documents, it found that the majority of them didn't need to be kept at all.   

Scenarios like these are common. In an era of Big Data, most companies store “ridiculous amounts of garbage” due to poor information governance habits, says Barclay Blair, president and founder of information governance consulting services firm ViaLumina. “This is an endemic problem that drives huge unnecessary costs and risks for organizations.”

Fear of running afoul of litigation holds and a wide variety of data preservation laws keep companies from deleting data that is often redundant, obsolete, or holds no business value. They also hold on to it because they don't know what it is. Most electronic data is unstructured and not well labeled or indexed, meaning its content is inherently unknown, such e-mails, text files, customer correspondence, and other “dark data.”

“The biggest challenge organizations face is that no attorney is going to sign off on deleting content if they don't know what information it contains,” says Stephen Stewart, chief technology officer of information management firm Nuix.

Employees, too, have their own hoarding habits, stashing files away and holding on to multiple versions of the same record, while others delete everything but a minimal amount. “It's just not being done in an organized way. It's being done in an ad hoc way,” says Anthony Diana, a partner with law firm Mayer Brown.

Traditionally, companies would often delete data following the completion of a lawsuit, or some other big event that would trigger such a decision, says Blair.  “I'm just now starting to see enormous energy around doing this proactively,” he says.

Adding to the challenges is that the exceptions to deleting data, and how long it must be retained, vary widely by industry. While many rules and regulations govern what information companies must keep, nothing says that companies have to keep everything, stresses Stewart. Under the Federal Rules of Civil Procedure, however, “all of that is trumped when you ‘reasonably anticipate' litigation,” he warns.

Thus, the liberty to delete data also depends on each company's litigation profile—how many legal holds it has in place and how expansive those holds are, says Diana.

This is where a “defensible deletion” strategy comes into play. Defensible deletion is the process within an overall information governance strategy of making values-based decisions about data deletion by segregating what data to keep from what to delete in a way that doesn't run afoul of legal or regulatory data retention requirements.

Getting to Defensible Deletion

It's not as simple, however, as just adopting a policy with a given hold period for certain types of information. “It's a major investment both in time and resources to really get an information governance strategy in place such that you can start doing defensible deletion,” says Diana.

The process requires, in part, that the various business units—IT, legal, compliance, and HR—collaborate with one another to locate the data and decide what data needs to be retained and for what length of time. People on the ground doing records management must also be trained.

“The biggest challenge organizations face is that no attorney is going to sign off on deleting content if they don't know what information is in that content.”

—Stephen Stewart,

Chief Technology Officer,


The first step toward implementing an effective defensible deletion strategy is to identify what data exists so as to gain transparency into that information. Where is it stored? Who has control over it?

The overall objective is to come up with a solution for migrating a company's unstructured data into a structured database, such that all valuable data is indexed and made searchable for the purpose of complying with any data retention requirements and legal holds.

“An effective legal hold process is the foundation for defensible deletion and gives in-house legal and outside counsel confidence that potentially relevant electronically stored information has been preserved. The standard is a reasonable effort rather than perfection," says Barry Murphy, co-founder and principal analyst for analyst services and consulting firm eDJ Group.

“Once information is searchable, you can analyze it and present information to the decisions makers about how to move forward,” says Stewart. This means getting to a point where you know that all the data that should be preserved is preserved, he says. 

Start Small

One of the biggest mistakes companies make when starting a defensible deletion strategy is that they try and take a “boil the ocean approach,” says Diana. They begin with a records management process, for example, “and then they freeze, because they suddenly realize how challenging the process is."

Diana recommends that companies start small. Once a company begins the process of defensible deletion within one data repository—such as desktops, network servers, e-mail servers, or legacy backup tapes—it's then much easier to expand into other areas, he says.

A good starting point may be e-mail systems. “The biggest target of litigation holds is e-mail,” says Bill Tolson, a consultant with information management software firm Recommind.


Below is an explanation of predictive coding from Recommind.

Predictive Coding is a court-endorsed process that combines people, technology, and workflow to find key documents quickly, irrespective of keyword. Due to its massive accuracy and efficiency gains, Predictive Coding is revolutionizing how Early Case Assessment (ECA), analysis, and document review are done. Predictive Coding has three components:

Recommind developed Predictive Coding in partnership with some of the world's leading enterprises and law firms. Recommind customers have been using Predictive Coding for the past 5 years.

What does Predictive Coding require?

Predictive Coding requires ALL of the following:

Input from a case expert

Keyword-agnostic analytics to find key documents and create seed sets

Proven workflow to deliver statistically certain results

Iterative machine-learning to “find like this” based on meaning not keywords

Integrated sampling for certainty and unparalleled defensibility

What does Predictive Coding not do?

Predictive Coding does not replace human review. It optimizes it. The solution takes all the documents related to an issue, ranks and tags them so that a human reviewer can look over the documents to confirm relevance. The beauty of this technology is that attorneys can use human decisions to teach the computer, making the relevancy suggestions more accurate over time.

Source: Recommind.

Data infrastructures need to be built in a way that any instructions to delete data are done in a centralized fashion. “With most infrastructures today, that centralized control does not exist,” says Tolson.

Part of the challenge is that older technologies are not equipped to handle massive volumes of digital information in an era of Big Data. “A lot of systems were built around capturing and securing data, but not deleting it,” says Tolson.

Newer technologies are solving this problem by arming in-house counsel with the insight they need to make informed decisions. Nuix, for example, can index massive quantities of unstructured data by deciphering redundant or obsolete data from valuable or sensitive data, and storing it or deleting it accordingly.

Other vendors like Recommind offer “predictive coding” capabilities that use algorithms to identify certain concepts in documents to determine what information could be subject to litigation holds or data preservation requirements, effectively weeding out unnecessary data.

The sorts of deletion policies that often get companies into trouble are the ones that are not rigorously followed. “The defensibility aspect of deletion is the rigor to which you execute that process,” says Stewart.

The worse thing a company can do is delete data at the wrong time, such as in anticipation of a lawsuit. On the other hand, if a company adheres to a data deletion policy on a consistent basis, a regulator or a court is going to be hard-pressed to call that policy into question, says Stewart.

“It's clear that the regulatory environment around information governance continues to get more and more complex,” says Blair. Thus, the need for companies to understand more about the information they maintain only grows more essential, he says, in order to make more informed decisions not only about what information is essential to keep, but also how to generate value from it from a Big Data perspective.