One of the more promising applications of Big Data is getting better insights and more accurately targeting employees who might be involved in a matter during internal investigations. 

A typical investigation can pull thousands and even millions of documents into its nets, a nearly impossible volume for investigators to manage, particularly considering data privacy restrictions that can apply in the healthcare industry or investigations that involve units in the European Union.

A new breed of tools is emerging that combine multiple and disparate data sources—cross-referencing audio files with written documents, for example—and  help ensure that an internal investigation is compliant, as well as manageable in scope.

During an internal investigation, sensitive data must also move among several organizations, such as the company, its outside counsel, an audit firm that may be hired to provide forensic accounting, a translation firm if the investigation is multinational, and others. Necessarily, investigation data moves from the investigating company's internal servers into the possession of those external investigation partners; but any handling of the data racks up storage costs and billable hours. Planning out how those involved in an internal investigation will manage the data before the investigation begins will give companies an advantage as it moves forward and keep an investigation from drowning in a sea of data that isn't well organized.

“First, the team needs to focus on determining the scope of the investigation,” says Matt Miller, a manager with Ernst & Young's Fraud Investigation & Dispute Services practice. And counsel needs to advise on privacy and legal issues that affect data collection, he says. Companies will also need to consider how moving data among the parties involved in the case could affect whether or not it is considered “privileged,” meaning whether or not it can be kept private. And then there are the local regulations to consider. Accounting data may cross borders, for example, which can create compliance risks. Another trouble-spot is e-mail communications, since some jurisdictions, such as China and the European Union, have laws that may protect some communications, particularly those with personally identifiable information.

“Once that's done, then you can move forward with performing targeted investigations of the data, moving it onto the analytics platform of choice, and taking a review and analysis approach of determining if something is high-risk enough to warrant further investigation,” says Vince Walden, a partner with Ernst & Young's Fraud Investigation & Dispute Services practice.

Where the Documents Live

In most internal investigations at large or multinational companies, the information is stored on a third-party hosting platform like Relativity, kCura's Web-based e-Discovery software platform. While numerous platforms exist, Relativity has captured a large segment of the market, says Jim Moore, executive vice president of Merrill Brink International, the language solutions provider. kCura claims the U.S. Department of Justice and 95 of the top 100 U.S. law firms in the United States among its more than 75,000 active users worldwide.

Typically, the audit firm, not the client company, manages the hosting platform relationship, which in turn helps to ensure country-specific data compliance. An international investigation may require more than one hosting platform, for example, one in the United States and a second in China because that data must be kept in country.

Crossing borders with data, however, is no longer the hindrance it once was, says Walden. “There are ways to sanitize or anonymatize data, or we'll go in country and do this analytical processing on site,” he says. Walden says companies are also providing more training to compliance managers at units abroad, so they can analyze data at local units without it having to leave the country.

Unmanaged data can not only hinder the outcome of an investigation, but the costs and duration can quickly get out of hand unless the documents and data are somehow wrangled.

Data management experts advise conducting a deduplication process to remove documents that are the same or nearly the same. A company should, for example, recognize a 10th generation e-mail that includes a complete string of the nine e-mails before it, and submit only that 10th for analysis or translation. As Moore describes, deduplication also ensures that one document is not submitted for translation by more than one entity (for example, by two law firms, plus the auditor).

“I personally have not found one technology that does everything. There are great text mining tools, data visualization tools, and statistical data mining tools on the market, but no one does them all.”

—Vince Walden,


Ernst & Young

Another good practice is to use automated translation, which Merrill Brink calls “machine translation.” Automated translation is insufficient for a perfectly accurate English-to-Chinese translation, or vice versa, and certainly a wise company would not rely upon it alone to protect itself from an FCPA action, but for many purposes, it's enough to get the job done.

Automated translation can effectively enable a keyword search, for example. A keywords list, created case-by-case, may include such obvious terms as “gift” and also the names of individuals, locations, and types of business involved. Merrill Brink typically creates a list of up to 100 keywords and translates it into the native languages for a given case.

“An example we're looking at right now is that a client has currently one million pages of Japanese and English documents,” says Moore. “We go through and separate Japanese from English, then run ‘translate keywords' for them, then run keyword searches which reduces a million pages to maybe 400,000—a significant reduction.”

“With automated translation you can get a gist of what the document is about,” says Moore. “To recognize what is and isn't a contract, or identify an e-mail where someone asked someone in the zoning department to expedite an approval to put a store up quicker than a competitor, or some evidence of subversion.”

Only after the documents are sorted by language, deduplicated, and flagged comes the “laying on of hands” with Merrill Brink's live translators, who work with just a subset of the original pile of documents. That upfront work can reduce costs and keep multiple outside organizations from processing the same information.

Querying and Visualizing

The same tools that enable Google searches and business intelligence dashboards are also making internal investigations far easier to manage.


Where translation is part and parcel of an internal investigation, it can be used to define the workflow and sift the data.

The first step in the Merrill Brink workflow is classic e-Discovery processing to simply identify any possibly relevant document, and ends with human translation, rather than beginning with it. Automated translation can be used to channel the documents to the right reviewers (for example, a Japanese or a U.S. team), and drive down the volume of documents such that the human translators work with only a small stack.

Source: Merrill Brink.

Chief among the tools is Apache Hadoop, which an auditing firm like Ernst & Young uses to manage volume, variety, and velocity of data, says Walden. “That capability has changed the game on how you can take much larger volumes of data and make sense of it all.” The Hadoop MapReduce functionality traces its origins to Google, and acts like Google in using large numbers of computers and servers as nodes map the data and turn results into answers.

And how are the answers presented? However the user chooses or finds most understandable. In classic accounting-driven investigations the outputs were viewed in rows and columns. “But we now look at things in dashboards that visualize those rows and columns,” says Walden, and conceptualizes the data in bar charts, pie charts, and even on geospatial maps.

Advanced visualization tools like Tableau or Spotfire can mine both data and text, structured and unstructured data, to identify anomalies, relationships, and patterns. They're also designed for lay users; functionalities like Spotfire's Guided Analytics enable a user to input rules and metrics and format the results into the visuals of their choice. So, the output looks less like a spreadsheet, and more like a marketing infographic.

Data management experts say avoid one-stop solutions, and use “best-of-breed” solutions for data management for internal investigations. “I personally have not found one technology that does everything,” says Walden. “There are great text mining tools, data visualization tools, and statistical data mining tools on the market, but no one does them all.”

GRC modules like those from NAVEX Global and The Network, which plug into enterprise systems, can also serve as workflow solutions and ensure that policies and procedures are in place for, for example, managing the approvals process, certifications and enforcing case escalation procedures—all vital functions, says Miller.

“There's an absolute need for what these types of systems and processes put in place. They're for compliance and controls monitoring which is a piece of the overall investigative process,” says Miller. “Where transaction testing is testing the effectiveness of the policies, workflow tools are about implementing those policies.”

But don't rely on them to find every case of wrongdoing. By nature, says Walden, GRC suites are rules-based tools, thus insufficient for discovery. “The problem with rules-based tests is that fraud is about going where rules don't exist, or circumventing them. You need more enhanced tools that incorporate data mining, visualization, and text mining working together with rules-based tests.”

Nearly all the experts suggest putting a data management system in place before the need arises for an internal investigation. “With all these systems in place it allows you to move from a reactive standpoint to proactively manage information based on all the different media formats.”