E-Discovery and Predictive Coding

Electronic Discovery is the Identification, Preservation, Collection, Processing and Review of Electronically stored Information (ESI) that has potential evidentiary value in a litigation or an investigation.

There has been an exponential rise in the use of e-Discovery services and e-Discovery has become an important element of the business and corporate governance landscape. The important drivers for the uptick in the use e-Discovery are: 

  • Increased scrutiny by regulators
  • Transparency and compliance requirements
  • The need to manage huge amounts of data while reducing the legal costs involved in reviewing the data.
  • The Need to improve the turn-around time in producing the relevant data to the requestors (Regulators, Internal Investigators, courts etc.)

As e-Discovery is reaching additional geographical frontiers, it is also hitting new technology milestones, Predictive coding is one of them and it is used across the globe and also its acceptance is growing in the court of law. 

“Predictive coding” is the use of automation and machine learning techniques to quickly review large population of documents by developing a trained/sample set of documents. Broadly speaking, predictive coding refers to the use of a software program to identify documents that are relevant to a particular case or matter. It was explored on the e-discovery platform around 2012 and popularized as a means to revolutionize the process of analyzing documents for relevancy, potentially replacing the traditional manual/human review model.

Predictive Coding process takes expert judgment from a sample set of documents relevant or responsive to a case and uses computer modeling to extrapolate that judgment to the remainder of the document population. In a standard review process, the reviewer reviews a sample set of document and marks them as responsive or non-responsive. Based on the reviewer’s judgement a set of rules are built by a computer algorithm which can be applied to the larger set of data.

The various synonyms for Predictive coding often used globally are predictive intelligence, technology-assisted review (TAR) or computer – Assisted review (CAR), automated document review, automated document classification, automatic categorization, predictive categorization, predictive ranking etc. different terms but they all essentially mean the same thing i.e. dramatically reduces the review costs.

Predictive coding often works best with a large quantity of documents and may not be an ideal choice to review small set of documents given the financial constraints.

The steps involved in predictive coding are:

  • Starts with a set of data, derived or grouped in any number of variety of ways (e.g. through keyword or concept searching) 
  • Uses a human-in-the–loop iterative strategy of manually coding a seed or sample set of documents for responsiveness and/or privileges.
  • Employs machine learning software to categorize similar documents in the larger set of data
  • Analyzes user annotations for purpose of quality control feedback and coding consistency
  • Apply that code to the entire review set and codes all remaining documents as responsive or unresponsive

E-discovery and predictive coding

Some of the common algorithms/techniques used in predictive coding include “Concept Searching”, “Contextual Searching”, “Metadata Searching”, “Probability theory”, “Relevance ranking”, “Clustering” and “Sorting documents by issue”.

Advantages of Predictive Coding

  • Potential to dramatically reduce the number of documents requiring attorney review, which ultimately can save time and money.
  • Minimize or eliminate the inconsistent production and privilege calls that plague every large document review and allow for a higher level of consistency in the process.
  • Identify more relevant documents than the traditional linear attorney review in which documents are reviewed one after another.

At the same time Predictive coding has its downsides and limitations. Most significantly, predictive coding requires significant attention from experienced attorneys during the machine learning process. A flawed seed set or testing process will cascade those flaws throughout a production. To guard against this risk, reviewers must commit substantial time and financial resources at the start of a case.

The growth of ESI and the importance of these documents to litigation make finding a manageable solution to review such material an urgent need and predictive coding is at the forefront. Predictive coding still in infancy, However it certainly is going to be the new concept to deal with Huge amounts of data sets to be reviewed by Legal Teams.


Authored By 

V V Satyanarayana Jalli

TCS Enterprise Security And Risk Management

Rate this article: 
Average: 5 (1 vote)
Article category: