Can ‘Predictive Coding’ cope up with ‘Big Data?

Discovery has changed, and electronically stored information (ESI) was the facilitator. Though ediscovery matters are no longer the novel issues that they once were,” technology is constantly changing. According to Baseline, it was estimated that 90 percent of worlds data has been created in the last two years.  in 2009 there were 988 Exabyte of data in existence, an amount that would stretch from the Sun to Pluto and back in paper form. The problem for corporations is the storage of huge amounts of data – let alone worry about the ‘compliance monster’.

Perhaps, cloud computing is here to ease things out, yet companies are retaining more information than ever, and lawsuits sometimes require attorneys review millions and millions of documents. While Judiciary struggles to devise effective mechanism regarding proportionality rules, big data is growing even bigger – not to mention growing litigation industry. It seems manual review of documents is not an option anymore, as technology is rushing towards meeting the growing needs of document review.

The most important element overlooked is the fact that human eyeballs are still required to review such documents leading to defensibility of the case; after all, isn’t that the real objective?

Definitions of “predictive coding” vary, but a common form of predictive coding includes the following steps. First, the data is uploaded onto a vendor’s servers. Next, representative samples of the electronic documents are identified. These “seed sets” can be created by counsel familiar with the issues, by the predictive coding software, or both. Counsel then review the seed sets and code each document for responsiveness or other attributes, such as privilege or confidentiality. The predictive coding system analyzes this input and creates a new “training set” reflecting the system’s determinations of responsiveness. Counsel then “train” the computer by evaluating where their decisions differ from the computers and then making appropriate adjustments regarding how the computer will analyze future documents.

This process is repeated until the system’s output is deemed reliable. Reliability is determined by statistical methods that measure recall—the percentage of responsive documents in the entire data set that the computer has located—and precision—the percentage of documents within the computer’s output set that are actually responsive. (That is, “recall” tests the extent to which the predictive coding system misses responsive documents, while “precision” tests the extent to which the system is mixing irrelevant documents in with the production set.) The resulting output can be either produced as is or further refined by subsequent human review. Subsequently, attorneys review a much smaller set of documents. Predictive coding therefore effectively “alleviates the need to review whole masses of records in order to find the relevant few.” Most importantly, predictive coding is estimated to reduce ediscovery costs as much as 40% to 60% while maintaining search quality.

A statistic quoted in an IDC and EMC report says that the digital universe is doubling every two years, and will reach 40,000 Exabyte (40 trillion gigabytes) by 2020. The question is: Can predictive coding cope up with big data?

e-Discovery | cloud computing
New Jersey, USA | Lahore, PAK | Dubai, UAE
(855) – 833 – 7775 (703) – 646 – 3043


One thought on “Can ‘Predictive Coding’ cope up with ‘Big Data?

  1. Pingback: Can ‘Predictive Coding’ cope up with ‘Big Data? | ClayDesk e-Discovery Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s