As I mentioned previously, it’s very important to test new software to make sure the software works to solve your problems with your workflow. No matter where your work falls in the EDRM, some of what you do will seem so old hat it’s not worth thinking about, and other sections in the model may be new to you. One area that burst into view a few years back, and has stayed in the news since then, is Technology Assisted Review (TAR). Today I’m going to talk about some aspects of TAR that need some clearing up.
I remember when I first heard about Equivio’s Relevance product, back in the summer of 2009. I had been doing product research around the new analytics features that had been getting some press, like Attenex concept clustering with its exotic UI, and the ongoing discussion around what exactly concept search should be. Those were heady times. Just before ILTA in 2009, I was in a meeting with Warwick Sharp from Equivio. He described Relevance as the ability to arrange all the documents for a review in order of how likely they were to be relevant. That was the first I had heard of what is now called Technology Assisted Review (TAR).
Unfortunately it wasn’t until early 2012 that I had an opportunity to test out Relevance, and over the next couple of years I also tested Relativity Assisted Review and Xerox Litigation Services’ CategoriX. All of these are what is now referred to as TAR 1.0, which means they all follow the same basic predictive coding workflow, and which happens to be the first widely seen and used workflow for TAR in eDiscovery. When TAR is referenced in eDiscovery, usually software in this TAR 1.0 model is what is meant.
TAR 1.0 software. Essentially, TAR 1.0 software uses supervised learning algorithms, which means the user needs to tell the software which documents are good examples and which are bad examples of what the software should be classifying. For eDiscovery this means relevant or in some cases privileged documents. The TAR software takes these example documents and creates a model of what a relevant document looks like, and tries to match all of the rest of the documents to this model. The TAR software will then give each document a score (usually between 0 and 100) as to how close it comes, and that lets the documents be ranked in order.
What this means is that TAR 1.0 software learns one definition of relevant at a time. Some, like Equivio’s Relevance, let you train multiple issues using the same documents, but you need to tag each document multiple times. Others will only learn one issue at a time. So if you are interested in having TAR 1.0 software find a smoking gun that’s different from all the other documents, maybe one that uses rare code phrases, you’re out of luck. If such a document exists, if it’s different from the documents you’ve been tagging as positive examples, it won’t turn up at the top of the list. That’s not how this software works.
Machine Learning. The actual underlying algorithm used by TAR software varies by each vendor. They use their own machine learning classification algorithm, usually starting with a standard algorithm then customizing their own version for improved results. One thing they have in common is that they break the contents of the file down into words (whatever a “word” is defined to be), and track, among other things, how frequently the words occur in the file. This information is used to make the decision about what makes a document a good or bad example, and from this the TAR software builds a classification model.
Keep in mind that these machine learning algorithms are used for classifying everything from spam to fraud detection to identifying cat photos on the internet. They use real math – probability and statistics and calculus – to solve real problems. Find out what you can about how your TAR software works, and keep that in mind when preparing to use it for review.
Test and Training sets. Whichever classification method TAR uses, there is still the question of how to tell whether the resulting classification model gives accurate results. A standard way for validating the model built by a machine learning algorithm involves using test and training sets.
The training sets usually start out as random documents, then if additional training sets are used they are customized as learning occurs, in order to clarify special cases or speed up the overall learning. The user tags all of the documents in the test set and the training sets as either good examples (e.g. relevant) or bad examples (e.g. not relevant).
The TAR software uses the documents and their classifications from the training set to build its classification model, which contains its definition of the type of document to focus on. TAR will use this model to tag new documents based on how closely each one matches this internal definition, and will assign each new document a number between 0 and 100 that ranks how closely the document matches the model. In order to see how well the model will work, the TAR software applies the model to the test set.
The test set (also sometimes called a control set) is created by taking a random sample of the total document set. The random part is important, as this is where math comes in. Each document has the same chance of being chosen as any other, which means the random sample is characteristic of the document set as a whole. Since the test set has already been tagged by the users, the software knows what the right tags should be and the TAR software can give a good idea of how well its new classification model will do at tagging new documents in the whole document set.
What else? There are many more questions to consider, such as the richness level of the document set, and what the confidence interval and margin of error will be, but what I’ve described are a couple of the basic machine learning steps of TAR 1.0. These TAR engines will provide a way to tag documents based on how they interpret the contents of the documents in the training sets. This ability can save you weeks of time and thousands of dollars, but they aren’t magic. Like any other computer program, TAR software does what the user tells it to do, not necessarily what the user wants it to do. You need to understand what’s happening well enough to integrate TAR into your regular workflow, to create a successful review.