Search

Gray Tabby Analyzes

Let’s Stop Using the Meaningless Terms TAR 1.0 and TAR 2.0

Long ago I knew a very good artist who had been hired to create the art for a video game. One day he came to work and was surprised that his software worked differently than it had the day before. “Oh,” said one of the engineers, “your software was upgraded. Now you’re running version 3.2.” The artist had no idea what the engineer was talking about, because he hadn’t realized that the numbers after the software’s name meant anything. He thought they were like the letters and numbers for cars, so that the numbers after ArtPro 3.2 were like the designation after Lexus RX 350. For attorneys new to Technology Assisted Review (TAR), terms like TAR 1.0 and TAR 2.0 can be just as confusing. Unlike version numbers for a single software program, these names refer to multiple different software programs, all of which do the same thing: use machine learning techniques to help attorneys determine which documents in a review are relevant.

So what is TAR 1.0? Lumped together under this name are most of, including the first set of, predictive coding programs. These tools, in general, first create a random sample of documents, called a control set, that humans tag as relevant or non-relevant. Then the software creates smaller training sets of documents that humans also tag for relevance. After each training round, the software builds a model of what it thinks a relevant document is, and tests that model on the human-tagged documents in the control set to see how many relevant documents in the control set it tagged correctly. These training rounds continue until the best possible computer model has been created. That’s roughly the routine that these programs follow, with some variations — some use human-chosen documents, called “seed sets”, in their training rounds; how they decide that training is complete varies; each has different tools for validating success — but that is how these programs work in general.

So what term should we use instead of TAR 1.0? I’m not an attorney, I’m a product manager with years of software development in my background. I’ve performed deep tests of several of these types of programs, including testing Equivio’s Relevance and Xerox’s Omnix quite extensively. I kept trying to come up with something pithy, but “control/training set” was the best I could do. Then I realized that Predictive Coding, which is how I first described these tools, is a good name for them.

What is TAR 2.0, then? TAR 2.0 refers to a program that uses Continuous Active Learning (CAL). CAL differs from Predictive Coding in that, first, there are no control or training sets. The human reviewer can tag as few as one document as relevant, and CAL can use that document to build its first model and tag the remaining documents. Humans review a set of the newly CAL-tagged documents, and if the humans disagree with the tags, CAL rebuilds its model and retags the documents. CAL also feeds documents to human reviewers that may be “edge cases” to help improve its understanding of a relevant document.

As far as I know there are only two CAL software products out there, and John Tredennick of Catalyst was kind enough to not only give me a demo of their CAL product, Insight, but also to set up an account for me to test it. I was impressed with the demo, not only with the results shown but in particular with how quickly Insight built the model. Unfortunately my schedule meant I couldn’t actually test Insight, so I can’t speak with any authority on the results, but Insight does solve the dreariness of having to tag 500 documents before any model is built.

So I recommend we refer to these TAR products as either Predictive Coding or Continuous Active Learning. These aren’t perfect terms, because Equivio’s Relevance (a Predictive Coding program) would add documents to its training rounds that it considered edge cases for both relevant and non-relevant in order to improve its algorithm, similar to how CAL feeds documents to human reviewers, so the distinctions aren’t perfect. The best solution is to know what you want to accomplish with your review, and to make sure you understand the software you’re going to use.

Oh, and the artist? The video game flopped (though it won an award for its beautiful art), and he went on to have a distinguished career as a US Marshal.

How to Know When You’re Right in a TAR-eat-TAR World

The classification functions used in ediscovery’s Technology Assisted Review (TAR) tools, which are used to rank documents as to how closely they match the relevant documents used to train the TAR tools, are usually judged to be successful based on their recall and precision. Dr. Herbert L. Roitblat has a great discussion on this topic here, which I agree with wholeheartedly. One key point that he skips over, though, is how to actually verify whether the recall percentage is correct. This is turn brings up the perennial question: What can we use to validate the results of a TAR review? which leads us to: What can we use to compare the results between different TAR tools and workflows?

Ideally, we would have available at least one document set for a civil case, containing the complete productions of both parties, for which each document has attached a tag marking the document as relevant or not relevant for the matter at hand. This tagged document set would be used by to compare the results of the TAR software’s predictions to the values of the actual tags. With this information, ediscovery professionals could build the best workflow for their needs, using the best TAR software needed as part of their workflow. So why don’t we have this?

The nature of ediscovery document sets is that they are private. Each document set contains the relevant documents in the case, which neither party likely wants to have public. Each set also contains documents which aren’t relevant, but which also likely contain business or personal secrets. When Enron executives Jeffrey Skilling and Kenneth Lay were found guilty of conspiracy, fraud, and insider trading in 2006, the documents of the case were made public. The EDRM project made these documents available online, and Nuix provided a cleansed version of the set. So why isn’t a tagged version of Enron ready for use?

When I tested TAR software, I spent time tagging documents. If you’ve done the same, you know that reading each document and determining its relevance can be slow, painstaking work. Even if you’re quickly tagging obviously not relevant documents — Amazon receipts, blank documents, random characters — the process is dull. Very, very dull. In order to tag all 18 gigabytes of Enron email, you would need to find enough qualified people to perform the review who are willing to do the work. You will also likely need to pay them. Possibly a lot.

Since, unfortunately, the tags from the Enron case didn’t come along with the documents themselves, any new tagging initiative will need to take all of this into consideration. The issues are considerable, but not insurmountable. Even knowing all of this, I still wanted to find a set of tagged documents that could be used as a gold standard for comparing the results of TAR review. In my next post, I’ll tell you what my quest turned up.

Under the Hood: Driving the TAR 1.0 Car

As I mentioned previously, it’s very important to test new software to make sure the software works to solve your problems with your workflow. No matter where your work falls in the EDRM, some of what you do will seem so old hat it’s not worth thinking about, and other sections in the model may be new to you. One area that burst into view a few years back, and has stayed in the news since then, is Technology Assisted Review (TAR). Today I’m going to talk about some aspects of TAR that need some clearing up.

I remember when I first heard about Equivio’s Relevance product, back in the summer of 2009. I had been doing product research around the new analytics features that had been getting some press, like Attenex concept clustering with its exotic UI, and the ongoing discussion around what exactly concept search should be. Those were heady times. Just before ILTA in 2009, I was in a meeting with Warwick Sharp from Equivio. He described Relevance as the ability to arrange all the documents for a review in order of how likely they were to be relevant. That was the first I had heard of what is now called Technology Assisted Review (TAR).

Unfortunately it wasn’t until early 2012 that I had an opportunity to test out Relevance, and over the next couple of years I also tested Relativity Assisted Review and Xerox Litigation Services’ CategoriX. All of these are what is now referred to as TAR 1.0, which means they all follow the same basic predictive coding workflow, and which happens to be the first widely seen and used workflow for TAR in eDiscovery. When TAR is referenced in eDiscovery, usually software in this TAR 1.0 model is what is meant.

TAR 1.0 software. Essentially, TAR 1.0 software uses supervised learning algorithms, which means the user needs to tell the software which documents are good examples and which are bad examples of what the software should be classifying. For eDiscovery this means relevant or in some cases privileged documents. The TAR software takes these example documents and creates a model of what a relevant document looks like, and tries to match all of the rest of the documents to this model. The TAR software will then give each document a score (usually between 0 and 100) as to how close it comes, and that lets the documents be ranked in order.

What this means is that TAR 1.0 software learns one definition of relevant at a time. Some, like Equivio’s Relevance, let you train multiple issues using the same documents, but you need to tag each document multiple times. Others will only learn one issue at a time. So if you are interested in having TAR 1.0 software find a smoking gun that’s different from all the other documents, maybe one that uses rare code phrases, you’re out of luck. If such a document exists, if it’s different from the documents you’ve been tagging as positive examples, it won’t turn up at the top of the list. That’s not how this software works.

Machine Learning. The actual underlying algorithm used by TAR software varies by each vendor. They use their own machine learning classification algorithm, usually starting with a standard algorithm then customizing their own version for improved results. One thing they have in common is that they break the contents of the file down into words (whatever a “word” is defined to be), and track, among other things, how frequently the words occur in the file. This information is used to make the decision about what makes a document a good or bad example, and from this the TAR software builds a classification model.

Keep in mind that these machine learning algorithms are used for classifying everything from spam to fraud detection to identifying cat photos on the internet. They use real math – probability and statistics and calculus – to solve real problems. Find out what you can about how your TAR software works, and keep that in mind when preparing to use it for review.

Test and Training sets. Whichever classification method TAR uses, there is still the question of how to tell whether the resulting classification model gives accurate results. A standard way for validating the model built by a machine learning algorithm involves using test and training sets.

The training sets usually start out as random documents, then if additional training sets are used they are customized as learning occurs, in order to clarify special cases or speed up the overall learning. The user tags all of the documents in the test set and the training sets as either good examples (e.g. relevant) or bad examples (e.g. not relevant).

The TAR software uses the documents and their classifications from the training set to build its classification model, which contains its definition of the type of document to focus on. TAR will use this model to tag new documents based on how closely each one matches this internal definition, and will assign each new document a number between 0 and 100 that ranks how closely the document matches the model. In order to see how well the model will work, the TAR software applies the model to the test set.

The test set (also sometimes called a control set) is created by taking a random sample of the total document set. The random part is important, as this is where math comes in.  Each document has the same chance of being chosen as any other, which means the random sample is characteristic of the document set as a whole. Since the test set has already been tagged by the users, the software knows what the right tags should be and the TAR software can give a good idea of how well its new classification model will do at tagging new documents in the whole document set.

What else? There are many more questions to consider, such as the richness level of the document set, and what the confidence interval and margin of error will be, but what I’ve described are a couple of the basic machine learning steps of TAR 1.0. These TAR engines will provide a way to tag documents based on how they interpret the contents of the documents in the training sets. This ability can save you weeks of time and thousands of dollars, but they aren’t magic. Like any other computer program, TAR software does what the user tells it to do, not necessarily what the user wants it to do. You need to understand what’s happening well enough to integrate TAR into your regular workflow, to create a successful review.

Finding the best software for your workflow

Having been the Product Manager for Search, Analytics, and Review at Clearwell Systems, I had a fairly comprehensive idea of what features were important for completing a defensible review, but when I began as Director of Innovation at Discovia, my focus changed. I no longer was concerned with defining a feature to appeal to a number of different kinds of users. I now needed to find the best eDiscovery products that would solve particular user and workflow problems for a single eDiscovery service provider. I needed to change my focus, and in turn I needed to help them change theirs.

First, everyone knows not to trust vendor claims, right? I’ve been a vendor, so I know. Each vendor does their testing in an environment that is best suited for their own software, so of course their software is the fastest and the easiest to use. Chances are very good that the vendor didn’t test in an environment that is best suited for you. That means you need to do that testing, but how?

You need to decide what you want to accomplish, and how you would like this software to fit into your existing workflow (or how you can modify your existing workflow to include this software). For example, I looked at possible solutions for speeding up processing the email and attachments used in cases. Before we began looking at specific software, I asked the Discovia operations team what they considered to be fast; that is, what did we really mean when we said we wanted to speed up processing? The operations team wanted to speed up document ingestion, the time it takes from when the documents arrive on the server hard drive to when the text is searchable in our review tool. They also wanted to compare how much “people time” the software took: how much time did an operations engineer need to spend in setup and configuration. With clear criteria for comparison, I was able to pare down the options to three possibilities, and, using our prepared testing document set, our testing went quickly. No, I won’t tell you what won, though I will tell you that one processing tool gave different results each time  we processed the EDRM Enron document set. I would call that a major bug.

But processing documents for eDiscovery is very well understood, you say. What about something more controversial, like Technology Assisted Review (TAR), which classifies documents as relevant based on training? I would say the first step is still the same: identify what you want to accomplish, and how it will fit into (or change) your existing workflow. You’ll need to think about what you want to achieve from using the software. These software tools are designed to take a set of tagged documents as input, then output a ranking for each of the rest of the documents that orders them based on how close they match the relevant documents in the tagged document set. How do you want to use those ranked documents? You’ll also need to think about process issues, such as which document set are you going to test the software with? Who is going to do the training? How will you identify success?

What ultimately matters, when it comes to new software, is whether the software works for your purposes and in the way you need it to work. You can’t trust the vendor to have your best interests at heart, so make sure you know clearly what you need.

Create a free website or blog at WordPress.com.

Up ↑