Learning from the Experiment that Failed

AI Isn’t Yet Ready to Be a Good Document Summarizer

Aug 16, 2023

Image generated by DALL-E 2

Have you worried about whether you are missing a trick in incorporating AI into your business? I recently did an experiment that revealed AI is not yet ready for prime time for a key function: legal document summarization. While you may not be interested in doing that, what I learned applies to summarizing any important document.

As a lawyer, I frequently read cases to learn what important things they say. It would be beneficial if AI could summarize cases reliably and not miss key points.

I worked with a computer science Ph.D. candidate to try to develop a case-law summarizer. Machine learning is a subset of AI and encompasses generative AIs, such as ChatGPT.

We used TextRank and ChatGPT both individually and in combination. TextRank is an algorithm designed to identify significant elements in a network, such as important text in a document.

Ultimately, the project didn’t produce useful summaries, but this failure taught me about the state of AI.

The Experiment

Here’s how the experiments fared: We started by using TextRank. Its summaries frequently repeated the same information, sometimes several times. The summaries also sometimes captured the essential elements of the case holding, but frequently, they did not do so sufficiently accurately to be useful.

We then used ChatGPT standing alone, followed by having it work on the work product of TextRank. ChatGPT usually eliminated the repetition. It produced a reasonably accurate summary but boiled things down too far to be useful.

Also, ChatGPT cannot be stopped from infusing its summary with commentary by others on the case. Unfortunately, including such commentary shrinks the portion of the output that summarizes the opinion. In theory, including commentary is tolerable as long as the output distinguishes between summary and commentary, but I worry that distinction may not always be clear.

In the end, neither TextRank nor ChatGPT, separately or in combination, produced a summary with sufficient detail and accuracy to be helpful.

Why Didn’t It Work?

I asked my technical consultant why he felt it didn’t work out. He said that to do a good job at document summarization, the only widely available and practical tools are either a large language model (LLM in geekspeak) tuned to your need or, for the TextRank approach, an extensive computer dictionary of relevant important terms. Each approach has significant limitations.

A just machine to make big decisions
Programmed by fellows with compassion and vision
We'll be clean when their work is done
We'll be eternally free yes and eternally young
- Steely Dan, from “I.G.Y. (What a Beautiful World)”

Why ChatGPT Didn’t Get the Job Done Well

An LLM produces its output by being autocomplete on steroids. A generative AI, such as ChatGPT, is an LLM. An LLM trains on a massive volume of data to learn to predict and type the best next word after a string of words. It needs that enormous volume of data to do a reasonably good job. Currently, the best and biggest publicly available LLM is ChatGPT running GPT4.

But ChatGPT is not trained only on legal documents, such as case law, so its output when summarizing a legal case will be heavily influenced by non-legal writing experienced in its training.

Also, one step of LLM training is for humans to review and rate the quality of the output, so it can learn what output is good. Its human training wasn’t done solely by knowledgeable attorneys seeking to produce the kind of output attorneys want. For that reason, it doesn’t reliably produce an output an attorney can rely upon. That’s crucial for any document summarization, even outside the law: if you can’t rely on it to report essential stuff accurately and completely, it won’t save you time.

Worse yet, it has a low limit on how much text it can summarize in one bite. If you want anything bigger summarized, you have to break it into parts and allow it to summarize each part. That won’t work for summarizing certain kinds of long documents, such as legal cases.

Why TextRank Didn’t Do Well Either

Instead of using an LLM for summarization, you can run a summarization computer program that usually involves applying an extensive dictionary of important terms, such as TextRank. A summarizer such as TextRank doesn’t necessarily need a dictionary of important terms to work, but supplementing it with a dictionary helps the summarizer understand what information is important to the user. Using a summarizer without a dictionary might produce a summary that captures useless information while omitting information important to the reader.

By “dictionary,” I don’t mean a literal English language dictionary of all English words. I mean a database of the key terms most important to the reader when summarizing the document.

That leads to a problem: the dictionary must be tuned to focus on the particular issues that concern the audience for the summary. That vocabulary list will change based on the document’s subject matter and what you want to summarize from it.

For example, the list of key terms for summarizing a trademark decision will differ from those for summarizing a bankruptcy decision. In fact, the key terms will vary even within those fields depending upon the issue addressed in a particular case.

Also, what if you want a case summary without tuning the output to address just one type of law? Either the dictionary will be too narrow for such general usage or, if it’s really broad, it may not be honed to summarize well important aspects of the document’s content.

In addition, terminology grows and evolves, even for specific areas of the law. For example, 40 years ago, concepts such as cybersquatting, Internet data privacy breach, and trademark dilution didn’t yet exist. Someone would have to track legal developments and keep the dictionaries current.

TextRank is a good solution if it’s geared to a specific summarization task in which the key terms are known and don’t change, but that rigidity makes it insufficiently useful for broader use cases.

In sum, an LLM needs a ton of data to produce good output, but big LLMs such as ChatGPT haven’t been trained for specific-industry use, such as legal analysis. Conversely, a summarizer-plus-dictionary approach, such as TextRank, requires customization to fit a specific situation, so it isn’t flexible enough to cover a wide range of summarization needs, such as summarizing a variety of types of legal decisions.

Current Technological Limitations on Finding a Better Tool

It’s possible companies are offering specialized generative AI products or summarizers tuned toward summarizing court opinions that might do a better job. That also may be true for business fields other than the law.

Still, regarding LLMs, there will still be problems with the largeness of the language model and the amount of text it can assimilate and summarize. Ultimately, you will have one of these problems: not enough input data to analyze or spending too much money on human training of the system.

The quality of LLM output is determined by its input quality, quantity, and relevancy to the purpose of use. LLMs need a massive amount of data to perform well. If you limit the LLM training data to high-quality and relevant documents, such as published court opinions in the scenario of using the LLM for summarizing court opinions, there likely would not be enough data for a system to do a good job.

In theory, you could make up for this by having knowledgeable human beings provide a lot of feedback to the LLM regarding the quality of its outputs, such as an army of lawyers reviewing the quality of case summaries. But that can make producing the LLM extraordinarily expensive, which might make it commercially not viable.

Complicating things, there are areas of law for probably every type of human and business activity. Would case law dealing with violent crime cases provide good training for summarizing case law concerning esoteric tax situations?

And this presumes the LLM could access all needed training data at low or no cost, to make it economic. Some important legal information that should probably be part of the training data set is in paywalled databases, such as treatises and subscription law journal articles. Even public documents, such as case law and regulations, may be best accessed only in paywalled databases due to the poor quality of public ones.

Also, there is a fundamental trade-off between cutting repetition from the summary and missing important details in what is summarized. This applies both to generative AIs such as ChatGPT and summarizers such as TextRank.

Usually, the repetition in the output is caused by the legal opinion or other document addressing the same issue several times. If the same issue is repeatedly addressed in the source document with different wording, the output might not understand that and produce repetition.

This is a tough-to-crack problem because sometimes the repetition in the source document is only verbosity, which could be distilled to a single statement, while in other cases, each statement might contain important new details or nuance that should be included in a proper summary. An LLM or text summarizer will struggle to avoid repetition in the summary while still gathering essential details.

Thus, due to how LLMs and dictionary-based summarizers currently work, I’m skeptical that even specialized legal products are ready to do this job well enough to be useful to lawyers. I think we need further technological advancement to get there.

Sometimes a Failure Produces Valuable Information

That sounds like a failure, but it’s informative. We lawyers want to be efficient and competitive, so we want to use technological tools when they can help us produce better work products or become more time-efficient or, ideally, both.

I learned AI isn’t yet ready to be a great case-law summarizer. Perhaps we’ll get there in a few years. For now, at least I learned that I’m not missing a trick yet.

Written on August 16, 2023

by John B. Farmer

Leading-Edge Law - John B. Farmer’s Substack

Discussion about this post