Will Copyright Law Wipe Out Generative AIs Such As ChatGPT?

It's the War of the Mind Worlds: AI vs. Human Creators

May 18, 2023

Do you fear generative AIs such as ChatGPT will destroy your earning power? Do you worry about AIs turning on mankind?

If you’re hoping for a savior to smite generative AI, could copyright law be it?

By analogy, think of H. G. Wells’ The War of the Worlds. In it, Martians, who have superior military technology, invade the Earth. The Martians begin laying waste to Earth’s cities. It appears human civilization is doomed.

But the Martians ultimately were laid low, not by Earthling military weapons, but by something in the background: Earth germs. The Martians died off, and the threat disappeared, because the environment was inhospitable.

Is copyright law inhospitable to generative AI?

How Generative AI Works

First, some background is necessary. Generative AIs such as ChatGPT work by studying a massive volume of data to map connections between language in its prompts and features of its output. It does this by converting words into sequences of numbers and learning how these numbers should optimally influence parts of the final product.

In training, these AIs are fed millions (perhaps billions) of documents and other media. While generative AI providers don’t fully divulge where they get training data, they admit the sources are mainly on the Internet. Likely sources include Wikipedia and news articles. OpenAI, the provider of ChatGPT, admits it hasn’t purchased licenses to use much of its training data.

When AIs are trained, they make a copy of every item of training data, such as every online article studied. But the generative AI does not store the training data in its online form and use it to generate answers to user requests (called “prompts”). It’s not like a librarian listening to your needs and finding the perfect book. Instead, the generative AI is a massive, many-layered neural network somewhat analogous to the network of neurons in your brain.

By studying training data, it learns about connections between words and the structure of human speech (in addition to connections in non-verbal media). It uses that learning to set “weights” in its neural network. Each piece of data, such as each article, affects where the weights are set. In effect, once training is completed, the final setting of the weights is the summation of the AI’s study of a massive volume of information.

For that reason, if you ask an AI to generate something, it’s highly unlikely its output will match or be nearly identical to any one piece of training data, such as an article. That could happen by chance, with the AI just happening to generate text highly similar to a single work. It also could happen if a prompt asks for an ultra-niche answer found only in a single item of training data. That should be rare, if it happens at all.

The Relevant Copyright Law

How does this generative AI process match up with copyright law? To do this analysis, we must first understand some copyright legal principles.

Copyright is a set of exclusive rights of an author to his or her creative work, such as an article, painting, picture, video, or computer code. Among the exclusive rights is the right to copy the creative work and to build upon it, which is called a “derivative work.”

Almost everything on the Internet is someone’s copyright property. Copyright comes into existence just by an author creating a work. Both copyright registration and using a copyright ownership notice strengthen an author’s rights but are not necessary for copyright ownership.

Copyright protects only specific expressions, not ideas. Rewriting someone else’s ideas in your own words is not copyright infringement.

If someone commits technical copyright infringement, such as copying someone else’s copyright property without permission, sometimes such activity is excused as fair use. Fair use is a squishy concept. It’s a case-by-case analysis based on four factors: (1) the nature of the use, including whether the use is commercial, educational, or nonprofit, (2) the nature of the copyrighted work, (3) the amount and portion of the copyrighted work being used, and (4) the effect of the activity on the copyrighted work’s market value. A classic example of fair use is showing a copyrighted painting in an art class to critique the artistry.

The Training Phase of Generative AI: Copying For Sure, But Is It Fair Use?

Generative AI gets copyright scrutiny at two stages: training and output.

As noted above, in AI training, the copyright property of third parties is copied without their permission. That’s copyright infringement unless it’s fair use. In theory, operators of generative AI could buy licenses to use training data, but given the massive volume they need and the difficulty of identifying copyright owners for some massive online databases, that’s probably impractical.

Is this use of the copyright property of others fair use? It probably is, but, because of the novelty of generative AI, past legal precedents don’t fit this situation.

Perhaps the most applicable precedent is the Google Books project, which got Google sued by the Authors Guild.

Google copied the collections of large physical libraries to generate a books database. It did so without permission from the books’ copyright owners. This database can be used to find books that address a search topic, and the search results show a small snippet of the most relevant part of each book. It was a tool to find books but didn’t replace acquiring the book to get its information.

The Second Circuit ruled that this copying was fair use. It held that the use didn’t generate copies of the books people could use. Consequently, this conduct didn’t damage the economic value of the books. Whether the behavior affects the market value of the copyright property is a big factor in these decisions.

In training, generative AI doesn’t go as far in copying as Google Books. With generative AI, a copy is made of copyright property to affect weights in the AI system, then that copy can be discarded. In the Google Books case, the copy is kept permanently to generate search results listing specific books and providing snippets. This comparison creates a strong argument that the intake part of running a generative AI is a fair use of others’ copyright property.

The Output Phase of Generative AI: Real Harm to Authors, But Where’s the Infringement?

Now, let’s look at the generative AI output.

Here, the output does not copy a single item of someone else’s copyright property, such as an article (subject to possible rare exceptions noted above).

Under copyright law, if AI output is substantially similar to a specific item of someone’s copyright property, and if that copyrighted property was in the training data, such “access plus substantial similarity” creates a presumption of copyright infringement. But, as noted above, such substantial similarity will rarely happen.

Producing something “in the style of” a particular copyright owner might happen frequently, but such style mimicry isn’t copyright infringement or a violation of any other right held by that owner. An author or other artist has no copyright on a style.

For example, let’s say I ask an AI to write a story in the style of Tom Wolfe about a young lawyer in Richmond in the 1990s. That article might read as if Tom Wolfe could have written it. The ability of an AI to generate such output might depress the demand for buying Tom Wolfe books because you can get a good free substitute from a generative AI. But what the AI produces likely won’t be similar enough to any single Tom Wolfe story to be copyright infringement.

The Disconnect Between AI Copying and AI Harm to Authors

Here’s the overall problem with attacking generative AIs with copyright law: Search services such as Google Books don’t substantially harm the market demand for the works produced by human authors. They might enhance it.

Generative AI can hurt or destroy the market for some human-generated works. For example, why buy a license for a photograph from Getty Images if you can generate the stock photo you want for free using an art-generating AI, such as Midjourney or DALL-E?

But this destruction of market value happens at the output stage, and the output isn’t a copyright infringement with rare exceptions. The copying occurs at the learning stage, and no court has held this copying for study purposes is copyright infringement.

How Will the Courts and Maybe Legislatures Handle This Issue?

Ultimately, copyright law isn’t suited to address this harm.

A federal court could greatly extend the current boundaries of copyright law to hold that the generative AI process adds up to copyright infringement. It could hold that the marketplace effect of the outputs means the copying occurring in training is not fair use – that you have to consider training and output collectively. That would be a big stretch and is unlikely. It would be like saying you can’t take a picture of a copyrighted painting if you intend to use that picture to try to paint something similar.

Also, the major generative AI providers, such as Microsoft and Google, are huge technology companies with zillions of dollars to spend on lawyers. Microsoft owns a major stake in OpenAI (the maker of ChatGPT) and has incorporated that technology in its Bing search engine. Google has its Bard generative AI.

Congress could amend the copyright laws to require generative AIs to buy licenses to use the information of others as training data. But nowadays, Congress usually lets the courts deal with challenging IP issues rather than getting involved. Federal copyright law likely preempts states from creating copyright-like protection against generative AIs.

Other Possible Legal Attacks on Generative AI

Because of the possibility that a generative AI might generate something substantially similar to something in its training data that is the property of a copyright owner, you might see some successful lawsuits against generative AI operators regarding some outputs, but that won’t be enough to take down the whole generative AI system.

There are other possible attacks on generative AI.

If an AI generates something mimicking a person's name, image, likeness, or voice so well that people would believe it is that person, the output might violate the “right to publicity.” That’s a rare case, however.

Sometimes generative AIs pull data from websites with online terms prohibiting doing so. Perhaps there would be a breach-of-contract or similar claim in such situations. That’s not resolved legally. But that likely isn’t a way of universally stopping AIs from harvesting information across the Internet for training data. Also, the operators of such websites would have to prove that the AI got the data from their websites, which may not be easy.

In some cases, data privacy laws may crimp the efforts of generative AI but not stop it entirely. Various states, including Virginia and California, have strong individual data privacy laws, and Europe has its GDPR data privacy law. That’s a subject for another day.

* * *

So, in this movie in which we live, it appears the copyright germs cannot kill off the Martian generative AIs, so the generative AIs will thrive and shape (at the least) humanity’s future. To quote Kent Brockman from The Simpsons, “I, for one, welcome our new alien overlords.”

Written on May 17, 2023

by John B. Farmer

Leading-Edge Law - John B. Farmer’s Substack

Discussion about this post