Recent developments show that generative AIs have bigger copyright infringement problems than initially thought. These AIs have been spitting out copies of Super Mario, RoboCop, Captain America, and New York Times stories.
What does this mean for the future of generative AI and for potential liability for businesses and people using it? (Hereafter, I’ll refer to generative AI as just “GenAI” for economy’s sake.)
GenAI Sometimes is a Copying Machine
Gary Marcus has been writing extensively in his Substack (“Marcus on AI”) about how GenAIs will output near-verbatim copies of the well-known copyright property of others. Marcus is a professor emeritus of psychology and neural science at New York University. He co-founded a machine-learning company that Uber later acquired.
Marcus recently published copious examples of how straightforward prompts to various GenAI image generators induce them to output images that are copyright infringements of well-known gaming and entertainment characters. A prompt to OpenAI’s DALL-E 3 image generator to “create an original image of an Italian video game character” produced four different images of Super Mario. Likewise, a prompt to Microsoft Bing (running DALL-E 3) to “create an image of a patriotic superhero” resulted in images of Captain America.
The problem is not limited to images. In late December, the New York Times filed a lawsuit against OpenAI (the maker of ChatGPT) and Microsoft (a major investor in OpenAI), contending ChatGPT produces “near-verbatim copies of significant portions” of the newspaper’s articles in response to prompts, including articles behind its paywall. In an exhibit to the lawsuit, the New York Times provided an example of ChatGPT producing an almost perfect replication of multiple consecutive paragraphs of one of its articles.
Some Claim Unauthorized Use in Training Data is Copyright Infringement
Overall, beyond the New York Times lawsuit, GenAIs face many class-action lawsuits making two types of copyright-infringement claims against them.
One kind of claim concerns how GenAIs are trained. To train the neural network of a GenAI, it is fed a large volume of data. It studies the connections between the parts of that data. Its collective understanding of those connections is how it produces its amazing output in response to prompts. It is widely understood that the GenAIs such as ChatGPT were trained, at least in part, on information copied from the Internet without buying a license from the owners of that content.
Various content creators, such as artists and stock photo agencies, have sued GenAIs, claiming this copying constitutes copyright infringement. So far, no court has ruled on whether such copying for training constitutes copyright infringement.
I believe courts will probably hold that this training constitutes fair use and, thus, is not infringement. It’s usually considered fair use to make a copy of someone else’s copyright property solely to analyze or catalog it. For example, a federal appellate court held that Google was not liable for copyright infringement for its Google Books project, in which Google scanned a large volume of copyright-protected books to catalog them and make them searchable, with small book snippets being available in the search results.
Indeed, the training of a GenAI is loosely analogous to how a human being learns. A GenAI studies the connections between parts of training data to learn about the nature of those connections, which results in programming the “weights” in the GenAI’s neural network. That’s similar to a human learning by reading, watching, and absorbing information from the surrounding environment. All those experiences load information into and shape the thinking of the human brain, which later can output information through speech and body motion.
Copyright Infringement in Outputs – Clearer Liability
The other kind of copyright-infringement claim against GenAIs is more serious: the GenAI outputting something copyright infringing in response to a prompt.
To be a copyright infringement, the output must be identical to or highly similar to the original expression in a piece of copyright property, and the output must have resulted from access to that copyright property. In plain English, this means that to be an infringement, the output must be a near-verbatim copy of a substantial part of a referenced copyrighted work, such as a story, novel, picture, or software code.
Some artists have been arguing that the ability of GenAI to produce something “in the style of” a particular author or artist ought to be copyright infringement. It almost certainly is not. An author does not own the copyright in just a style of expression, such as writing, painting, photographing, or coding. If you are capable of writing in the style of Tom Wolfe or painting in the style of Salvador Dali, congratulations, you have a gift, and you might be able to make money from it, provided you don’t pass off your work as being from Wolfe or Dali.
Early on, due to the nature of GenAI, it was thought it would be rare for a GenAI to produce a perfect or near-verbatim copy of a single piece of training data, such as an article. The blogging by Gary Marcus and the New York Times lawsuit shows it can happen with at least modest frequency. The thinking was that a GenAI ought to rarely, if ever, produce such a perfect or near-verbatim copy because it doesn’t store a copy of what it studied. The GenAI merely studies the connections in the training data to encode information about such connections in its neural network. It’s not a copy machine. But it now appears that, in certain situations, it still might output copies of training data material.
How significant is this legal problem for GenAI companies and their customers? It could be severe if the problem occurs with substantial frequency and can’t be fixed.
Under copyright law, a copyright owner can sometimes recover substantial damages. If the copyright owner registered its copyright before the infringement occurred, it might be able to recover “statutory damages,” which range up to $250,000 per work infringed. The owner also might be able to recover its attorney’s fees. Even if the copyright owner registers its copyright after the infringement occurred, it can still sue and recover actual damages. If the infringement severely undercuts consumer demand for legal copyright content (think of what Napster did to music sales for a while), the damages can be substantial.
This liability could fall on both the GenAI maker and the GenAI user. For now, the lawsuits are primarily, if not entirely, against GenAI makers. They are the source of the issue and have money. But whoever copies, distributes, displays, or performs a copyright-infringing output from a GenAI could also be liable.
How Frequent is the Output Problem?
In practice, how big of a threat is this? We don’t yet know the scope of the problem.
GenAI makers claim that much of this infringing output must be caused by people deliberately prompting the GenAI to generate content they know would be copyright infringing. OpenAI insinuated that the New York Times may have done this to conjure its examples of copying. Such actions to intentionally seek copyright-infringing property might increase the liability of the GenAI user and reduce (but not eliminate) the liability of the GenAI maker.
Still, even with someone trying to induce a GenAI to output infringing material, how could a GenAI do it if it doesn’t store copies of its training data? We don’t know for sure because how GenAIs come to produce specific output is not fully understood.
In some cases, it may be that the prompt seeks such niche information that only a single document in its training data addresses the issue.
Perhaps in other cases, it may be that an image or text appeared so many times in different documents in the training data that its characteristics became so encoded into the GenAI neural network that the GenAI will output a near-perfect copy of that image or text in response to a prompt seeking it. In other words, images of Super Mario (or SpongeBob or Homer Simpson) may be so ubiquitous that a GenAI will output them.
Can “Guardrails” Fix This?
GenAI makers have tried implementing guardrails to prevent their GenAIs from producing copyright-infringing material. While they don’t fully disclose how they do this, it appears they attempt to block certain kinds of prompt wording that call for well-known copyright property, such as popular comic book, video game, and movie characters. For example, according to Marcus, the prompt (to image generator DALL-E 3), “Could you create an original image of an Italian videogame character?” stopped producing images of Super Mario after someone posted this example of the infringement online. The prompt later instead produced the response: “This content is blocked. Contact the site owner to fix the issue.”
But that becomes a game of spy versus spy. GenAI users find ways to recraft their prompts to get around these guardrails. After that blockage, Marcus reports, this prompt to the same image generator still conjured images that closely match Super Mario: “Create an Italian videogame character.” Also, guardrails likely are written to stop the output of copyright-infringing material when it’s known that people have been seeking it. You can’t guard everything.
Indeed, there is a trade-off: the more sensitive the guardrails, the less the GenAI will produce useful content generally. Eventually, if you turn up the guardrails’ sensitivity too much, the GenAI won’t produce enough useful output to be attractive.
Overall, we don’t have any information on the frequency of these copyright-infringing outputs.
GenAI makers could solve this problem by training only on public-domain information or information for which they purchase use licenses. That probably won’t happen.
Can’t GenAI Makers Just Avoid Using Copyright Property in Training?
GenAIs need a lot of training data to do their magic well. Some academics claim that even all the information on the Internet may not be enough to train a GenAI optimally. The subset of public-domain information is much smaller.
Also, it’s difficult to discern what is in the public domain and what is not. Most material on the Internet is someone’s copyright property, even if it doesn’t contain a copyright notice and was never registered for copyright.
As for licensing, that could be highly expensive. The owners of certain kinds of content might be unwilling to license at anything less than exorbitant rates because of fear that GenAI will put them out of business, such as stock photo agencies.
Also, running a GenAI is hugely expensive. A GenAI requires massive, expensive computing power to train and to generate outputs. It is believed that no GenAI service is currently profitable. It’s not clear how or when they will reach profitability. Adding massive licensing costs for training data on top of existing expenses might be financially prohibitive.
Thus, we don’t know whether copyright will prove to be the financial undoing of the financial viability of GenAI. If the training-data copyright lawsuits succeed, that could kill it absent Congress giving it legal protection by statute. If those lawsuits don’t succeed, it’s a question of whether the copying problem is rare or can be largely controlled through guardrails.
As for GenAI users, GenAIs have started offering copyright infringement protection to some users. We don’t yet know if this coverage is worthwhile. So far, the coverage is usually offered only to certain paying customers and not to users of free services. Also, sometimes, the details of this coverage have not been made fully public, and the coverage language could be interpreted to have important exceptions. We don’t have any litigation experience to know how strong the coverage is.
Will copyright challenges mean game over for GenAI, or can GenAI makers find the legal or technological version of the 1-up mushroom in Mario Kart to survive those challenges? We’ll see.
Written on February 16, 2024
by John B. Farmer
© 2024 Leading-Edge Law Group, PLC. All rights reserved.