One of the more contentious parts of the rise of AI is its relationship to the rightsholders for the content on which it is trained. Many consider it blatant copyright infringement.
I’m not so sure.
Before I dive into my musings, I want to be clear: I don’t know the answer. I’m not discounting the value of the content that’s being consumed, and I’m certainly not saying that content creators should be ignored in this process.
Let’s start by looking at how humans consume content.
Throughout our lives we read books, magazines, love letters, emails, essays, scribbled notes, and more. These come from a variety of sources. They might be things we’ve purchased, things we’ve borrowed, or things that are available for free and without restriction.
While we remember some of it, all of it has the potential of affecting exactly how and what we think. In a sense we build an internal model of thought based on all that we’ve consumed throughout our lives.
When we create, we use that internal model to generate our ideas, words, or other creations. While those creations might have similarities to what we’ve consumed in the past, they are our unique and original creations. Striking similarity is not uncommon — we do occasionally hear of musicians being accused of theft, when the music in question was legitimately and independently created. Given the quantity of things being created, occasionally synchronicity seems inevitable.
Now, compare that to how large language models (LLMs), and so-called AI systems, work.
They consume immense amounts of content. Far more than any human could in a lifetime. That content is publicly accessible. (Unless laws were actually broken, or other arrangements were made.)
They “remember” every word. Their internal model mimicking thought is literally based on everything they’ve consumed.
When they create, they use that internal model to mimic creating ideas, writing words, drawing images, or creating something else. While those creations might have similarities to what the LLM has consumed in the past, it’s a unique and original creation. Striking similarity is, I believe, rare, but much like the music example above, we do occasionally hear of a series of words being strung together by AI that happens to be identical to words written by a prior author. Again, given the quantity of things being created, occasionally synchronicity seems inevitable.
My question is this: given the strong similarities to the processes used by both LLMs and humans, where is the plagiarism? Where is the copyright violation?
What’s the difference between me reading your book and writing an essay on a similar topic, and an AI doing the something similar?
I’m not saying that there isn’t something important to be explored here, but the existing concepts just don’t seem to me to apply.
If I asked ChatGPT to write me a story about a sea captain obsessed with killing the whale that cost him his leg, and it responded by spitting out Moby Dick verbatim, that’s plagiarism, no question. Indeed, even if only a few paragraphs of the result were unattributed word for word copies, it would be a problem.
But that’s not what’s happening. Something else is happening, and we need to come to grips with whatever that is, and build some kind of consensus about what is and what is not fair game.
Today’s copyright law and calls of plagiarism aren’t it.
I asked ChatGPT. Its response is both insightful, and in some ways, misses the crux of my concern. You’ll find that here.