Generative AI Has a Visual Plagiarism Problem

@[email protected] · 1 year ago

Generative AI Has a Visual Plagiarism Problem

@[email protected] · 1 year ago

Generative AI is based on “predicting” and generating the next token. Tune it one way and it will regurgitate its training data exactly. Tune it the other way and the words it comes up with are nonsense. Tune it just right and it comes up with something that seems creative.

The problem is that the training data is always in there somewhere. It can’t generate something in the style of Shakespeare without containing Shakespeare as reference. That’s probably fine for Shakespeare which is out of copyright, but if it contains say Stephen King’s entire collected works, that’s another issue.

If a human writer read all of Stephen King’s books then tried to write in the style of King, that would be OK, but that’s because a human can’t memorize everything King has written word-for-word. When a human reads King, they don’t build up a database of “probable next word frequency”, instead they build heuristics having to do with how he approaches dialogue, how he reveals character, how he builds tension, etc. They may remember one especially memorable line or two, but the bits they remember, even if written down word-for-word would probably not be enough to be copyright infringing on their own.

I would bet that we’ve come too far to completely scrap generative AI. Too many billions have been invested, and the companies have too much political power. So, the question is whether there will be significant changes to copyright law. On one side of that fight will be the trillions of dollars behind the entertainment industry. On the other side of that fight will be the trillions of dollars behind the tech industry. Of course, individual artists will be trampled in the process.

@[email protected] · 1 year ago

It seems though that in the long run, the line between a human reading Shakespeare and coming up with their own version and computer doing the same will be thinner and thinner. After all we are really just biological computers. One could imagine a computer “thinking” of things the same “way” that we do. What then?

@[email protected] · edit-2 1 year ago

One could imagine a computer “thinking” of things the same “way” that we do.

One can imagine it, but that’s been the impossible nut to crack ever since the first computers. People were saying that artificial intelligence (what we now want to call AGI instead) was 5 years away since the 1970s, if not earlier.

The new generative systems seem intelligent, but they’re just really good at predicting the next word. There’s no consciousness there. As good as LLMs are, they can’t plan for the future. They don’t have goals.

The only interesting twist here is that consciousness / free will might not really exist, at least not in the form most people think of it. So, maybe LLMs are closer to being “thinking” computers not because they’re getting closer to consciousness / free will, but because we’re starting to realize free will was an illusion all along.

@[email protected] · edit-2 1 year ago

That’s what I mean. We elevate the human thought process as if what we come up with is more valid than what a (future) computer could think up. But is it?

So if a computer synthesizing Shakespeare is stealing, maybe so is a human doing it. But maybe then we could never create anything at all. And if we must not be blocked from it, must a machine?

@[email protected] · 1 year ago

So if a computer synthesizing Shakespeare is stealing

Copyright infringement is never stealing. But, as to whether it’s infringing copyright, the difference is that current laws were designed based on human capabilities. If memorizing hundreds of books word for word was a typical human ability, copyright would probably look very different. Instead, normal humans are only capable of memorizing short passages, but they’re capable of spotting patterns, understanding rhythms, and so-on.

The human brain contains something like 100 billion neurons, and many of them are dedicated to things like hearing, seeing, eating, walking, sex, etc. Only a tiny fraction are available for a task like learning to write like Shakespeare or Stephen King. GPT-4 contains about 2 trillion parameters, and every one of them is dedicated to “writing”. So, we have to think differently about whether what it’s storing is “fair” when it comes to infringing someone’s copyright.

Personally, I think copyright is currently more harmful than helpful, so I like that LLMs are challenging the system. OTOH, I can understand how it’s upsetting for an artist or a writer to see that SALAMI can reproduce their stuff almost exactly, or produce something in their style so well that it effectively makes them obsolete.

propapanda :verified: · 1 year ago

deleted by creator

@Thermal_shocked · 1 year ago

You sir are a turing machine.

@[email protected] · 1 year ago

I’d be careful with the plagiarism argument. People uploaded content to Meta/reddit and tons of other where the use term allow them to make commercial usage of the content you uploaded and it’s derivative. Not sure whether these use terms have been challenged in court but Meta and others has a massive database of image they can use to train their AI and reuse commercially. Artists knew it was going to happen when they started to post content on insta.

We do still have an issue though being able to generate Sailor Moon or the Simpson means they have copyrighted data in their training dataset, and it’s a serious risk for free/libre model (while Meta is big enough to tell Toei animation to fuck off.)

@[email protected] · 1 year ago

deleted by creator