In the end, though, the crux of this lawsuit is the same as all the others. It’s a false belief that reading something (whether by human or machine) somehow implicates copyright. This is false. If the courts (or the legislature) decide otherwise, it would upset pretty much all of the history of copyright and create some significant real world problems.
Part of the Times complaint is that OpenAI’s GPT LLM was trained in part with Common Crawl data. Common Crawl is an incredibly useful and important resource that apparently is now coming under attack. It has been building an open repository of the web for people to use, not unlike the Internet Archive, but with a focus on making it accessible to researchers and innovators. Common Crawl is a fantastic resource run by some great people (though the lawsuit here attacks them).
But, again, this is the nature of the internet. It’s why things like Google’s cache and the Internet Archive’s Wayback Machine are so important. These are archives of history that are incredibly important, and have historically been protected by fair use, which the Times is now threatening.
(Notably, just recently, the NY Times was able to get all of its articles excluded from Common Crawl. Otherwise I imagine that they would be a defendant in this case as well).
Either way, so much of the lawsuit is claiming that GPT learning from this data is infringement. And, as we’ve noted repeatedly, reading/processing data is not a right limited by copyright. We’ve already seen this in multiple lawsuits, but this rush of plaintiffs is hoping that maybe judges will be wowed by this newfangled “generative AI” technology into ignoring the basics of copyright law and pretending that there are now rights that simply do not exist.
I see what you mean, but I thought copyright is a protection against copying something (even with some modifications).
Techdirt traditionally has a very clear view on copyright and its restrictions, so I am familiar with their bias. Their argument here boils down to the difference between copying something and learning from something. If reading something and learning from it is copyright infringement, any educational institute should be very worried. Because that’s exactly what’s going on in there.
I do understand the difference between a student reading dozens/hundreds of NYT articles (for free in the library) and a computer program doing the same, but for orders of magnitude more articles. So I’m curious to see how this is going to turn out
You raise an interesting issue with learning. I would say as humans we have the capacity to add creative input into our works where as a program can only restructure and regurgitate information entered.
And to be clear I don’t opposed the creations of AI in general I just think that creatives, especially independent artists, deserve to be justly compensated if they choose to allow AI to train on their works.