In the end, though, the crux of this lawsuit is the same as all the others. It’s a false belief that reading something (whether by human or machine) somehow implicates copyright. This is false. If the courts (or the legislature) decide otherwise, it would upset pretty much all of the history of copyright and create some significant real world problems.
Part of the Times complaint is that OpenAI’s GPT LLM was trained in part with Common Crawl data. Common Crawl is an incredibly useful and important resource that apparently is now coming under attack. It has been building an open repository of the web for people to use, not unlike the Internet Archive, but with a focus on making it accessible to researchers and innovators. Common Crawl is a fantastic resource run by some great people (though the lawsuit here attacks them).
But, again, this is the nature of the internet. It’s why things like Google’s cache and the Internet Archive’s Wayback Machine are so important. These are archives of history that are incredibly important, and have historically been protected by fair use, which the Times is now threatening.
(Notably, just recently, the NY Times was able to get all of its articles excluded from Common Crawl. Otherwise I imagine that they would be a defendant in this case as well).
Either way, so much of the lawsuit is claiming that GPT learning from this data is infringement. And, as we’ve noted repeatedly, reading/processing data is not a right limited by copyright. We’ve already seen this in multiple lawsuits, but this rush of plaintiffs is hoping that maybe judges will be wowed by this newfangled “generative AI” technology into ignoring the basics of copyright law and pretending that there are now rights that simply do not exist.
That depends on the nature of the derivative and the license the original work was made under. Fair use is an exception to copy right laws and it’s applicability depends on several factors.
If you publish a photo under a non commercial use license the NY Times can’t just publish a cropped black and white version of it in their paper without arranging a deal with you.
But someone else could write a blog post critiquing your photo, and show the photo in the process.
The contention is around if AI tools are meeting the Fair Use standard.
Your initial example is poorly constructed as it implies that, much like republishing a cropped section of an original photo, AI is “generating” its results by merely stitching quotes together. That could not be further from the truth, and perpetuating that misconception is irresponsible and unhelpful.
It would be a more accurate analogy to describe an original photograph as one in a volume compiled to refine specific visual details like building structure/style of a specific place, fashion of an era, photography style, etc., etc. to better enable the LLM’s text-to-graphic mechanics.
edit: thanks to the silent cowardly anon for the downvote seconds after posting. jog on, li’l sweaty thing, jog on.