LLMs are likely going to scrape no matter the license. I doubt OpenAI got a copyright license from Reddit to ingest it. In fact I’m not even sure they need one if ingestion can be make similar enough to “reading the web site”. And so making content CC probably won’t affect LLM use of public posts.
Yeah, I understand that screen scraping is a thing, and having a robot just simply read an entire website means there’s nothing you can do to stop that from happening short of taking the website offline.
I was talking about in a more structured and proactive way “We know that AI will read our site, and ingest that for LLM, instead of simply accepting that as an inevitability we’re extending this offer instead, for a nominal fee we will provide them with the entirety of our sites information with all screen names redacted to protect the identity of the content creators, in exchange for them not simply using AI to read our site.”
Or something to that effect. Accept that it will happen, and there’s nothing you can really do to stop it. But to package the data in a clean way so that they don’t have too, and can simply ingest it into the LLM data sets directly.
LLMs are likely going to scrape no matter the license. I doubt OpenAI got a copyright license from Reddit to ingest it. In fact I’m not even sure they need one if ingestion can be make similar enough to “reading the web site”. And so making content CC probably won’t affect LLM use of public posts.
Yeah, I understand that screen scraping is a thing, and having a robot just simply read an entire website means there’s nothing you can do to stop that from happening short of taking the website offline.
I was talking about in a more structured and proactive way “We know that AI will read our site, and ingest that for LLM, instead of simply accepting that as an inevitability we’re extending this offer instead, for a nominal fee we will provide them with the entirety of our sites information with all screen names redacted to protect the identity of the content creators, in exchange for them not simply using AI to read our site.”
Or something to that effect. Accept that it will happen, and there’s nothing you can really do to stop it. But to package the data in a clean way so that they don’t have too, and can simply ingest it into the LLM data sets directly.