Open source license that doesn’t allow your code to be used for AI data training?

cat_fishing@feddit.online · 23 hours ago

Open source license that doesn’t allow your code to be used for AI data training?

slazer2au · 23 hours ago

Licensing only works as well as enforcing it. How do you show a LLM consumed your code as part of its training data?

lobut@lemmy.ca · 22 hours ago

Some authors typed the first few sentences of their book and the LLM spit out the rest.

FaceDeer@fedia.io · 19 hours ago

That generally only happens in cases of overfitting, where the model was trained on a poorly de-duplicated data set that contains many copies of that book (or excerpts, quotes, and so forth). This is considered a flaw by AI trainers and a lot of work goes into sanitizing the training data to prevent it.

XLE@piefed.social · edit-2 19 hours ago

But you’re otherwise disgusted by the fact that material is plagiarized without consent to begin with…

…Right, FaceDeer?

FaceDeer@fedia.io · 19 hours ago

You went digging through my Reddit comments to find a two-month-old thread, that must have taken a lot of effort. But I’m afraid I don’t see what the relevance of it is, aside from a general “it’s about AI”. The bulk of the comments I wrote there were about water usage.

I’m genuinely puzzled. Are you saying that deduplicating data is “hiding unethical behaviour?” It’s actually intended for improving the model’s performance, having a model spit out exact copies of its training data means you’ve produced a hugely expensive and wasteful re-implementation of copy-and-paste rather than a generative AI. The whole point of generative AI is to produce novel outputs.