Does a license like this exist?

  • slazer2au
    link
    fedilink
    English
    arrow-up
    11
    ·
    23 hours ago

    Licensing only works as well as enforcing it. How do you show a LLM consumed your code as part of its training data?

    • lobut@lemmy.ca
      link
      fedilink
      arrow-up
      6
      ·
      22 hours ago

      Some authors typed the first few sentences of their book and the LLM spit out the rest.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        5
        arrow-down
        4
        ·
        19 hours ago

        That generally only happens in cases of overfitting, where the model was trained on a poorly de-duplicated data set that contains many copies of that book (or excerpts, quotes, and so forth). This is considered a flaw by AI trainers and a lot of work goes into sanitizing the training data to prevent it.

          • FaceDeer@fedia.io
            link
            fedilink
            arrow-up
            3
            arrow-down
            4
            ·
            19 hours ago

            You went digging through my Reddit comments to find a two-month-old thread, that must have taken a lot of effort. But I’m afraid I don’t see what the relevance of it is, aside from a general “it’s about AI”. The bulk of the comments I wrote there were about water usage.

            I’m genuinely puzzled. Are you saying that deduplicating data is “hiding unethical behaviour?” It’s actually intended for improving the model’s performance, having a model spit out exact copies of its training data means you’ve produced a hugely expensive and wasteful re-implementation of copy-and-paste rather than a generative AI. The whole point of generative AI is to produce novel outputs.