ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

  • @[email protected]
    link
    fedilink
    English
    451 year ago

    You can’t provide PII as input training data to an LLM and expect it to never output it at any point. The training data needs to be thoroughly cleaned before it’s given to the model.