Source: https://front-end.social/@fox/110846484782705013

Text in the screenshot from Grammarly says:

We develop data sets to train our algorithms so that we can improve the services we provide to customers like you. We have devoted significant time and resources to developing methods to ensure that these data sets are anonymized and de-identified.

To develop these data sets, we sample snippets of text at random, disassociate them from a user’s account, and then use a variety of different methods to strip the text of identifying information (such as identifiers, contact details, addresses, etc.). Only then do we use the snippets to train our algorithms-and the original text is deleted. In other words, we don’t store any text in a manner that can be associated with your account or used to identify you or anyone else.

We currently offer a feature that permits customers to opt out of this use for Grammarly Business teams of 500 users or more. Please let me know if you might be interested in a license of this size, and I’II forward your request to the corresponding team.

  • @QuaternionsRock
    link
    11 year ago

    It’s all code, the people coding it are 100% capable of programming it to keep track of where the information comes from. Even if it’s transformative, that doesn’t prevent it from keeping track of what was transformed.

    This is a fundamental misunderstanding of how LLMs actually work. Given a list of previous tokens, a complicated set of linear algebra and normalization operations are applied to yield the “probability” (in quotes because this is a dubious application of the word imo) that each known token will follow it. The model is trained using an equally complicated regression algorithm that slowly adjusts the billions of linear algebra coefficients to more closely match the training data. RLHF is then used to make more adjustments that allow the AI to fulfill its intended purpose (e.g., to reinforce the question-answer format expected of ChatGPT).

    You may recall regression from your first statistics class. Even in the case of simple linear regression, when the input consists of millions of data points, it is essentially impossible to determine which point should be “credited” for any given aspect of the output line. The same is true for AI: you could maybe compile a list of training data that makes a token “likely” to appear after another token, but nothing more complex than that. It is very rare for a small set of sources to be responsible for a sequence longer than a few tokens.

    I do, however, believe they should be required to provided a very specific list of sources used for training the model. I think it’s ridiculous to claim that generative AI is transformative in a practical sense: I can’t imagine it would be legal for companies to make endless photocopies of copyrighted material and have a computer make fancy scrapbooks out of it, even if “it’s a fledgling industry” or whatever.