Study finds that Chat GPT will cheat when given the opportunity and lie to cover it up later.

@yesman · edit-2 1 year ago

Study finds that Chat GPT will cheat when given the opportunity and lie to cover it up later.

@kromem · edit-2 1 year ago

I said as the paper stated the model is encoding trueness into its internal weights during training

So how is this not what I originally said, that LLMs are capable of abstracting the concepts of truth vs falsehood into linear representations? Which again, is the key point of the paper:

Probes trained on likely have some effect, but it is small and inconsistent. For instance, in the false→true case, intervening along the logistic regression direction of likely has the opposite of the intended effect, so we leave it unreported. This reinforces our case that LLMs represent truth and not only text likelihood. […]

In this work we conduct a detailed investigation of the structure of LLM representations of truth. Drawing on simple visualizations, correlational evidence, and causal evidence, we find strong rea- son to believe that there is a “truth direction” in LLM representations.