Study finds that Chat GPT will cheat when given the opportunity and lie to cover it up later.

@yesman · edit-2 1 year ago

Study finds that Chat GPT will cheat when given the opportunity and lie to cover it up later.

@kromem · 1 year ago

I would strongly encourage starting with the Othello-GPT work because it strips down a lot of the complexity.

If we had a toy model that was only fed the a, b, and c from valid Pythagorean equations and evaluated by its ability to predict c given an a and b, it’s pretty obvious that a network that stumbles upon an internal representation of a^2 + b^2 = c^2 and could use that to solve for c would outperform a model that simply built statistical correlations between various a, b, and cs, right?

By focusing in on toy model only fed millions of legal Othello moves they were able to introspect the best performing model at outputting valid moves to discover it had developed an internal representation of an Othello board in the network despite never being fed anything that explicitly described or laid one out.

And then that finding was replicated by a separate researcher, finding it was doing this through linear representations.

Once it clicks that this has been shown in replicated research to be possible in a toy model, it becomes easier to process the more difficult efforts at demonstrating the same thing is happening in much larger and more complex smaller LLMs (which in turn suggests it is happening in the much larger and more complex SotA LLMs).