We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision.
No, that’s not what it is about and I’m really not sure where you are picking that perspective up. It is discussing the limits on the ability to model the representations, but it’s not about the inherent ability of the model to classify. Tegmark’s recent interest has entirely been about linear representations of world models in LLMs, such as the other paper he coauthored a few weeks before this one looking at representation of space and time: Language Models Represent Space and Time
That’s not how they work. You are confusing their training from their operation. They are trained to predict the next tokens, but how they accomplish that is much more complex and opaque. Training is well understood. Operation is not, especially on the largest models. Though Anthropic is making good headway in the past few months with the perspective of virtual neurons mapped onto the lower dimensional actual nodes and looking at activation around features instead of nodes.
It’s definitely not the best and I’m not sure where you got that impression.
All LLM activations are multidimensional. That’s how the networks work, with multidimensional vectors in a virtual network fuzzily mapping to the underlying network nodes and layers. But you seem to think that because it’s a complex modeling of language relationships that it can’t be modeling world models? I’m not really clear what point you are trying to make here.
Again, there’s many papers pointing to how LLMs establish world models abstracted from the input, from the Othello-GPT paper and follow-up by a DeepMind researcher to Tegmark’s two recent papers. This isn’t an isolated paper but part of a broader trend. To be saying that this isn’t actually happening means claiming multiple different researchers across Harvard, MIT, and institutions leading in the development of the tech are all getting it wrong.
And none of the LLM papers these days are peer reviewed because no one is waiting months to publish in a field where things are moving so quickly that your findings will likely be secondary or uninteresting by the time you publish. For example both Stanford’s model collapse one and Are Emergent Abilities of Large Language Models a Mirage? were published to arXiv and not peer reviewed journals, while both getting a ton of attention, in part because of how negative takes on LLMs get more press coverage these days. Go ahead and point to an influential LLM paper from the last year published in a peer reviewed journal and not arXiv. Even Wei’s CoT paper, probably the most influential in the past two years, was published there.
I could be wrong, I’ll keep reading, thanks for the feedback and the citations.
I would strongly encourage starting with the Othello-GPT work because it strips down a lot of the complexity.
If we had a toy model that was only fed the a, b, and c from valid Pythagorean equations and evaluated by its ability to predict c given an a and b, it’s pretty obvious that a network that stumbles upon an internal representation of a^2 + b^2 = c^2 and could use that to solve for c would outperform a model that simply built statistical correlations between various a, b, and cs, right?
By focusing in on toy model only fed millions of legal Othello moves they were able to introspect the best performing model at outputting valid moves to discover it had developed an internal representation of an Othello board in the network despite never being fed anything that explicitly described or laid one out.
And then that finding was replicated by a separate researcher, finding it was doing this through linear representations.
Once it clicks that this has been shown in replicated research to be possible in a toy model, it becomes easier to process the more difficult efforts at demonstrating the same thing is happening in much larger and more complex smaller LLMs (which in turn suggests it is happening in the much larger and more complex SotA LLMs).