Is chatgpt proof that standard tests are bad measures of intelligence

@randoot · 11 months ago

Is chatgpt proof that standard tests are bad measures of intelligence

@kromem · edit-2 11 months ago

Standardized tests were always a poor measure of comprehensive intelligence.

But this idea that “LLMs aren’t intelligent” popular on Lemmy is based on what seems to be a misinformed understanding of LLMs.

At this point there’s been multiple replications of the findings that transformers build world models abstracted from the training data and aren’t just relying on surface statistics.

The free version of ChatGPT (what I’m guessing most people have direct experience with) is several years old tech that is (and always has been) pretty dumb. But something like Claude 3 Opus is very advanced at critical thinking compared to GPT-3.5.

A lot of word problem examples that models ‘fail’ are evaluating the wrong thing. When you give a LLM a variation of a classic word problem, the frequency of the normal form biases the answer back towards it unless you take measures to break the token similarities. If you do that though, most modern models actually do get the variation completely correct.

So for example, if you ask it to get a vegetarian wolf, a carnivorous goat, and a cabbage across a river, even asking with standard prompt techniques it will mess up. But if you ask it to get a vegetarian 🐺, a carnivorous 🐐 and a 🥬 across, it will get it correct.

GPT-3.5 will always fail it, but GPT-4 and more advanced will get it correct. And recently I’ve started seeing models get it correct even without the variation and trip up less with variations.

The field is moving rapidly and much of what was true about LLMs a few years ago with GPT-3 is no longer true with modern models.

@okamiueru · edit-2 11 months ago

I don’t know… I’ve been using ChatGPT4. I use it only where the knowledge it outputs is not important. It’s good when I need help with language related things, as more of a writing assistant. Creative stuff is also OK, sometimes even impressive.

With facts? On moderately complicated topics? I’d say it gets something subtly wrong about 80% of the time, and very obviously wrong 20%. The latter isn’t the problem.

I don’t understand where the “intelligent” part would even come in. Sure, it requires a fair level of intelligence to understand and generate human language responses. But, to me, all I’ve seen fits: generate responses that seem plausible as responses to the input.

If intelligence requires some deeper understanding of the world, and the facts and relationships between them, then I don’t see it. It’s just a coincidence when it looks like it happened. It’s impressive how often that is, but it’s still all it is.