Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

brianpeiris@lemmy.ca · edit-2 1 day ago

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

davidgro · 24 hours ago

Those are the high scores.

lath · 23 hours ago

🤔 So this is a visual comparison between peak performance of some humans and peak performance of current LLMs in a controlled environment?

floquant@lemmy.dbzer0.com · 22 hours ago

Is this a gotcha? Not sure where you got the “visual” from, but yes it is best human performance vs best LLM performance

lath · 22 hours ago

I don’t know why you assume there has to be a gotcha, maybe it’s the competitive background… Anyway, it’s visual because you look at it to see it. And it’s not the best human performance vs best LLM performance, it’s best controlled performance because the testing is limited to a set of parameters.

floquant@lemmy.dbzer0.com · 22 hours ago

That’s what games are? I really don’t see how it is an unfair comparison to you. How would you change it?

lath · 20 hours ago

Stress test it. Low, average, high, impairment conditions, safeguards off, order, chaos and everything in between.

gnufuu@infosec.pub · edit-2 19 hours ago

I haven’t read all of their Benchmark introduction and Technical Documentation. I assume you have and didn’t find any of the tests you’re asking for?

EDIT: Their methodology seems to be good enough for the big LLM players:

lath · 14 hours ago

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 - A benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Announcing ARC-AGI-3 | ARC Prize