Dot. to

Fuck AIEnglish • 6 days ago

GPT-4o and Co. get it wrong more often than right, says OpenAI study.

the-decoder.com

4

20

GPT-4o and Co. get it wrong more often than right, says OpenAI study.

the-decoder.com

Dot. to

Fuck AIEnglish • 6 days ago

4

GPT-4o and Co. get it wrong more often than right, says OpenAI study

the-decoder.com

A new OpenAI study using their in-house SimpleQA benchmark shows that even the most advanced AI language models fail more often than they succeed when answering factual questions.

A new OpenAI study using their SimpleQA benchmark shows that even the most advanced AI language models fail more often than they succeed when answering factual questions, with OpenAI’s best model achieving only a 42.7% success rate.

The SimpleQA test contains 4,326 questions across science, politics, and art, with each question designed to have one clear correct answer. Anthropic’s Claude models performed worse than OpenAI’s, but smaller Claude models more often declined to answer when uncertain (which is good!).

The study also shows that AI models significantly overestimate their capabilities, consistently giving inflated confidence scores. OpenAI has made SimpleQA publicly available to support the development of more reliable language models.

Chat

Flying Squid
link
12•6 days ago
Dude, at least read the part of the article that OP pasted for you in the body of their post.
- @ladicius
  link
  2•
  edit-2
  5 days ago
  You need to improve your prompting elsewise that ai bot won’t follow your instructions.

Fuck AI

[email protected]

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

“We did it, Patrick! We made a technological breakthrough!”

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

1.38K users / day
1.61K users / week
2.86K users / month
8.85K users / 6 months
1.35K subscribers
253 Posts
3.92K Comments
Modlog