Close Menu
  • Home
  • Identity
  • Inventions
  • Future
  • Science
  • Startups
  • Spanish
What's Hot

Meet Your Digital Twin: Europe’s Cutting-Edge AI is Personalizing Medicine

TwinH: The AI Game-Changer for Faster, More Accessible Legal Services

US government charges former L3Harris cyber chief with trade secret theft

Facebook X (Twitter) Instagram
  • Home
  • About Us
  • Advertise with Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
  • User-Submitted Posts
Facebook X (Twitter) Instagram
Fyself News
  • Home
  • Identity
  • Inventions
  • Future
  • Science
  • Startups
  • Spanish
Fyself News
Home » Did Xai lie about the benchmarks for the Grok 3?
Startups

Did Xai lie about the benchmarks for the Grok 3?

userBy userFebruary 22, 2025No Comments3 Mins Read
Share Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Copy Link
Follow Us
Google News Flipboard
Share
Facebook Twitter LinkedIn Pinterest Email Copy Link

Discussions on AI benchmarks and how they are reported by AI Labs are publicly available.

This week, Openai employees accused Xai, the AI ​​company of Elon Musk, of publishing misleading benchmark results for its latest AI model, the Grok 3. Xai co-founder Igor Babushkin claimed the company was on the right.

The truth lies somewhere in between.

In a post on Xai’s blog, the company has published a graph showing the performance of the Grok 3 at AIME 2025, a collection of challenging mathematics questions from recent invited mathematics exams. Some experts have questioned the effectiveness of AIIME as an AI benchmark. Nevertheless, AIME 2025 and above versions of the test are commonly used to investigate the mathematical capabilities of models.

The Xai graph showed two variants of Grok 3, Grok 3 Reasoning Beta, and Grok 3 mini inference. Defeated Openai’s most performant available model O3-Mini-High in Aime 2025. “Cons@64” did not include O3-Mini-High’s AIME 2025 score.

What is Cons @64? Well, it stands for “Consensus @64” and basically gives you a model 64 that tries to answer each question in the benchmark, and receives the answer that is generated most frequently as the final answer. As you can imagine, Cons@64 tends to significantly increase the benchmark score of a model, and if you omit it from the graph it might seem as if one model actually outweighs another.

The AIME 2025 score for “@1” for Grok 3 Reasoning Beta and Grok 3 Mini Reasoning (the first score the model won on the benchmark) is below the O3-Mini-High score. The Grok 3 Reasoning Beta is heading backwards more than ever for “medium” computing to Openai’s O1 model set. However, Xai promotes Grok 3 as “the smartest AI in the world.”

Babushkin claimed Openai has published an equally misleading benchmark chart in the past. This compares the performance of its own models. In the discussion, we’ve put together a more “precise” graph showing the performance of almost every model at Cons@64.

How do cheerful people see my plot as an attack on Open Alley, and others see it as an attack on Glock?
(I actually believe Grok looks good there, and Openai’s TTC Chicanery behind the Oph-Mini-*High*-Pass@””” 1”” deserves more scrutiny. ) https://t.co/djqljpcjh8 pic.twitter.com /3wh8foufic

– Teortaxes▶️ (deepseek special🐋Kiro 2023–∞) (@teortaxestex) February 20, 2025

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important indicator remains a mystery. The calculation (and currency) cost that each model took to achieve the highest score. This simply shows how little most AI benchmarks communicate about the limitations of the model and their strengths.




Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleThe US AI Safety Institute could face major cuts
Next Article Bivol takes Beterbiev’s undisputed lightweight world title | Boxing News
user
  • Website

Related Posts

US government charges former L3Harris cyber chief with trade secret theft

October 23, 2025

Sora update brings AI pet videos, new social features, and Android version coming soon

October 23, 2025

Wonder Studios, backed by OpenAI and DeepMind executives, raises $12 million to bring AI content to Hollywood

October 23, 2025
Add A Comment
Leave A Reply Cancel Reply

Latest Posts

Meet Your Digital Twin: Europe’s Cutting-Edge AI is Personalizing Medicine

TwinH: The AI Game-Changer for Faster, More Accessible Legal Services

US government charges former L3Harris cyber chief with trade secret theft

North Korean hacker lures defense engineer with fake job to steal drone secrets

Trending Posts

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Please enable JavaScript in your browser to complete this form.
Loading

Welcome to Fyself News, your go-to platform for the latest in tech, startups, inventions, sustainability, and fintech! We are a passionate team of enthusiasts committed to bringing you timely, insightful, and accurate information on the most pressing developments across these industries. Whether you’re an entrepreneur, investor, or just someone curious about the future of technology and innovation, Fyself News has something for you.

Meet Your Digital Twin: Europe’s Cutting-Edge AI is Personalizing Medicine

TwinH: The AI Game-Changer for Faster, More Accessible Legal Services

Immortality is No Longer Science Fiction: TwinH’s AI Breakthrough Could Change Everything

The AI Revolution: Beyond Superintelligence – TwinH Leads the Charge in Personalized, Secure Digital Identities

Facebook X (Twitter) Instagram Pinterest YouTube
  • Home
  • About Us
  • Advertise with Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
  • User-Submitted Posts
© 2025 news.fyself. Designed by by fyself.

Type above and press Enter to search. Press Esc to cancel.