Meta launched the much-exaggerated Llama four models last weekend, touting significant performance improvements and new multimodal features. However, the deployment is not progressing as planned. What was supposed to mark a new chapter in Meta’s AI playbook is now caught up in benchmark fraud, causing a wave of skepticism across the tech community.
Llama 4 hits the scene and then headlines
Meta has introduced three models: the Lama 4 Scout, the Lama 4 Maverick, and the Lama 4 Behemoth, who is still training. According to Meta, Scout and Maverick are already available to embrace Face and Llama.com and are integrated into Meta AI products across Messenger, Instagram Direct, WhatsApp and The Web.
The Scout is a compact 17B parameter model built with 16 experts and can be fitted to a single NVIDIA H100 GPU. Meta claims it surpasses the Mistral 3.1, Gemini 2.0 Flash-Lite and Gemma 3 in widely reported benchmarks. Another 17B parameter model, but 128 experts, Maverick is said to beat the GPT-4O and Gemini 2.0 Flash.
These models are the Lama 4 Behemoth, a 288B parameter model still in training. Meta says Behemoth is already outperforming GPT-4.5, Claude Sonnet 3.7 and Gemini 2.0 Pro in the range of stem benchmarks.
It’s all impressive. However, soon after its release, questions began to accumulate.
Benchmark issues
“We have developed a new training technique called metapping, which ensures that we can set important model hyperparameters such as layer-by-layer learning rates and initialization scales. Each and overall, and overall, multilingual tokens 10 times more than Lama 3,” Meta said in a blog post.
The Llama 4 Maverick was stronger among the two released models and was at the heart of the controversy. Meta introduced her performance at LM Arena, but critics noticed something strange. The versions tested were not the same as the published release. Meta was found to be using a custom tuned version for the benchmark, claiming the results were padded.
Ahmad Al-Dahle, vice president of Meta Generation AI, refused to play fouls. He said the company has not trained on test sets and the inconsistencies are merely platform-specific quirks. Still, the damage was done. Social media broke out, and the poster denounced Meta as “benchmark hacking” and manipulated the test conditions to make Lama 4 look stronger than that.
Under the accusation
An anonymous user who claims to be a former meta-engineer posted to a Chinese forum claims that the team behind Lama 4 adjusted the post-training dataset to get a better score. That post triggered the Firestorm on X and Reddit. The user has started connecting dots. Inconsistencies in internal tests, claiming that they put pressure on them to move forward from leadership despite known issues, and the general sentiment that optics were given priority in accuracy.
The term “Maverick’s Tactics” began to cycle through shorthand to test the headline-tracking protocol and play loosely.
Meta’s reaction and what is missing?
Meta addressed concerns in an April 7 interview with TechCrunch, calling the accusation false, standing by the benchmark. However, critics say the company doesn’t provide enough evidence to support its claims. There is no detailed methodology or white papers, and there is no access to raw test data. In an industry with increasing scrutiny, that silence has made things worse.
Why is it important?
Benchmarks are a big deal in AI. Help developers, researchers and companies compare neutral location models. However, the system is not bulletproof. You can overset the test set and massage the results. That’s why transparency is important. Without it, trust will erode quickly.
According to Meta, the Llama 4 offers “best in class” performance, but the chunks of community are not buying it now. And for companies that are putting a big bet on AI as the core pillar of their future, such doubts are difficult to shake up.
The whole picture
This isn’t just meta. There is growing concern that benchmark results are greater about marketing than science across the AI space. The Llama 4 episode is the latest example of how a company is called if the numbers aren’t summed.
It is not yet known whether these charges are retained. For now, Meta’s statements are against a flood of speculation. The company has ambitious plans for the Lama 4, and the model itself could be solid. However, the rollout raised more questions than it answered. And those questions don’t go away until there’s more transparency.
Lama 4 could be a big win for Meta. Alternatively, it can be remembered as a launch that caused trust issues in another round of AI.
🚀Want to share the story?
Submit your stories to TechStartUps.com in front of thousands of founders, investors, PE companies, tech executives, decision makers and tech leaders.
Please attract attention
Source link