The new AI coding challenge has revealed the first results – and they are not pretty

The new AI coding challenge revealed the first winner and set up a new bar for software engineers with AI.

At 5pm on Wednesday, the nonprofit Laude Institute announced the first winners of the K Award, a multi-round AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian prompt engineer named Eduardo Rocha de Andredo, who received $50,000 for the award. But what was more surprising than victory was his final score. He won with the correct answer to just 7.5% of test questions.

“I’m glad they’ve actually built a difficult benchmark,” says Konwinski. “If benchmarking is important, then benchmarking should be difficult,” he continues, “If a big lab enters with the biggest model, the score will be different. But that’s a kind of point. K-winners will go offline with limited computing.

Konwinski has pledged $1 million to the first open source model that can score more than 90% in testing.

Like the well-known SWE bench system, K-Popular tests the model against flagged issues from GitHub to test how well the model can handle real programming problems. However, while the SWE bench is based on fixed issues that can compete with the model, the K Award is designed as a “pollution-free SWE bench” using a timed entry system to prevent benchmark-specific training. For round 1, the model was scheduled for March 12th. The K Award organizers then created the test using only the GitHub issues that were flagged after that date.

The 7.5% top score contrasts significantly with the SWE bench itself, and now shows a top score of 75% on the simpler “validation” test and 34% on the stiffer “full” test. Konwinski remains to be seen whether the disparity is due to pollution on the SWE bench or the challenge of collecting new issues from GitHub, but we hope that the K-Prize project will answer the questions soon.

“As we run things more, we feel better,” he told TechCrunch.

TechCrunch Events

San Francisco
|
October 27th-29th, 2025

While it may seem like an inadequate place given the wide range of AI coding tools already published, benchmarks are becoming too easy, many critics see projects like the K-Award as a necessary step to solving AI’s growing evaluation problems.

“We’re very bullish about creating new tests for existing benchmarks,” says Princeton researcher Sayash Kapoor. “Without such experiments, we won’t know if the problem is contamination or even just targeting the people in the loop and the SWE bench leaderboard.”

For Konwinski, it’s not just a better benchmark, it’s an open challenge to other parts of the industry. “When you listen to the hype, you should meet AI doctors, AI lawyers and AI software engineers, and that’s not true,” he says. “If you can’t get over 10% on a pollution-free SWE bench, that’s a reality check for me.”

Source link

What's Hot

Bonnie Tyler has recovered from coma but remains ‘very unwell’ after emergency surgery

Choose a new language (or 25 languages) with this $127 Rosetta Stone sale

Jelly Roll files for divorce from Bunny XO after 10 years of marriage

The new AI coding challenge has revealed the first results – and they are not pretty

Choose a new language (or 25 languages) with this $127 Rosetta Stone sale

Best Robot Lawn Mower Deal: 45% Off Sunseeker S4 Robot Lawn Mower

Social media reacts to Knicks’ storied NBA Finals win

Bonnie Tyler has recovered from coma but remains ‘very unwell’ after emergency surgery

Choose a new language (or 25 languages) with this $127 Rosetta Stone sale

Jelly Roll files for divorce from Bunny XO after 10 years of marriage

Merlin, a common roadside duck in Mexico City, will be the World Cup mascot.

Bonnie Tyler has recovered from coma but remains ‘very unwell’ after emergency surgery

Jelly Roll files for divorce from Bunny XO after 10 years of marriage

BTS is the group fans are most looking forward to seeing perform at the 2026 World Cup

Castilla-La Mancha Ignites Innovation: fiveclmsummit Redefines Tech Future

Local Power, Health Innovation: Alcolea de Calatrava Boosts FiveCLM PoC with Community Engagement

The Future of Digital Twins in Healthcare: From Virtual Replicas to Personalized Medical Models

Human Digital Twins: The Next Tech Frontier Set to Transform Healthcare and Beyond

What's Hot

The new AI coding challenge has revealed the first results – and they are not pretty

Related Posts