The new AI coding challenge has revealed the first results – and they are not pretty

The new AI coding challenge revealed the first winner and set up a new bar for software engineers with AI.

At 5pm on Wednesday, the nonprofit Laude Institute announced the first winners of the K Award, a multi-round AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian prompt engineer named Eduardo Rocha de Andredo, who received $50,000 for the award. But what was more surprising than victory was his final score. He won with the correct answer to just 7.5% of test questions.

“I’m glad they’ve actually built a difficult benchmark,” says Konwinski. “If benchmarking is important, then benchmarking should be difficult,” he continues, “If a big lab enters with the biggest model, the score will be different. But that’s a kind of point. K-winners will go offline with limited computing.

Konwinski has pledged $1 million to the first open source model that can score more than 90% in testing.

Like the well-known SWE bench system, K-Popular tests the model against flagged issues from GitHub to test how well the model can handle real programming problems. However, while the SWE bench is based on fixed issues that can compete with the model, the K Award is designed as a “pollution-free SWE bench” using a timed entry system to prevent benchmark-specific training. For round 1, the model was scheduled for March 12th. The K Award organizers then created the test using only the GitHub issues that were flagged after that date.

The 7.5% top score contrasts significantly with the SWE bench itself, and now shows a top score of 75% on the simpler “validation” test and 34% on the stiffer “full” test. Konwinski remains to be seen whether the disparity is due to pollution on the SWE bench or the challenge of collecting new issues from GitHub, but we hope that the K-Prize project will answer the questions soon.

“As we run things more, we feel better,” he told TechCrunch.

TechCrunch Events

San Francisco
|
October 27th-29th, 2025

While it may seem like an inadequate place given the wide range of AI coding tools already published, benchmarks are becoming too easy, many critics see projects like the K-Award as a necessary step to solving AI’s growing evaluation problems.

“We’re very bullish about creating new tests for existing benchmarks,” says Princeton researcher Sayash Kapoor. “Without such experiments, we won’t know if the problem is contamination or even just targeting the people in the loop and the SWE bench leaderboard.”

For Konwinski, it’s not just a better benchmark, it’s an open challenge to other parts of the industry. “When you listen to the hype, you should meet AI doctors, AI lawyers and AI software engineers, and that’s not true,” he says. “If you can’t get over 10% on a pollution-free SWE bench, that’s a reality check for me.”

Source link

What's Hot

Klarna’s IPO Pop raises $1.4 billion, with Sequoia being garnered as the biggest winner

Chinese apt deploys egg stream fireless malware to infringe Philippine military systems

Vimeo is acquired by bending a spoon in a $1.38 billion all-cash transaction

The new AI coding challenge has revealed the first results – and they are not pretty

Klarna’s IPO Pop raises $1.4 billion, with Sequoia being garnered as the biggest winner

Vimeo is acquired by bending a spoon in a $1.38 billion all-cash transaction

Uber will add Blade helicopters to its platform in 2026

Klarna’s IPO Pop raises $1.4 billion, with Sequoia being garnered as the biggest winner

Chinese apt deploys egg stream fireless malware to infringe Philippine military systems

Vimeo is acquired by bending a spoon in a $1.38 billion all-cash transaction

Chillyhell Macos backdoor and Zinorrat rats threaten Macos, Windows and Linux Systems

Bridging Healthcare Divides: ‘Break The Gap 2025’ Summit Sets New Agenda for Vertical Health

Wearable Tech Deep Dive: The Science Behind Smartwatches and Your Health Goals

The Adaptable Healthcare Playbook: How TwinH Is Leading the Way

Smart Health, Seamless Integration: GooApps Leads the Way in 2025

What's Hot

The new AI coding challenge has revealed the first results – and they are not pretty

Related Posts