Last Friday, Openai introduced a new coding system called Codex. It is designed to perform complex programming tasks from natural language commands. Codex moves Openai to a new cohort of agent coding tools that are just beginning to take shape.
From early Github co-pilots to modern tools like Cursor and Windsurf, most AI coding assistants work as highly intelligent autocompletes. Tools generally reside in an integrated development environment, where users interact directly with AI-generated code. The chances of returning when you simply assign a task and finish it are almost out of reach.
However, these new agent coding tools, led by products such as Devin, Swe-Agent, Openhands and the aforementioned Openai Codex, are designed so that users don’t have to look at the code. The goal is to run like an engineering team manager, assign problems through workplace systems such as Asana and Slack, and check in when they reach the solution.
For those who believe in the highly capable form of AI, it is the next logical step in the natural progression of automation that takes over more and more software tasks.
“Initially, people just wrote the code by pressing every keystroke,” explains Kilian Lieret, a Princeton researcher and member of the SWE-Agent team. “Github Copilot was the first product to offer a genuine automatic complete like Stage 2. You’re still in the loop, but sometimes you can take shortcuts.”
The goal of an agent system is to move completely beyond the developer environment, presenting problems to the coding agent instead, leaving them to solve their own. “We pull things back to the management layer, where we just assign a bug report and the bots try to fix it completely autonomously,” says Rieret.
It is an ambitious purpose, and so far it has proven difficult.
After Devin became generally available at the end of 2024, it sparked fierce criticism from YouTube experts and more measured criticism from early clients on Answer.ai. The overall impression was familiar to veterans who coded the atmosphere. With so many errors, supervising a model requires as much work as doing tasks manually. (Devin’s development was a bit rocky, but fundraising has not stopped them from realizing the possibility. In March, Devin’s parent company, Cognition AI, reportedly raised hundreds of millions of dollars at a $4 billion valuation.)
Even technology advocates are aware of the unsupervised atmosphere and view new coding agents as a powerful element in the human surveillance development process.
“Now, for a foreseeable future, humans must step into code review time to look into written code,” says Robert Brennan, CEO of All Hands AI, who maintains open hands. “I’ve seen some people fall into confusion just by automatically approving all the code the agents are writing.
Hallucinations are also an ongoing issue. When asked about the API released after the OpenHands agent’s training data cutoff, Brennan recalls one incident in which the agent created API details that fit the description. All hands say they are working on the system to catch these hallucinations before causing harm, but there are no easy fixes.
Perhaps the best measure of agent programming progress is the SWE Bench Leaderboard, which allows developers to test their models against an open set of issues in the open GitHub repository. Currently, OpenHands holds the top spot on the verified leaderboard, solving 65.8% of the problem set. Openai claims that Codex-1, one of the Codex-powered models, will be better, listing a score of 72.1% in its announcement, but the score comes with some warnings and has not been independently verified.
A concern for many in the tech industry is that high benchmark scores do not necessarily lead to true handoff agent coding. If the agent coder can only solve three of the four problems, particularly when tackling complex systems at multiple stages, critical monitoring from human developers is required.
Like most AI tools, we hope that the Foundation model’s improvements will be at a steady pace, and that ultimately, the agent coding system will grow into a reliable developer tool. But finding ways to manage hallucinations and other reliability issues is important to getting there.
“I think it has a slightly healthy barrier effect,” says Brennan. “The question is how much trust can you transfer to your agent.
Source link