Openai launched a new model family on Monday called the GPT-4.1. Yes, “4.1” – it’s as if the company’s nomenclature is already not confusing enough.
There are GPT-4.1, GPT-4.1 MINI, and GPT-4.1 NANO. All of these say “Excel” after coding and instructions. Although available in Openai’s API, the multimodal model, rather than ChatGpt, has a million context windows. This means that you can get around 750,000 words (longer than “war and peace”) in one session.
GPT-4.1 arrives as an effort by Openai rivals like Google and Anthropic Ratchet to build sophisticated programming models. Google has recently released the Gemini 2.5 Pro (also with a million context windows, which ranks highly in popular coding benchmarks. So is the upgraded V3 of Anthropic’s Claude 3.7 Sonnet and Chinese AI startup Deepseek.
Training AI coding models that can perform complex software engineering tasks is the goal of many tech giants, including Openai. Openai’s epic ambition is to create an “agent software engineer” as CFO Sarah Friar said at the Tech Summit in London last month. The company claims that future models can program the entire app end-to-end, dealing with aspects like quality assurance, bug testing and document writing.
GPT-4.1 is a step in this direction.
“We optimized GPT-4.1 for real world use based on real feedback based on real-world feedback, based on real-world feedback. We adhere to front-end coding, less external editing, formatting, reliable format, response structure and ordering, consistent use of tools, etc. “These improvements allow developers to build fairly good agents on real-world software engineering tasks.”
Openai claims that the complete GPT-4.1 model outperforms the GPT-4O and GPT-4O mini models in coding benchmarks, including the SWE bench. The GPT-4.1 Mini and Nano are said to be more efficient and faster at the expense of some degree of accuracy, and Openai says the GPT-4.1 Nano is the fastest and cheapest model.
GPT-4.1 costs $2 per input token and $8 per million output token. GPT-4.1 MINI is 0.40/million input token and $1.60/million output token, while GPT-4.1 NANO is 0.10/million input token and 0.40/million output token.
According to Openai’s internal testing, GPT-4.1 was able to generate more tokens at once than GPT-4O (32,768 vs. 16,384), recording 52%-54.6% in the human-validated subset of the SWE bench. (In a blog post, Openai pointed out that the range of scores cannot be performed because some solutions to the SWE Bench validated issues cannot be implemented on the infrastructure.) These figures are slightly below the scores and humanity reported by Google and humanity on the Gemini 2.5 Pro (63.8%) and Claude 3.7 Sonnet (62.3%) on the same benchmarks respectively.
In another evaluation, OpenAI investigated GPT-4.1 using video MME. It is designed to measure the model’s ability to “understand” the content of a video. GPT-4.1 reached 72% accuracy of the chart-top in the “long, no subtitles” video category, Openai claims.
Although GPT-4.1 scores fairly well on the benchmark and has a recent “knowledge cutoff”, it provides a framework for reference for current events (until June 2024), it is important to note that even some of today’s best models struggle with work that doesn’t trip over professionals. For example, many studies have shown that code generation models often fix and don’t even introduce security vulnerabilities or bugs.
Openai also admits that GPT-4.1 is less reliable (i.e. it could make a mistake), and it has to deal with more input tokens. In one of the company’s own tests, Openai-MRCR, the model’s accuracy has decreased from about 84% to 50% of 1 million tokens at 8,000 tokens. GPT-4.1 also tended to be “literally” than GPT-4o, the company says.
Source link