The Large Language Model (LLMS) reached the European Digital Sovereignty Agenda last week. This is because news of a new programme has been revealed that it will develop a series of “truly” open source LLMs covering all European Union languages.
This includes the current 24 official EU languages and the languages of countries currently negotiating for entry into EU markets such as Albania. Future Proof is the name of the game.
Openeurollm is co-led with around 20, co-led by Jan Hajič, a computational linguist at Charles University in Prague, and Peter Sarlin, CEO and co-founder of Finland’s AI Lab Silo AI, which AMD acquired for $665 million last year. It’s collaboration between organizations. .
The project fits the broader narrative that allows Europe to push digital sovereignty as a priority, bringing mission-critical infrastructure and tools closer to home. While most of the cloud giants have invested in local infrastructure to ensure that EU data stays local, AI Darling Open has recently been a new product that allows customers to process and store data in Europe has been announced.
Elsewhere, the EU recently signed a $11 billion deal to create a sovereign satellite constellations comparable to Elon Musk’s Starlink.
So OpenEurollm is certainly a brand.
However, the budget written solely for building the model itself is 37.4 million euros, with about 20 million euros coming from the EU’s Digital Europe program. This is a decline in the ocean compared to what corporate AI world giants are investing in. The actual budget is probably the largest cost calculated, considering the tangential direction and funds allocated for the related work. Partners for the OpenEurollm project include EuroHPC Supercomputer Centres in Spain, Italy, Finland and the Netherlands. Additionally, the broader EuroHPC project budget is around 7 billion euros.
However, a vast number of different participating parties, spanning academia, research and corporations, have come to question whether the goal is achievable. Anastasia Stasenko, co-founder of LLM Company Pleias, questioned whether “a vast consortium of over 20 organizations” could have a measured focus for homemade private AI companies.
“Europe’s recent success in AI shines through small concentrated teams such as Mistral AI and Lighton. “They are quickly responsible for their choices, including finances, market positioning, reputation and more. .”
Until scratch
The OpenEurollm project starts from scratch or head start depending on how you look at it.
Since 2022, Hajič has also coordinated the High Performance Language Technology (HPLT) project, which aims to develop free and reusable datasets, models, and workflows using High Performance Computing (HPC). Masu. According to Hajič, the project is expected to close in the second half of 2025, but according to Hajič, considering most of HPLT’s partners (except for UK partners) are also participating here, it is a “predecessor” for Openeurollm. can be considered as.
“this [OpenEuroLLM] It’s a really broad participation, but it’s focused on the Generation LLM,” Hajichu said. “It doesn’t start from scratch in terms of data, expertise, tools and calculation experiences. We’ve brought together people who know what they’re doing. We’ve done quickly and quickly You should be able to raise it.”
Hajichu said he expects the first version to be released by mid-2026, once the final iteration arrives by the conclusion of the 2028 project. Push beyond the Barebone github profile.
“In that respect, we are starting from scratch – the project began on Saturday. [February 1]Hajichi said. “But we’ve been preparing a project for a year. [the tender process opened in February 2024]. ”
From academia and research, the organization spanning the Czech Republic, the Netherlands, Germany, Sweden, Finland and Norway is part of the Openeurollm Cohort, in addition to the EuroHPC Centre. From the corporate world, Silo AI is powered by AMD, an AI lab owned by Finland, as well as Aleph Alpha (Germany), Ellamind (Germany), Prompsit Language Engineering (Spain), and Lighton (France).
One notable omission from the list is the omission of the French AI Unicorn Mistral. It has established itself as open source to replace current positions such as Openai. No one on Mistral responded to TechCrunch for comments, but Hajič confirmed that he tried to start a conversation with the startup, but to no avail.
“I tried to approach them, but it didn’t bring about a focused discussion about their participation,” Hajich said.
The project is limited to EU organizations, but it will still be able to attract new participants as part of the funding EU programme. This means that UK and Swiss entities will not be able to participate. This, in contrast to the Horizon R&D programme, the UK re-joined in 2023 after a long Brexit stalemate and funded HPLT.
build up
The top line goal of the project is to follow its catchphrases to create a “single foundational model of transparent AI in Europe.” Furthermore, these models must maintain the “linguistic and cultural diversity” of all EU languages, namely the present and future.
This means a core multilingual LLM designed for general purpose tasks where accuracy is most important, although it is still ironed to translate from an artifact standpoint. Also, for edge applications where efficiency and speed are probably more important, the “quantized” version will be smaller.
“This is something we still have to plan in detail,” Hajich said. “We want it to be small but as high quality as possible. From a European perspective, it’s a high stake and there’s a lot of money coming from the European Commission, so it’s public money. Because of this, we don’t want to release something half-baked.”
The goal is to make the model as skilled as possible in all languages, but achieving full equality can also be difficult.
“That’s the goal, but it’s the question of how successful you can be in a language that lacks digital resources,” Hajichu said. “But that’s why we want to have a true benchmark for these languages, and we’re not going to shake up towards benchmarks that probably don’t represent the language and the culture behind them.”
As for data, this is where we prove that much of the work of the HPLT project is fruitful with version 2.0 of the dataset released four months ago. The dataset trains 4.5 petabytes of web crawl and over 20 billion documents, and Hajič said it will add additional data from Common Crawl (the open repository for Web Crawled Data) to the mix.
Open Source Definition
In traditional software, the enduring struggle between open source and its own revolves around the “true” meaning of “open source.” This can be resolved by postponing the formal “definition,” as well as industry managers of legitimate open source licenses, according to open source initiatives.
Recently, OSI has formed the definition of “open source AI,” but not everyone is happy with the outcome. The open source AI proponents argue that not only models but datasets, prerequisite models, weights and full-shevans should be available. The definition of OSI does not require training data. This is because AI models are often trained with their own data or with redistribution limits.
It is enough to say that Openeurollm faces these same difficulties, and despite its intention to be “really open”, it is likely that there is no compromise in order to fulfill its “quality” obligations It won’t be.
“The goal is to open everything up. Now, of course, there are some limitations,” Hajich said. “We want to have the best possible model. We can use whatever we can get based on European copyright instructions. We cannot redistribute them, but some of them are Some of the items can be saved for future inspections.”
What this means is that while the Openerollm project may need to wrap and retain some of the training data, it will audit according to the requirements required for high-risk AI systems under the terms of the EU AI Act You may need to make it available to people.
“We hope for most of the data [will be open]data that comes from the crawls, especially common,” Hajichu said. “We want it all to be fully open, but I understand. Either way, we have to follow AI regulations.”
One is two
Another criticism revealed in the aftermath of Openeurollm’s official announcement was a very similar project that began in Europe just a few months ago. Eurollm, which launched its first model in September and began following up in December, was jointly funded by the EU, along with a consortium of nine partners. These include academic institutions such as the University of Edinburgh, and companies such as Unbabel, which last year gained millions of GPU training hours on EU supercomputers.
Eurollm shares almost a name-like goal. “Builds a large-scale open-source language model for Europe that supports 24 official European languages, and builds several other strategically important languages.”
Andre Martins, research director at Unbabel, joined social media to highlight these similarities. “We hope that various communities will openly collaborate, share their expertise and not decide to reinvent the wheels every time a new project is funded,” writes Martins.
Hajichu called the situation “unfortunate” and emphasized that because of its funding sources in the EU, Openeurollm is limited in terms of cooperation with non-EU entities, including the UK, but he said that they can cooperate. He added that he wanted it. University.
Funding gap
The arrival of China’s Deepshek and the performance-to-performance ratio it promised encouraged that the AI initiative could do much more with far less than originally thought. However, over the past few weeks, many have questioned the real costs involved in building Deepseek.
“As for Deepseek, we really don’t know much about what exactly happened to build it,” Peter Sarlin, technical co-leading of the Openeurollm project, told TechCrunch .
Anyway, Sarlin believes that Openeurollm will have plenty of money available, as it is to cover many people. In fact, the majority of the cost of building AI systems is calculated, and most should be covered through partnerships with the EuroHPC Centre.
“I would say Openeurollm actually has a very important budget,” says Sarlin. “EuroHPC has invested billions in AI, calculated the infrastructure and committed billions more to expand it over the next few years.”
It is also worth noting that the Openeurollm project is not built for consumer or corporate grade products. It’s purely about the model, and this is why Sarlin thinks he’s fully considering the budget it has.
“The intention here is not to build chatbots or AI assistants. It’s a product initiative that requires a lot of effort, and that’s what ChatGpt did well,” says Sarlin. “What we are contributing to is an open source fundamental model that acts as an AI infrastructure for European companies to build. We know what is necessary to build a model, but that’s what you can do. It’s not something you need for billions.”
Since 2017, Sarlin has been at the forefront of AI Lab Silo AI and has collaborated with others, including the HPLT project, to launch a family of Poro and Viking Open models. Although these already support several European languages, the company is currently preparing the next iteration “Europa” model, covering all European languages.
And this is linked to the whole concept that Hajiche supports: “it doesn’t start from scratch.” There is already a foundation of expertise and technology.
Sovereign nation
As critics have pointed out, Openeurollm has many moving parts. This is a positive outlook, but it acknowledges it.
“I’ve been involved in a lot of collaborative projects, and I think it has more advantages than a single company,” he said. “Of course they’ve done great things to Mistral with things like Openai, but I hope that the combination of academic expertise and corporate focus can bring something new.”
And in many ways, it’s not about trying to beat big tech or billion-dollar AI startups. The ultimate goal is digital sovereignty: (almost) open foundation LLM built by Europe.
“I hope this is not the case, but if we have a ‘good’ model in the end, rather than a number one model, then there are still models that contain all the components based in Europe. ” Hajichi said. “This will be a positive outcome.”
Source link