High-performance large language models for Europe

The High-Performance Language Technologies (HPLT) project is developing very large-scale multilingual resources for large language models and machine translation.

Massive text collections for pre-training are the ‘crude oil’ of the large language model (LLM) era. The process of ‘refining’ high-quality datasets from web data at scale presupposes computational infrastructure and technological muscle that is often characteristic of corporate environments, as evidenced, for example, by some notable generally available pre-training datasets: C4,¹ FineWeb 1 & 2,2,3 MADLAD-400,⁴ or Nemotron-CC.⁵ With a few notable exceptions, this line of work tends to capitalise on the English language.

Here, we present the open-source results6,9,10 of the European R&D consortium HPLT – a project that has been funded under the auspices of the Horizon Europe programme in 2022–2025. Together with a myriad of additional results, HPLT has produced massive pre-training datasets of high-quality texts in close to 200 distinct language–script combinations. Its 2025 monolingual data release, HPLT 3.0, comprises some 30 trillion sub-word tokens in total, of which close to half represent languages other than English. We make this resource publicly available under the most permissive terms of use possible. We further share a state-of-the-art and open-source data preparation pipeline, an innovative multilingual evaluation framework, as well as hundreds of language models pre-trained on HPLT data.

Furthermore, the project has produced novel bilingual datasets for more than 50 language pairs, hundreds of associated machine translation models, open-source pipelines for data preparation, model training, and evaluation, as well as synthesised additional pre-training data for underrepresented languages by machine translation of very high-quality English documents. In our view, it is the totality of generally available and very large-scale resources and the documentation of the underlying processes that bears promise of ‘democratising’ the current LLM and MT landscape.

Organisation

The HPLT consortium comprised partners from five different universities (Charles University in Prague and the Universities of Edinburgh, Helsinki, Oslo, and Turku), two national HPC centres (CESNET in the Czech Republic and Sigma2 in Norway), and a language engineering company (Prompsit) from all around Europe. The project has received about €4.1m from the Horizon Europe programme and £960,000 from UK Research and Innovation, and ran from September 2022 through December 2025. The project was coordinated by Jan Hajič (Charles University), with technical coordination by Kenneth Heafield (Edinburgh) and Stephan Oepen (Oslo) in its first and second halves, respectively.

Data curation

HPLT has gathered and processed more than ten petabytes of raw web data. The project has released more than 30 billion tokens (word-like units) of high-quality textual data, accompanied by rich metadata, for close to 200 distinct languages. The process of extracting, cleaning, annotating, and filtering texts from raw web archives is schematically depicted in Fig. 1, composed of about a dozen modules.

Raw web archives were drawn from three sources: the Internet Archive (IA), host of the iconic Wayback Machine); the non-profit Common Crawl Foundation (CC); and the ArchiveBot volunteer infrastructure for long-term web archiving. Sub-tasks like, for example, the extraction of ‘running text’ from marked-up document formats, language identification at the document and paragraph levels, ‘fuzzy’ near-deduplication, annotation with a wealth of text quality and regulatory compliance signals, and final filtering based on all available information, each directly impact the practical utility of the final data sets. Here, text quality versus overall volume present separate and typically antithetical dimensions for optimisation, creating a rich space for different design choices and trade-offs. This remains an active area of research. The open-source HPLT processing pipelines are highly flexible and parameterisable, where default values represent the current state of knowledge.

Monolingual statistics

To put the HPLT monolingual data into perspective, Table 1 (below) presents document and token counts (see note) for the English and multilingual (non-English) partitions of the data, as well as counts for a small sample of individual languages. For ease of comparison, these statistics are accompanied with average document lengths and per-language proportions, and contrasted with corresponding figures for three other publicly available multilingual datasets mentioned above.

Table 1: Note: For the purpose of comparable statistics across languages and different datasets, all token counts are computed using the Gemma-3 tokenizer,⁸ a SentencePiece model with a vocabulary of 256K sub-words, providing good coverage for all target languages

As is evident from these numbers, HPLT 3.0 is by far the largest publicly available such dataset, and its multilingual breadth compares favourably to other widely used resources. In Gemma-3 tokens, the multilingual HPLT 3.0 partition is about 2–3 times larger than FineWeb and the earlier version HPLT 2.0, respectively, and five times larger than the older MADLAD-400 dataset. In terms of average document length, which often is correlated with text quality, HPLT 3.0 and 2.0 pattern alike, markedly ahead of FineWeb but well behind MADLAD-400. For a small selection of European languages, the table shows languages ranging between a ‘mere’ billion of available tokens to others with hundreds of billions.

In-depth analytics

Training data quality arguably is the most important factor in model quality, but in-depth data inspection at scale is a challenging endeavour. HPLT has developed an open-source tool, HPLT Analytics, to compute a broad range of fine-grained statistics and enable interactive visualisation and exploration. The datasets are internally structured in documents, paragraph-like segments, and tokens. Descriptive frequency and length statistics, combined with basic correlation analysis with metadata like internet domains or predicted text register labels, can reveal distributional trends or outliers. Annotations are predominantly available at the document level, but in some cases also for smaller units. Contrasting the distributions of document versus segment language predictions, for example, allows insights into both degrees of in-document ‘code switching’ and uncertainty in language identification, typically among closely related languages.

Multilingual evaluation

As an additional tool to gauge data quality and experimentally inform design choices in training data preparation (as well as in language model training), the project has developed a framework for automated large-scale multilingual evaluation, dubbed HPLT-e. In its current state of development, the framework comprises 127 language understanding and generation tasks across the nine European languages highlighted in Table 1.

This selection allowed both availability of native speakers in the project team and a minimum level of diversity in terms of language resources, families, and scripts. Tasks in HPLT-e are often drawn from pre-existing benchmark suites, but emphasising natively constructed (rather than translated) tasks and extending each with three to seven human-written prompts to mitigate the methodological challenge of prompt sensitivity. Similar to Penedo et al.,2,3 we pretrain separate ‘smallish’ (2B parameters) GPT-like models per language using an otherwise fixed pretraining setup, and evaluate them at regular checkpoint intervals in a zero-shot regime, carefully selecting tasks that meet a range of evaluation signal criteria, i.e. can be expected to act as informative and reliable indicators of training data quality. Such criteria include monotonicity and relative stability of model performance as pretraining progresses, ranking consistency across pretraining intervals, and multiple, indicators of limited prompt sensitivity. Fig. 2 shows a comparison of the four datasets introduced above using HPLT-e. To aggregate scores across different prompts, tasks, and languages, per-task scores are maximised across prompts and min-max normalised relative to a task-specific random baseline. Per-task scores are then averaged across task categories within each language and, finally, across languages. An alternative approach to overall aggregation is called Borda’s count, using Vote’n’Rank,⁷ which is essentially the average of per-language counts of a model outranking all the others. Models trained on all four datasets for up to 100B tokens show a monotonic performance improvement on our selected tasks. Models pretrained on (the comparatively smaller) MADLAD-400 achieve the highest multilingual score, followed by HPLT 3.0, while HPLT 2.0 and FineWeb perform on par. These results are corroborated by rank-based aggregation across tasks and languages, which yields: MADLAD-400, HPLT 3.0, and HPLT 2.0 and FineWeb.

Language models

While training data creation has taken centre stage in the HPLT work plan, the project has also developed a wealth of language models of different sizes and architectures supporting various languages and language groups.

In addition to large language models trained from scratch for Finnish and Norwegian, a common theme in this work was strong emphasis on smaller, specialised models that are efficient to run. In total, publicly available project results comprise hundreds of language models, including the following sub-groups:

55 monolingual encoder-only (BERT-like) models for a typologically diverse set of languages. When fine-tuned as embedders for ‘classic’ language understanding tasks, these models uniformly show superior performance to standard multilingual models.
57 monolingual encoder–decoder (T5-like) models, again for a typologically broad set of languages. These models exhibit competitive performance in both embedding and generation benchmarks, thus, offering a novel platform for experimentation.
38 monolingual decoder-only (GPT-like) reference models, each with 2.15B parameters and trained to 100B tokens. These models can serve a number of purposes, including as baselines for mono- and multilingual training, references for the comparison of HPLT and other data, and tools for contrasting the HPLT data quality across different.
Two larger (13B parameters), continually pretrained generative models, for Finnish and Norwegian, built on the fully open-source OLMo 2 platform. These models compare favourably to language-specific adaptations of the Mistral NeMo model, suggesting that fully transparent foundation models can yield competitive results to their merely open-weight counterparts.

Mining for bilingual text

Another wealth of open-source results from HPLT are related to machine translation (MT), notably large collections of parallel texts derived from mining the monolingual datasets for translational correspondences at the sentence of document levels. These resources are created using the additional processing block called Bitextor Pipeline in Fig. 1. The pipeline applies a multi-stage text extraction procedure that identifies documents with identical content in different languages using various matching and alignment techniques implemented as an open source toolbox.¹ Heavy parallel computing makes it possible to run such bitext mining on a scale provided by the monolingual web-crawls coming from HPLT. Traditionally, parallel texts are provided as sentence-aligned bitexts that can directly be fed into machine translation training. HPLT provides three releases of parallel text corpora with a language coverage of 57 language pairs. The data is collected in an English-centric manner aligning documents with English counterparts in our dataset. Pivoting on those English documents, we can then also derive multilingual parallel text collections spanning 1,446 language pairs. In total, HPLT provides 2.7 million sentence alignments released from our repository of parallel corpora, OPUS.²

Machine Translation

Mirroring the interplay of data creation and model building in the LLM track, HPLT has worked intensely on the development and evaluation of new translation models for 100 language pairs, combined with novel infrastructures for automated training at scale and integration of benchmarking results into the OPUS dashboard. A special focus is set on efficiency, emphasising the need of compact translation models that can run locally on edge devices. Specialised models that are several magnitudes smaller than common general-purpose language models enable fast inference without losing translation performance and enable secure deployments that are independent from external services and online connections. Translation models trained including HPLT data show competitive performance in comparison, especially for lesser-resourced languages. To further reduce computational costs, we also developed a pipeline for systematic multilingual knowledge distillation that supports the transfer from expensive teacher models to compact student models that can be as small as 20 megabytes of size.

Computational infrastructure

All work in HPLT has been exceedingly compute- and storage-intensive, made possible through a combination of resources covered by the project grant and of additional substantial resources allocated to consortium members from national (Czech, Finnish, and Norwegian) quotas and through the EuroHPC system. ‘Bulk’ storage for very large-scale web data, in total close to 21 petabytes, was distributed over facilities in the Czech Republic (CESNET), Norway (Sigma2), and Finland (LUMI). Exclusive access to dedicated compute nodes tightly integrated with the storage systems made possible a first stage of lightweight document and metadata extraction (see Fig. 1), reducing the data volume for further processing by about a factor of three.

In addition to some experimentation on national superclusters, the EuroHPC LUMI system served as the main ‘workhorse’ for HPLT, where the consortium used combined allocations of around 60 million CPU and about 11.5 million GPU hours over the 40-month project duration, which is the theoretical equivalent – on average – of more than 2,000 active CPUs at all times.

Please Note: This is a Commercial Profile

References

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67
Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. 2024. The FineWeb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37:30811–30849
Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. 2025. FineWeb2: One pipeline to scale them all – adapting pre-training data processing to every language
Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2023. MADLAD-400: A multilingual and document-level large audited dataset. In Proceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY, USA
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. 2025. Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2459–2475, Vienna, Austria
Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, and Jörg Tiedemann. 2024. A new massive multilingual dataset for high-performance language technologies. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1116–1128, Torino, Italia
Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova. 2023. Vote’n’rank: Revision of benchmarking with social choice theory. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 670–686, Dubrovnik, Croatia
Gemma Team. 2025. Gemma 3. Google Technical Report
Laurie Burchell, Ona De Gibert Bonet, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Proyag Pal, Jousia Piha, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, and Jaume Zaragoza-Bernabeu. 2025. An expanded massive multilingual dataset for high-performance language technologies (HPLT). In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17452–17485, Vienna, Austria
Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Bañón, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Hajič, Jindřich Helcl, Andrey Kutuzov, Veronika Laippala, Zihao Li, Risto Luukkonen, Bhavitvya Malik, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Lucie Poláková, Sampo Pyysalo, Gema Ramírez Sánchez, Janine Siewert, Pavel Stepachev, Jörg Tiedemann, Teemu Vahtola, Dušan Variš, Fedor Vitiugin, Tea Vojtěchov, Jaume Zaragoza. 2025. HPLT 3.0: Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models. arXiv:2511.01066 [cs.CL]

Please note, this article will also appear in the 25th edition of our quarterly publication.

Source link

What's Hot

Is space weather hiding alien signals?

MSP guide to scaling cybersecurity with AI-powered risk management

Iran-linked Muddy Water hackers target US networks with new Dindoor backdoor

High-performance large language models for Europe

Is space weather hiding alien signals?

Spain secures €200 million in EU funding to expand EV value chain

How Indaver became a pioneer in PFAS destruction

Is space weather hiding alien signals?

MSP guide to scaling cybersecurity with AI-powered risk management

Iran-linked Muddy Water hackers target US networks with new Dindoor backdoor

Spain secures €200 million in EU funding to expand EV value chain

Castilla-La Mancha Ignites Innovation: fiveclmsummit Redefines Tech Future

Local Power, Health Innovation: Alcolea de Calatrava Boosts FiveCLM PoC with Community Engagement

The Future of Digital Twins in Healthcare: From Virtual Replicas to Personalized Medical Models

Human Digital Twins: The Next Tech Frontier Set to Transform Healthcare and Beyond

What's Hot

High-performance large language models for Europe

The High-Performance Language Technologies (HPLT) project is developing very large-scale multilingual resources for large language models and machine translation.

Organisation

Data curation

Monolingual statistics

In-depth analytics

Multilingual evaluation

Language models

Mining for bilingual text

Machine Translation

Computational infrastructure

Please Note: This is a Commercial Profile

References

Related Posts