SYNTHEMA is creating a secure, federated platform for the generation and validation of synthetic clinical data, enabling GDPR-compliant research and innovation in rare haematological diseases.
Haematological diseases comprise a diverse group of over 450 conditions affecting blood cells, the lymphatic system, and coagulation functions. These disorders – whether malignant or not – are often linked to genetic abnormalities and present significant challenges to public health systems. Haematological cancers alone account for approximately 5% of all cancer cases, while many non-oncological conditions lead to chronic complications and demand lifelong care.
In fact, in 2016, it was estimated by the European Haematology Association (EHA), together with a large variety of haematological diseases, these conditions contribute to a financial burden of €22.5bn on European society. Among them, Sickle Cell Disease (SCD) and Acute Myeloid Leukaemia (AML) exemplify both the medical and societal burden of these conditions. SCD causes recurrent pain episodes, progressive organ damage and shortened life expectancy, disproportionately impacting underserved populations. Similarly, AML, an aggressive and biologically heterogeneous cancer, often affects older adults and is associated with poor prognoses and complex treatment needs.
Yet, despite the urgency, research and healthcare planning remain severely limited by a lack of accessible, structured and interoperable data. Rare haematological diseases suffer from both data scarcity and fragmentation, which hinders large-scale clinical research and obstructs the development of AI-powered diagnostics and novel therapeutic tools. These challenges are further compounded by the underrepresentation of rare diseases in clinical coding systems, making it difficult to trace patient journeys across healthcare systems and undermining the long-term viability of national and European registries. SYNTHEMA addresses these critical challenges through a €7m Horizon Europe-funded initiative that pioneers the use of synthetic data and privacy-preserving AI to unlock new research potential in rare haematological diseases.
The SYNTHEMA project
SYNTHEMA’s mission is to advance research and innovation in rare haematological diseases by developing and validating AI-driven methods for synthetic data generation and anonymisation. The project leverages federated learning to enable secure, decentralised AI training across multimodal clinical datasets – such as laboratory results, imaging, and electronic health records – without moving sensitive patient data from its original location. In parallel, SYNTHEMA generates high-quality synthetic datasets that replicate the patterns of real data to overcome limitations caused by missing or scarce information. These artificial datasets support personalised diagnosis, treatment evaluation, and outcome prediction, ultimately enhancing patient care and enabling cross-border collaboration in rare disease research.
The project brings together a multidisciplinary consortium of 16 partners from ten European countries, uniting leading institutions in clinical research, AI, data privacy, and healthcare innovation. This partnership includes top hospitals, universities, SMEs and research centres from Spain, Italy, France, Germany, the Netherlands, Austria, Portugal, Estonia, Belgium and the United Kingdom. The project also benefits from the strong involvement of ERN-EuroBloodNet, a joint effort between the EHA, the European Network on Rare and Congenital Anaemias (ENERCA), the EURORDIS European Patient Advocacy Groups (ePAGS) and the EHA Patient Organisations Workgroup. Through EuroBloodNet’s clinical partners and resources, SYNTHEMA can align its work with real-world healthcare needs, ensuring the clinical relevance of its use cases and maximising the impact of its outcomes across Europe’s rare disease community.
The SYNTHEMA platform
At the heart of SYNTHEMA is a federated, privacy-preserving infrastructure designed to enable the development, training, and validation of AI models for synthetic data generation, without the need to centralise and share sensitive clinical data. Instead of moving data across borders or institutions, the platform brings computation to the data, using federated learning to train algorithms directly within local hospital premises.
This architecture consists of multiple clinical nodes – located at participating hospitals and clinical centres – and computing nodes managed by technical partners. Clinical nodes retain all real patient data locally, while computing nodes provide the processing power and AI toolkits needed to train models securely across sites. The system integrates advanced privacy-preserving techniques such as Secure Multiparty Computation (SMPC), Differential Privacy (DP), and Homomorphic Encryption, ensuring that no sensitive information is exposed during collaborative computation processes.

The platform also includes dedicated modules for data harmonisation, quality control, and anonymisation, following FAIR (Findable, Accessible, Interoperable, and Reusable) principles. These modules ensure that data, once synthetically generated, is of high utility while remaining fully compliant with GDPR and ethical standards. Importantly, SYNTHEMA’s platform is scalable and interoperable, allowing other institutions or disease use cases to be onboarded in the future.
Embedding ethics-by-design principles in synthetic data research
At the core of SYNTHEMA lies a firm commitment to integrating ethics-by-design directives into every stage of synthetic data generation and AI model development. In the healthcare realm–particularly for rare haematological diseases where patient numbers are small and data sensitivity is high – ensuring legal and ethical compliance is not just a regulatory requirement, but a moral imperative.
Synthetic data plays a central role in SYNTHEMA, as it offers a solution to one of the biggest challenges in rare disease research: data scarcity. Synthetic data is artificially generated from real – world datasets using advanced algorithms, and it mirrors the statistical properties of the original data without revealing any sensitive or identifiable information. This allows researchers to augment limited datasets, address data gaps, and improve the robustness of AI models without compromising patient privacy. By generating realistic and privacy-preserving synthetic datasets, SYNTHEMA enhances the performance of machine learning systems and enables meaningful insights, even in contexts where real data is scarce or fragmented.
SYNTHEMA also goes beyond technical safeguards by developing dedicated legal and ethical frameworks tailored to the sensitive nature of health-related personal data. These include governance models that support responsible data use, risk management strategies for privacy breaches, and co-creation of algorithms with ethical oversight to mitigate bias and ensure transparency. By engaging stakeholders from healthcare, legal, academic, and technical fields, SYNTHEMA fosters a participatory approach to the design of trustworthy AI systems. This is particularly relevant for rare disease populations, who are often underrepresented in data-driven research and may face added risks of bias or exclusion.
In parallel, the project supports interoperability and standardisation efforts, ensuring that anonymisation methods and synthetic data outputs are aligned with existing regulatory frameworks, including the GDPR. All data processed within the project adheres to strict ethical review procedures and conforms to FAIR principles, facilitating transparent audit trails and long-term reusability. Ultimately, SYNTHEMA’s ethics-first approach is designed to enable trustworthy innovation in healthcare AI – empowering researchers while protecting patients and setting a benchmark for ethical synthetic data research in Europe and beyond.
Use cases: Tackling data scarcity in Sickle Cell Disease and Acute Myeloid Leukaemia
To demonstrate the applicability and scalability of synthetic data generation in rare haematological diseases, SYNTHEMA focuses on two high-impact use cases: SCD and AML. These conditions were carefully selected to reflect the broad spectrum of challenges in rare disease research, covering both non-oncological and oncological profiles. Their inclusion ensures the project’s capacity to test and validate synthetic data methods across diverse data types, clinical workflows, and privacy requirements.
SCD is a rare but globally prevalent genetic blood disorder that primarily affects individuals of African, Mediterranean, Middle Eastern, and Indian ancestry. Characterised by the production of abnormal haemoglobin, SCD leads to chronic anaemia, severe pain episodes, organ damage, and reduced life expectancy. Although early diagnosis and better management have improved outcomes, patients often face limited treatment options and a fragmented care experience. From a data perspective, SCD is underrepresented in European registries, and the sensitive nature of patient information poses challenges to data sharing across borders. SYNTHEMA aims to address these barriers by enabling the generation of realistic synthetic datasets that preserve the statistical and clinical integrity of real patient records while ensuring full compliance with privacy standards.
AML, in contrast, is a rapidly progressing and heterogeneous blood cancer, typically affecting older adults. AML is marked by the accumulation of immature white blood cells in the bone marrow, leading to impaired blood production and systemic complications. The disease’s biological complexity, combined with its aggressive nature, makes it particularly challenging to diagnose and treat. Despite advances in genomics and targeted therapies, survival rates remain low and vary widely depending on patient characteristics. AML datasets are often limited in size, fragmented across institutions, and constrained by strict privacy regulations. Within SYNTHEMA, AML serves as a critical testbed for synthetic data generation that must accommodate multimodal inputs, such as genomics, clinical history, and laboratory values, while respecting the high-risk nature of the patient population.
Both use cases are central to SYNTHEMA’s ambition to create GDPR-compliant synthetic datasets that are both clinically valuable and statistically robust Moreover, by focusing on two contrasting conditions – one chronic and genetically inherited, the other acute and genetically diverse – SYNTHEMA strengthens the generalisability and scalability of its methods for application across a wider range of rare haematological diseases.
Ultimately, the SCD and AML use cases will serve not only as proof-of-concept for SYNTHEMA’s technical innovations but also as catalysts for change in how Europe approaches data sharing and AI-driven research in rare disease domains. The insights and tools generated through these pilots will be openly shared with the wider community, contributing to registries, clinical research initiatives, and the long-term vision of ethical, AI-enabled personalised medicine in haematology.
SYNTHEMA’s legacy: The power of AI in rare haematological diseases
Rare haematological diseases present one of the most pressing gaps in modern healthcare, where the scarcity of high-quality, usable data continues to slow down progress in diagnosis, treatment, and research. SYNTHEMA directly tackles this challenge by creating the foundations for a future in which synthetic, privacy-preserving data can drive innovation without compromising patient rights. By advancing federated learning infrastructures, robust anonymisation strategies, and the clinical validation of synthetic datasets, the project is not only solving a technical problem – it is shaping a new model for ethically grounded health data use in Europe.
The project’s impact will extend beyond its initial use cases in SCD and AML. SYNTHEMA’s flexible architecture and standardised protocols are designed for scalability, enabling future adoption by other clinical centres and adaptation to additional rare or underrepresented diseases. Through its alignment with ERN-EuroBloodNet and contribution to broader data harmonisation efforts, the project is already setting the stage for long-term sustainability and integration with future digital health frameworks.
By building trust in synthetic data and providing practical tools for its generation and validation, SYNTHEMA is contributing to a paradigm shift in health research, where secure, decentralised collaboration can thrive without the need to compromise on ethics, legal compliance, or data protection. In doing so, the project supports Europe’s vision of a more connected, responsible, and patient-centric health ecosystem – one where data can empower, rather than restrict, meaningful medical advances.
Disclaimer
Funded by the European Union
SYNTHEMA has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101095530.
Please note, this article will also appear in the 23rd edition of our quarterly publication.
Source link