On Wednesday, Wikimedia Deutschland announced a new database that will make Wikipedia’s rich knowledge more accessible to AI models.
Called the Wikidata Embedding Project, the system applies a technique consisting of nearly 120 million entries to existing data on Wikipedia and its sister platforms, a technique that helps computers understand the meaning and relationships between words.
Combined with new support for Model Context Protocol (MCP), a standard that helps AI systems communicate with data sources, this project makes data more accessible to LLMS natural language queries.
The project was carried out by the German branch of Wikimedia in collaboration with neural search company Jina.ai and DataStax, a real-time training DATA company owned by IBM.
Wikidata has been providing machine-readable data from the Wikimedia property for many years, but existing tools now only allow keyword searches, SPARQL queries, and special query languages. The new system works well by providing developers with the opportunity to ground the model with knowledge verified by Wikipedia editors, thanks to a searched generation (RAG) system that allows AI models to draw in external information.
The data is configured to provide important semantic contexts. For example, queriing a database of the term “scientists” creates a list of prominent nuclear scientists and scientists who worked at Bell Lab. There is also the translation of the word “scientist” into a different language, the image of scientists in the workplace that has cleared Wikimedia, and extrapolation to related concepts such as “researcher” and “scholar.”
The database is published on Toolforge. Wikidata is also holding a webinar for developers of interest on October 9th.
TechCrunch Events
San Francisco
|
October 27th-29th, 2025
This new project is because AI developers are rushing to a high-quality data source that they can use to fine-tune their models. Training systems are more refined and often assembled as complex training environments rather than simple data sets, but require closely curated data to function. The need for reliable data is particularly urgent for deployments that require high accuracy, and some overlook Wikipedia, but that data is significantly more oriented than catch-all datasets like Common Crawl, a large collection of web pages scraped off the entire internet.
In some cases, driving high-quality data can have expensive consequences for AI labs. In August, humanity offered to settle a lawsuit with the group of authors whose works were being used as training material by agreeing to pay $1.5 billion to end allegations of fraud.
In a statement to the media, Wikidata AI Project Manager Philip Saade highlighted his project’s independence from major AI labs or large high-tech companies. “The launch of this embedded project shows that strong AI doesn’t need to be controlled by a small number of companies,” Saadé told reporters. “It could be open, supportive and built to serve everyone.”
Source link