Last week, Chinese lab Deepseek released an updated version of the R1 Reasoning AI model that works well with many mathematics and coding benchmarks. The company did not reveal the source of the data it used to train the models, but some AI researchers speculate that at least partially came from AI in Google’s Gemini family.
Sam Paech, a Melbourne-based developer who creates AI’s “emotional intelligence” assessment, has published what he claims is evidence that Deepseek’s latest model has been trained for output from Gemini. The Deepseek model, called the R1-0528, prefers words and expressions similar to Google’s Gemini 2.5 Pro favours, Paech said in the X-Post.
It’s not a smoking gun. However, he pointed out that another developer, the trace of the Deepseek model, the pseudonym creator of AI’s “free speech assessment,” called SpeechMap, the “thinking” that the model generates when it works towards conclusions, “read like traces of Gemini.”
Deepseek has previously been accused of training on data from rival AI models. In December, developers observed that Deepseek’s V3 model often identifies as ChatGpt, Openai’s AI-powered Chatbot platform, suggesting that it may be trained in the ChatGPT chat log.
Earlier this year, Openai told the Financial Times that it found evidence linking Deepseek to the use of distillation. According to Bloomberg, Microsoft, a collaborator and investor at Openai, detected a large amount of data was being excluded through its Openai developer account in late 2024. Openai believes it is affiliated with Deepseek.
Distillation is not an uncommon practice, but Openai’s terms of service prohibit customers from using company model output to build competing AI.
To be clear, many models misidentify themselves and converge to the same word and phrases of turn. This is because Open Web, a place where AI companies source most of their training data, is scattered with AI slops. Content Farms are using AI to create ClickBait, and bots are flooding Reddit and X.
This “contamination” made it extremely difficult to thoroughly filter the AI output from the training dataset if so.
Still, AI experts like Nathan Lambert, a researcher at the non-profit AI Institute AI2, don’t think Deepseek trained data from Google’s Gemini out of trouble.
“If I were Deepseek, I would definitely create a ton of synthetic data from the best API models out there,” Lambert wrote in X’s post.[DeepSeek is] Shorten the GPU and wash it off with cash. It’s literally more efficient for them more calculations. ”
In some cases, AI companies are increasing their security measures to prevent distillation.
In April, OpenAI began requesting organizations to complete the identity verification process to access certain advanced models. This process requires a government-issued ID from one of the countries supported by Openai’s API. China is not on the list.
Elsewhere, Google recently launched a “summary” of traces generated by models available through the AI Studio Developer Platform. In May, humanity said it would begin summarizing traces of its own model, citing the need to protect “competitive benefits.”
I will contact Google for comment and update this article if I receive a reply.