Openai is accused by many parties who trained AI with permissions for copyrighted content. Now, new papers by the AI Watchdog organization are making serious accusations that they are increasingly relying on private books that are not licensed to train more sophisticated AI models.
AI models are inherently complex prediction engines. He is trained on many data, including books, films, and TV shows. They learn patterns and novel ways to extrapolate from simple prompts. When models “write” essays on Greek tragedy and Ghibli-style images, they simply draw and approximate from their vast knowledge. It hasn’t reached anything new.
Many AI labs, including Openai, have begun to employ data generated by AI to train AI when ejecting real-world sources (mainly public web), but few have eschewed the actual data entirely. This is because training purely synthetic data involves risks such as poor model performance.
A new paper from the AI Disclosures Project, a nonprofit co-founded by media tycoon Tim O’Reilly and economist Iran Strauss in 2024, led to the conclusion that Openai likely trained the GPT-4O model with a paywalled book from O’Reilly Media. (O’Reilly is CEO of O’Reilly Media.)
In ChatGPT, the GPT-4O is the default model. O’Reilly does not have a license agreement with Openai, the paper states.
“Openai’s more recent and capable model, GPT-4o, shows a strong recognition of Paywalled O’Reilly’s content compared to Openai’s previous model, GPT-3.5 Turbo,” the paper’s co-author wrote. “In contrast, the GPT-3.5 turbo shows a significant relative perception of the published O’Reilly book sample.”
This paper used a method called DE-COP, originally introduced in academic paper in 2024, designed to detect copyrighted content in training data in language models. Also known as a “membership inference attack,” this method tests whether the model can reliably distinguish between the same textual paraphrase, AI-generated versions and human-written text. If possible, it suggests that the model may have prior text knowledge from the training data.
The paper’s co-authors, O’Reilly, Strauss, and AI researcher Sruly Rosenblat, say they investigated knowledge of GPT-4O, GPT-3.5 Turbo and other Openai models on the O’Reilly media book published before and after the training cutoff date. They used 13,962 paragraph excerpts from 34 O’Reilly’s book to estimate the probability that a particular excerpt was included in the model’s training data set.
According to the results of the paper, the GPT-4o “recognized” the contents of Paywall O’Reilly books much more than the older models of Openai, including the GPT-3.5 turbo. It said, like an improvement in the new model’s ability to grasp whether texts are human writing, even after considering potential confounding factors.
“GPT-4O [likely] The co-author wrote:
It’s not a smoking gun, and co-authors should be careful. They acknowledge that their experimental methods are not innocent and that Openai may have collected excerpts from paid books from users copying and pasting them into ChatGpt.
The co-authors were even muddy with water and did not evaluate Openai’s latest collection of models, including “inference” models such as GPT-4.5, O3-Mini and O1. These models may not be trained with the Paywalled O’Reilly’s data or may have been trained in less than the GPT-4O.
That being said, it is no secret that Openai, which uses copyrighted data to advocate for loose restrictions on model development, has been seeking high-quality training data for some time. The company has gone so far as to hire journalists to help tweak the output of the model. This is a broader industry-wide trend. AI companies recruit experts in domains such as science and physics to supply these experts with knowledge to AI systems.
You should be aware that Openai will pay at least a portion of its training data. The company carries out license transactions with news publishers, social networks, Stock Media Library and more. Openai also offers an opt-out mechanism (although incomplete) that allows copyright holders to flag content they prefer not to use for training purposes.
Still, the O’Reilly paper is the least flattering, as Openai fights several suits on training data practices and handling of copyright laws in US courts.
Openai did not respond to requests for comment.
Source link