The new research appears to give credit to allegations that Openai has trained at least some of the AI models for copyrighted content.
Openai is caught up in a lawsuit brought by authors, programmers and other rights holders. This accuses the company of developing models without permission using books, codebases, etc. Openai has long advocated fair use defense, but plaintiffs in these cases argue that there is no sculpture in the US copyright laws for training data.
The study was co-authored by researchers at Washington University, Copenhagen University and Stanford University, and proposes new methods for identifying training data “memorized” by models behind APIs like Openai.
The model is a prediction engine. Learn trained patterns with lots of data – that’s how they generate essays, photos, etc. Most output is not a verbatim copy of the training data, but it is inevitably so due to the way the model “trains”. Image models are known to reflux screenshots of trained films, while linguistic models have been observed to effectively plagiarize news articles.
The method of this research relies on what co-authors call “high rise,” that is, what stands out as rare in the context of larger works. For example, the word “radar” in the sentence “Jack and I sat completely with radar humming” is considered high-rise, as it is less likely to appear statistically before “humming” than words like “engine” or “radio.”
The co-authors investigated signs of memorization by removing advanced words from fiction books and fragments of New York Times, including GPT-4 and GPT-3.5, and attempting to “predict” the models with words masked. If the model managed to guess correctly, it was possible that they memorized the snippet during training, and the co-authors concluded.

Test results showed that GPT-4 showed signs that it memorized a portion of a popular book, including a book containing a sample of a copyrighted e-book called Bookkmia. The results also suggest that the model remembered some of the New York Times article, but that is not at a relatively low speed.
Abhilasha Ravichander, a doctoral student at the University of Washington and a co-author of the study, told TechCrunch that the findings could shed light on the “controversial data” model.
“To create a large, reliable language model, we need a model that can be scientifically investigated, audited and inspected,” Ravichander said. “Our work is aimed at providing tools to explore large-scale language models, but greater data transparency is needed across the ecosystem.”
Openai has long advocated for loose restrictions on model development using copyrighted data. The company is conducting specific content licensing transactions and offers an opt-out mechanism that allows copyright holders to flag content they don’t like to use for training purposes, but it lobbys several governments to codify the “fair use” rules regarding AI training approaches.
Source link