Openai’s model “remembered” copyrighted content suggests new research

The new research appears to give credit to allegations that Openai has trained at least some of the AI models for copyrighted content.

Openai is caught up in a lawsuit brought by authors, programmers and other rights holders. This accuses the company of developing models without permission using books, codebases, etc. Openai has long advocated fair use defense, but plaintiffs in these cases argue that there is no sculpture in the US copyright laws for training data.

The study was co-authored by researchers at Washington University, Copenhagen University and Stanford University, and proposes new methods for identifying training data “memorized” by models behind APIs like Openai.

The model is a prediction engine. Learn trained patterns with lots of data – that’s how they generate essays, photos, etc. Most output is not a verbatim copy of the training data, but it is inevitably so due to the way the model “trains”. Image models are known to reflux screenshots of trained films, while linguistic models have been observed to effectively plagiarize news articles.

The method of this research relies on what co-authors call “high rise,” that is, what stands out as rare in the context of larger works. For example, the word “radar” in the sentence “Jack and I sat completely with radar humming” is considered high-rise, as it is less likely to appear statistically before “humming” than words like “engine” or “radio.”

The co-authors investigated signs of memorization by removing advanced words from fiction books and fragments of New York Times, including GPT-4 and GPT-3.5, and attempting to “predict” the models with words masked. If the model managed to guess correctly, it was possible that they memorized the snippet during training, and the co-authors concluded.

Openai Copyright Study — An example of a model “guessing” a highly praised word.Image credit: Openai

Test results showed that GPT-4 showed signs that it memorized a portion of a popular book, including a book containing a sample of a copyrighted e-book called Bookkmia. The results also suggest that the model remembered some of the New York Times article, but that is not at a relatively low speed.

Abhilasha Ravichander, a doctoral student at the University of Washington and a co-author of the study, told TechCrunch that the findings could shed light on the “controversial data” model.

“To create a large, reliable language model, we need a model that can be scientifically investigated, audited and inspected,” Ravichander said. “Our work is aimed at providing tools to explore large-scale language models, but greater data transparency is needed across the ecosystem.”

Openai has long advocated for loose restrictions on model development using copyrighted data. The company is conducting specific content licensing transactions and offers an opt-out mechanism that allows copyright holders to flag content they don’t like to use for training purposes, but it lobbys several governments to codify the “fair use” rules regarding AI training approaches.

Source link

What's Hot

‘Girls Like Girls’ favors nostalgia over the depth of a young queer awakening story

This special Babbel offer gives you lifetime access to lessons created by linguists

Deadmau5 adopts a cat he rescued by donating to an animal shelter

Openai’s model “remembered” copyrighted content suggests new research

This special Babbel offer gives you lifetime access to lessons created by linguists

Best Espresso Machine Deal: 20% Off Ninja Luxe Cafe Pro Series

Prime Day Early Kitchen Sale: Ninja, Keurig, Breville, Calphalon on sale

‘Girls Like Girls’ favors nostalgia over the depth of a young queer awakening story

This special Babbel offer gives you lifetime access to lessons created by linguists

Deadmau5 adopts a cat he rescued by donating to an animal shelter

Ranking of all official World Cup songs

Deadmau5 adopts a cat he rescued by donating to an animal shelter

Ranking of all official World Cup songs

Jennifer Lopez needed to find herself again after divorce from Affleck

Castilla-La Mancha Ignites Innovation: fiveclmsummit Redefines Tech Future

Local Power, Health Innovation: Alcolea de Calatrava Boosts FiveCLM PoC with Community Engagement

The Future of Digital Twins in Healthcare: From Virtual Replicas to Personalized Medical Models

Human Digital Twins: The Next Tech Frontier Set to Transform Healthcare and Beyond

What's Hot

Openai’s model “remembered” copyrighted content suggests new research

Related Posts