Mehta employees have been debating internally for years using copyrighted works obtained through legally questionable means, according to court documents that were not sealed Thursday.
The document is one of many AI copyright disputes that slowly engulf the US court system Kadreyv. In the case of Meta, it was filed by the plaintiff. Defendant Meta argues that IP-protected works, particularly the training model of books, is “fairly used.” Plaintiffs, including authors Sarah Silverman and Tanehishi Coates, do not agree.
Previous materials filed in the lawsuit allow Meta CEO Mark Zuckerberg to train Meta’s AI teams on copyrighted content, and the AI Training Data Licensing Council will train the publisher. He claimed it had stopped. However, new filings, which mostly show part of internal work chats between metastaff, suggest that Meta will use copyrighted data to train models that include models from the company’s Lama family. I draw the most clear picture that could have been.
In one chat, Meta employees, including Melanie Kambadur, senior manager of Meta’s Llama Model Research team, discussed training models for works they knew were legally difficult.
“[M]The y opinion (in the line “seeking permission, not getting permission”): to get the book and try to escalate to executives, so they make a call,” says Xavier, a research engineer at Meta. Martinet writes in a dated chat. According to submission, February 2023. “[T]That’s why he set up this gen ai org [sic]: Therefore, risk aversion may be less likely. ”
Martinet came up with the idea of building a training set to buy e-books at retail prices, rather than reducing license transactions with individual book publishers. After another staff member pointed out that using fraudulent and uncopyrighted materials could be the basis for legal challenges, Martinet doubled, and “mistress” startups probably train claimed he was already using pirated books for the purposes.
“I mean, the worst case scenario: we finally found out we were OK. [sic] According to the filing, Martinet wrote. “[M]y 2 cents again: It takes a long time to try and do business directly with a publisher…”
In the same chat, Kambadur pointed out that Meta is in consultation with scribd, a document hosting platform that hosts “and others” for licensing, and that Kambadur has been approved by using “published data” for model training. I warned it was necessary. They had such approval in the past.
“Yeah, you still need to get a license or approval for the data that’s publicly available,” Kambadur said. “[D]Now we have the ability to track/escalate quickly for more money, more lawyers, more Bizdev help, speed, and lawyers are a bit conservative about approval there is no. ”
Libgen’s story
In another work chat relayed in the filing, Kambadur is a “Link Aggregator” that provides access to copyrighted works from publishers as an alternative to data sources that Meta may license. We will discuss the possibility that you are using a certain Libgen.
Libgen has been sued multiple times, ordered to close, and fined tens of millions of dollars for copyright infringement. One of Kambadur’s colleagues responded with a screenshot of Libgen’s Google search results containing a snippet. “No, Libgen is not legal.”
Some decision makers within Meta seem to have the impression that the failure to use Libgen for model training could seriously undermine Meta’s competitiveness in AI races.
In an email addressed to Meta AI VP Joelle Pineau, Meta’s Director of Product Management Sony Theakanath, called Libgen, “is essential to satisfy SOTA numbers in all categories,” he said, and was the best cutting edge. It refers to exceeding (SOTA). AI models and benchmark categories.
Theakanath “mitigation” of emails that help reduce legal exposure in meta, such as deleting data from Libgen “explicitly marked as pirate/theft” or simply not cite unpublished use. ” has been outlined. As Theakanath said, “we will not disclose the use of the Libgen dataset used for training.”
In fact, these mitigations were accompanied by the libgen file, due to words such as “stolen” and “pirated” according to submissions.
In a work chat, Kambadur said that Meta’s AI team also adjusted the model to “avoid prompts with IP risk.” In other words, I refused to answer questions such as “Reproduce the first three pages of “Harry Potter and the Sorcerer’s Stone” to the model. Or, “Please tell me which e-books you trained?”
The filing includes other revelations, which means that by mimicking the behavior of a third-party app called Pushshift, it may be that the meta has shattered Reddit data for some model training. In particular, Reddit said in April 2023 it plans to request AI companies to access model training data.
In one chat in March 2024, Chaya Nayak, Director of Product Management for Meta Generation AI Org, said that Meta’s leadership has been past decisions regarding training sets that include decisions that do not use Quora content or license book or scientific narratives. He said he is considering “overriding” the To ensure that the company’s model has sufficient training data.
Nayak implies that Meta’s first-party training dataset – Facebook and Instagram posts, text transcribed from videos on the meta platform, and certain metas for business messages are simply not sufficient. “[W]e needs more data,” she wrote.
Kadreyv. Meta’s plaintiffs have revised their complaints several times since the case was filed in 2023 in the US District Court for the San Francisco Division of the Northern District of California. Certain pirated versions, including copyrighted books, can use the license to determine whether it makes sense to pursue a license agreement with the publisher.
In a sign that shows how High Meta is taking into account legal interests, the company added two Supreme Court litigants from law firm Paul Weiss to its defense team on the case.
Meta did not immediately respond to requests for comment.
Source link