The AI industry has contracted with training data issues in the wake of Anthropic’s $1.5 billion copyright settlement. There are as many as 40 other pending cases seeking damages for unauthorized data, including one that takes Mid Journey to court to create an image of Superman.
Without certain licensing systems, AI companies could face an avalanche of copyright lawsuits.
Now, a group of engineers and web publishers have launched a system that enables data licensing at scale, provided that AI companies take it on. The system, known as the Real Simple License (RSL), is already supported by leading web publishers such as Reddit, Quora, and Yahoo. The question is whether the momentum is enough to bring the major AI labs to the negotiation table.
According to Eckart Walther, co-founder of RSL, the RSS standard was also co-creating, with the goal being to create a training data licensing system that could be expanded across the Internet. “We need a machine-readable licensing agreement for the Internet,” Walther told TechCrunch. “That’s really what RSL solves.”
For years, groups such as the Dataset Providers Alliance have pushed for clearer collection practices, but RSL is the first attempt at technical and legal infrastructure that could actually work. On the technical side, the RSL protocol introduces specific licensing terms that publishers can set up on content, whether they require custom licenses or adopt Creative Commons terms. Participating websites contain terminology as part of the “robots.txt” file in a pre-located format, making it easier to identify which data fits the criteria.
Legally, the RSL team has founded the RSL Collective, a collective license organization that can negotiate terms and gather royalties, similar to musicians and film MPLC ASCAP. Like music and film, the goal is to provide a single contact to pay licensors royalties, and to provide the rights holder with a way to condition a large number of potential licensors at once.
Many web publishers have already joined the group, including Yahoo, Reddit, Medium, O’Reilly Media, Ziff Davis (owner of Mashable and CNET), Internet brands (owner of WebMD), People Inc. and The Daily Beast. Others like Fastly, Quora and Adweek support the standards without participating in groups.
TechCrunch Events
San Francisco
|
October 27th-29th, 2025
In particular, RSL Collective includes publishers that already have licensed transactions. Most notably, Reddit, which receives an estimated $60 million a year from Google to use its training data. There is nothing in the RSL system that prevents businesses from cutting their own transactions, just as Taylor Swift can set special terms for their licensing while collecting royalties through ASCAP. However, for publishers, if it is too small to draw a deal of its own, RSL collective terms may be their only option.
However, while it’s easy to determine when a song was played, AI models pose unique challenges when it comes to knowing when loyalties are scheduled for specific training data. This issue is the easiest for products like Google’s AI Search Summary. This extracts data in real time from the web and maintains the strict attribution of each fact.
However, if no logs are recorded when training occurs, it is almost impossible to confirm that a particular document has been ingested in LLM. It is especially difficult when publishers ask them to pay per inference rather than receive the blanket fee, an option offered by any of their stock RSL licenses.
Still, RSL creators believe that AI companies can manage the difficulty level. “Some of the licensing agreements they’ve already made require them to report on it, so that’s possible,” says Doug Leeds, co-founder of RSL and former CEO of IAC Publishing. “It doesn’t have to be perfect, it has to be enough to pay people.”
The bigger question is whether AI companies will accept the system. Just as companies like Scaleai and Mercor have been successful, Frontier Labs has no problem paying for data, but the web has traditionally been considered a source of cheap and low quality data. With datasets like general crawls already available, it may be difficult to extract royalties from what labs use to get for free. Also, as recent dust up between CloudFlare and Perplexity shows, it’s not easy to convey the difference between web scraping and machine-enhanced browsing.
When I put this question in Leeds, he pointed out a recent comment from an AI leader seeking a system like RSL. Most notably, it comes from Sundal Pichai at last year’s dealbook summit. Whether the licensing system requests are serious or not, the RSL team plans to keep them. “They’ve been saying outwardly to everyone, this kind of thing needs to exist,” Leeds told me. “We need a protocol. We need a system.”
Now they might get it.
Source link