Denas Grybauskas, Chief Governance and Strategy Officer at Oxylabs, outlines the key considerations in EU AI law that need to be taken into account from both a legal and ethical perspective to ensure that best web data collection practices are followed.
Web scraping today faces an interesting dichotomy. It’s a critical part of the internet experience that powers major sites, but the sheer amount of data collected for AI training purposes has put it under scrutiny.
As the AI boom is changing the entire nature of the web, old debates about how public data is accessed are also being rekindled. Add in headlines about AI piracy, and the picture of how data is used becomes cloudy, making it difficult for companies to navigate.
As discussed in the OxyCon session I chaired this year, the EU AI legislation introduces additional challenges for the industry to grapple with. There is no doubt that data-intensive companies are not given the “highway code” for web scraping, and many elements of the law remain unclear, creating easy traps for companies to fall into.
uncertain legal situation
There are recurring legal issues that companies need to be aware of when collecting web data.
Breach of contract: The most common legal claims related to web data collection are for breach of contract, which occurs when one party fails to do what it agreed to do when agreeing to the Terms of Use. Let’s say a company has accounts on a particular website, such as a social media site, and decides to scrape that site at the same time. In that case, of course, you will be exposed to greater risks. Scraping content from social media sites after agreeing to a table of contents is one of the main causes of litigation in this area. It can still be argued (and in some cases it was) that the act of scraping is unrelated to the purpose of the social media site or the creation of the account. Therefore, the Terms of Service should not regulate scraping of public data. However, proving this point requires effort. Copyright Infringement: The most headline-grabbing legal claims today involve copyright infringement, particularly those that lead to high-profile class action lawsuits. These cases have caused the most controversy, with protests erupting in London earlier this year over claims that Mehta stole the books. Media outlets are currently reporting on a music publisher embroiled in a legal battle with Anthropic over AI copyright claims. These types of lawsuits reflect the ongoing debate about what data can be used for AI training purposes and how creators should be involved. Personal Data: In some cases, publicly available data may include personal information. Even though it is technically “public,” personal data is still protected by privacy laws, typically subject to exceptions and conditions such as those outlined in the CCPA. Therefore, companies should thoroughly evaluate whether collecting such information is necessary and ethical. Issues of privacy and data ownership are very likely to remain the main areas of focus in courts and public debates about web data for some time to come.
The underlying perception that web scraping practices exist in a “gray area” often stems from a lack of clarity. Today’s legal landscape lacks a clear and easy-to-understand “one-stop-shop” guide to “unclouding” the issue and achieving full compliance.
Despite good intentions, EU AI law does not provide for this.
How AI impacts web scraping
The AI boom has once again brought the need for legal clarity into the spotlight. This has increased the demand for data and the term “data scraping” has emerged into mainstream conversation. The amount of web scraping conducted by companies has skyrocketed, and unsurprisingly, this has put copyright issues in the spotlight.
However, there are some legitimate arguments in the US legal system, such as that aggregation of public (copyrighted) data may fall under the fair use doctrine. For example, if a company is transparent about the public data it uses and transforms it into something new, this could be considered fair use. According to a recent US case (the Anthropic case), one of the key conditions is that the work (in which public data is aggregated and used) is transformative.
Currently, fair use in the United States cannot be completely legally prevented within a contract. However, fair use allows you to reuse copyrighted material in completely new ways. This example has been converted from a copyrighted state.
When doing this, companies need to be aware of several factors in order to act ethically and within the confines of current law. For example, courts consider the following to define fair use and rule on copyright infringement:
The nature of the copyrighted work – is it private or personal in any way? How often is the copyrighted work used? Has a change occurred? What is the economic impact of copyrighted works? Was the original work affected?
When publicly scraping data to train an AI model, it’s important to remain vigilant and aware, regardless of location. The EU has both a database rights regime and a DSM Directive that include exemptions for text and data mining. Although legal systems vary, it is always important to evaluate the source of the data being used and the jurisdiction of your company to understand what rules apply and what the best course of action is to stay within these rules.
How can companies prepare training on public data?
To ensure vigilance, all AI system adopters and providers should conduct a thorough risk assessment before deploying web data collection to the market. This research should include understanding specific local regulations and ensuring that key personnel are fully aware of copyright, privacy, and other laws.
Current laws and regulations regarding AI are incredibly fragmented, creating a difficult environment to navigate. A comprehensive understanding of these laws, including AI laws and broader EU regulations, will enable businesses to implement seamless web data collection practices.
At the end of the day, the companies whose AI models will stand the test of time are those that not only build with compliance in mind, but truly build systems that can easily adapt to regulations.
Practical implementation of EU AI law
Unfortunately, in the European Union, businesses still lack a comprehensive guide to web scraping. Instead, it provides knowledge of specific obligations towards the function module provider. The result is fragmented and unstable, with no clear path to success.
A thorough understanding of best practices along with risk assessment is key to success in this legal environment.
For technology in today’s world to remain as fair, ethical, and representative as possible, we must strive to ensure that public data remains open for AI training purposes. The Internet as a whole is a diverse dataset that, with appropriate legal guidance, can be harnessed to foster innovation.
Source link