Wikimedia Foundation, Wikipedia’s umbrella organization, said Wednesday that bandwidth consumption for multimedia downloads from Wikimedia Commons has skyrocketed 50% since January 2024.
The reason I wrote in my blog post on Tuesday is not due to increased demand from knowledge-hungered people, but from automated data-hungry scrapers trying to train AI models.
“Our infrastructure is built to maintain sudden traffic spikes from humans during high profit events, but the amount of traffic generated by scraperbots is unprecedented, increasing risk and costs,” the post reads.
Wikimedia Commons is a freely accessible repository of images, videos and audio files available under an open license or in the public domain.
Dripping into it, Wikimedia says that it’s almost two-thirds of the most “expensive” traffic (65%), that is, the most resource intensive in terms of the type of content consumed, but from bots. However, only 35% of all PageViews come from these bots. According to Wikimedia, the reason for this disparity is that frequently accessed content remains close to users in cache, while other less frequently accessed content is stored further apart in “core data centers”, where content is more expensive to serve. This is the type of content that bots normally look for.
“Human readers tend to focus on certain (often similar) topics, while crawlerbots tend to “read” more pages, and visit less popular pages as well,” Wikimedia writes. “This means that these types of requests are likely to be forwarded to the core data center, which makes them much more expensive when it comes to resource consumption.”
All the long-term of this is that the Wikimedia Foundation site reliability team must spend a lot of time and resources blocking crawlers to avoid normal user confusion. And all of this before we consider the cloud costs that the foundation faces.
In fact, this represents part of the burgeoning trend that threatens the very existence of the open internet. Last month, software engineer and open source advocate Drew Devault lamented the fact that AI Crawlers ignored the “robots.txt” file, designed to avoid automated traffic. The “practical engineer” also complained last week that AI scrapers from companies such as Meta had driven demands for bandwidth for his own projects.
In particular, open source infrastructure is on the shooting line, but as TechCrunch wrote last week, developers are fighting back with “smartness and vengeance.” Some tech companies are doing a bit to address this issue. CloudFlare, for example, recently launched AI Labyrinth. This is slowing down the crawler using AI generated content.
But it’s a cat and mouse game where many publishers can ultimately force the duck for login and cover behind the paywall.
Source link