AI crawlers cause Wikimedia Commons bandwidth demands, surges 50%

Wikimedia Foundation, Wikipedia’s umbrella organization, said Wednesday that bandwidth consumption for multimedia downloads from Wikimedia Commons has skyrocketed 50% since January 2024.

The reason I wrote in my blog post on Tuesday is not due to increased demand from knowledge-hungered people, but from automated data-hungry scrapers trying to train AI models.

“Our infrastructure is built to maintain sudden traffic spikes from humans during high profit events, but the amount of traffic generated by scraperbots is unprecedented, increasing risk and costs,” the post reads.

Wikimedia Commons is a freely accessible repository of images, videos and audio files available under an open license or in the public domain.

Dripping into it, Wikimedia says that it’s almost two-thirds of the most “expensive” traffic (65%), that is, the most resource intensive in terms of the type of content consumed, but from bots. However, only 35% of all PageViews come from these bots. According to Wikimedia, the reason for this disparity is that frequently accessed content remains close to users in cache, while other less frequently accessed content is stored further apart in “core data centers”, where content is more expensive to serve. This is the type of content that bots normally look for.

“Human readers tend to focus on certain (often similar) topics, while crawlerbots tend to “read” more pages, and visit less popular pages as well,” Wikimedia writes. “This means that these types of requests are likely to be forwarded to the core data center, which makes them much more expensive when it comes to resource consumption.”

All the long-term of this is that the Wikimedia Foundation site reliability team must spend a lot of time and resources blocking crawlers to avoid normal user confusion. And all of this before we consider the cloud costs that the foundation faces.

In fact, this represents part of the burgeoning trend that threatens the very existence of the open internet. Last month, software engineer and open source advocate Drew Devault lamented the fact that AI Crawlers ignored the “robots.txt” file, designed to avoid automated traffic. The “practical engineer” also complained last week that AI scrapers from companies such as Meta had driven demands for bandwidth for his own projects.

In particular, open source infrastructure is on the shooting line, but as TechCrunch wrote last week, developers are fighting back with “smartness and vengeance.” Some tech companies are doing a bit to address this issue. CloudFlare, for example, recently launched AI Labyrinth. This is slowing down the crawler using AI generated content.

But it’s a cat and mouse game where many publishers can ultimately force the duck for login and cover behind the paywall.

Source link

What's Hot

Rama Dowaj Styles Upcycled Knicks Shirt by Claire Sullivan

Knicks parade ends with Alicia Keys singing “Empire State of Mind”

Role Models announce dates for fall 2026 North American tour

AI crawlers cause Wikimedia Commons bandwidth demands, surges 50%

New York Knicks Parade: Live updates from the parade route

Using companion AI is a deal breaker in dating, says Match study

Best robot vacuum and mop deals: Eufy E25 robot vacuum and mop combo reduced to $629.99

Rama Dowaj Styles Upcycled Knicks Shirt by Claire Sullivan

Knicks parade ends with Alicia Keys singing “Empire State of Mind”

Role Models announce dates for fall 2026 North American tour

New York Knicks Parade: Live updates from the parade route

Rama Dowaj Styles Upcycled Knicks Shirt by Claire Sullivan

Knicks parade ends with Alicia Keys singing “Empire State of Mind”

Role Models announce dates for fall 2026 North American tour

Castilla-La Mancha Ignites Innovation: fiveclmsummit Redefines Tech Future

Local Power, Health Innovation: Alcolea de Calatrava Boosts FiveCLM PoC with Community Engagement

The Future of Digital Twins in Healthcare: From Virtual Replicas to Personalized Medical Models

Human Digital Twins: The Next Tech Frontier Set to Transform Healthcare and Beyond

What's Hot

AI crawlers cause Wikimedia Commons bandwidth demands, surges 50%

Related Posts