The past 12 months have demonstrated the enormous capabilities enabled by public web data collection. However, it is clear that there is still room for growth in this industry in 2026.
It will be interesting to see how this year unfolds, with expected legal changes and legal battles looming in the dependent AI industry. There’s one thing we can count on. That means the basics of data collection remain more important than ever.
Below, top technology experts come together to share insights into how the data collection landscape is expected to evolve based on their industry expertise, revealing what 2026 could bring to business and AI around the world.
Fair use of copyrighted material
Denas Grybauskas, chief governance and strategy officer at Oxylabs, explained that “U.S. legal discussions and potential practice will increasingly focus on the transformation of copyrighted works. The fair use doctrine allows for transformative uses of copyrighted works, which add something new or have a different purpose or nature.”
“Many legal discussions will therefore focus on whether the use of content, including web content, for AI training constitutes sufficient transformative use to qualify as fair use.
“At the same time, where fair use principles do not apply (in jurisdictions such as the EU), the industry will need technical mechanisms for credit attribution and viable ways to compensate creators without compromising the openness of the web and the seamlessness of access to public information.”
Agent system for data collection
Julius Černiauskas, CEO of Oxylabs, said: “The next year could see interesting developments in comprehensive agent systems for public data collection. Consider the process of web scraping, which consists of many small tasks. AI agents can automate these tasks.”
“Together, they form a multi-agent system that can handle much of the process, reducing costs and democratizing public data access by facilitating access to public data without requiring specific skills or engineering teams.
“Again, new tools and features are constantly coming to market to automate certain tasks, and there will be more in the coming year.”
Use LLM for analysis
“Over the next 12 months, we will see an increase in the use of LLM for analytics. Over the past few years, data analytics has been one of the most impactful AI use cases in public data collection,” said Juras Juršėnas, COO at Oxylabs.
“However, we were still limited by the price (of the LLM token) and prompt size constraints. Developers and data teams always had to clean up and reduce the size of the HTML before passing it to LLM for analysis. This required additional resources. Now they may only need to do this in certain cases.”
“The market is rapidly increasing the choice of tools that can do this, so it is reasonable to expect that the use of LLM for analysis will increase.”
quality and quantity
Rytis Ulys, Head of Data and AI at Oxylabs, commented, “In 2026, data searches will focus on quality over quantity. Recent human studies have shown that even small amounts of low-quality data can ruin an entire dataset.”
“Furthermore, we found that beyond a certain point, adding low-quality data yields minimal benefit or even degrades performance compared to using a more targeted and relevant subset.
“That’s why the fundamentals of data collection will remain more important than ever. Robust tables and catalogs, quality and lineage, and low-latency query engines are now prerequisites for agent acquisition rather than afterthoughts. Enhanced acquisition with graphs and vectors is moving from blog posts to patterns, observability extends to prompts, tools, and cost, and compliance is on the same plane as performance. Data doesn’t go away, it goes away.” Controlling AI Promoted to surface. ”
Gain a better understanding of online data collection
Based on these insights, we can expect interesting developments in comprehensive agent systems for public data collection, growth in LLMs for analysis, and a shift toward quality over quantity in data retrieval.
In parallel, legal decisions regarding copyright law will need to be taken in both the US and Europe over the next 12 months, as the current situation leaves many people in uncertain territory.
In 2026, we hope to introduce new tools and features to automate processes and improve our understanding of web data collection and its role in businesses’ daily lives, providing business clarity and understanding.
Source link
