Why AI startups are getting their hands on the data themselves

For one week this summer, Taylor and her roommate wore GoPro cameras on their foreheads as they painted, sculpted, and did chores. They trained an AI vision model and carefully synchronized footage so the system could capture the same action from multiple angles. It was a difficult job in many ways, but they were well compensated for it. And that allowed Taylor to spend most of his day creating art.

“We woke up, went through our usual routine, then strapped the camera to our heads and synced the time,” she told me. “Then we made breakfast and washed the dishes. Then we went our separate ways and worked on our art.”

They were hired to produce five hours of synchronized footage each day, but Taylor soon realized they needed to allocate seven hours a day to allow enough time for rest and physical recovery.

“It would cause a headache,” she said. “When you take it off, all you’re left with is a red square on your forehead.”

Taylor, who asked that her last name not be used, worked as a data freelancer at Turing, the AI company that connected her to TechCrunch. Turing’s goal was not to teach AI how to make oil paintings, but rather to teach it more abstract skills in sequential problem solving and visual reasoning. Unlike large-scale language models, Turing’s visual models are trained entirely on videos, most of which are collected directly by Turing.

Along with artists like Taylor, Turing contracts with chefs, construction workers, electricians and just about anyone who uses their hands. Sudarshan Sivaraman, chief AGI officer at Turing, told TechCrunch that manual collection is the only way to obtain a sufficiently diverse dataset.

“We’re doing this for so many different types of blue-collar jobs that we can get diverse data in the pre-training phase,” Sivaraman told TechCrunch. “Having all this information allows the model to understand how a particular task is performed.”

tech crunch event

san francisco
|
October 27-29, 2025

Turing’s work on vision models is part of a growing shift in the way AI companies handle data. Where once training sets were freely collected from the web or from poorly paid annotators, companies now pay top dollar for carefully curated data.

The raw power of AI is already established, and companies are turning to unique training data as a competitive advantage. And rather than relying on contractors to do the work, contractors often take on the work themselves.

One example is Fyxer, an email company that uses AI models to categorize emails and draft replies.

After some early experiments, founder Richard Hollingsworth discovered that the best approach was to use a series of small models with focused training data. Unlike Turing, Fyxer builds on others’ foundational models, but the underlying insights are the same.

“We realized that the quality of data, not the quantity, is what really defines performance,” Hollingsworth told me.

In practical terms, this meant unconventional personnel choices. In the early days, Hollingsworth said, Fyxer sometimes outnumbered engineers and managers by a factor of 4 to 1 in the number of executive assistants needed to train the models.

“We hired a lot of experienced executive assistants because we needed to train them on the basics of whether or not they should respond to emails,” he told TechCrunch. “This is a very people-oriented problem. It’s very difficult to find good people.”

The pace of data collection never slowed down, but over time, Hollingsworth became more attached to datasets, preferring a more tightly curated set of smaller datasets in the post-training period. As he says, “It’s the quality of data, not the quantity, that really defines performance.”

This is especially true when synthetic data is used, expanding both the range of possible training scenarios and the impact of defects in the original dataset. On the vision side, Turing estimates that 75% to 80% of the data is synthetic data extrapolated from the original GoPro video. However, this makes it even more important to keep the original dataset as high quality as possible.

“If the pre-training data itself isn’t good quality, then whatever you do with the synthetic data won’t be good quality,” Sivaraman says.

Beyond quality concerns, there is a strong competitive logic behind keeping data collection in-house. For Fyxer, data collection efforts are one of the company’s best moats against competition. In Hollingsworth’s view, anyone can incorporate an open source model into their product, but not everyone can find a professional annotator to train it into a viable product.

“We believe the best way to do that is through data,” he told TechCrunch. “Through custom model building and human-driven, high-quality data training.”

Correction: An earlier version of this article incorrectly referred to Turing by name. TechCrunch regrets this mistake.

Source link

What's Hot

Silver Fox spreads Winos 4.0 attack to Japan and Malaysia via HoldingHands RAT

Senate Republicans deepfaked Chuck Schumer, but X isn’t taking it down

AI tools run on fracked gas and bulldozed land in Texas

Why AI startups are getting their hands on the data themselves

Senate Republicans deepfaked Chuck Schumer, but X isn’t taking it down

AI tools run on fracked gas and bulldozed land in Texas

Trump’s Energy Department decides to keep at least one Biden-era energy plan in place

Silver Fox spreads Winos 4.0 attack to Japan and Malaysia via HoldingHands RAT

Senate Republicans deepfaked Chuck Schumer, but X isn’t taking it down

AI tools run on fracked gas and bulldozed land in Texas

Immortality is No Longer Science Fiction: TwinH’s AI Breakthrough Could Change Everything

Immortality is No Longer Science Fiction: TwinH’s AI Breakthrough Could Change Everything

The AI Revolution: Beyond Superintelligence – TwinH Leads the Charge in Personalized, Secure Digital Identities

Revolutionize Your Workflow: TwinH Automates Tasks Without Your Presence

FySelf’s TwinH Unlocks 6 Vertical Ecosystems: Your Smart Digital Double for Every Aspect of Life

What's Hot

Why AI startups are getting their hands on the data themselves

Related Posts