The promise and dangers of synthetic data
Is it possible for an AI to be trained only on data generated by another AI? It may seem like a foolhardy idea. But it has been around for some time, and as new real data becomes more difficult to obtain, it has gained momentum.
Anthropic used some synthetic data to train one of its main models, Claude 3.5 Sonnet. Meta refine her Llama 3.1 Models Using data generated by artificial intelligence. OpenAI is said to obtain synthetic training data from them o1its “inferential” model of the future Orion.
But why does AI need data in the first place, and for what? Kind What data do you need? This data can truly Will it be replaced by synthetic data?
The importance of footnotes
AI systems are statistical machines. By training them with lots of examples, they learn the patterns in those examples to make predictions, such as the “to whom” phrase in an email typically precedes the “I might be interested” phrase.
Annotations, which are usually text describing the meaning or parts of the data these systems ingest, are a key part of these examples. They serve as guideposts, “teaching” a model for distinguishing between things, places, and ideas.
Consider an image classification model that returns lots of pictures of kitchens with the word “kitchen” in them. As it trains, the model will begin to create connections between the “kitchen” and the public characteristics Kitchens (eg those with refrigerators and worktops). After training, given an image of a kitchen that was not included in the initial examples, the model should be able to recognize it as such. (Of course, if pictures of kitchens are labeled “cow,” they will be identified as cows, underscoring the importance of good captioning.)
The desire for artificial intelligence and the need to provide labeled data for its development has exploded the market for annotation services. Dimension of market research Estimates It’s worth $838.2 million today — and will be worth $10.34 billion in the next 10 years. While there are no accurate estimates of how many people will be involved in tagging, by 2022 paper He associates the number with “millions.”
Companies large and small rely on workers hired by data annotation companies to create labels for AI training suites. Some of these jobs pay reasonably well, especially if the classification requires specialized knowledge (such as expertise in mathematics). Others can be divided. Explanations in developing countries They are paid only a few dollars per hour on averagewithout any interest or guarantees for future gigs.
Dry the data well
So there are humanitarian reasons to seek alternatives to human-created labels. For example, Uber is expanding its fleet of… Gig workers to work on AI annotations and data classification. But there are also practical things.
Humans can only classify so quickly. Annotation also has Biases Which can appear in their annotations, and thus in any models trained on them. Make annotations mistakesor get I stumbled By labeling the instructions. Paying humans to do things is expensive.
Data In general Expensive, for that matter. Shutterstock charges AI vendors tens of millions of dollars for access archivesWhile I replied He has He made hundreds of millions off licensing data for Google, OpenAI and others.
Finally, data is also becoming more difficult to obtain.
Most models are trained on huge public data sets, data that users are increasingly choosing to view due to concerns about it stolen Or they won’t get credit or attribution for it. More than 35% of the top 1000 websites in the world Now block OpenAI’s web scraper. About 25% of data from “high quality” sources has been restricted from the main datasets used to train models, which recently came as He studies Found.
If the current trend of blocking access continues, research group Epoch AI Projects Developers will run out of data to train generative AI models between 2026 and 2032. This, in addition to the fears of… Copyright claims and Objectionable material Making their way into open datasets has forced a reckoning on AI vendors.
Synthetic alternatives
At first glance, synthetic data may seem like the answer to all these problems. Do you need annotations? Create it. More example data? No problem. The sky is the limit.
To some extent, this is true.
“If data is the new oil, then synthetic data presents itself as a biofuel, one that can be created without the negative externalities of the real thing,” Os Keyes, a doctoral student at the University of Washington who studies the ethical impact of emerging technologies, told TechCrunch. . “You can take a small starting set of data and simulate and extrapolate new inputs from it.”
The AI industry has taken this concept and run with it.
This month, Writer, an enterprise-focused generative AI company, launched a model, Palmyra X 004, trained almost entirely on synthetic data. The writer claims that its development cost only $700,000 – comparison Estimates of $4.6 million for a similarly sized OpenAI model.
Microsoft fi Open models were trained using partially synthetic data. So was Google Gemma Models. Nvidia This summer Unveiling a model family designed to generate synthetic training data, startup Hugging Face recently released what it claims to be The largest dataset for AI training From synthetic text.
Synthetic data generation has become a business in itself, and it can be a business He deserves $2.34 billion by 2030. Gartner He predicts 60% of the data used in AI and analytics projects this year will be generated synthetically.
Luca Soldini, a senior researcher at the Allen Institute for AI, noted that synthetic data techniques can be used to create training data in a format that cannot be easily obtained through extraction (or even content licensing). For example, in his training video generator General movieused Meta Llama 3 to generate captions for shots in the training data, which humans then enhanced to add more detail, such as lighting descriptions.
In the same vein, OpenAI says it has fine-tuned it GPT-4o Using synthetic data to build something like a drawing board fabric ChatGPT feature. And Amazon has it He said They generate synthetic data to supplement the real-world data they use to train Alexa’s speech recognition models.
“Synthetic data models can be used to scale quickly based on human intuition about what data is needed to achieve a specific model behavior,” Soldini said.
Artificial risks
However, synthetic data is not a panacea. It suffers from the same “garbage in, garbage out” problem as all AI systems. Models creates The data is synthetic, and if the data used to train these models has biases and limitations, their outputs will be similarly distorted. For example, groups that are poorly represented in fundamental data will be poorly represented in synthetic data.
“The problem is there’s only so much you can do,” Keyes said. “Let’s say you only have 30 black people in the data set. Extrapolation might be useful, but if those 30 people were all middle class, or all light-skinned, then this is what the ‘representative’ data would look like.”
To this point, 2023 He studies Researchers at Rice University and Stanford have found that overreliance on synthetic data during training can create models whose “quality or diversity gradually decreases.” Sampling bias — a poor representation of the real world — worsens model variability after a few generations of training, according to the researchers (although they also found that mixing in a little real-world data helps mitigate this).
Keyes sees additional risks in complex models like OpenAI’s o1, which he believes can produce models that are difficult to detect. Hallucinations In their synthetic data. This, in turn, can reduce the accuracy of models trained on the data, especially if the sources of hallucinations are not easy to identify.
“Complex models hallucinate; data produced by complex models contain hallucinations,” Keyes added. “And with a model like o1, the developers themselves can’t necessarily explain why artifacts appear.”
Complex hallucinations can lead to babbling patterns. A He studies This publication in Nature reveals how models are created, which are trained on data full of errors More than that Error-ridden data, and how this feedback loop degrades future generations of models. The researchers found that models lose their grasp of esoteric knowledge over generations, becoming more general and often producing answers that are unrelated to the questions they are asked.

tracking He studies He explains that other types of models, such as image generators, are not immune to this type of breakdown:

Soldini agrees that “raw” synthetic data should not be trusted, at least if the goal is to avoid forgetful training of chatbots and homogeneous image generators. Using them “safely,” he says, requires reviewing, formatting, and filtering them comprehensively, and ideally pairing them with new real data — just as you would any other data set.
Failure to do so can eventually happen It leads to the collapse of the modelwhere the model becomes less “creative” – and more biased – in its output, ultimately seriously compromising its functionality. Although this process can be recognized and stopped before it becomes serious, it is risky.
“Researchers need to examine the data generated, repeat the generation process, and identify safeguards to remove low-quality data points,” Soldini said. “Synthetic data pipelines are not a self-improving machine; Its outputs must be carefully examined and improved before being used for training.
Sam Altman, CEO of OpenAI, once said that AI will do just that someday Produce synthetic data good enough to train itself effectively. But – assuming that’s possible – the technology doesn’t exist yet. No major AI lab has released a trained model On synthetic data alone.
At least for the foreseeable future, it looks like we’ll need humans in the loop somewhere To ensure that the model training is not skewed.
TechCrunch has an AI-focused newsletter! Register here Get it in your inbox every Wednesday.
Update: This story was originally published on October 23 and was updated on December 24 with more information.