Elon Musk agrees that we have exhausted AI training data
Elon Musk agrees with other AI experts that there is little real-world data left to train AI models on.
“We have now basically exhausted the cumulative sum of human knowledge…in AI training,” Musk said during a live-streamed conversation with Stagwell CEO Mark Penn on Channel X late Wednesday. Basically last year.”
Musk, who owns the artificial intelligence company xAI, echoed the same themes as former OpenAI chief scientist Ilya Sutskever. I touched At NeurIPS, the machine learning conference, during a speech in December. Sutskever, who said the AI industry has reached what he called “peak data,” predicted that the lack of training data will lead to a shift from the way models are developed today.
In fact, Musk has suggested that synthetic data — data generated by AI models themselves — is the way forward. “The only way to continue [real-world data] With synthetic data, where artificial intelligence creates [training data]He said. “With synthetic data… [AI] He will sort himself out and go through this self-learning process.”
Other companies, including technology giants such as Microsoft, Meta, OpenAI, and Antropik, are already using synthetic data to train leading AI models. Gartner Estimates 60% of the data used in AI and analytics projects in 2024 was generated synthetically.
Microsoft PHI-4which was open source early Wednesday, was trained on synthetic data alongside real-world data. So was Google Gemma Models. Anthropic used some synthetic data to develop one of its most performing systems, Claude 3.5 Sonnet. And Meta has tuned its latest release Llama Series of models Using data generated by artificial intelligence.
Training on synthetic data has other advantages, such as cost savings. AI startup Writer claims that the Palmyra X 004 model, which was developed using almost entirely synthetic sources, cost just $700,000 to develop. comparison Estimates of $4.6 million for a similarly sized OpenAI model.
But there are drawbacks as well. Some research He suggests that synthetic data can lead to model breakdown, as the model becomes less “creative” – and more biased – in its output, ultimately seriously compromising its functionality. Because models create synthetic data, if the data used to train these models contains biases and limitations, their outputs will be similarly distorted.