OpenAI’s o3 points out that AI models are expanding in new ways — but so are the costs

OpenAI’s o3 points out that AI models are expanding in new ways — but so are the costs

Last month, AI founders and investors told TechCrunch that we are now in “The second era of measurement laws,“Noting how approaches to improving AI models were delivering diminishing returns. One promising new method they suggested could sustain the gains is”Scaling the test time“, which seems to be the reason behind the performance OpenAI O3 model -But it comes with drawbacks of its own.

Much of the AI ​​world took the announcement of OpenAI’s o3 model as evidence that progress in scaling AI has not “hit a wall.” The o3 model performs well in benchmarks, significantly outperforming all other models on the ARC-AGI general ability test, and scoring 25% on the ARC-AGI test. Hard math test No other AI model scored more than 2%.

Of course, we at TechCrunch are taking all of this with a grain of salt until we can test the o3 for ourselves (very few have tried it yet). But even before o3 was released, the AI ​​world was already convinced that something big had changed.

Noam Brown, co-founder of the OpenAI Model Series, noted on Friday that the startup is reporting impressive o3 gains just three months after the startup announced o1 — a relatively short timeframe for such a jump in performance.

“We have every reason to believe this path will continue,” Brown said in a letter. tweet.

Anthropic co-founder Jack Clark said in a… Blog post Monday that o3 is evidence that progress in AI will be faster in 2025 than in 2024. (Keep in mind that it’s helpful for Anthropic — especially its ability to raise capital — to point out that the laws of scaling AI continue, even If Clark were a complement to a competitor.)

Next year, Clark says the AI ​​world will combine testing time scaling with traditional pre-training scaling methods to get more returns from AI models. It may be suggested that Anthropic and other AI model providers will launch their own inference models in 2025, just like Google did last week.

Measuring test time means that OpenAI uses more computing during the inference phase of ChatGPT, which is the period of time after hitting the enter button on a prompt. It’s not clear exactly what’s going on behind the scenes: OpenAI either uses more computer chips to answer a user’s question, runs more powerful inference chips, or runs those chips for longer periods of time — 10 to 15 minutes in some cases — before the AI ​​produces an answer. . We don’t know all the details of how o3 is made, but these benchmarks are early signs that measuring test time may improve the performance of AI models.

While o3 may give some renewed faith in the progress of AI scaling laws, OpenAI’s latest model also uses an unprecedented level of computing, which means a higher price per answer.

“Perhaps the only important caveat here is to understand that one of the reasons O3 is so much better is that it costs more money to run at inference time — the ability to use test-time computation means that on some problems you can turn the computation into a better answer,” Clark wrote. On his blog. “This is interesting because it has made the running costs of AI systems somewhat less predictable – previously, you could tell the cost of serving a generative model by just looking at the model and the cost of generating a given output.”

Clark and others pointed to o3’s performance on the ARC-AGI benchmark — a tough test used to evaluate breakthroughs in artificial general intelligence — as an indicator of its progress. It is worth noting that passing this test, according to its makers, does not mean an artificial intelligence model achieved Artificial general intelligence is one way to measure progress toward a vague goal. However, the o3 model surpassed the results of all previous AI models that ran the test, scoring 88% on one of its attempts. OpenAI’s next best AI model, o1, only scored 32%.

Chart showing the performance of the OpenAI o Series in the ARC-AGI test.Image credits:Ark Award

But the log x-axis on this graph may be worrying to some. The top-rated version of o3 used over $1,000 of compute per task. The o1 models used about $5 of compute per task, and the o1-mini models used only a few cents.

The creator of the ARC-AGI standard, François Cholet, wrote in an article Blog OpenAI used nearly 170 times as much computation to get a score of 88%, compared to the highly efficient version of o3 which scored only 12% lower. The high-scoring version of o3 used more than $10,000 in resources to complete the test, making it very expensive to compete for the ARC award – an unbeatable competition for AI models to beat the ARC test.

However, Cholet says o3 was still a major breakthrough in AI modeling.

“O3 is a system capable of adapting to tasks it has never encountered before, and it can be said that it approaches human-level performance in the ARC-AGI domain,” Cholet said in the blog. “Of course, such generality comes at a huge cost, and it won’t be quite economical yet: you could pay a human to solve ARC-AGI tasks for about $5 per task (we know, we’ve done it), while consuming mere cents.” In energy.”

It’s too early to talk about exact pricing for all of this – we’ve seen prices for AI models drop in the past year, and OpenAI has yet to announce the actual cost of o3. However, these prices indicate the amount of computing required to break, even slightly, the performance barriers set by today’s leading AI models.

This raises some questions. What is O3 actually? And how much computation is needed to make further gains around inference using o4 or o5 or whatever else OpenAI calls the following inference models?

It doesn’t look like o3, or its successor, will be anyone’s “daily driver” like GPT-4o or Google Search. These models use a lot of computing to answer little questions throughout your day, like “How can the Cleveland Browns make the 2024 playoffs?”

Instead, it appears that AI models with scaled-down computing at test time may only be useful for big-picture claims like, “How can the Cleveland Browns become a Super Bowl franchise in 2027?” Even then, it’s probably not worth the high computing costs unless you’re the general manager of the Cleveland Browns, and you’re using these tools to make some big decisions.

Institutions with deep pockets may be the only ones who can afford o3, at least to start, as Wharton professor Ethan Mollick notes in an article. tweet.

We’ve already seen OpenAI version A $200 level to use a high compute version of o1but the startup has It is said that the weight of creating subscription plans costs up to $2,000. When you see how much o3 computing uses, you can understand why OpenAI would think so.

But there are drawbacks to using O3 for high-impact work. As Cholet notes, o3 is not general AI, and it still fails at some very easy tasks that a human can do quite easily.

This is not necessarily surprising, as with large language models You still have a huge hallucinations problemwhich seems to calculate o3 and test time not resolved. That’s why ChatGPT and Gemini include a disclaimer at the bottom of every answer they provide, asking users not to trust the answers at face value. Presumably, artificial general intelligence, if achieved, would not need such a disclaimer.

One way to achieve further gains in scaling test time is with better AI inference chips. There’s no shortage of startups tackling just this, like Groq or Cerebras, while other startups are designing more cost-effective AI chips, like MatX. Andreessen Horowitz, general partner Angne Meda, previously told TechCrunch She expects these startups to play a bigger role In measuring the testing time going forward.

Although o3 is a significant improvement in the performance of AI models, it raises many new questions about usage and costs. However, o3’s performance adds credence to the claim that test-time computing is the tech industry’s next best way to scale AI models.

TechCrunch has an AI-focused newsletter! Register here Get it in your inbox every Wednesday.


Leave a Comment

Your email address will not be published. Required fields are marked *