April 11, 2025

Test-Time Compute: Thinking, (Fast and) Slow

Written by
Will Horyn

Artificial intelligence continues to evolve at an astounding pace, with new models, applications, and research breakthroughs seemingly daily. Among the hotly debated subjects within this space is the question of scaling laws – that is, the notion that throwing more data and compute at a larger model increases performance – and whether or not performance is asymptoting as we run out of abundant real data for pre-training. While this remains something of an open question (larger models continue to push performance and synthetic data may play a key role in addressing the lack of unused real data), what has become apparent is that the pre-training phase is only one of multiple axes upon which models can be scaled to improve performance. In fact the driver behind the next wave of breakthroughs might not be a bigger model or more training data at all, but rather how we use compute after the model is trained.

In this post, we’ll explore this concept of Test-Time Compute (TTC, also known as Inference-Time Compute), the paradigm that’s reshaping AI and powering the recent slew of “reasoning” models such as Google’s Gemini 2.5 Pro, OpenAI’s o1, and Anthropic’s Claude 3.7 Sonnet models by adding a layer of “think time” during inference. We’ll break down what TTC is, how it contrasts with traditional Training-Time Compute, why it’s gaining traction now, what it means for startups and industry leaders alike, and how it may evolve going forward. We’ll also highlight performance data and real-world examples, including a brief look at how companies like DeepSeek are innovating in this space.

What is Test-Time Compute?

Imagine if every time you answered a question, you could take a moment to double-check your answer, revise it if necessary, or even run through several possible solutions before deciding on the best one. That’s exactly what test-time compute allows an AI model to do. Instead of generating an answer in one fast, single pass (often compared to a quick “gut reaction”), TTC gives the model extra computation time during the inference phase – when it’s actually processing your question – to refine its output.

This “extra thinking time” is sometimes described as the model engaging in “System 2” thinking – a cognitive psychology concept first elucidated in Daniel Kahneman’s book Thinking, Fast and Slow where slow, deliberate reasoning complements quick, intuitive judgments (or “System 1” thinking). With TTC, an AI might generate several candidate answers, review them internally, and even adjust its final output based on a deeper analysis of the problem. For example, instead of answering a math problem in one go, the model may first outline the required steps and then use a verification mechanism to ensure accuracy.

Research has shown that when models use additional inference steps, their performance on complex tasks improves significantly. Look no further than OpenAI’s original o1 paper, which demonstrated material improvement across a variety of benchmarks by using Reinforcement Learning to tune the model for sophisticated reasoning capabilities, allowing it to think and review before answering. This dramatic boost in performance underscores the potential of test-time compute to make AI not just bigger, but smarter.

How Does Test-Time Compute Differ from Training-Time Compute?

Historically, the AI community has largely focused on scaling up models through training-time compute, pouring massive amounts of data and computational power into training models – a process that can cost millions of dollars and require weeks or even months of GPU time. At a high level, you can think of training as building an engine, and inference as the engine running. Once trained, these models then serve predictions in a single, rapid pass during inference. Think of it like cramming for an exam: you absorb as much information as possible during study time, and then on exam day you regurgitate what you learned as quickly as possible.

Test-Time Compute, by contrast, shifts some of that computational effort to the inference phase, or the process during which a model generates a response to a query based upon how it has been trained. Instead of relying solely on pre-trained knowledge and a single pass, models are given extra “brainpower” on the spot to work through complex problems. This means that a model can:

Allocate additional computation dynamically: For a simple query, it might answer immediately, but for a tougher problem, it can take extra steps to reason through the answer (for example by using a technique such as Chain-of-Thought prompting)
Adjust based on difficulty: Harder questions trigger more internal processing, similar to how a person might pause longer on a challenging math problem
Incorporate feedback loops: The model can generate multiple candidate responses and then use a verification step (or even external tools) to choose the best answer

The trade-offs are clear. While training a giant model is generally a non-recurring, upfront expense (albeit a costly one), test-time compute introduces ongoing operational costs (comparable to shifting dollars from capex to opex). Each query might require more processing power and time, which can increase latency and potentially drive up cloud computing bills. However, proponents argue that the benefits – such as dramatically improved performance on difficult tasks and more adaptable, context-aware responses – can outweigh these downsides.

In essence, training-time compute is about building a vast repository of knowledge, while test-time compute is about making that knowledge work smarter when it matters most. This shift allows developers to get more mileage out of models without needing to continuously invest in ever-larger training runs.

Why Test-Time Compute Matters Today

The era of “bigger is always better” in AI is showing signs of strain. Despite scaling up models to hundreds of billions (and even >1 trillion) of parameters, many of the gains from these enormous training runs are starting to plateau. There’s a growing recognition that simply adding more data and compute during training isn’t yielding the explosive improvements we once saw, and many are reasonably questioning whether these marginal improvements justify the required investment.

Diminishing Returns on Training

Recent sentiment within the AI industry reflects that while increasing training compute has historically led to significant performance gains, these improvements tend to level off over time. OpenAI Co-Founder Ilya Sutskever echoed this notion late last year: “The 2010s were the age of scaling, now we’re back in the age of wonder and discovery once again. Everyone is looking for the next thing. Scaling the right thing matters more now than ever.” Similarly, Sam Altman himself acknowledged that their most recent model (GPT–4.5) “won’t crush benchmarks.” To be fair 4.5 is not a reasoning model and, as such, shouldn’t be compared to models such as Gemini 2.5 Pro, but the performance is not dramatically improved over other comparable models such as Gemini 2.0 Pro Experimental, and was reportedly expensive to train. This is where test-time compute enters the picture – by investing compute at inference, models can continue to improve their accuracy and reliability even when further pre-training gains have stalled.

Real-Time Adaptability and Efficiency

In today’s fast-paced world, adaptability is key. Modern AI systems need to respond to new information, adjust to context, and make decisions in real time. Test-time compute offers a way for models to remain agile, drawing on extra compute power when facing unfamiliar or complex queries. This dynamic approach is particularly important for applications like virtual assistants or autonomous systems, such as AI agents, where every second of additional reasoning can mean a big difference in performance.

Industry Endorsements and Trends

The shift toward test-time compute isn’t just an academic idea. Industry leaders like OpenAI, Anthropic, xAI, Google DeepMind, and NVIDIA are investing heavily in this area. Anthropic’s recent Claude 3.7 Sonnet model, for example, employs multi-step reasoning visible to the user during inference to boost accuracy, setting a new benchmark for how inference-time processes can enhance AI capabilities. Many of these model developers now have lines of reasoning models dedicated to this technological breakthrough (e.g., Google’s Flash Thinking and OpenAI’s o-Series).

Data-Driven Success Stories

In addition to the o1 paper referenced above, research published earlier this year indicates that smaller models scaled at inference with TTC can outperform vastly larger models. The authors specifically demonstrate that a 1B parameter model can outperform a 405B parameter model across certain benchmarks, such as MATH-500 when TTC is utilized. Another research paper illustrates that applying a simple delay in a model’s thinking by introducing the word “wait” into its generation caused it to double-check its work and sometimes revise answers. This enabled a 32B model to outperform o1-preview (estimated to be ~10x as large) on competition math questions by up to 27%.

DeepSeek, a Chinese AI startup, provides another example of the power of leveraging TTC when they released their R1 model earlier this year which (similar to OpenAI with o1) introduced Reinforcement Learning techniques (such as Group Relative Policy Optimization, combined with Supervised Fine-Tuning) to fine-tune a pre-trained base model for reasoning capabilities. While the headline cost of ~$6M is broadly understood to be understated, it is nonetheless noteworthy that they were able to significantly improve model performance to approach that of OpenAI’s o1 model while using fewer resources and ultimately open-sourcing key elements.

Google’s recently released Gemini 2.5 Pro reasoning model cements the power of TTC, topping popular LLM leaderboards such as LiveBench and LMArena. These examples are part of a broader trend where the focus is shifting from merely scaling model size to improving inference-time strategies.

The Business and Startup Perspective

For startups, the promise of test-time compute is compelling. Training enormous AI models from scratch is not only expensive but also requires access to cutting-edge hardware and vast amounts of data – resources that are often out of reach for newer companies. Instead, by leveraging pre-trained models and enhancing them with test-time compute, startups can achieve competitive performance without the astronomical upfront costs (see DeepSeek above).

Lower Upfront Costs and Greater Flexibility

Startups can adopt a model where they take a robust pre-trained system and then invest in smart inference algorithms that allow the model to “think on the fly.” This approach is more cost-effective because it avoids the massive one-time expense of training a gigantic model from scratch. Instead, compute is allocated dynamically – only when needed, and only to the extent required by each specific query.

This model also lends itself to faster iteration. When the need arises to address a new type of query or improve performance on an edge case, startups can update their inference strategies (like refining the candidate generation or tweaking the verification process) rather than undergoing a full re-training cycle. This agile methodology can lead to shorter development cycles and quicker time-to-market.

To expand on a point from the Data-Driven Success Stories section above, companies can start with a smaller model that is both cheaper to train and serve, and by leveraging TTC, achieve comparable if not better performance as compared to a larger pre-trained model. A “mid-sized” model such as Claude 3.5 Sonnet may cost tens of millions of dollars to train, while a smaller model (e.g. 3B parameters) can be trained for less than $1M. Further, the output token cost of a smaller model will be cheaper than that of a larger one due to the reduced computational resources required to generate a response. For example, o3-mini output token cost is ~34x cheaper than that of GPT-4.5 per OpenAI’s pricing page. What is striking about this is that despite the significantly cheaper per token costs, o3-mini outperforms GPT-4.5 because it is a reasoning model that leverages TTC principles and techniques. It must be noted however that the catch is in the amount of output tokens required for a reasoning model to serve a response. While the upfront training cost and per token output cost may be significantly cheaper with a smaller model, the number of tokens generated at inference with TTC is multiples higher than a traditional pre-trained model. In his recent GTC keynote, NVIDIA CEO Jensen Huang illustrated with a simple example (starts at 59:20) that a reasoning model generated 20x as many tokens (while using 150x as much compute) as a traditional LLM to answer the same question. It is also worth noting, however, that the traditional LLM got it wrong. Therein lies a key takeaway: the ultimate cost will vary greatly depending upon a multitude of factors, but leveraging TTC offers a new lever to unlock meaningfully superior performance and potentially better economics as well.

Enhancing Customer Experience and Operational Efficiency

User experience continues to be paramount for the successful adoption of AI applications, perhaps more so than ever as foundation model performance becomes increasingly competitive. For example, virtual assistants and chatbots enhanced with TTC can provide more thoughtful, context-aware responses, making them more reliable and engaging. In enterprise settings, systems that can reason through complex decisions in real time are highly valued – for instance, an AI that can generate multiple scheduling solutions and select the optimal one based on current constraints.

While increased inference compute might mean slightly slower responses in some cases, the trade-off is often worth it for the improved accuracy and reliability. As noted by OpenAI’s Noam Brown, “It turned out that having a bot think for just 20 seconds in a hand of poker got the same boosting performance as scaling up the model by 100,000x and training it for 100,000 times longer.”

Infrastructure and Scalability Challenges

Of course, there are challenges. Adding extra compute at inference inevitably increases operational costs and the additional latency may not be tenable for certain use cases or applications. This can complicate infrastructure needs, such as for a service that must handle millions of queries, with each potentially requiring a variable amount of processing time. One research paper posits that not all inference is created equal and the increased costs associated with TTC scaling may not justify the perceived performance gains. Huang acknowledged during his GTC keynote that power is the ultimate constraint to maximizing inference performance today across FLOPs (“Floating-Point Operations per Second,” or the computational performance of processors such as CPUs and GPUs), bandwidth, and memory. Memory requirements become particularly important as TTC exponentially increases the output tokens generated (echoing the point made in the Cost section above) and stored to formulate a final response (e.g., a model may generate and consider multiple responses to a query before finalizing its output). Cloud providers and AI infrastructure companies are already working on solutions, such as adaptive compute scheduling and specialized inference chips, to handle these demands more efficiently.

For startups, careful design of the inference pipeline is essential. It’s not just about adding compute power; it’s about smartly allocating it. Some emerging techniques involve dynamically assessing the difficulty of each query and only applying extra reasoning steps when necessary. This adaptive approach aims to strike a balance between quality and efficiency.

The Future of Test-Time Compute: Hybrid Models and Smarter Inference

Looking ahead, the most promising vision for AI combines the best of both worlds: robust training-time learning paired with agile test-time reasoning. We can imagine systems where a model not only draws on its vast pre-trained knowledge but also dynamically adjusts its processing based on the specific demands of each query. Such hybrid models could advance AI by ensuring high accuracy while maintaining operational efficiency. This is a case for the continued scaling of pre-training and combining it with agile TTC for optimal performance.

One intriguing possibility is Test-Time Training, where models continue to learn from each inference (as opposed to the model weights remaining fixed until fine-tuned or retrained). In this approach, every query is not just a question to be answered but also an opportunity to refine the model’s understanding. Early research in this area suggests that models could fine-tune themselves on the fly, becoming more robust and adaptable as they interact with real-world data.

The hardware landscape is also evolving in tandem. Companies like NVIDIA are developing systems designed to maximize performance for the incoming explosion in inference-heavy workloads, addressing the unique challenges of variable compute needs, such as the tradeoff between throughput and latency (Huang gets into this around the 54-minute mark during his keynote at GTC 2025). These advances in hardware, combined with smarter inference algorithms, could lead to AI systems that are both high-performing and energy-efficient.

Conclusion: The Case for Thinking More

The AI revolution is at a turning point. For years, the focus was on making models bigger and training them on ever-larger datasets. But as those gains begin to plateau, the industry is pivoting toward a new frontier – Test-Time Compute. By giving models extra “think time” during inference, we’re opening up new avenues for performance improvements, smarter reasoning, and real-time adaptability.

As the industry continues to innovate in this space, one thing is clear: the future of AI isn’t just about building bigger brains – it’s about teaching them to think better. Hybrid models that combine robust training with dynamic, on-demand reasoning will likely be key to unlocking the next wave of performance breakthroughs and AI capabilities.

If you’re building in this space or just want to compare notes, please reach out – we’d love to chat!

——

Appendix

KEY TAKEAWAYS

Test-Time Compute allows models to dynamically allocate more computation at the inference phase when needed, much like how a person might pause to double-check a difficult problem
It offers a compelling alternative to the traditional focus on training-time compute, particularly in an era where further scaling up may not yield proportional benefits
The result of TTC on model performance is now both obvious and compelling as demonstrated by Gemini 2.5 Pro, o1, Claude 3.7 Sonnet, Grok 3, and DeepSeek R1 among other models and bodies of research (note that as of this writing the top 3 models on LiveBench are all reasoning models)
For businesses and startups, TTC means potentially better performance with lower upfront costs and the ability to iterate quickly – making cutting-edge AI more accessible
Hardware and software vendors alike are reorienting their businesses and product strategies around TTC
We are still early in exploring TTC and newer concepts such as test-time training remain exciting possibilities