The Billion-Dollar Question: Why Are AI Companies Still Obsessed with Bigger Models?
Unpacking the Reasons Behind the Relentless Pursuit of Larger Foundation Models – In a 3-Minute Read
The Gist:
AI companies are locked in a race to build ever-larger foundation models, spending billions on training. Why? It's not just hype. They're doing it because these large foundation models are the essential building blocks for the future of AI, powering both efficiency (distillation) and breakthrough capabilities (reasoning). The race for bigger models is a race for the future of AI itself, and is directly tied with the pursuit of agentic systems.
What Needs to be Understood:
Billions on the Line: Tech giants like Meta, Amazon, Alphabet, and Microsoft are projected to invest up to $320 billion combined in 2025 on AI and data centers. Microsoft alone plans to spend $80 billion in its fiscal year 2025.
The Indispensable Ingredient: Foundation Models, trained on more data and compute, are the starting point for everything else in AI. They're the raw material, the foundation for all other applications.
Pretraining Scaling Laws: These laws describe the relationship between model size, dataset size, and compute used for pretraining (the initial, broad training phase). Generally, bigger models, trained on more data with more compute, perform better. This is why companies keep building larger and larger foundation models.
Two Paths, One Foundation:
Distillation: A large "teacher" model (often a foundation model) trains a smaller "student" model. The better the teacher, the better the student.
Reasoning: DeepSeek R1 demonstrates how foundational models (like DeepSeek-V3-Base) serve as the essential starting point for reasoning models.
Base Model as the Engine: In both cases, the quality, capability, and size of the initial foundation model are paramount. It's the raw potential that's either transferred (distillation) or unlocked (reasoning).'
The rise of AI agents: Systems that can plan, reason, and act autonomously—depends heavily on both reasoning capabilities and the deployment of efficient models. As these agents become more sophisticated, they drive demand for both larger foundation models (to improve reasoning) and more efficient models (for practical deployment).
Observations:
Microsoft's Phi Family: Microsoft's focus on Small Language Models (SLMs) like the Phi family, with over 20 million downloads, shows the demand for distilled, efficient models.
Nvidia's Bet on Reasoning: On Nvidia's Q4 2025 earnings call CFO, Colette Kress, explicitly mentioned "distillation" and "reasoning models" as drivers of demand for their infrastructure. Jensen Huang highlighted Blackwell's design specifically for "reasoning AI models".
Synthetic Data's Role: Teacher models can generate synthetic data, which is crucial for the "cold start" problem in reasoning models, allowing for the creation of vast datasets of step-by-step reasoning examples.
Importance of Base Model Intelligence: A larger base model with increased intelligence provides a richer foundation. For instance, its ability to capture complex patterns and generalize well gives RL on CoT a stronger starting point, enabling the discovery of sophisticated reasoning strategies.
Grok3 Proves Scaling Laws Persist: Trained on 200,000 Nvidia H100 GPUs (a 10x increase in compute), Grok 3 achieved an Elo score of 1402 in the Chatbot Arena, showcasing leading capabilities. The principle that "bigger models, trained on more data, perform better" still holds true. This superior performance is the raw material for subsequent advancements.
Beyond Revenue: Even if these models are not generating revenue yet, there is a need for them.
Something to Think About:
The "Good Enough" Point: We're seeing these models get crazy big. But at what point is "good enough" actually good enough? Will there be a point where the extra performance just isn't worth the insane cost of training even larger models?
Show Me the Money: Billions are being poured into these foundation models. But where's the revenue going to come from? And how long until these investments start paying off?
Will Enterprises Actually Use This Stuff?: Distillation is supposed to make these powerful models accessible for enterprise use cases. But will they actually use them? Will we see a whole ecosystem of specialized, "mini-me" models, or will a few giant models dominate?
How ChatGPT Cheaps Out Over Time
Self-Driving Expert Unpacks the Biggest Breakthroughs and Bottlenecks