Forget training. The real AI war is about running models at scale—and a new generation of infrastructure companies is racing to win it.
The AI narrative has been dominated by training for the past three years. Bigger models. More parameters. Trillion-dollar compute clusters. OpenAI, Anthropic, and Google locked in an arms race to build the most capable foundation models.
But that narrative is about to flip.
This week, Modal Labs entered talks to raise at a $2.5 billion valuation—more than doubling its $1.1 billion valuation from just five months ago. General Catalyst is leading the round. The company's annualized revenue run rate sits at approximately $50 million.
Modal isn't building AI models. It's building the infrastructure to run them.
Welcome to the AI inference revolution—and it's going to reshape how every company deploys artificial intelligence.
The Shift Nobody Saw Coming
For most of 2023 and 2024, investors poured billions into companies training large language models. The assumption was straightforward: whoever builds the best model wins. Training was the hard part. Running the model? A detail.
That assumption was wrong.
By late 2025, the market began to correct. Not because training doesn't matter—it absolutely does—but because training is a one-time cost. Inference is forever.
When you train a model, you pay once. When you run that model to answer millions of user queries, process documents, generate images, or power autonomous agents, you pay every single time. And as AI moves from demos to production, inference costs have become the dominant line item on every AI company's P&L.
The numbers tell the story. According to Deloitte's 2026 predictions, inference workloads now account for roughly two-thirds of all AI compute—up from one-third in 2023 and half in 2025. The market for inference-optimized chips alone will exceed $50 billion this year.
The AI inference market overall is projected to grow from $106 billion in 2025 to $255 billion by 2030, a CAGR of 19.2% according to MarketsandMarkets. That's not a niche. That's an entire industry emerging in real-time.
What Modal Labs Actually Does
Modal Labs occupies a specific and increasingly critical position in the AI infrastructure stack: serverless GPU compute for AI workloads.
Here's the problem Modal solves. Let's say you're an AI company—or any company deploying AI features. You've fine-tuned a model or you're using an open-source model like Llama, Mistral, or Qwen. Now you need to run it.
You have three traditional options:
Option 1: Cloud providers (AWS, GCP, Azure). Reserve GPU instances. Pay whether you use them or not. Manage containers, orchestration, scaling, and cold starts yourself. Wait weeks for quota approvals during capacity crunches. Watch your infrastructure team grow faster than your product team.
Option 2: Dedicated hardware. Buy or lease GPUs. Build out a data center presence. Hire a team to maintain it. Commit to years of depreciation on hardware that becomes obsolete in 18 months.
Option 3: API providers (OpenAI, Anthropic, etc.). Easy to start. Zero control over cost, latency, or data privacy. Complete dependency on another company's infrastructure and pricing decisions.
Modal offers a fourth path: serverless GPU infrastructure defined entirely in code.
With Modal, you write Python. Your code declares what GPU it needs (A100, H100, whatever), what container environment it requires, and what functions should run. Modal handles everything else—provisioning, scaling, load balancing, cold starts, and shutdowns.
There's no YAML. No Kubernetes manifests. No reserved capacity. You pay per second of actual compute usage. When traffic spikes, Modal scales to hundreds of GPUs automatically. When traffic drops, it scales to zero. You pay nothing.
This is what serverless was supposed to be, but for GPU workloads. And in the AI era, GPU workloads are what matter.
Why Inference Efficiency is the New Moat
Let's do some math.
A typical LLM inference request costs between $0.001 and $0.02 in compute, depending on model size, request length, and infrastructure efficiency. That seems trivial—until you scale.
At 1 million requests per day, you're spending $10,000 to $200,000 monthly on inference alone. At 100 million requests per day—the scale of a successful B2C AI application—you're looking at $30 million to $600 million annually.
At that scale, a 30% improvement in inference efficiency isn't a nice-to-have. It's the difference between a viable business and a cash incinerator.
This is why inference optimization has become existential. Every percentage point of latency reduction, every improvement in GPU utilization, every clever batching strategy—it all flows directly to the bottom line.
And it's why companies like Modal are suddenly worth billions.
The infrastructure layer captures margin that model providers and application developers cannot. OpenAI can charge whatever the market will bear for API calls, but their costs are downstream from infrastructure efficiency. Application developers can raise prices, but they're competing against alternatives. Infrastructure providers sit in the middle, improving unit economics for everyone above them while building defensible technical moats.
The Inference Arms Race
Modal isn't alone. The inference infrastructure market has exploded over the past six months, with valuations rising faster than almost any other sector in tech.
Baseten raised $300 million at a $5 billion valuation in January 2026—more than doubling its $2.1 billion valuation from September 2025. IVP, CapitalG, and Nvidia led the round. Baseten focuses on production ML infrastructure, optimizing the journey from trained model to deployed service.
Fireworks AI secured $250 million at a $4 billion valuation in October 2025. Fireworks positions itself as an inference cloud, providing API access to open-source models running on optimized infrastructure.
Inferact, the commercialized version of the open-source vLLM project, emerged in January 2026 with $150 million in seed funding at an $800 million valuation. Andreessen Horowitz led. vLLM has become the de facto standard for efficient LLM serving, and Inferact is betting it can capture commercial value from that position.
RadixArk, spun out of the SGLang project, also launched in January with seed funding at a reported $400 million valuation led by Accel. SGLang pioneered radix attention and other techniques for faster inference, and RadixArk is commercializing that research.
These valuations would have been unthinkable 18 months ago. What changed?
The market finally understood that AI's bottleneck isn't models—it's deployment. Everyone has access to capable models now. Open-source alternatives like Llama 3.3 and Mistral Large approach proprietary model performance at a fraction of the cost. The differentiation isn't in what model you use; it's in how efficiently you run it.
The Technical Battlefield
Under the hood, inference optimization is a surprisingly deep technical problem. Companies are competing on multiple fronts simultaneously.
Batching strategies: The more requests you can process simultaneously on a single GPU, the lower your cost per request. But naive batching introduces latency. The best inference systems dynamically adjust batch sizes based on current load, request characteristics, and latency requirements.
Memory management: LLMs are memory-bound, not compute-bound. Efficient key-value cache management can dramatically reduce memory pressure and increase throughput. This is where techniques like PagedAttention (pioneered by vLLM) and continuous batching have transformed the field.
Quantization and compression: Running models in lower precision (INT8, INT4, even INT2) reduces memory requirements and increases throughput. The trick is doing this without degrading output quality. The best inference platforms make quantization transparent—you deploy a model, they handle the optimization.
Speculative decoding: Generate multiple tokens speculatively, then verify them in parallel. This can dramatically reduce latency for certain workloads without changing the output distribution.
Infrastructure optimization: Cold starts are death for serverless GPU platforms. Modal has invested heavily in reducing container startup times to subsecond levels—a non-trivial achievement when you're loading multi-gigabyte model weights.
Multi-tenancy: Running multiple customers' workloads on shared infrastructure efficiently requires sophisticated isolation, scheduling, and resource allocation. This is where hyperscaler experience matters—and where startups like Modal have a surprising advantage. They're building from scratch without legacy assumptions.
Each of these areas represents years of engineering work. The compounding effect of optimizing across all of them is what creates genuine infrastructure moats.
What This Means for Companies Deploying AI
If you're a company deploying AI—and increasingly, every company is—the inference revolution has direct implications for your strategy.
1. Don't overbuild internal infrastructure.
The temptation to build internal ML infrastructure teams is strong. Resist it. The best inference platforms are advancing faster than any internal team can match. Their R&D budgets exceed what you can dedicate to infrastructure. Their scale gives them data on optimization that you can't replicate.
Unless AI infrastructure is your core product, use a platform. The build-versus-buy calculation has decisively shifted toward buy.
2. Design for portability from day one.
The inference market is still maturing. Today's leader may not be tomorrow's. Design your AI systems to be infrastructure-agnostic. Use abstraction layers. Keep your model serving code decoupled from platform-specific APIs.
Modal, Baseten, Fireworks, and others all have proprietary interfaces. Build a thin abstraction layer that lets you switch between them. This isn't premature optimization—it's risk management.
3. Monitor inference costs obsessively.
In production AI systems, inference costs can scale superlinearly with usage if you're not careful. A poorly optimized prompt that doubles token count doubles your costs. A missing cache layer that recomputes embeddings on every request incinerates margin.
Build cost observability into your AI systems from the start. Track cost per request. Monitor GPU utilization. Understand where your inference spend goes. The companies that win in AI will be the ones that understand their unit economics at a granular level.
4. Consider open-source models seriously.
The inference revolution has leveled the playing field between proprietary and open-source models. When you control your inference infrastructure, you can optimize open-source models far more aggressively than API providers can.
A well-optimized Llama 3.3 deployment can approach GPT-4 performance at a fraction of the cost. The gap is closing. For many applications, open-source models running on optimized infrastructure are now the economically rational choice.
5. Latency matters more than you think.
For user-facing AI applications, latency directly impacts conversion and engagement. Every 100 milliseconds of latency in an AI response correlates with measurable drops in user satisfaction.
The best inference platforms can cut latency by 50% or more compared to naive deployments. That's not just a technical improvement—it's a product advantage.
The Bigger Picture: Infrastructure as the AI Endgame
Zoom out, and Modal's $2.5 billion valuation—along with Baseten's $5 billion, Fireworks' $4 billion, and the rest—suggests something profound about where AI value will ultimately accrue.
The AI stack has three layers:
- Models: The foundation models themselves (GPT-4, Claude, Llama, etc.)
- Applications: Products built on top of models
- Infrastructure: The compute and tooling that runs everything
For the past three years, attention and capital concentrated in models and applications. Infrastructure was an afterthought—necessary, but boring.
That's changing. Infrastructure is emerging as the durable value layer.
Models commoditize. Today's state-of-the-art becomes tomorrow's baseline. Open-source catches up. New architectures emerge. Betting on a single model is betting on a depreciating asset.
Applications compete on distribution and user experience, not technology. Most AI applications are thin wrappers around model APIs. The defensibility comes from brand, data, and network effects—not from the AI itself.
Infrastructure, by contrast, is sticky. Once you've built your deployment pipeline on a platform, switching costs are real. Infrastructure providers improve continuously, passing efficiency gains to customers while maintaining margin. And infrastructure is model-agnostic—whether you run GPT, Claude, or Llama, you need compute.
This is why investors are suddenly paying up for inference infrastructure. It's not hype. It's a structural bet on where AI profits will concentrate as the market matures.
What Comes Next
Modal Labs' reported $2.5 billion valuation—if the round closes at those terms—will mark another milestone in the inference infrastructure boom. But this is still early.
The market is heading toward consolidation. Not every inference platform will survive. The winners will be those who:
- Execute on technical depth: Marginal improvements in inference efficiency compound. The platforms that push the boundary consistently will pull ahead.
- Build genuine scale: Inference infrastructure has massive economies of scale. More customers means more data on optimization, more bargaining power with GPU suppliers, and more ability to invest in R&D.
- Integrate into developer workflows: The best infrastructure is invisible. Platforms that make deployment effortless—that feel like magic—will win developer mindshare.
- Navigate the hyperscaler relationship: AWS, GCP, and Azure are all investing heavily in AI inference. Infrastructure startups must find positions that complement rather than directly compete with hyperscaler offerings.
Modal is well-positioned on most of these dimensions. Erik Bernhardsson, the CEO, built data infrastructure at Spotify and served as CTO at Better.com before founding Modal. The company has genuine technical depth. Its Python-first, serverless approach has resonated with developers.
But the competition is fierce. Baseten has more capital and Nvidia as a strategic investor. Fireworks has model optimization expertise. The vLLM and SGLang commercialization efforts bring deep open-source communities.
The next 18 months will determine which platforms emerge as category leaders. For everyone building with AI, this is the layer to watch.
Key Takeaways
- Modal Labs in talks to raise at $2.5B valuation, more than doubling its valuation in five months
- Inference, not training, is the new AI battleground as production deployment costs dominate
- The inference market is exploding: $106B in 2025, projected to reach $255B by 2030
- Valuations have skyrocketed: Baseten ($5B), Fireworks ($4B), Modal ($2.5B), Inferact ($800M), RadixArk ($400M)
- For companies deploying AI: Use platforms, design for portability, monitor costs obsessively, consider open-source models, prioritize latency
- Infrastructure is the durable value layer in AI—model-agnostic, sticky, and improving continuously
The AI inference revolution isn't coming. It's here. And for companies that understand it, it's an opportunity to build faster, cheaper, and more efficiently than ever before.
Webaroo helps companies build and deploy AI systems that actually work. If you're navigating the inference landscape and need guidance, get in touch.
