The Hardware Behind the Headlines
When OpenAI announced GPT-5.3-Codex on February 5, 2026, NVIDIA was quick to point out a key detail: the model was co-designed for, trained with, and is served on NVIDIA GB200 NVL72 systems.
This isn’t just marketing synergy. The GB200 NVL72 represents a fundamental shift in how AI infrastructure is architected—moving from individual GPUs to rack-scale computing where 72 GPUs act as a single, unified processing unit.
Let’s break down what makes this hardware significant.
What Is the GB200 NVL72?
At its core, the GB200 NVL72 is an exascale computer in a single rack. It connects:
- 36 NVIDIA Grace CPUs (Arm-based, purpose-built for AI workloads)
- 72 NVIDIA Blackwell GPUs (the B200, NVIDIA’s latest architecture)
- 13.4 TB of HBM3e memory with 576 TB/s aggregate bandwidth
- 130 TB/s NVLink bandwidth connecting all 72 GPUs
The entire system operates as a liquid-cooled unit, drawing approximately 120,000 watts of power—or about $30/hour at California residential electricity rates.
The GB200 Grace Blackwell Superchip
The building block of the NVL72 is the GB200 Grace Blackwell Superchip. Each superchip combines:
- 1 Grace CPU with 72 Arm Neoverse V2 cores
- 2 Blackwell B200 GPUs
- 372 GB of HBM3e memory per superchip
- NVLink-C2C interconnect providing tight CPU-to-GPU coupling
What makes the Blackwell GPUs remarkable is their die architecture. Two separate silicon dies are fused together with a gap thinner than a human hair, allowing them to function as a single logical GPU. This isn’t consumer SLI—it’s a manufacturing and interconnect breakthrough that eliminates the communication overhead that plagued previous multi-die approaches.
The NVL72 rack contains 18 compute nodes, each housing two superchips (4 GPUs per node). These are arranged symmetrically—10 nodes at the top, 8 at the bottom—with 9 NVLink switch units sandwiched in between. This layout isn’t arbitrary; it optimizes electrical signal timing across the massive interconnect fabric.
Performance: The Numbers That Matter
NVIDIA is positioning the GB200 NVL72 against its own H100-based systems, and the improvements are substantial:
| Metric | GB200 NVL72 vs. H100 |
|---|---|
| Real-time LLM Inference | 30x faster |
| Mixture-of-Experts (MoE) | 10x faster |
| LLM Training | 4x faster |
| Energy Efficiency | 25x better |
| Database Queries | 18x faster vs. CPU |
The inference numbers assume trillion-parameter models with real-time latency requirements (50ms time-to-last-token). For training, the comparison uses a 1.8 trillion parameter MoE model across 4,096 H100 GPUs versus 456 GB200 NVL72 racks scaled over InfiniBand.
Raw Compute Specs
| Precision | GB200 NVL72 | Per Superchip |
|---|---|---|
| FP4 (Tensor Core) | 1,440 PFLOPS | 40 PFLOPS |
| FP8/FP6 (Tensor Core) | 720 PFLOPS | 20 PFLOPS |
| FP16/BF16 (Tensor Core) | 360 PFLOPS | 10 PFLOPS |
| FP64 (Tensor Core) | 2,880 TFLOPS | 80 TFLOPS |
The new second-generation Transformer Engine enables FP4 inference with microscaling formats optimized for high-throughput, low-latency workloads. This is a key enabler for serving trillion-parameter models in real time.
Why Liquid Cooling Matters
Air cooling is physically impossible at these power densities. But liquid cooling provides benefits beyond thermal management:
- Higher compute density — More GPUs per square foot of data center space
- Lower latency — GPUs can be packed closer together, reducing interconnect distances
- Reduced water consumption — Counterintuitively, liquid cooling reduces overall water usage compared to air-cooled systems that rely on evaporative cooling towers
- NVLink at scale — The 130 TB/s NVLink domain wouldn’t be achievable with air-cooled systems spread across multiple racks
The rack’s internal “spline” contains over 5,000 wires totaling 2 miles of copper. NVIDIA chose copper over fiber optics specifically because fiber would have added 20,000 watts of additional power consumption for the required optical transceivers.
The Smaller Option: GB200 NVL4
Not every workload needs 72 GPUs. The GB200 NVL4 offers a more accessible entry point:
- 4 Blackwell GPUs connected via NVLink bridge
- 2 Grace CPUs
- Compatible with liquid-cooled NVIDIA MGX modular servers
- 2x performance over prior generation for HPC and AI inference
This configuration targets converged HPC and AI workloads—scientific computing, AI for science training, and inference applications that don’t require rack-scale systems.
What This Means for AI Development
The GB200 NVL72’s design philosophy is significant: NVIDIA thinks of the entire 72-GPU rack as a single GPU. This has implications for how models are trained and served:
Perfect Linear Scaling
When you add GPUs to a training cluster, you don’t typically get proportional speedups. Communication overhead, synchronization delays, and memory bandwidth limits create diminishing returns. The NVL72’s tightly integrated NVLink domain achieves near-perfect linear scaling—doubling GPUs nearly doubles throughput.
Real-Time Trillion-Parameter Inference
Previous infrastructure could train trillion-parameter models but struggled to serve them at interactive latencies. The 30x inference improvement enables real-time applications with models that were previously batch-only.
Mixture-of-Experts Optimization
MoE architectures like those used in GPT-4 and similar models route tokens to specialized “expert” sub-networks. This creates irregular communication patterns that stress traditional interconnects. The NVL72’s all-to-all NVLink fabric handles these patterns efficiently, delivering 10x MoE performance.
The Cost Reality
NVIDIA hasn’t published official pricing, but estimates place a single NVL72 rack at $3+ million USD. Operating costs add up quickly:
- ~$250,000/year in electricity (at $0.20/kWh)
- Data center space, cooling infrastructure, and maintenance
- InfiniBand networking to connect multiple racks
For context, training GPT-5.3-Codex likely required hundreds of these racks running for months. This is infrastructure accessible only to hyperscalers and well-funded AI labs.
Looking Ahead: GB300 NVL72
NVIDIA has already announced the GB300 NVL72, featuring Blackwell Ultra GPUs. It promises:
- 50x overall AI factory output vs. Hopper-based platforms
- Purpose-built for test-time scaling inference and AI reasoning tasks
- Integration with Quantum-X800 InfiniBand or Spectrum-X Ethernet
- NVIDIA Mission Control management software for hyperscale efficiency
The roadmap is clear: rack-scale integration is the future, and each generation will push the boundaries of what’s possible with tightly coupled GPU systems.
The Bottom Line
The GB200 NVL72 isn’t just another GPU announcement—it’s a new category of computing infrastructure. By treating 72 GPUs as a single unified system, NVIDIA has removed many of the scaling bottlenecks that limited previous generations.
When you hear that GPT-5.3-Codex was “trained on Blackwell,” this is what that means: thousands of these racks, connected via high-speed networking, operating as a distributed but tightly integrated supercomputer.
For most organizations, direct access to NVL72 systems will come through cloud providers and NVIDIA’s own DGX Cloud. But understanding the underlying hardware helps contextualize why certain capabilities—like real-time trillion-parameter inference—are suddenly becoming possible.
The era of rack-scale AI computing has arrived.
