If you’ve spent any time architecting modern infrastructure, you’ve probably felt the frustration: your storage is fast, your compute is fast, but the network between them feels like it’s stuck in 2010. That bottleneck has a name, and for decades it was SCSI. The solution reshaping enterprise data centers right now is NVMe over Fabrics (NVMe-oF), and understanding it isn’t optional anymore—it’s table stakes for anyone building serious infrastructure.
In this post, we’ll break down what NVMe-oF actually is, explore the transport options available, examine the infrastructure requirements for deployment, and look at why this matters for AI workloads and Kubernetes environments.
The Problem with Legacy Storage Protocols
For over thirty years, the Small Computer System Interface (SCSI) was the language of storage. It was designed when hard disk drives ruled the data center—an era where latency was measured in milliseconds and the mechanical seek time of a drive head across a magnetic platter was the primary constraint. The protocol was serial, command processing was sequential, and none of that mattered much when your storage was physically slow.
Then NAND flash happened.
Solid State Drives changed everything about storage performance. Suddenly, the drives themselves could respond in microseconds, not milliseconds. The bottleneck shifted from the physical media to the protocol stack. The Advanced Host Controller Interface (AHCI) and SCSI couldn’t keep pace with what the silicon was capable of delivering.
Enter NVMe—Non-Volatile Memory Express. This wasn’t an incremental improvement; it was a ground-up redesign built specifically for flash storage. NVMe communicates directly with the CPU via the PCIe bus, bypassing the legacy storage stack entirely. Where AHCI supported a single command queue with a depth of 32 commands, NVMe supports up to 65,535 queues, each capable of holding 65,535 commands. That’s not a typo. This architecture allows modern multi-core processors to work with storage in true parallel fashion, eliminating lock contention and unlocking performance that was previously impossible.
But here’s the catch: early NVMe was landlocked. Those incredible speeds were confined to direct-attached storage. Your blazing-fast NVMe drive was stuck inside a single server chassis, creating isolated islands of performance that couldn’t be shared across the network.
What NVMe-oF Actually Does
NVMe over Fabrics solves this problem by extending native NVMe commands across network fabrics. Instead of translating NVMe into some other protocol (like iSCSI does with SCSI), NVMe-oF encapsulates the native commands and ships them over the network. The result? Remote storage targets that perform with near-local latency.
This isn’t just faster networked storage—it enables true disaggregation of compute and storage. You can build composable infrastructures where resources are pooled and allocated dynamically based on workload demands. Need more storage for your analytics cluster? Provision it from the shared pool. Need to scale compute independently? No problem. This flexibility is essential for modern cloud architectures, AI training pipelines, and high-performance computing environments.
An NVMe-oF subsystem has three components: the host (initiator), the transport fabric, and the storage target. The host submits commands, the fabric carries them, and the target processes I/O requests against namespaces—logical block devices that function like traditional LUNs. Communication happens through submission and completion queue capsules that travel across the fabric asynchronously, preventing CPU idle time during storage operations.
Transport Options: TCP, RoCE, Fibre Channel
The NVMe-oF specification is deliberately transport-agnostic. You choose the fabric that fits your infrastructure, expertise, and performance requirements. Let’s examine the three primary options.
NVMe over Fibre Channel (FC-NVMe)
If you’ve got an existing Fibre Channel SAN, FC-NVMe lets you run NVMe alongside your legacy FCP traffic on the same infrastructure. This provides a non-disruptive migration path—you don’t have to rip and replace your entire storage network to adopt NVMe.
FC-NVMe leverages Fibre Channel’s inherent stability and credit-based flow control. The protocol transfers NVMe structures directly without translation. The second generation (FC-NVMe v2) introduced advanced error recovery that enables retransmissions up to 1000 times faster than previous upper-layer protocol methods.
Latency typically falls in the 50 to 100 microsecond range—not as low as RDMA solutions, but the operational maturity and existing toolchains make this attractive for mission-critical enterprise workloads where stability trumps raw speed.
NVMe over RDMA (RoCE and iWARP)
For absolute minimum latency, RDMA-based transports are the answer. Remote Direct Memory Access bypasses the traditional TCP/IP stack and the host operating system kernel entirely, enabling direct memory-to-memory communication between nodes.
RoCE (RDMA over Converged Ethernet) is currently the dominant RDMA variant for NVMe-oF. RoCEv2 uses UDP as its transport, with InfiniBand headers carrying the NVMe payload. The critical requirement: RoCE demands a lossless Ethernet fabric. Without robust congestion control, packet loss will devastate performance. This means implementing Data Center Bridging features like Priority Flow Control.
iWARP runs RDMA over standard TCP, which allows operation on lossy networks without complex lossless configuration. It supports selective retransmission and out-of-order packet handling. However, ecosystem adoption has been limited compared to RoCE.
InfiniBand provides the ultimate performance profile with native RDMA support and credit-based flow control. The trade-off is specialized, non-Ethernet hardware that restricts deployment to demanding HPC and AI training clusters where the cost is justified.
NVMe over TCP (NVMe/TCP)
Here’s where things get interesting for mainstream adoption. NVMe/TCP encapsulates NVMe commands within standard TCP/IP packets. You can use your existing Ethernet switches and NICs—no specialized RDMA hardware required.
NVMe/TCP treats the protocol data units as a byte stream segmented into packets. The receiving subsystem walks through the stream to extract commands. This introduces slightly higher CPU utilization and latency (typically 100 to 200 microseconds) compared to RDMA. But it still delivers a massive performance leap over iSCSI, and for many organizations, the simplicity of a plug-and-play deployment on existing infrastructure outweighs the modest latency trade-off.
Infrastructure Requirements: Building a Storage-Ready Fabric
Deploying NVMe-oF successfully requires infrastructure that can handle consistent low latency and, for RDMA transports, a zero-loss environment.
Leaf-Spine Architecture
Modern storage fabrics are built on leaf-spine topologies. This design ensures all storage targets and compute nodes are separated by a consistent number of switch hops—typically three (leaf-spine-leaf). The predictability is critical for microsecond-level NVMe-oF performance.
Horizontal scaling is straightforward: add spine switches for more backbone bandwidth, add leaf switches for more server/storage ports. By optimizing the blocking factor (the ratio of downlink to uplink ports), you prevent the oversubscription that causes congestion.
Data Center Bridging for Lossless Ethernet
For RoCE deployments, your Ethernet fabric needs to behave like Fibre Channel—guaranteed delivery with no dropped packets. The Data Center Bridging suite provides this:
Priority Flow Control (PFC) pauses only specific traffic classes when congestion occurs, rather than all traffic on a link. NVMe/RoCE traffic gets assigned to a high-priority queue, protecting it from packet loss while less-sensitive traffic continues to flow.
Explicit Congestion Notification (ECN) allows switches to mark packets when they detect impending congestion. Endpoints see the marks and signal senders to reduce transmission rates—avoiding drops before they happen.
Enhanced Transmission Selection (ETS) guarantees bandwidth allocation to different traffic classes, ensuring storage traffic gets its share even during heavy general network utilization.
Buffer Considerations
Storage traffic is notorious for “incast” events—multiple senders simultaneously overwhelming a single receiver, creating microbursts that can overrun switch buffers. High-performance NVMe-oF deployments often require switches with deep buffers to absorb these bursts. Shallow-buffer switches designed for web traffic frequently struggle with sustained, high-bandwidth storage flows.
The AI and Kubernetes Connection
This is where NVMe-oF stops being a “nice to have” and becomes mandatory.
Kubernetes and CSI Integration
Stateful applications on Kubernetes need high-performance shared storage. Modern Container Storage Interface drivers for NVMe-oF (from vendors like Simplyblock and Lightbits) provide dynamic provisioning of NVMe/TCP volumes with sub-100 microsecond latency.
Intelligent orchestration uses node affinity and topology-aware scheduling to place pods on compute nodes with the lowest network distance to storage resources. When a storage node or switch fails, the system automatically reconnects NVMe devices on different paths, reducing recovery times by 80 to 90 percent compared to legacy approaches.
AI Training Pipelines
AI workloads are inherently parallel and latency-sensitive. Thousands of GPUs must synchronize multiple times per second during training. If the storage fabric is congested, synchronization updates are delayed, creating “stalled epochs” where expensive GPU resources sit idle.
NVMe-oF over RoCE or high-speed Ethernet ensures the data pipeline keeps pace with compute. Faster epoch times, more efficient checkpoint writing, accelerated model development. By 2027, NVMe-based tiered storage supporting parallel file systems like GPFS or Lustre will be a fundamental requirement for any serious AI initiative.
What’s Coming Next
The NVMe specification continues to evolve. The 2.0 release introduced features that will further optimize fabric efficiency:
Zoned Namespaces (ZNS) allows the host to collaborate with storage on data placement, organizing data into zones that align with NAND flash characteristics. This reduces write amplification and eliminates complex internal garbage collection, resulting in more predictable latency and longer-lasting drives.
Computational Storage enables processing within the storage device itself. Database filtering, compression, and AI inference can be offloaded to the storage controller, reducing data movement across the fabric.
Ultra Ethernet is being developed by the Ultra Ethernet Consortium specifically for AI and storage workloads, aiming to combine Ethernet’s scalability and cost-effectiveness with the adaptive routing and ultra-low latency previously exclusive to InfiniBand.
Conclusion
NVMe over Fabrics has reached a critical point of maturity. It’s no longer specialized technology for edge cases—it’s the primary architecture for modern enterprise storage.
The key takeaways: NVMe/TCP will likely dominate mainstream deployments due to its simplicity and “good enough” latency. RoCE provides maximum performance when you can invest in lossless fabric infrastructure. Fibre Channel remains the migration path for organizations with deep FC expertise and existing investments.
Whatever transport you choose, the underlying requirement is the same: storage-ready network infrastructure with leaf-spine topologies, proper QoS configuration, and awareness of the unique demands storage traffic places on your fabric.
The standard for data center storage is shifting from raw capacity to the intelligence and speed of the fabric that connects it. NVMe-oF is the bridge that allows silicon-speed storage to finally be realized at network scale.
