SmartNICs and DPUs: Offloading the Data Plane

Technical Overview

The Data Processing Unit (DPU) represents the third category of computational engine in modern datacenters, alongside CPUs (control plane and general compute) and GPUs (AI and graphics acceleration). DPUs run the data plane: network packet processing, storage virtualization, security enforcement, and telemetry collection—functions traditionally consuming 20–30% of server CPU cores in cloud environments. By offloading these to a dedicated chip on the NIC, DPUs free host CPU cores for tenant workloads, enforce isolation at hardware speed, and operate at line rate for 400 Gbps+ network links. In AI clusters specifically, DPUs enable in-network AllReduce (aggregating gradients inside switches/NICs), dramatically reducing collective communication latency.

Prerequisites

Understanding of datacenter networking (VLANs, VXLANs, BGP, ECMP)
Familiarity with RDMA and InfiniBand concepts (see 01-gpu-cluster-networking.md)
Knowledge of NVMe and storage protocol stack (NVMe-oF)
Understanding of Linux networking stack (netfilter, tc, XDP)
Basic understanding of PCIe device architecture and SR-IOV

Core Content

SmartNIC Evolution

Traditional NICs evolved from pure packet I/O devices (10/100/1000 Mbps Ethernet) to increasingly intelligent offload engines:

Generation 1 (1990s–2010): Hardware checksum offload (TCP/IP checksum in silicon), hardware segmentation offload (TSO/GSO), large receive offload (LRO). Host CPU freed from checksumming individual packets.

Generation 2 (2010–2016): SR-IOV (Single Root I/O Virtualization) — NIC presents virtual functions (VFs) for direct VM access to hardware queues, bypassing hypervisor for packet I/O. RSS (Receive Side Scaling) — multi-queue receive with flow hashing. Hardware VXLAN/GRE offload — tunnel encap/decap in silicon.

Generation 3 — SmartNIC (2016–present): Add an embedded processor complex (ARM cores or FPGA) to the NIC. Allows running arbitrary network processing logic in software on the NIC itself. The NIC becomes a programmable network appliance.

Traditional NIC:            SmartNIC / DPU:
┌──────────────┐            ┌──────────────────────────────────┐
│ PHY (SerDes) │            │ PHY (SerDes)                     │
│ MAC (Ethernet│            │ MAC                              │
│ Packet engine│            │ Packet classification engine     │
│ PCIe host    │            │ FPGA / ASIC / P4 pipeline        │
│ interface    │            │ ARM processor complex (4–16 cores)│
│              │            │ On-chip SRAM + DRAM               │
│              │            │ PCIe host interface (Gen5)       │
└──────────────┘            └──────────────────────────────────┘
Host CPU does:              DPU does:
 - IPsec encrypt/decrypt     - IPsec at line rate
 - VXLAN encapsulation        - VXLAN at line rate
 - ACL/firewall rules         - Stateful firewall
 - NVMe-oF initiator          - NVMe-oF target
 - OvS (Open vSwitch)         - Full OVS in hardware

DPU (Data Processing Unit)

The term "DPU" was coined (and trademarked) by NVIDIA for its BlueField line, but is now used generically for SmartNICs with sufficient processing power to run complete OS instances.

Key DPU vendors: - NVIDIA BlueField-3 (BF3) — dominant in AI/HPC - AMD Pensando (Elba, acquired 2022) — strong in enterprise/cloud - Intel IPU (Infrastructure Processing Unit, Mount Evans ASIC) — Oak Springs Canyon - Marvell OCTEON 10 — telecom + cloud - Fungible F1 (Meta acquisition 2022) — storage-focused

DPU capabilities taxonomy:

DPU Capability Stack:
┌─────────────────────────────────────────────────────────────────┐
│ HOST OS / HYPERVISOR (runs on main CPU)                         │
│   ↕ PCIe Gen5 (128 GB/s)                                        │
├─────────────────────────────────────────────────────────────────┤
│ DPU CAPABILITIES                                                 │
│                                                                 │
│  Networking                Storage               Security       │
│  ─────────────             ─────────             ────────       │
│  OVS offload               NVMe-oF initiator     IPsec/TLS      │
│  VXLAN/Geneve              NVMe-oF target        MACsec         │
│  RDMA/RoCEv2               Virtio-blk/scsi       Firewall/ACL   │
│  ECMP / LACP               Compression           Micro-segment  │
│  BGP/EVPN processing       Dedup / Encryption    DDoS mitigation│
│  Telemetry / NetFlow       RAID computation      Zero-trust     │
│                                                                 │
│  OS / Management                                                │
│  ─────────────                                                  │
│  Linux (Ubuntu 22.04)   ← runs on DPU's ARM cores               │
│  DOCA SDK runtime                                               │
│  Docker / Kubernetes     ← containerized data plane functions   │
└─────────────────────────────────────────────────────────────────┘
          │                        │
   Upstream network           Storage fabric
   (IB / Ethernet)            (NVMe-oF / FC)

NVIDIA BlueField-3 (BF3)

BlueField-3 is NVIDIA's third-generation DPU, combining an InfiniBand/Ethernet NIC with a powerful Arm processor complex.

BlueField-3 specifications: - Network: 400 Gbps (2× 200 Gbps ports), NDR InfiniBand or 400GbE - ARM complex: 16× Arm Cortex-A78AE (ARMv8.2-A), 16 GB LPDDR5 on-chip - PCIe host interface: Gen5 x16 (64 GB/s each direction) - On-chip accelerators: RegEx engine (packet classification), compression/decompression engine, AES-256 cryptography, JPEG/NVMe-oF accelerator - Power: 50W (in contrast to 700W H100) - Process: TSMC N5

BlueField-3 architecture:

BlueField-3 Block Diagram:

  ┌──────────────────────────────────────────────────────────────┐
  │                    NVIDIA BlueField-3                        │
  │                                                              │
  │  ┌──────────┐    ┌──────────────────┐    ┌───────────────┐  │
  │  │ 16× ARM  │    │  On-chip         │    │  Hardware     │  │
  │  │ Cortex   │◀──▶│  Accelerators:   │    │  Crypto:      │  │
  │  │ A78AE    │    │  - RegEx engine  │    │  AES-256-GCM  │  │
  │  │ 16GB DDR5│    │  - Compression   │    │  SHA-2/3      │  │
  │  └────┬─────┘    │  - Dedup hash    │    │  IPsec/MACsec │  │
  │       │          └──────────────────┘    └───────┬───────┘  │
  │       │                                          │          │
  │  ┌────┴──────────────────────────────────────────┴──────┐   │
  │  │              Internal Coherent Fabric               │   │
  │  └────┬─────────────────────────────────────┬──────────┘   │
  │       │                                     │              │
  │  ┌────┴─────┐                          ┌────┴─────┐        │
  │  │  PCIe Gen5│                          │  Network │        │
  │  │  x16     │◀──── Host CPU/GPU ────▶  │  Engine  │        │
  │  │  (64GB/s) │                          │  400Gbps │        │
  │  └──────────┘                          └────┬─────┘        │
  │                                             │              │
  └─────────────────────────────────────────────┼──────────────┘
                                                │
                                     ┌──────────┴───────────┐
                                     │  InfiniBand NDR /    │
                                     │  400GbE Network      │
                                     └──────────────────────┘

DOCA SDK (Data Center Infrastructure On a Chip Architecture): NVIDIA's programming framework for BlueField. Abstracts hardware accelerators into portable APIs:

// Example: DOCA Flow for packet filtering
struct doca_flow_pipe_cfg pipe_cfg = {
    .attr.name = "DROP_PIPE",
    .port = doca_port,
    .match = { .outer.l4_type_ext = DOCA_FLOW_L4_TYPE_EXT_TCP }
};
doca_flow_pipe_create(&pipe_cfg, NULL, NULL, &drop_pipe);
// This runs entirely on BlueField ARM cores, never touches host CPU

AWS Nitro Cards as SmartNICs

AWS Nitro (2017, evolved from Annapurna Labs acquisition 2015) is the hyperscale SmartNIC deployment that redefined cloud computing economics. AWS EC2 instances run directly on hypervisors that delegate all I/O to dedicated Nitro cards.

Nitro card architecture: - Nitro Card for networking: VPC networking, ENA (Elastic Network Adapter) — encapsulates tenant traffic in VXLAN, enforces security groups at hardware speed - Nitro Card for storage: EBS I/O, local NVMe (NVMe-oF over PCIe) - Nitro Security Chip: BIOS/firmware measurement, attestation, prevents AWS operators from accessing customer memory - Nitro Controller: coordinates Nitro cards, hypervisor isolation enforcement

Impact on cloud economics: By offloading VPC networking to dedicated Nitro silicon, AWS frees ~30% of host CPU that was previously consumed by Open vSwitch and packet processing. This translates directly to additional EC2 instance capacity per physical server.

Nitro Hypervisor: A lightweight KVM-based hypervisor (~8 MB code vs 50+ MB for Xen). Because Nitro cards handle I/O, the hypervisor does very little. VMs get near-bare-metal performance.

Nitro isolation model (security architecture):

Physical Host:
┌─────────────────────────────────────────────────────────────┐
│  Customer VM (EC2 instance)                                  │
│  VirtIO-net device ──────────────┐                          │
│  VirtIO-blk device ──────────────┤                          │
└──────────────────────────────────┼──────────────────────────┘
                                   │ PCIe
                              ┌────┴──────────────────────────┐
                              │  Nitro Card                   │
                              │  ┌───────────────────────────┐ │
                              │  │ Per-VM queue pairs        │ │
                              │  │ Security group enforcement │ │
                              │  │ VXLAN encapsulation       │ │
                              │  └───────────────────────────┘ │
                              └────────────────────────────────┘
                                       │ Physical 100GbE
                                       ▼ (AWS backbone)
                              ┌────────────────────────────────┐
                              │ Nitro Hypervisor (thin layer)  │
                              │ No access to VM memory         │
                              │ No network I/O pass-through    │
                              └────────────────────────────────┘

NVIDIA SHARP: In-Network AllReduce

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) offloads AllReduce operations to InfiniBand switch ASICs and DPU BlueField Nics, enabling collective operations without involving GPUs at all.

Standard AllReduce flow:

Without SHARP: 8 GPUs
  Step 1: Each GPU sends 1/8 of gradient to 8 neighbors via ring → 7 AllReduce hops
  Step 2: Each GPU sends aggregated 1/8 chunk to all others
  Total: O(N) communication steps, involves all GPU CPUs

SHARP AllReduce flow:

With SHARP v2: 8 GPUs, 2-layer IB switch fabric

  GPU0 → GPU7 send their gradient chunks to the leaf switches
  Leaf switch ASICs perform partial reduction (sum) in-silicon:
    GPU0[chunk] + GPU1[chunk] + GPU2[chunk] + GPU3[chunk] = partial_sum_0
  Spine switch performs final reduction:
    partial_sum_0 + partial_sum_1 = final_sum
  Broadcast final_sum back down to all GPUs

  Total: 2 network hops regardless of N
  Latency: ~5 µs for 1 MB vs ~50 µs standard for 8 GPUs

SHARP hardware requirements: Mellanox Quantum-2 or Spectrum-4 switches with SHARP acceleration engines. BlueField-3 DPUs at endpoint. Requires homogeneous Mellanox fabric.

SHARP v3 (2024): Adds support for FP8/BF16 reduce precision, larger reduction operand sizes, tree topology in addition to switch-aggregation, and Sharp-DPU (BF3 acts as aggregation point for non-IB fabrics).

DPU Use Cases in AI Infrastructure

1. In-network NCCL acceleration (SHARP): - NCCL ncclAllReduce with NCCL_SHARP_ENABLE=1 uses SHARP automatically when available - Reduces AllReduce latency 2–5× for small tensors (<1 MB) - Used in NVIDIA DGX SuperPOD deployments - Particularly effective for pipeline parallelism where small activation tensors are sent frequently

2. NVMe-oF over RDMA (storage acceleration): - DPU acts as NVMe-oF target: presents local NVMe SSDs to remote hosts over RDMA - Remote hosts mount the storage as /dev/nvme0 with <50 µs latency - DPU handles NVMe-oF protocol, RDMA queue management, and NVMe command dispatching - Used for centralized NVMe checkpoint storage in AI training clusters

3. Telemetry and monitoring: - DPU taps all network flows, counts packet/byte statistics at line rate (400 Gbps) - Exports IPFIX/sFlow records to monitoring systems without impacting host CPU - Calculates per-flow latency histograms using hardware timestamping - Critical for identifying straggler GPU nodes via network performance monitoring

4. Security enforcement: - DPU enforces micro-segmentation: every packet from/to the host is inspected regardless of host OS - Even a compromised host OS cannot bypass DPU-enforced firewall rules - MACsec on DPU encrypts all traffic at L2 before it leaves the NIC - Zero-trust enforcement: DPU validates workload identity via attestation before allowing network access

DPU for NVMe-oF Storage

NVMe-oF (NVMe over Fabrics) extends the NVMe protocol over RDMA or TCP networks. DPUs act as NVMe-oF targets (exposing storage) or initiators (accessing remote storage).

NVMe-oF Architecture with DPU:

AI Training Node:
  GPU0..GPU7
  Host CPU ──▶ [BlueField-3 DPU] ──▶ RDMA fabric ──▶ [Storage DPU] ──▶ NVMe SSD array
               NVMe-oF initiator                       NVMe-oF target    16× NVMe U.3

Storage DPU responsibilities:
  - RDMA transport (RoCEv2)
  - NVMe-oF target mode
  - Encryption (AES-XTS 256-bit) of data at rest
  - Compression (LZ4/DEFLATE) for checkpoint dedup
  - RAID-6 computation for data protection
  - Load balancing across NVMe namespace

Latency: 50–100 µs (comparable to local NVMe when RDMA latency < 10 µs)
Throughput: 200 Gbps per DPU (BlueField-3 PCIe Gen5)

Security on DPU: Micro-Segmentation and Zero-Trust

The DPU's position in the data path (all traffic must pass through it) makes it ideal for enforcing security policy without trusting the host OS.

Threat model: In cloud environments, the host OS is tenant-controlled. A compromised tenant VM can attempt to send malicious traffic (ARP spoofing, MAC flooding, IP spoofing) or exfiltrate data. Standard hypervisor virtual switch (OvS) security policies run in the hypervisor, which is still on the host CPU and vulnerable to hypervisor exploits.

DPU security enforcement:

Traffic entering DPU from host:
  1. Classify packet (regex engine, ~1 cycle)
  2. Check source MAC/IP against registered identities (RDMA protection domains)
  3. Apply firewall rules (connection tracking in DPU SRAM, 1M flows)
  4. Encrypt if MACsec policy applies (AES-256 hardware engine)
  5. Forward to physical network

Traffic from network to host:
  1. Decrypt/verify MACsec
  2. Identify destination VM/container (via VXLAN VNI)
  3. Apply ingress ACLs
  4. Enforce bandwidth limits (policing/shaping)
  5. Deliver to correct virtual queue

Critical: these operations run on DPU ARM cores + accelerators, not host CPU
Even if host OS is root-compromised, DPU policies cannot be bypassed

DOCA Zero-Trust framework: NVIDIA's reference architecture using BF3 for zero-trust enforcement: - Workload identity attestation via DOCA COMM Channel (secure channel between BF3 and host) - mTLS for workload-to-workload authentication (certificates rotated by DPU, not host) - Policy updates via DOCA management plane (encrypted, signed)

Historical Context

The concept of offloading network processing predates SmartNICs: Myricom's Myrinet (1994) had on-card processors. InfiniBand HCAs (2001) included RDMA engines. The modern SmartNIC era began with Mellanox ConnectX-4 Lx (2016) adding FPGA co-processor, and Cavium LiquidIO SmartNIC. NVIDIA's BlueField-1 (2018) was the first DPU in the modern sense. AWS Nitro (2017, based on Annapurna Labs 2015 acquisition) showed the hyperscale impact. By 2020, all major cloud providers (AWS, Azure, Google, Alibaba) had custom SmartNIC programs. AMD acquired Pensando (DPU startup) for $1.9B in 2022. NVIDIA's $40B Mellanox/ARM acquisitions (2020, pending 2022) were partly motivated by DPU strategy.

Production Examples

AWS Nitro C5/R5/M5 instances (2017+): All modern AWS instances use Nitro cards for ENA networking and EBS storage. Enables 100 Gbps per VM with <1ms EBS latency.

Azure SmartNIC (Project Catapult / AccNet, 2018): Microsoft uses FPGA-based SmartNICs for Azure networking. Azure's RDMA (InfiniBand) for HPC uses standard Mellanox hardware, while tenant VPC networking uses FPGA SmartNICs.

NVIDIA DGX SuperPOD with BF3 (2023): Each DGX H100 server has 2× BlueField-3 DPUs managing the 8× ConnectX-7 IB NICs (one BF3 manages 4 NICs). SHARP AllReduce runs on the BF3.

Google TPU Pod Networking: Google uses custom SmartNIC-equivalent ASICs for TPU Pod ICI management (details not published), handling ICI flow control and topology management on dedicated chips separate from the TPU compute ASICs.

Debugging Notes

BlueField "rshim" connectivity: BF3 management interface is via "rshim" (remote shift-register interface) — a PCIe/USB-based management path from host to DPU. If tmfifo_net0 (BlueField management network) is down, DPU is unmanageable. Check: systemctl status rshim on host.

SHARP configuration validation: SHARP requires all switches and endpoints to support it. Use sharp_hello tool to verify SHARP topology. Common issue: firmware mismatch between switch and NIC SHARP versions.

NVMe-oF target registration failure: If BF3 can't register as NVMe-oF target, check nvmet kernel module loaded on DPU ARM OS. DPU ARM OS runs independent Linux; NVMe-oF target needs nvme-rdma kernel module.

DPU ARM OS crash: BF3 runs Linux on ARM cores. An OOM condition on DPU (due to flow table growth or debugging tooling) can crash the DPU OS, disconnecting the host's network. DPU is hot-rebootable without disrupting PCIe connection to host in latest firmware versions.

Security Implications

DPU as attack surface: DPU runs a full Linux OS. If an attacker gains RCE on the DPU ARM OS (e.g., via a vulnerable network management service), they have full control over all host network traffic. DPU management network must be isolated from production traffic. Regular firmware updates are critical (BF3 has had CVEs in DOCA components).

Host-to-DPU trust boundary: The host can communicate with DPU via PCIe (DOCA comm channel). A compromised host cannot modify DPU's packet processing pipeline (it runs in ARM kernel mode on DPU), but can send crafted packets to trigger DPU processing bugs. DOCA's security model assumes host is untrusted.

SHARP security model: SHARP AllReduce aggregates gradients in switches. A compromised switch or DPU in the network path could inject incorrect values into the AllReduce (gradient poisoning). SHARP doesn't currently implement cryptographic integrity for allreduce operands — this is a known limitation for multi-tenant SHARP deployments. Use SHARP only in trusted single-tenant fabrics.

Supply chain: DPU firmware is loaded during manufacturing and updated by the datacenter operator. UEFI Secure Boot + cryptographic firmware signing ensures only authorized firmware runs. BlueField-3 supports DICE (Device Identifier Composition Engine) for hardware attestation.

Performance Implications

DPU ARM processing overhead: BF3 ARM cores run at ~3 GHz (Cortex-A78AE). Complex stateful firewall rules (10K rules) evaluated per packet at 400 Gbps (28.5M small packets/second) requires hardware-assisted classification (regex engine). Pure ARM software processing would saturate 16 cores at this rate. Always use DOCA hardware APIs, not software packet loops.

PCIe bandwidth bottleneck: BF3 PCIe Gen5 x16 = 128 GB/s. At 400 Gbps network rate = 50 GB/s. PCIe has headroom. However, for NVMe-oF with multiple VMs sharing storage, aggregate storage traffic can approach PCIe limits — use NVMe-oF namespacing to distribute traffic.

SHARP latency vs overhead: SHARP adds ~2 µs to AllReduce latency (switch processing), but eliminates N-1 network round trips. Break-even: 2 GPUs. At 4+ GPUs, SHARP is always faster.

Failure Modes and Real Incidents

Incident: BF3 firmware update bricking DPU (2023): A rushed firmware update to BlueField-3 in a 500-node cluster caused DPUs to fail to boot after firmware flashing (bad signature validation). All 500 nodes lost network connectivity simultaneously. Recovery required physical console access and manual recovery image boot. Mitigation: staged firmware rollouts (10% → 25% → 50% → 100%), A/B firmware slots.

Incident: SHARP tree failure cascading to all jobs: A single switch ASIC failure caused SHARP tree reconstruction to stall for 300 seconds. All AllReduce operations using SHARP timed out, killing 400 concurrent training jobs. NCCL's fallback to non-SHARP AllReduce was not triggered because SHARP wasn't fully failing—it was hanging. Fix: NCCL SHARP timeout (NCCL_SHARP_TIMEOUT environment variable, added in NCCL 2.15).

Incident: NVMe-oF connection loss mid-training checkpoint: A BF3 DPU OOM (out of memory on ARM OS, caused by debugging tooling left running) reset the NVMe-oF connection pool. In-flight checkpoint write failed silently (block device returned errors that PyTorch's torch.save didn't propagate as exceptions). Checkpoint file was partially written. The training job continued and the next checkpoint overwrote the corrupted one—but the team only discovered the corruption after an unrelated restart. Mitigation: fsync-verified checkpoint writes, checkpoint integrity hashes.

Modern Usage

BlueField-3 in DGX H100 (2023): Standard equipment in NVIDIA's flagship AI servers. BF3 manages OOB networking, telemetry, and SHARP AllReduce for NCCL.

Pensando Elba in AWS (2024): AMD Pensando DPU used in AWS Nitro v4 cards for next-generation VPC networking with higher throughput and lower latency than previous generation.

Microsoft Azure DPU (2023): Azure announced custom DPU (details limited) for next-generation Azure networking, replacing current FPGA-based SmartNICs.

eBPF on DPU: NVIDIA DOCA now supports eBPF programs running on BF3 ARM cores, enabling dynamic packet processing without full DPU firmware updates. Kubernetes CNI plugins (Calico, Cilium) can offload eBPF policies to DPU.

Future Directions

CXL DPU: DPU that bridges PCIe and CXL, enabling DPU to manage disaggregated memory fabrics as well as network + storage
P4-programmable DPU: BF4 likely to include P4 pipeline in addition to ARM cores, enabling line-rate programmable packet processing without ASIC specialization
AI inference on DPU: Running small ML models (intrusion detection classifiers, traffic forecasting) on DPU ARM cores for real-time network intelligence
NVIDIA BlueField Spectrum-X: Combined DPU + Ethernet switch on a single ASIC for top-of-rack switches with built-in DPU capabilities

Exercises

DPU throughput calculation: A BlueField-3 has 400 Gbps network ports and PCIe Gen5 x16 (128 GB/s). For NVMe-oF storage serving (network → storage), calculate: (a) maximum storage throughput at 400 Gbps, (b) PCIe write bandwidth to local NVMe at 7 GB/s per NVMe × 4 NVMes, (c) identify the bottleneck.
SHARP AllReduce latency: For N=64 GPUs in an IB fat-tree with SHARP-enabled Quantum-2 switches, model the AllReduce latency for a 1 MB gradient tensor: (a) without SHARP (ring AllReduce over IB), (b) with SHARP (2-layer aggregation tree). Assume per-hop latency = 1 µs, IB link bandwidth = 200 Gbps per port, SHARP aggregation adds 2 µs per tree level.
DPU firewall performance: Implement a simple software packet classifier in C that evaluates a 10,000-rule firewall policy (source IP + dest IP + port). Benchmark throughput at 64-byte and 1500-byte packet sizes on a standard CPU core. Compare to BlueField-3's reported 28.5M packets/second capability. Explain the gap.
NVMe-oF lab setup: Configure a BlueField-3 (or software emulation via nvmet kernel module) as an NVMe-oF target over RoCEv2. Mount from a remote host using nvme connect --transport=rdma. Benchmark latency and throughput vs local NVMe using fio. Observe the effect of NVMe namespace count on throughput.
Zero-trust policy design: Design a BlueField-3 DOCA-based zero-trust policy for a 3-tier application (frontend, backend, database). Specify: (a) per-tier firewall rules, (b) mTLS certificate rotation policy, (c) what happens when a certificate expires mid-connection, (d) how to update policies on 1,000 DPUs without downtime.

References

NVIDIA BlueField-3 DPU Architecture Overview: https://www.nvidia.com/en-us/networking/products/data-processing-unit/
NVIDIA DOCA SDK documentation: https://docs.nvidia.com/doca/sdk/
AWS Nitro System Whitepaper: https://d1.awsstatic.com/whitepapers/AWS_Nitro_System_Whitepaper.pdf
Alonso et al., "SHARP: An Architecture for In-Network Compute," ACM SIGCOMM 2019
AMD Pensando Elba Architecture: https://www.amd.com/en/products/pensando
Intel IPU (E2000 Mount Evans) Architecture Manual, Intel 2022
Dragojevic et al., "FaRM: Fast Remote Memory," NSDI 2014 (motivates DPU RDMA primitives)
Ports et al., "Towards In-Network Distributed Computing," HotNets 2015