Research

Turbo: SmartNIC-enabled Dynamic Load Balancing of µs-scale RPCs (published at HPCA’23)

Insight: μs-scale RPCs require immediate load imbalance detection and remediation mechanism operating on a per-packet granularity. Existing NIC-based mechanisms (RSS, RSS++) are static and too coarse grained and software-based mechanisms introduce significant overhead for µs-scale RPCs.

Turbo’s load balancing intelligently steers packets into user space queues at line rate using its two adative policies (i.e., JSQ —Join Shortest Queue— and JLQ —Join Lightest Queue-), improving throughput under tight tail latency SLOs and reducing tail response latency.

Turbo was implemented on a Mellanox Innova Flex-4 FPGA SmartNIC and evaluated on an RDMA UD microbenchmark and Masstree-RDMA key-value store with various service time distributions.

animated

NFSlicer: Data movement optimization for shallow network functions (on ArXiv)

Insight: Moving packets across PCIe increases contention on the PCIe interface, increasing PCIe interface latency. Shallow network functions can benefit from the reduced data movement, as they only require the packet header for processing.

NFSlicer realizes a NIC-based mechanism to slice packets into header and payload on ingress, sending only the header to the host, storing payload on the NIC. The payload is then spliced back to the header on egress.

NFSlicer’s slice/splice mechanism was designed in Vivado HLS and synthesized using the Synopsys Design Compiler on an open-source 15nm technology node library to extract power, area, and timing estimates and ensure line rate performance,

Enabling payload slicing eliminates data movement bottlenecks between the NIC and server, reducing tail response latency.

animated

SmartNIC-based notification protocol to assist CPU in handling RDMA connections at scale (in progress)

Online services are severely decomposed into microservices that are distributed across several compute hierarchies within the datacenter. Leaf nodes, responsible for back-end microservices (e.g., key-value stores), often establish thousands of connections with front-end or mid-tier nodes to serve requests.

RDMA has gained traction as the low-latency transport in datacenters. In RDMA RC, the application typically interacts with each connection through its completion queue (CQ). Locating work in RDMA queues poses a problem at the scale of hundreds of connections. Idle polling on empty queues is prohibitive in handling μs-scale services.

One approach to solve this problem is by using sharedCQs, where multiple connections share one endpoint within the application (cores may share these CQs). However, as datacenter traffic is non-uniform and unpredictable, this will leave cores idle, limiting the system’s throughput. Moreover, sharing CQs across threads introduces inter-core synchronization overhead which is, again, prohibitive in handling μs-scale services.

In this work, I design and implement a smartNIC-based host notification protocol, a mechanism to notify cores of active RDMA connections, and balance active connections across CPU cores to boost throughput and improve tail latency of the service.

Full paper coming soon!